* Re: [PATCH v6 05/16] Documentation/rv: Add documentation about hybrid automata
From: Gabriele Monaco @ 2026-03-02 14:23 UTC (permalink / raw)
To: Juri Lelli
Cc: linux-kernel, Steven Rostedt, Nam Cao, Juri Lelli,
Jonathan Corbet, linux-trace-kernel, linux-doc, Tomas Glozar,
Clark Williams, John Kacur
In-Reply-To: <aaWXmBVIvTlVtiRp@jlelli-thinkpadt14gen4.remote.csb>
Hello,
On Mon, 2026-03-02 at 14:58 +0100, Juri Lelli wrote:
> Considering the spec above, does the 'event' need to be 'enqueue'
> instead of 'sched_wakeup' (or the other way around)? Or maybe it's
> equivalent?
Good catch, in fact enqueue/dequeue don't work well for this model (the actual
stall monitor uses wakeup), but for the sake of the, already simplified, example
I should keep it consistent.
Thanks,
Gabriele
^ permalink raw reply
* Re: [PATCH v4 5/5] mm: add tracepoints for zone lock
From: Vlastimil Babka (SUSE) @ 2026-03-02 14:16 UTC (permalink / raw)
To: Dmitry Ilvokhin, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Axel Rasmussen, Yuanchu Xie,
Wei Xu, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Rafael J. Wysocki, Pavel Machek, Len Brown, Brendan Jackman,
Johannes Weiner, Zi Yan, Oscar Salvador, Qi Zheng, Shakeel Butt
Cc: linux-kernel, linux-mm, linux-trace-kernel, linux-pm,
"linux-cxl
In-Reply-To: <ae145fe890f028409f727b4921904b547346fa0b.1772206930.git.d@ilvokhin.com>
On 2/27/26 17:00, Dmitry Ilvokhin wrote:
> Add tracepoint instrumentation to zone lock acquire/release operations
> via the previously introduced wrappers.
>
> The implementation follows the mmap_lock tracepoint pattern: a
> lightweight inline helper checks whether the tracepoint is enabled and
> calls into an out-of-line helper when tracing is active. When
> CONFIG_TRACING is disabled, helpers compile to empty inline stubs.
>
> The fast path is unaffected when tracing is disabled.
>
> Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Agree with Steven; otherwise
Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
^ permalink raw reply
* Re: [PATCH v6 12/16] sched: Add deadline tracepoints
From: Juri Lelli @ 2026-03-02 14:15 UTC (permalink / raw)
To: Gabriele Monaco
Cc: linux-kernel, Steven Rostedt, Nam Cao, Juri Lelli,
Masami Hiramatsu, Ingo Molnar, Peter Zijlstra, linux-trace-kernel,
Phil Auld, Tomas Glozar, Clark Williams, John Kacur
In-Reply-To: <20260225095122.80683-13-gmonaco@redhat.com>
Hello,
On 25/02/26 10:51, Gabriele Monaco wrote:
> Add the following tracepoints:
>
> * sched_dl_throttle(dl_se, cpu, type):
> Called when a deadline entity is throttled
> * sched_dl_replenish(dl_se, cpu, type):
> Called when a deadline entity's runtime is replenished
> * sched_dl_update(dl_se, cpu, type):
> Called when a deadline entity updates without throttle or replenish
> * sched_dl_server_start(dl_se, cpu, type):
> Called when a deadline server is started
> * sched_dl_server_stop(dl_se, cpu, type):
> Called when a deadline server is stopped
>
> Those tracepoints can be useful to validate the deadline scheduler with
> RV and are not exported to tracefs.
>
> Reviewed-by: Phil Auld <pauld@redhat.com>
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> ---
>
> Notes:
> V6:
> * Add dl_se type to differentiate between fair and ext servers
> * Add event to track dl_update_curr not firing other events
> V3:
> * Rename dl argument to dl_se in tracepoints
>
> include/trace/events/sched.h | 26 ++++++++++++++++++++++++++
> kernel/sched/core.c | 4 ++++
> kernel/sched/deadline.c | 25 ++++++++++++++++++++++++-
> 3 files changed, 54 insertions(+), 1 deletion(-)
>
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 5844147ec5fd..944d65750a64 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -904,6 +904,32 @@ DECLARE_TRACE(sched_dequeue,
> TP_PROTO(struct task_struct *tsk, int cpu),
> TP_ARGS(tsk, cpu));
>
> +#define DL_OTHER 0
> +#define DL_TASK 1
> +#define DL_SERVER_FAIR 2
> +#define DL_SERVER_EXT 3
> +
> +DECLARE_TRACE(sched_dl_throttle,
> + TP_PROTO(struct sched_dl_entity *dl_se, int cpu, uint8_t type),
> + TP_ARGS(dl_se, cpu, type));
> +
> +DECLARE_TRACE(sched_dl_replenish,
> + TP_PROTO(struct sched_dl_entity *dl_se, int cpu, uint8_t type),
> + TP_ARGS(dl_se, cpu, type));
> +
> +/* Call to update_curr_dl_se not involving throttle or replenish */
> +DECLARE_TRACE(sched_dl_update,
> + TP_PROTO(struct sched_dl_entity *dl_se, int cpu, uint8_t type),
> + TP_ARGS(dl_se, cpu, type));
> +
> +DECLARE_TRACE(sched_dl_server_start,
> + TP_PROTO(struct sched_dl_entity *dl_se, int cpu, uint8_t type),
> + TP_ARGS(dl_se, cpu, type));
> +
> +DECLARE_TRACE(sched_dl_server_stop,
> + TP_PROTO(struct sched_dl_entity *dl_se, int cpu, uint8_t type),
> + TP_ARGS(dl_se, cpu, type));
> +
> #endif /* _TRACE_SCHED_H */
>
> /* This part must be outside protection */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 4ca79ff58fca..b5bb2eb112bf 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -124,6 +124,10 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_exit_tp);
> EXPORT_TRACEPOINT_SYMBOL_GPL(sched_set_need_resched_tp);
> EXPORT_TRACEPOINT_SYMBOL_GPL(sched_enqueue_tp);
> EXPORT_TRACEPOINT_SYMBOL_GPL(sched_dequeue_tp);
> +EXPORT_TRACEPOINT_SYMBOL_GPL(sched_dl_throttle_tp);
> +EXPORT_TRACEPOINT_SYMBOL_GPL(sched_dl_replenish_tp);
> +EXPORT_TRACEPOINT_SYMBOL_GPL(sched_dl_server_start_tp);
> +EXPORT_TRACEPOINT_SYMBOL_GPL(sched_dl_server_stop_tp);
Don't we need to export sched_dl_update_tp as well?
> DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
> DEFINE_PER_CPU(struct rnd_state, sched_rnd_state);
...
> @@ -1532,7 +1551,8 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
>
> if (!is_leftmost(dl_se, &rq->dl))
> resched_curr(rq);
> - }
> + } else
> + trace_sched_dl_update_tp(dl_se, cpu_of(rq), dl_get_type(dl_se, rq));
This wants braces even if it's a single statement.
>
> /*
> * The dl_server does not account for real-time workload because it
Thanks,
Juri
^ permalink raw reply
* Re: [PATCH v4 4/5] mm: rename zone->lock to zone->_lock
From: Vlastimil Babka (SUSE) @ 2026-03-02 14:10 UTC (permalink / raw)
To: Dmitry Ilvokhin, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Axel Rasmussen, Yuanchu Xie,
Wei Xu, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Rafael J. Wysocki, Pavel Machek, Len Brown, Brendan Jackman,
Johannes Weiner, Zi Yan, Oscar Salvador, Qi Zheng, Shakeel Butt
Cc: linux-kernel, linux-mm, linux-trace-kernel, linux-pm,
"linux-cxl, SeongJae Park
In-Reply-To: <d61500c5784c64e971f4d328c57639303c475f81.1772206930.git.d@ilvokhin.com>
On 2/27/26 17:00, Dmitry Ilvokhin wrote:
> This intentionally breaks direct users of zone->lock at compile time so
> all call sites are converted to the zone lock wrappers. Without the
> rename, present and future out-of-tree code could continue using
> spin_lock(&zone->lock) and bypass the wrappers and tracing
> infrastructure.
>
> No functional change intended.
>
> Suggested-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> Acked-by: SeongJae Park <sj@kernel.org>
I see some more instances of 'zone->lock' in comments in
include/linux/mmzone.h and under Documentation/ but otherwise LGTM.
Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
^ permalink raw reply
* Re: [PATCH v4 3/5] mm: convert compaction to zone lock wrappers
From: Vlastimil Babka (SUSE) @ 2026-03-02 14:02 UTC (permalink / raw)
To: Dmitry Ilvokhin, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Axel Rasmussen, Yuanchu Xie,
Wei Xu, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Rafael J. Wysocki, Pavel Machek, Len Brown, Brendan Jackman,
Johannes Weiner, Zi Yan, Oscar Salvador, Qi Zheng, Shakeel Butt
Cc: linux-kernel, linux-mm, linux-trace-kernel, linux-pm,
"linux-cxl
In-Reply-To: <3a09e46f52cf9f709b0725bc2b648cc5212843b2.1772206930.git.d@ilvokhin.com>
On 2/27/26 17:00, Dmitry Ilvokhin wrote:
> Compaction uses compact_lock_irqsave(), which currently operates
> on a raw spinlock_t pointer so it can be used for both zone->lock
> and lruvec->lru_lock. Since zone lock operations are now wrapped,
> compact_lock_irqsave() can no longer directly operate on a
> spinlock_t when the lock belongs to a zone.
>
> Split the helper into compact_zone_lock_irqsave() and
> compact_lruvec_lock_irqsave(), duplicating the small amount of
> shared logic. As there are only two call sites and both statically
> know the lock type, this avoids introducing additional abstraction
> or runtime dispatch in the compaction path.
>
> No functional change intended.
>
> Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
^ permalink raw reply
* Re: [PATCH v6 05/16] Documentation/rv: Add documentation about hybrid automata
From: Juri Lelli @ 2026-03-02 13:58 UTC (permalink / raw)
To: Gabriele Monaco
Cc: linux-kernel, Steven Rostedt, Nam Cao, Juri Lelli,
Jonathan Corbet, linux-trace-kernel, linux-doc, Tomas Glozar,
Clark Williams, John Kacur
In-Reply-To: <20260225095122.80683-6-gmonaco@redhat.com>
Hello,
On 25/02/26 10:51, Gabriele Monaco wrote:
> Describe theory and implementation of hybrid automata in the dedicated
> page hybrid_automata.rst
> Include a section on how to integrate a hybrid automaton in
> monitor_synthesis.rst
> Also remove a hanging $ in deterministic_automata.rst
>
> Reviewed-by: Nam Cao <namcao@linutronix.de>
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> ---
...
> diff --git a/Documentation/trace/rv/hybrid_automata.rst b/Documentation/trace/rv/hybrid_automata.rst
> new file mode 100644
> index 000000000000..39c037a71b89
> --- /dev/null
> +++ b/Documentation/trace/rv/hybrid_automata.rst
> @@ -0,0 +1,341 @@
> +Hybrid Automata
> +===============
...
> +Stall model with invariants (iteration 2)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +The first iteration isn't exactly what was intended, we can change the model as:
> +
> +- *X* = { ``dequeued``, ``enqueued``, ``running``}
> +- *E* = { ``enqueue``, ``dequeue``, ``switch_in``}
> +- *V* = { ``clk`` }
> +- x\ :subscript:`0` = ``dequeue``
> +- X\ :subscript:`m` = {``dequeue``}
> +- *f* =
> + - *f*\ (``enqueued``, ``switch_in``) = ``running``
> + - *f*\ (``running``, ``dequeue``) = ``dequeued``
> + - *f*\ (``dequeued``, ``enqueue``, ``reset(clk)``) = ``enqueued``
^^^
> +- *i* =
> + - *i*\ (``enqueued``) = ``clk < threshold``
> +
> +Graphically::
> +
> + |
> + |
> + v
> + #=========================#
> + H dequeued H <+
> + #=========================# |
> + | |
> + | enqueue; reset(clk) |
> + v |
> + +-------------------------+ |
> + | enqueued | |
> + | clk < threshold | | dequeue
> + +-------------------------+ |
> + | |
> + | switch_in |
> + v |
> + +-------------------------+ |
> + | running | -+
> + +-------------------------+
...
> + static bool verify_constraint(enum states curr_state, enum events event,
> + enum states next_state)
> + {
> + bool res = true;
> +
> + /* Validate guards as part of f */
> + if (curr_state == enqueued && event == sched_switch_in)
> + res = get_env(clk) < threshold;
> + else if (curr_state == dequeued && event == sched_wakeup)
> + reset_env(clk);
Considering the spec above, does the 'event' need to be 'enqueue'
instead of 'sched_wakeup' (or the other way around)? Or maybe it's
equivalent?
Thanks,
Juri
^ permalink raw reply
* Re: [PATCH v4 2/5] mm: convert zone lock users to wrappers
From: Vlastimil Babka (SUSE) @ 2026-03-02 13:42 UTC (permalink / raw)
To: Dmitry Ilvokhin, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Axel Rasmussen, Yuanchu Xie,
Wei Xu, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Rafael J. Wysocki, Pavel Machek, Len Brown, Brendan Jackman,
Johannes Weiner, Zi Yan, Oscar Salvador, Qi Zheng, Shakeel Butt
Cc: linux-kernel, linux-mm, linux-trace-kernel, linux-pm,
"linux-cxl, SeongJae Park
In-Reply-To: <d26a43ebed2f0f1edb9cfe4fbed16dd31c7a069c.1772206930.git.d@ilvokhin.com>
On 2/27/26 17:00, Dmitry Ilvokhin wrote:
> Replace direct zone lock acquire/release operations with the
> newly introduced wrappers.
>
> The changes are purely mechanical substitutions. No functional change
> intended. Locking semantics and ordering remain unchanged.
>
> The compaction path is left unchanged for now and will be
> handled separately in the following patch due to additional
> non-trivial modifications.
>
> Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> Reviewed-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
^ permalink raw reply
* Re: [BUG] RCU stall / hung rcu_gp: process_srcu blocked in synchronize_rcu_normal triggered by perf trace teardown on 7.0.0-rc1
From: Sasha Levin @ 2026-03-02 13:36 UTC (permalink / raw)
To: Zw Tang, paulmck, peterz, mhiramat
Cc: Sasha Levin, jiangshanlai, mingo, acme, namhyung, rcu,
linux-perf-users, linux-trace-kernel, linux-kernel, rostedt,
mathieu.desnoyers, josh, bigeasy, ast, boqun.feng, mark.rutland
In-Reply-To: <CAPHJ_VLUpgBO7VfF4ih2oy2HDCxvxkHRkryFUjHAm8QTNdF6Sg@mail.gmail.com>
This response was AI-generated by bug-bot. The analysis may contain errors — please verify independently.
## Bug Summary
This is an RCU stall and hung task deadlock on 7.0.0-rc1, triggered by perf trace teardown under perf interrupt storm conditions. The perf subsystem's tracepoint unregistration path now blocks on SRCU (tracepoint_srcu), which in turn blocks on RCU grace period completion, creating a cascading stall when RCU progress is delayed by perf NMI interrupt storms. Severity: system hang (multiple tasks blocked >143s, eventual complete stall).
## Stack Trace Analysis
The bug involves three interacting blocked entities. Here are the decoded stack traces:
**1. repro2 (pid 4086) - blocked in perf trace teardown (close()):**
```
__x64_sys_close
fput_close_sync
__fput
perf_release
perf_event_release_kernel
put_event
__free_event
perf_trace_destroy
perf_trace_event_unreg [kernel/trace/trace_event_perf.c:154]
tracepoint_synchronize_unregister [include/linux/tracepoint.h:116]
synchronize_srcu(&tracepoint_srcu)
__synchronize_srcu
wait_for_completion ← BLOCKED
```
**2. kworker/0:0 (pid 9) and kworker/0:1 (pid 11) - SRCU grace period workers:**
```
Workqueue: rcu_gp process_srcu
process_srcu [kernel/rcu/srcutree.c:1304]
srcu_advance_state [kernel/rcu/srcutree.c:1161]
try_check_zero [kernel/rcu/srcutree.c:1171]
srcu_readers_active_idx_check [kernel/rcu/srcutree.c:544]
synchronize_rcu() ← SRCU-fast path, line 569
synchronize_rcu_normal
wait_for_completion ← BLOCKED
```
**3. repro2 (pid 4093) - RCU stall source:**
```
rcu: Tasks blocked on level-0 rcu_node (CPUs 0-1): P4093
task:repro2 state:R running task
(running in futex_wake syscall, interrupted by timer IRQ)
asm_sysvec_apic_timer_interrupt
irqentry_exit → preempt_schedule_irq → __schedule
finish_task_switch
```
The trace shows process context for the hung tasks and interrupt context (timer IRQ) for the RCU stall detection. The kworkers are in D (uninterruptible sleep) state, blocked in wait_for_completion() within the SRCU grace period state machine.
## Root Cause Analysis
This is a regression introduced by commit a46023d5616ed ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast"), which switched tracepoint read-side protection from preempt_disable()+RCU to SRCU-fast via DEFINE_SRCU_FAST(tracepoint_srcu).
The root cause is a new coupling between SRCU grace period processing and RCU grace period completion that did not exist before. The deadlock chain is:
1. The reproducer creates perf events using tracepoints, then closes them while generating heavy perf interrupt load. The perf NMI interrupt storms ("perf: interrupt took too long" messages escalating from 69ms to 336ms) consume most CPU time, starving RCU quiescent state detection.
2. When the perf fd is closed, perf_trace_event_unreg() (kernel/trace/trace_event_perf.c:154) calls tracepoint_synchronize_unregister() (include/linux/tracepoint.h:116), which now calls synchronize_srcu(&tracepoint_srcu) instead of synchronize_rcu().
3. The SRCU grace period for tracepoint_srcu is processed by process_srcu() running in the rcu_gp workqueue. Because tracepoint_srcu is DEFINE_SRCU_FAST, its srcu_reader_flavor includes SRCU_READ_FLAVOR_FAST, which is part of SRCU_READ_FLAVOR_SLOWGP.
4. In srcu_readers_active_idx_check() (kernel/rcu/srcutree.c:544), when SRCU_READ_FLAVOR_SLOWGP is detected, the function calls synchronize_rcu() (line 569) instead of smp_mb() (line 301 in non-fast path). This is the key design tradeoff of SRCU-fast: faster readers (no smp_mb() on read side) at the cost of slower grace periods (synchronize_rcu() on update side).
5. synchronize_rcu() → synchronize_rcu_normal() → wait_for_completion(), waiting for an RCU grace period to complete. But the RCU grace period is stalled because the perf interrupt storms are preventing CPUs from passing through quiescent states quickly enough.
6. Since process_srcu is blocked waiting for synchronize_rcu(), the tracepoint_srcu SRCU grace period cannot advance, so synchronize_srcu(&tracepoint_srcu) in the perf teardown path also blocks indefinitely.
The pre-existing condition (perf NMI storms causing RCU stalls) was previously tolerable because the perf teardown path used synchronize_rcu() directly (via the old tracepoint_synchronize_unregister()), which would eventually complete once the RCU stall resolved. Now, with SRCU-fast, there is an additional layer of indirection: perf teardown waits on SRCU, SRCU processing waits on RCU, and both the SRCU workqueue threads and the perf teardown task are stuck.
## Affected Versions
This is a regression in v7.0-rc1. The bug was introduced by commit a46023d5616ed ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast"), which was merged via the trace-v7.0 merge (3c6e577d5ae70). The underlying SRCU-fast infrastructure was added by commit c4020620528e4 ("srcu: Add SRCU-fast readers") and 4d86b1e7e1e98 ("srcu: Add SRCU_READ_FLAVOR_SLOWGP to flag need for synchronize_rcu()"), but the regression became triggerable only when a46023d5616ed applied SRCU-fast to the tracepoint_srcu used in the perf event teardown path.
Kernels before v7.0-rc1 (i.e., v6.x and earlier) are not affected, as they used preempt_disable()+RCU for tracepoint protection, and tracepoint_synchronize_unregister() called synchronize_rcu() directly without SRCU involvement.
## Relevant Commits and Fixes
Key commits in the causal chain:
- a46023d5616ed ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast") - the commit that introduced the regression by switching tracepoints to SRCU-fast
- a77cb6a867667 ("srcu: Fix warning to permit SRCU-fast readers in NMI handlers") - immediate predecessor fix
- c4020620528e4 ("srcu: Add SRCU-fast readers") - added the SRCU-fast reader API
- 4d86b1e7e1e98 ("srcu: Add SRCU_READ_FLAVOR_SLOWGP to flag need for synchronize_rcu()") - added the synchronize_rcu()-instead-of-smp_mb() logic in SRCU grace period processing
- 16718274ee75d ("tracing: perf: Have perf tracepoint callbacks always disable preemption") - preparatory commit for the SRCU-fast switch
No fix for this specific issue was found in mainline or in any -next branches as of today.
## Prior Discussions
No prior reports of this specific RCU stall / SRCU deadlock triggered via perf trace teardown with SRCU-fast were found on lore.kernel.org. The original SRCU-fast tracepoint series was posted at https://lore.kernel.org/all/20260126231256.499701982@kernel.org/ (linked from the commit message), motivated by enabling preemptible BPF on tracepoints for RT systems (https://lore.kernel.org/all/20250613152218.1924093-1-bigeasy@linutronix.de/). No discussion of the synchronize_rcu()-from-workqueue stall scenario appears to have taken place in those threads.
## Suggested Actions
1. Confirm the regression by testing with the parent commit a77cb6a867667 (immediately before a46023d5616ed). If the issue disappears, this confirms the SRCU-fast tracepoint switch as the cause.
2. As a quick workaround, reverting a46023d5616ed (and its preparatory commits a77cb6a867667, f7d327654b886, 16718274ee75d if needed) should eliminate the deadlock, at the cost of losing preemptible BPF tracepoint support.
3. The fundamental issue is that process_srcu() for SRCU-fast structures calls synchronize_rcu() synchronously from workqueue context. Possible fixes include:
- Using an asynchronous mechanism (e.g., call_rcu() with a callback to resume SRCU GP processing) instead of blocking synchronize_rcu() within the SRCU state machine.
- Having srcu_readers_active_idx_check() use poll_state_synchronize_rcu() and defer retrying instead of blocking.
- Bounding the perf interrupt rate escalation to prevent the RCU stall in the first place (though this would only mask the underlying SRCU↔RCU coupling issue).
4. If you can reproduce reliably, adding the following debug options would provide more information: CONFIG_RCU_TRACE=y, CONFIG_PROVE_RCU=y, and booting with rcutree.rcu_kick_kthreads=1 to see if kicking the RCU threads helps break the stall.
^ permalink raw reply
* Re: [PATCH v4 1/5] mm: introduce zone lock wrappers
From: Vlastimil Babka (SUSE) @ 2026-03-02 13:34 UTC (permalink / raw)
To: Dmitry Ilvokhin, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Axel Rasmussen, Yuanchu Xie,
Wei Xu, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Rafael J. Wysocki, Pavel Machek, Len Brown, Brendan Jackman,
Johannes Weiner, Zi Yan, Oscar Salvador, Qi Zheng, Shakeel Butt
Cc: linux-kernel, linux-mm, linux-trace-kernel, linux-pm,
"linux-cxl
In-Reply-To: <849dee9c47df1e6fba97c9933af0d5a08b8e15d3.1772206930.git.d@ilvokhin.com>
On 2/27/26 17:00, Dmitry Ilvokhin wrote:
> Add thin wrappers around zone lock acquire/release operations. This
> prepares the code for future tracepoint instrumentation without
> modifying individual call sites.
>
> Centralizing zone lock operations behind wrappers allows future
> instrumentation or debugging hooks to be added without touching
> all users.
>
> No functional change intended. The wrappers are introduced in
> preparation for subsequent patches and are not yet used.
>
> Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
*checks patch 2 diffstat*
I think we could do it as mm/zone_lock.h even and not pollute include/linux/
Even kernel/power/snapshot.c could include it in a somewhat ugly way.
However we should also later look at moving that particular code somewhere
under mm/ really...
Anyway,
Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---
> MAINTAINERS | 1 +
> include/linux/mmzone_lock.h | 38 +++++++++++++++++++++++++++++++++++++
> 2 files changed, 39 insertions(+)
> create mode 100644 include/linux/mmzone_lock.h
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 55af015174a5..947298ecb111 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -16672,6 +16672,7 @@ F: include/linux/memory.h
> F: include/linux/mm.h
> F: include/linux/mm_*.h
> F: include/linux/mmzone.h
> +F: include/linux/mmzone_lock.h
> F: include/linux/mmdebug.h
> F: include/linux/mmu_notifier.h
> F: include/linux/pagewalk.h
> diff --git a/include/linux/mmzone_lock.h b/include/linux/mmzone_lock.h
> new file mode 100644
> index 000000000000..a1cfba8408d6
> --- /dev/null
> +++ b/include/linux/mmzone_lock.h
> @@ -0,0 +1,38 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_MMZONE_LOCK_H
> +#define _LINUX_MMZONE_LOCK_H
> +
> +#include <linux/mmzone.h>
> +#include <linux/spinlock.h>
> +
> +static inline void zone_lock_init(struct zone *zone)
> +{
> + spin_lock_init(&zone->lock);
> +}
> +
> +#define zone_lock_irqsave(zone, flags) \
> +do { \
> + spin_lock_irqsave(&(zone)->lock, flags); \
> +} while (0)
> +
> +#define zone_trylock_irqsave(zone, flags) \
> +({ \
> + spin_trylock_irqsave(&(zone)->lock, flags); \
> +})
> +
> +static inline void zone_unlock_irqrestore(struct zone *zone, unsigned long flags)
> +{
> + spin_unlock_irqrestore(&zone->lock, flags);
> +}
> +
> +static inline void zone_lock_irq(struct zone *zone)
> +{
> + spin_lock_irq(&zone->lock);
> +}
> +
> +static inline void zone_unlock_irq(struct zone *zone)
> +{
> + spin_unlock_irq(&zone->lock);
> +}
> +
> +#endif /* _LINUX_MMZONE_LOCK_H */
^ permalink raw reply
* Re: [PATCH net-next v2 07/10] devlink: allow devlink instance allocation without a backing device
From: Jiri Pirko @ 2026-03-02 13:15 UTC (permalink / raw)
To: Jakub Kicinski
Cc: netdev, davem, edumazet, pabeni, horms, donald.hunter, corbet,
skhan, saeedm, leon, tariqt, mbloch, przemyslaw.kitszel, mschmidt,
andrew+netdev, rostedt, mhiramat, mathieu.desnoyers, chuck.lever,
matttbe, cjubran, daniel.zahka, linux-doc, linux-rdma,
linux-trace-kernel
In-Reply-To: <20260228150138.14e35ee7@kernel.org>
Sun, Mar 01, 2026 at 12:01:38AM +0100, kuba@kernel.org wrote:
>On Wed, 25 Feb 2026 14:34:19 +0100 Jiri Pirko wrote:
>> - dev_warn(port->devlink->dev, "Type was not set for devlink port.");
>> + if (port->devlink->dev)
>> + dev_warn(port->devlink->dev,
>> + "Type was not set for devlink port.");
>
>since I'm already nit-picking - maybe we should have a helper for this
>case an pr_warn() the message if dev is NULL?
Okay
^ permalink raw reply
* [PATCH v2] tracing/osnoise: Add option to align tlat threads
From: Tomas Glozar @ 2026-03-02 13:13 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Mathieu Desnoyers, John Kacur, Luis Goncalves, Crystal Wood,
Costa Shulyupin, Wander Lairson Costa, LKML, linux-trace-kernel,
Tomas Glozar
Add an option called TIMERLAT_ALIGN to osnoise/options, together with a
corresponding setting osnoise/timerlat_align_us.
This option sets the alignment of wakeup times between different
timerlat threads, similarly to cyclictest's -A/--aligned option. If
TIMERLAT_ALIGN is set, the first thread that reaches the first cycle
records its first wake-up time. Each following thread sets its first
wake-up time to a fixed offset from the recorded time, and increments
it by the same offset.
Example:
osnoise/timerlat_period is set to 1000, osnoise/timerlat_align_us is
set to 20. There are four threads, on CPUs 1 to 4.
- CPU 4 enters first cycle first. The current time is 20000us, so
the wake-up of the first cycle is set to 21000us. This time is recorded.
- CPU 2 enter first cycle next. It reads the recorded time, increments
it to 21020us, and uses this value as its own wake-up time for the first
cycle.
- CPU 3 enters first cycle next. It reads the recorded time, increments
it to 21040 us, and uses the value as its own wake-up time.
- CPU 1 proceeds analogically.
In each next cycle, the wake-up time (called "absolute period" in
timerlat code) is incremented by the (relative) period of 1000us. Thus,
the wake-ups in the following cycles (provided the times are reached and
not in the past) will be as follows:
CPU 1 CPU 2 CPU 3 CPU 4
21080us 21020us 21040us 21000us
22080us 22020us 22040us 22000us
... ... ... ...
Even if any cycle is skipped due to e.g. the first cycle calculation
happening later, the alignment stays in place.
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
---
v1 + discussion: https://lore.kernel.org/linux-trace-kernel/20260227150420.319528-1-tglozar@redhat.com/T/#u
v2:
- Make align_next global and reset it to 0 in osnoise_workload_start()
so that it gets set by the first thread of each measurement and is not stuck
on what is set by the first measurement until reboot.
- Use atomic64_add_return_relaxed() in place of atomic64_fetch_add_relaxed()
to make the code shorter and easier to read.
- Add more detailed comments to the alignment synchronization logic.
- Fix two typos in the commit message: 50 -> 20 in the example introduction,
and incremenets -> increments.
I tested v2 with the same command I used for v1. I also added debug printk()
calls and verified that the logic is implemented correctly:
[ 120.273901] timerlat: thread 2 setting align_next to 119896370619
[ 120.273977] timerlat: aligning thread 1 to 119896370619
[ 120.274385] timerlat: aligning thread 3 to 119896370619
[ 120.274476] timerlat: aligning thread 4 to 119896370619
[ 142.457440] timerlat: thread 1 setting align_next to 142080851122
[ 142.457529] timerlat: aligning thread 2 to 142080871122
[ 142.457629] timerlat: aligning thread 3 to 142080891122
[ 142.458033] timerlat: aligning thread 4 to 142080911122
kernel/trace/trace_osnoise.c | 52 +++++++++++++++++++++++++++++++++++-
1 file changed, 51 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index dee610e465b9..1cde1da57f97 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -58,6 +58,7 @@ enum osnoise_options_index {
OSN_PANIC_ON_STOP,
OSN_PREEMPT_DISABLE,
OSN_IRQ_DISABLE,
+ OSN_TIMERLAT_ALIGN,
OSN_MAX
};
@@ -66,7 +67,8 @@ static const char * const osnoise_options_str[OSN_MAX] = {
"OSNOISE_WORKLOAD",
"PANIC_ON_STOP",
"OSNOISE_PREEMPT_DISABLE",
- "OSNOISE_IRQ_DISABLE" };
+ "OSNOISE_IRQ_DISABLE",
+ "TIMERLAT_ALIGN" };
#define OSN_DEFAULT_OPTIONS 0x2
static unsigned long osnoise_options = OSN_DEFAULT_OPTIONS;
@@ -326,6 +328,7 @@ static struct osnoise_data {
u64 stop_tracing_total; /* stop trace in the final operation (report/thread) */
#ifdef CONFIG_TIMERLAT_TRACER
u64 timerlat_period; /* timerlat period */
+ u64 timerlat_align_us; /* timerlat alignment */
u64 print_stack; /* print IRQ stack if total > */
int timerlat_tracer; /* timerlat tracer */
#endif
@@ -338,6 +341,7 @@ static struct osnoise_data {
#ifdef CONFIG_TIMERLAT_TRACER
.print_stack = 0,
.timerlat_period = DEFAULT_TIMERLAT_PERIOD,
+ .timerlat_align_us = 0,
.timerlat_tracer = 0,
#endif
};
@@ -1813,6 +1817,11 @@ static enum hrtimer_restart timerlat_irq(struct hrtimer *timer)
return HRTIMER_NORESTART;
}
+/*
+ * timerlat wake-up offset for next thread with TIMERLAT_ALIGN set.
+ */
+static atomic64_t align_next;
+
/*
* wait_next_period - Wait for the next period for timerlat
*/
@@ -1829,6 +1838,26 @@ static int wait_next_period(struct timerlat_variables *tlat)
*/
tlat->abs_period = (u64) ktime_to_ns(next_abs_period);
+ /*
+ * Align thread in the first cycle on each CPU to the set alignment
+ * if TIMERLAT_ALIGN is set.
+ *
+ * This is done by using an atomic64_t to store the next absolute period.
+ * The first thread that wakes up will set the atomic64_t to its
+ * absolute period, and the other threads will increment it by
+ * the alignment value.
+ */
+ if (test_bit(OSN_TIMERLAT_ALIGN, &osnoise_options) && !tlat->count
+ && atomic64_cmpxchg_relaxed(&align_next, 0, tlat->abs_period)) {
+ /*
+ * A thread has already set align_next, use it and increment it
+ * to be used by the next thread that wakes up after this one.
+ */
+ tlat->abs_period = atomic64_add_return_relaxed(
+ osnoise_data.timerlat_align_us * 1000, &align_next);
+ next_abs_period = ns_to_ktime(tlat->abs_period);
+ }
+
/*
* If the new abs_period is in the past, skip the activation.
*/
@@ -2650,6 +2679,17 @@ static struct trace_min_max_param timerlat_period = {
.min = &timerlat_min_period,
};
+/*
+ * osnoise/timerlat_align_us: align the first wakeup of all timerlat
+ * threads to a common boundary (in us). 0 means disabled.
+ */
+static struct trace_min_max_param timerlat_align_us = {
+ .lock = &interface_lock,
+ .val = &osnoise_data.timerlat_align_us,
+ .max = NULL,
+ .min = NULL,
+};
+
static const struct file_operations timerlat_fd_fops = {
.open = timerlat_fd_open,
.read = timerlat_fd_read,
@@ -2746,6 +2786,11 @@ static int init_timerlat_tracefs(struct dentry *top_dir)
if (!tmp)
return -ENOMEM;
+ tmp = tracefs_create_file("timerlat_align_us", TRACE_MODE_WRITE, top_dir,
+ &timerlat_align_us, &trace_min_max_fops);
+ if (!tmp)
+ return -ENOMEM;
+
retval = osnoise_create_cpu_timerlat_fd(top_dir);
if (retval)
return retval;
@@ -2877,6 +2922,11 @@ static int osnoise_workload_start(void)
return 0;
osn_var_reset_all();
+ /*
+ * Reset also align_next, to be filled by a new offset by the first timerlat
+ * thread that wakes up, if TIMERLAT_ALIGN is set.
+ */
+ atomic64_set(&align_next, 0);
retval = osnoise_hook_events();
if (retval)
--
2.53.0
^ permalink raw reply related
* Re: [RFC PATCH bpf-next v3 0/3] Optimize kprobe.session attachment for exact function names
From: Jakub Sitnicki @ 2026-03-02 12:47 UTC (permalink / raw)
To: Andrey Grodzovsky
Cc: bpf, linux-open-source, ast, daniel, andrii, jolsa, rostedt,
linux-trace-kernel, kernel-team
In-Reply-To: <20260227204052.725813-1-andrey.grodzovsky@crowdstrike.com>
On Fri, Feb 27, 2026 at 03:40 PM -05, Andrey Grodzovsky wrote:
> - Patch 1: libbpf detects exact function names (no wildcards) in
> bpf_program__attach_kprobe_multi_opts() and bypasses kallsyms parsing,
> passing the symbol directly to the kernel via syms[] array.
> ESRCH is normalized to ENOENT for API consistency.
FWIW, Ivan was also trying to make it faster from the kernel side:
https://lore.kernel.org/bpf/20260129-ivan-bpf-ksym-cache-v1-1-ca503070dcc0@cloudflare.com/
But, IIUC, with this change we're not hitting get_ksymbol_bpf at all.
^ permalink raw reply
* Re: [PATCH] tracing: Fix WARN_ON in tracing_buffers_mmap_close
From: Lorenzo Stoakes @ 2026-03-02 12:13 UTC (permalink / raw)
To: Steven Rostedt
Cc: Vincent Donnefort, Qing Wang, Masami Hiramatsu, Mathieu Desnoyers,
linux-kernel, linux-trace-kernel, syzbot+3b5dd2030fe08afdf65d,
linux-mm, Andrew Morton, Vlastimil Babka, David Hildenbrand
In-Reply-To: <20260227155601.18ebd3ca@gandalf.local.home>
+cc David.
On Fri, Feb 27, 2026 at 03:56:01PM -0500, Steven Rostedt wrote:
> On Fri, 27 Feb 2026 10:20:38 -0500
> Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > On Fri, 27 Feb 2026 11:22:22 +0000
> > Vincent Donnefort <vdonnefort@google.com> wrote:
> >
> > > > Ah right, Syzkaller is using madvise(MADVISE_DOFORK) which resets VM_DONTCOPY.
> > >
> > > As we are applying restrictive rules for this mapping, I believe setting VM_IO
> > > might be a better fix.
> >
> > Agreed.
> >
>
> Adding MM folks so we do this right.
>
> Dear MM folks,
>
> Here's the issue. When the ftrace ring buffer is memory mapped to user
> space, we do not want anything "special" done to it. One of those things we
> did not want done was to have it copied on fork. To do that, we added
> VM_DONTCOPY, but we didn't know that an madvise() could disable that. It
> looks like VM_IO will prevent that from happening.
>
> But looking at the various flags, I see there's a VM_SPECIAL. I'm wondering
> if that is what we should use?
VM_SPECIAL is not a VMA flag, it's a bitmask of all the flags which cause us not
to permit things like splitting/merging of VMAs (because we can't safely do
them), i.e. that are one or more of:
VM_IO - Memory-mapped I/O range.
VM_PFNMAP - A mapping without struct folio's/page's backing them, e.g. perhaps a
raw kernel mapping.
VM_MIXEDMAP - A combination of page/folio-backed memory and/or PFN-backed memory.
VM_DONTEXPAND - Disallow expansion of memory in mremap().
You already set VM_DONTEXPAND so you get these semantics already.
Setting VM_IO just to trigger a failure case in madvise() feels like a hack? I
guess it'd do the trick though, but you're not going to be able to reclaim that
memory, and you might get some unexpected behaviour in code paths that assume
VM_IO means it's memory-mapped I/O... (for instance GUP will stop working, if
you need that).
I'd take a step back and wonder why you are wanting to not allow copying on
fork? Is this kernel-allocated memory? In which case you should set VM_MIXEDMAP
or VM_PFNMAP as appropriate... If not and it has a folio etc. then it seems like
strange semantics.
Are you really bothered also by users doing strange things? Maybe the solution
is to tolerate a fork-copy even if it's broken? I presume somethings straight up
breaks right now?
Without more context that I don't really have much time to acquire it's hard to
know what to advise.
>
> The effected code is here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/trace/ring_buffer.c#n7172
>
> What's your thoughts?
>
> Thanks,
>
> -- Steve
Cheers, Lorenzo
^ permalink raw reply
* Re: [PATCH] mm: add Adaptive Memory Pressure Signaling (AMPRESS)
From: Lorenzo Stoakes @ 2026-03-02 11:48 UTC (permalink / raw)
To: Andre Ramos
Cc: akpm, hannes, linux-mm, linux-kernel, linux-trace-kernel, david,
rostedt
In-Reply-To: <CALXtAv3u1hgLkBEbEgR3=r_iz3=KrnHB8B-=tg8Q3CEOWAPFiA@mail.gmail.com>
NAK.
This is a super questionable conceptual idea that isn't appropriate to submit as
a non-RFC patch.
I'm with David on this.
On Mon, Mar 02, 2026 at 12:45:33AM -0300, Andre Ramos wrote:
> Introduce /dev/ampress, a bidirectional fd-based interface for
> cooperative memory reclaim between the kernel and userspace.
There's just absolutely no way we'd expose anything like this as a character
device.
>
> Userspace processes open /dev/ampress and block on read() to receive
> struct ampress_event notifications carrying a graduated urgency level
> (LOW/MEDIUM/HIGH/FATAL), the NUMA node of the pressure source, and a
> suggested reclaim target in KiB. After freeing memory the process
> issues AMPRESS_IOC_ACK to close the feedback loop.
This is really not how we want to expose kernel interfaces. This seems like a
hack you'd implement internally rather than something we'd consider having in
mainline.
You're also inserting some new lock acquisitions and a linked list waking up
some unlimited number of threads on a core reclaim path - no.
>
> The feature hooks into balance_pgdat() in mm/vmscan.c, mapping the
> kswapd scan priority to urgency bands:
> priority 10-12 -> LOW
> priority 7-9 -> MEDIUM
> priority 4-6 -> HIGH
> priority 1-3 -> FATAL
>
> ampress_notify() is IRQ-safe (read_lock_irqsave + spin_lock_irqsave,
> no allocations) so it can be called from any reclaim context.
> Per-subscriber events overwrite without queuing to prevent unbounded
> backlog. A debugfs trigger at /sys/kernel/debug/ampress/inject allows
> testing without real memory pressure.
This is far too little description, especially given you're submitting
everything as one patch (which is not how kernel development is done).
The patch doesn't deal with MGLRU, and feels like a 'let's hook into one
specific part of mm and just dump out information to a random place'.
You could reasonably obtain the same information from BPF no?
>
> New files:
> include/uapi/linux/ampress.h - UAPI structs and ioctl definitions
> include/linux/ampress.h - internal header and ampress_notify()
> include/trace/events/ampress.h - tracepoints for notify and ack
> mm/ampress.c - miscdevice driver and core logic
> mm/ampress_test.c - KUnit tests (3/3 passing)
> tools/testing/ampress/ - userspace integration and stress tests
This doesn't belong in a commit message.
>
> Signed-off-by: André Castro Ramos <acastroramos1987@gmail.com>
> ---
> MAINTAINERS | 11 +
> include/linux/ampress.h | 34 +++
> include/trace/events/ampress.h | 70 ++++++
> include/uapi/linux/ampress.h | 40 ++++
> mm/Kconfig | 26 ++
> mm/Makefile | 2 +
> mm/ampress.c | 320 +++++++++++++++++++++++++
> mm/ampress_test.c | 124 ++++++++++
> mm/vmscan.c | 27 +++
> tools/testing/ampress/.gitignore | 2 +
> tools/testing/ampress/Makefile | 21 ++
> tools/testing/ampress/ampress_stress.c | 199 +++++++++++++++
> tools/testing/ampress/ampress_test.c | 212 ++++++++++++++++
> 13 files changed, 1088 insertions(+)
This is not how you submit patches, this needed to be broken up into a series,
submitting a single patch changing 13 files and adding 1,088 lines isn't how
kernel development works.
> create mode 100644 include/linux/ampress.h
> create mode 100644 include/trace/events/ampress.h
> create mode 100644 include/uapi/linux/ampress.h
> create mode 100644 mm/ampress.c
> create mode 100644 mm/ampress_test.c
> create mode 100644 tools/testing/ampress/.gitignore
> create mode 100644 tools/testing/ampress/Makefile
> create mode 100644 tools/testing/ampress/ampress_stress.c
> create mode 100644 tools/testing/ampress/ampress_test.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 61bf550fd37..ea4d7861ff9 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -16629,6 +16629,17 @@ F: mm/memremap.c
> F: mm/memory_hotplug.c
> F: tools/testing/selftests/memory-hotplug/
>
> +ADAPTIVE MEMORY PRESSURE SIGNALING (AMPRESS)
> +M: Darabat <playbadly1@gmail.com>
> +L: linux-mm@kvack.org
> +S: Maintained
> +F: include/linux/ampress.h
> +F: include/trace/events/ampress.h
> +F: include/uapi/linux/ampress.h
> +F: mm/ampress.c
> +F: mm/ampress_test.c
> +F: tools/testing/ampress/
As David said, it's really not proper to add yourself as a maintainer without a
track record in the kernel and community trust.
Maintainership is a serious responsibility and really requires that you have
both demonstrated consistent technical understanding and an ability to work with
the community.
Obviously as a new contributor, neither have been demonstrated.
Also there's an existing convention of 'MEMORY MANAGEMENT - xxx' for mm entries
in MAINTAINERS.
Thanks, Lorenzo
^ permalink raw reply
* [BUG] RCU stall / hung rcu_gp: process_srcu blocked in synchronize_rcu_normal triggered by perf trace teardown on 7.0.0-rc1
From: Zw Tang @ 2026-03-02 11:26 UTC (permalink / raw)
To: paulmck, peterz, mhiramat
Cc: jiangshanlai, mingo, acme, namhyung, rcu, linux-perf-users,
linux-trace-kernel, linux-kernel
Hi,
I am reporting an RCU stall / hung task issue triggered by a syzkaller
reproducer on Linux 7.0.0-rc1.
The system gets stuck with RCU stalls and multiple hung tasks. The
rcu_gp workqueue shows in-flight “process_srcu”, and kworkers running
“process_srcu” are blocked for >143s. At the same time, a repro task
blocks during perf event teardown (close()), waiting on SRCU.
Key log excerpts:
“Showing busy workqueues and worker pools: workqueue rcu_gp ...
in-flight: process_srcu”
“rcu: INFO: rcu_preempt detected stalls on CPUs/tasks ... Tasks
blocked on level-0 rcu_node ...”
hung tasks:
kworker/... Workqueue: rcu_gp process_srcu
process_srcu -> try_check_zero -> synchronize_rcu_normal -> wait_for_completion
repro2 task blocked in close():
__x64_sys_close -> __fput -> perf_release -> perf_event_release_kernel
-> __free_event
-> perf_trace_destroy -> perf_trace_event_unreg -> __synchronize_srcu
-> wait_for_completion
My understanding / suspected wait chain:
repro task closes a perf fd and enters perf trace teardown:
perf_trace_event_unreg() waits for SRCU via __synchronize_srcu().
rcu_gp workers running process_srcu attempt to advance SRCU and end up calling
synchronize_rcu_normal(), waiting for a normal RCU grace period to complete.
RCU grace period does not complete (RCU stall reported on CPUs/tasks), causing:
__synchronize_srcu() to wait indefinitely, process_srcu to remain
blocked, and the system to stall.
This looks like a circular wait involving perf trace teardown + SRCU +
RCU GP progress (possibly aggravated by timing/RT scheduling).
Reproducer:
C reproducer: https://pastebin.com/raw/L9hDPCrP
kernel config: https://pastebin.com/raw/jCq8qdq7
console output : https://pastebin.com/raw/D8XETkeH
Kernel:
git tree: torvalds/linux
commit: 4d349ee5c7782f8b27f6cb550f112c5e26fff38d
kernel version: 7.0.0-rc1
hardware: QEMU Ubuntu 24.10
[ 170.750583] Showing busy workqueues and worker pools:
[ 170.750595] workqueue rcu_gp: flags=0x108
[ 170.750609] pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=2 refcnt=3
[ 170.750635] in-flight: 9:process_srcu ,11:process_srcu
[ 170.750723] pool 2: cpus=0 node=0 flags=0x0 nice=0 hung=0s
workers=4 idle: 220
[ 173.609623] perf: interrupt took too long (69138 > 69021), lowering
kernel.pe0
[ 173.918434] perf: interrupt took too long (86584 > 86422), lowering
kernel.pe0
[ 174.394436] perf: interrupt took too long (109613 > 108230),
lowering kernel.0
[ 174.633565] perf: interrupt took too long (137226 > 137016),
lowering kernel.0
[ 174.652442] perf: interrupt took too long (171805 > 171532),
lowering kernel.0
[ 174.679472] perf: interrupt took too long (214868 > 214756),
lowering kernel.0
[ 174.728383] perf: interrupt took too long (268722 > 268585),
lowering kernel.0
[ 174.882802] perf: interrupt took too long (336002 > 335902),
lowering kernel.0
[ 271.443907] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 271.443951] rcu: Tasks blocked on level-0 rcu_node (CPUs 0-1): P4093
[ 271.443972] rcu: (detected by 1, t=105002 jiffies, g=33573,
q=4342 ncpus=2)
[ 271.443986] task:repro2 state:R running task
stack:28792 pid:4093 tgid:4092 ppid:300 task_flags:0x400140
flags:0x00080012
[ 271.444042] Call Trace:
[ 271.444051] <IRQ>
[ 271.444058] sched_show_task+0x3b5/0x650
[ 271.444088] ? trace_event_raw_event_sched_pi_setprio+0x440/0x440
[ 271.444121] ? rcu_dump_cpu_stacks+0x337/0x4b0
[ 271.444146] ? write_comp_data+0x1f/0x70
[ 271.444164] ? __sanitizer_cov_trace_pc+0x1a/0x40
[ 271.444182] ? wq_watchdog_touch+0xec/0x170
[ 271.444198] rcu_sched_clock_irq+0x297b/0x31c0
[ 271.444221] ? tmigr_requires_handle_remote_up+0x143/0x1c0
[ 271.444245] ? rcu_momentary_eqs+0x40/0x40
[ 271.444261] ? __tmigr_cpu_deactivate+0x1a0/0x1a0
[ 271.444283] ? __sanitizer_cov_trace_pc+0x1a/0x40
[ 271.444301] ? tmigr_requires_handle_remote+0x1cd/0x2a0
[ 271.444324] ? tmigr_handle_remote+0x320/0x320
[ 271.444348] ? write_comp_data+0x1f/0x70
[ 271.444366] ? write_comp_data+0x1f/0x70
[ 271.444383] ? __cgroup_account_cputime_field+0xb9/0x160
[ 271.444406] ? write_comp_data+0x1f/0x70
[ 271.444424] ? hrtimer_run_queues+0x64/0x450
[ 271.444447] update_process_times+0xfa/0x200
[ 271.444469] tick_nohz_handler+0x504/0x720
[ 271.444487] ? find_held_lock+0x2b/0x80
[ 271.444505] ? tick_do_update_jiffies64+0x380/0x380
[ 271.444523] ? lock_release+0xc9/0x2a0
[ 271.444548] __hrtimer_run_queues+0x771/0xb20
[ 271.444568] ? tick_do_update_jiffies64+0x380/0x380
[ 271.444589] ? write_comp_data+0x1f/0x70
[ 271.444607] ? enqueue_hrtimer+0x360/0x360
[ 271.444626] ? __sanitizer_cov_trace_pc+0x1a/0x40
[ 271.444649] hrtimer_interrupt+0x36e/0x820
[ 271.444676] __sysvec_apic_timer_interrupt+0xb5/0x3b0
[ 271.444695] sysvec_apic_timer_interrupt+0x6b/0x80
[ 271.444720] </IRQ>
[ 271.444724] <TASK>
[ 271.444730] asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 271.444748] RIP: 0010:finish_task_switch+0x126/0x5c0
[ 271.444771] Code: f6 4d 8d 7d 48 e8 8a 45 0a 00 31 f6 4c 89 ef e8
b0 fb ff ff 4c 89 ff e8 88 db b9 02 e8 93 c6 34 00 fb 65 4c 8b 3d 5a
34 cb 04 <49> 8d bf c8 14 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa
48 c1
[ 271.444786] RSP: 0018:ffff888009db79b0 EFLAGS: 00000202
[ 271.444798] RAX: 00000000001a071f RBX: ffff88800d6b4c80 RCX: 0000000000000006
[ 271.444808] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff81467bbd
[ 271.444817] RBP: ffff888009db79f8 R08: 0000000000000001 R09: 0000000000000001
[ 271.444826] R10: fffffbfff0ae32ea R11: ffffffff85719757 R12: ffff8880073f9a80
[ 271.444836] R13: ffff88806d338fc0 R14: 0000000000000001 R15: ffff888009d19a80
[ 271.444852] ? finish_task_switch+0x11d/0x5c0
[ 271.444876] ? finish_task_switch+0xe0/0x5c0
[ 271.444897] ? __sanitizer_cov_trace_pc+0x1a/0x40
[ 271.444914] ? __switch_to+0x854/0x1270
[ 271.444936] __schedule+0x1198/0x4190
[ 271.444957] ? io_schedule_timeout+0x80/0x80
[ 271.444974] ? mark_held_locks+0x49/0x70
[ 271.444998] preempt_schedule_irq+0x4e/0x90
[ 271.445015] irqentry_exit+0x17b/0x6c0
[ 271.445037] ? irqentry_enter+0x2a/0xd0
[ 271.445058] ? trace_hardirqs_off_finish+0x12f/0x160
[ 271.445080] asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 271.445096] RIP: 0010:__sanitizer_cov_trace_pc+0xd/0x40
[ 271.445127] Code: 00 00 ff 00 09 c2 75 07 4c 8b 81 e0 14 00 00 4c
89 c0 c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa bf 02 00 00 00 4c
8b 0c 24 <65> 48 8b 35 9b a7 9e 04 e8 46 ff ff ff 84 c0 74 20 48 8b 96
d0 14
[ 271.445141] RSP: 0018:ffff888009db7c28 EFLAGS: 00000246
[ 271.445152] RAX: dffffc0000000000 RBX: ffff88800df47b58 RCX: 0000000000000000
[ 271.445161] RDX: 1ffff11001be8f71 RSI: ffff888009d19a80 RDI: 0000000000000002
[ 271.445171] RBP: ffff888009d18000 R08: 0000000000000000 R09: ffffffff8167adf7
[ 271.445180] R10: ffffed1001d87110 R11: ffff88800ec38883 R12: 0000000000000001
[ 271.445190] R13: ffff888009db7cd0 R14: ffff888009d18028 R15: 00000000000000ec
[ 271.445203] ? __futex_wake_mark+0xb7/0xe0
[ 271.445222] ? __sanitizer_cov_trace_pc+0x1a/0x40
[ 271.445240] __futex_wake_mark+0xb7/0xe0
[ 271.445255] futex_wake_mark+0xa4/0x190
[ 271.445272] futex_wake+0x441/0x540
[ 271.445289] ? futex_wake_mark+0x190/0x190
[ 271.445306] ? percpu_counter_add_batch+0x11b/0x260
[ 271.445329] ? write_comp_data+0x1f/0x70
[ 271.445348] do_futex+0x26b/0x360
[ 271.445363] ? __ia32_sys_get_robust_list+0x140/0x140
[ 271.445380] ? lock_is_held_type+0x9b/0x110
[ 271.445404] __x64_sys_futex+0x1c9/0x480
[ 271.445420] ? do_futex+0x360/0x360
[ 271.445433] ? write_comp_data+0x1f/0x70
[ 271.445450] ? fd_install+0x1ec/0x4e0
[ 271.445471] ? __sanitizer_cov_trace_pc+0x1a/0x40
[ 271.445488] ? __sys_socket+0x4b/0x130
[ 271.445508] do_syscall_64+0x115/0x690
[ 271.445529] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 271.445544] RIP: 0033:0x7fdfffbe4fc9
[ 271.445568] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00
00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 97 8e 0d 00 f7 d8 64 89
01 48
[ 271.445582] RSP: 002b:00007fdfffaebde8 EFLAGS: 00000246 ORIG_RAX:
00000000000000ca
[ 271.445595] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fdfffbe4fc9
[ 271.445605] RDX: 00000000000f4240 RSI: 0000000000000081 RDI: 000055b24d1850ec
[ 271.445614] RBP: 00007fdfffaebe00 R08: 0000000000000000 R09: 0000000000000000
[ 271.445623] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fffb7d2901e
[ 271.445632] R13: 00007fffb7d2901f R14: 00007fdfffaebf00 R15: 0000000000022000
[ 271.445650] </TASK>
[ 318.431885] INFO: task kworker/0:0:9 blocked for more than 143 seconds.
[ 318.431913] Not tainted 7.0.0-rc1-00301-g4d349ee5c778 #1
[ 318.431921] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 318.431928] task:kworker/0:0 state:D stack:27912 pid:9
tgid:9 ppid:2 task_flags:0x4208060 flags:0x00080000
[ 318.431983] Workqueue: rcu_gp process_srcu
[ 318.432032] Call Trace:
[ 318.432038] <TASK>
[ 318.432049] __schedule+0x1190/0x4190
[ 318.432078] ? io_schedule_timeout+0x80/0x80
[ 318.432095] ? lock_release+0xc9/0x2a0
[ 318.432119] schedule+0xd1/0x260
[ 318.432134] schedule_timeout+0x240/0x280
[ 318.432157] ? hrtimer_nanosleep_restart+0x340/0x340
[ 318.432182] ? mark_held_locks+0x49/0x70
[ 318.432203] ? lockdep_hardirqs_on_prepare+0xd7/0x170
[ 318.432224] ? _raw_spin_unlock_irq+0x23/0x40
[ 318.432245] ? trace_hardirqs_on+0x18/0x170
[ 318.432267] wait_for_completion+0x169/0x320
[ 318.432286] ? wait_for_completion_killable+0x410/0x410
[ 318.432304] ? mark_held_locks+0x49/0x70
[ 318.432324] ? lockdep_hardirqs_on_prepare+0xd7/0x170
[ 318.432348] ? _raw_spin_unlock_irqrestore+0x2c/0x50
[ 318.432372] ? trace_hardirqs_on+0x18/0x170
[ 318.432390] ? _raw_spin_unlock_irqrestore+0x2c/0x50
[ 318.432414] synchronize_rcu_normal+0x208/0x5f0
[ 318.432433] ? start_poll_synchronize_rcu_full+0x90/0x90
[ 318.432453] ? do_raw_spin_lock+0x123/0x290
[ 318.432481] ? _raw_spin_unlock_irqrestore+0x2c/0x50
[ 318.432504] ? rt_mutex_slowunlock+0x824/0xaf0
[ 318.432526] ? lock_is_held_type+0x9b/0x110
[ 318.432555] try_check_zero+0x429/0x630
[ 318.432583] process_srcu+0x4c1/0x16c0
[ 318.432607] ? lock_acquire+0x187/0x2e0
[ 318.432626] ? process_scheduled_works+0x4ca/0x1ac0
[ 318.432650] ? lock_release+0xc9/0x2a0
[ 318.432674] process_scheduled_works+0x553/0x1ac0
[ 318.432704] ? insert_work+0x190/0x190
[ 318.432723] ? write_comp_data+0x1f/0x70
[ 318.432741] ? __sanitizer_cov_trace_pc+0x1a/0x40
[ 318.432761] ? lock_is_held_type+0x9b/0x110
[ 318.432784] ? __sanitizer_cov_trace_pc+0x1a/0x40
[ 318.432805] worker_thread+0x5a9/0xd10
[ 318.432830] ? bh_worker+0x740/0x740
[ 318.435937] kthread+0x3f9/0x530
[ 318.435991] ? kthread_affine_node+0x2a0/0x2a0
[ 318.436060] ret_from_fork+0x666/0xab0
[ 318.436098] ? native_tss_update_io_bitmap+0x6c0/0x6c0
[ 318.436139] ? write_comp_data+0x1f/0x70
[ 318.436175] ? __sanitizer_cov_trace_pc+0x1a/0x40
[ 318.436213] ? __switch_to+0x854/0x1270
[ 318.436255] ? kthread_affine_node+0x2a0/0x2a0
[ 318.436301] ret_from_fork_asm+0x11/0x20
[ 318.436360] </TASK>
[ 318.436373] INFO: task kworker/0:1:11 blocked for more than 143 seconds.
[ 318.436393] Not tainted 7.0.0-rc1-00301-g4d349ee5c778 #1
[ 318.436411] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 318.436422] task:kworker/0:1 state:D stack:26592 pid:11
tgid:11 ppid:2 task_flags:0x4208060 flags:0x00080000
[ 318.436528] Workqueue: rcu_gp process_srcu
[ 318.436571] Call Trace:
[ 318.436580] <TASK>
[ 318.436595] __schedule+0x1190/0x4190
[ 318.436639] ? io_schedule_timeout+0x80/0x80
[ 318.436676] ? lock_release+0xc9/0x2a0
[ 318.436726] schedule+0xd1/0x260
[ 318.436759] schedule_timeout+0x240/0x280
[ 318.436804] ? hrtimer_nanosleep_restart+0x340/0x340
[ 318.522895] ? mark_held_locks+0x49/0x70
[ 318.522930] ? lockdep_hardirqs_on_prepare+0xd7/0x170
[ 318.522952] ? _raw_spin_unlock_irq+0x23/0x40
[ 318.522977] ? trace_hardirqs_on+0x18/0x170
[ 318.523008] wait_for_completion+0x169/0x320
[ 318.523029] ? wait_for_completion_killable+0x410/0x410
[ 318.523047] ? mark_held_locks+0x49/0x70
[ 318.523068] ? lockdep_hardirqs_on_prepare+0xd7/0x170
[ 318.523088] ? _raw_spin_unlock_irqrestore+0x2c/0x50
[ 318.523110] ? trace_hardirqs_on+0x18/0x170
[ 318.523128] ? _raw_spin_unlock_irqrestore+0x2c/0x50
[ 318.523152] synchronize_rcu_normal+0x208/0x5f0
[ 318.523171] ? start_poll_synchronize_rcu_full+0x90/0x90
[ 318.523190] ? do_raw_spin_lock+0x123/0x290
[ 318.523219] ? _raw_spin_unlock_irqrestore+0x2c/0x50
[ 318.523242] ? rt_mutex_slowunlock+0x824/0xaf0
[ 318.523264] ? lock_is_held_type+0x9b/0x110
[ 318.523294] try_check_zero+0x429/0x630
[ 318.523321] process_srcu+0x4c1/0x16c0
[ 318.523346] ? lock_acquire+0x187/0x2e0
[ 318.523365] ? process_scheduled_works+0x4ca/0x1ac0
[ 318.523389] ? lock_release+0xc9/0x2a0
[ 318.523412] process_scheduled_works+0x553/0x1ac0
[ 318.523442] ? insert_work+0x190/0x190
[ 318.523461] ? write_comp_data+0x1f/0x70
[ 318.523478] ? __sanitizer_cov_trace_pc+0x1a/0x40
[ 318.523498] ? lock_is_held_type+0x9b/0x110
[ 318.523521] ? __sanitizer_cov_trace_pc+0x1a/0x40
[ 318.523541] worker_thread+0x5a9/0xd10
[ 318.523566] ? bh_worker+0x740/0x740
[ 318.523587] kthread+0x3f9/0x530
[ 318.523605] ? kthread_affine_node+0x2a0/0x2a0
[ 318.523627] ret_from_fork+0x666/0xab0
[ 318.523645] ? native_tss_update_io_bitmap+0x6c0/0x6c0
[ 318.523664] ? write_comp_data+0x1f/0x70
[ 318.523680] ? __sanitizer_cov_trace_pc+0x1a/0x40
[ 318.523698] ? __switch_to+0x854/0x1270
[ 318.523717] ? kthread_affine_node+0x2a0/0x2a0
[ 318.523738] ret_from_fork_asm+0x11/0x20
[ 318.523764] </TASK>
[ 318.526244] INFO: task repro2:4086 blocked for more than 143 seconds.
[ 318.526258] Not tainted 7.0.0-rc1-00301-g4d349ee5c778 #1
[ 318.526267] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 318.526273] task:repro2 state:D stack:25512 pid:4086
tgid:4086 ppid:298 task_flags:0x400040 flags:0x00080002
executing program[ 318.526322] Call Trace:
[ 318.526327] <TASK>
[ 318.526334] __schedule+0x1190/0x4190
[ 318.526356] ? io_schedule_timeout+0x80/0x80
[ 318.526373] ? lock_release+0xc9/0x2a0
[ 318.526396] schedule+0xd1/0x260
[ 318.526411] schedule_timeout+0x240/0x280
[ 318.526432] ? hrtimer_nanosleep_restart+0x340/0x340
[ 318.526458] ? mark_held_locks+0x49/0x70
[ 318.526478] ? lockdep_hardirqs_on_prepare+0xd7/0x170
[ 318.526499] ? _raw_spin_unlock_irq+0x23/0x40
[ 318.526520] ? trace_hardirqs_on+0x18/0x170
[ 318.526539] wait_for_completion+0x169/0x320
[ 318.526556] ? __lock_acquire+0x55a/0x1ef0
[ 318.526577] ? wait_for_completion_killable+0x410/0x410
[ 318.526597] ? lockdep_init_map_type+0x4b/0x210
[ 318.526621] __synchronize_srcu+0x143/0x230
[ 318.526643] ? start_poll_synchronize_srcu+0x10/0x10
[ 318.526666] ? rcu_tasks_pregp_step+0x10/0x10
[ 318.526687] ? kvm_clock_read+0x3b/0x60
[ 318.526707] ? write_comp_data+0x1f/0x70
[ 318.526723] ? __sanitizer_cov_trace_pc+0x1a/0x40
[ 318.526741] ? ktime_get_mono_fast_ns+0x19d/0x2b0
[ 318.526766] ? synchronize_srcu+0x53/0x260
[ 318.526789] perf_trace_event_unreg.isra.0+0xb8/0x1e0
[ 318.526807] perf_trace_destroy+0xc3/0x1c0
[ 318.526822] ? perf_tp_event_init+0x120/0x120
[ 318.529171] __free_event+0x257/0xc10
[ 318.529200] ? perf_event_release_kernel+0x460/0x460
[ 318.529215] put_event+0x3c/0x90
[ 318.529230] perf_event_release_kernel+0x357/0x460
[ 318.529248] ? perf_event_release_kernel+0x460/0x460
[ 318.529263] perf_release+0x37/0x50
[ 318.529277] __fput+0x420/0xb80
[ 318.529303] fput_close_sync+0x10f/0x230
[ 318.529325] ? alloc_file_clone+0x110/0x110
[ 318.529345] ? dnotify_flush+0x7f/0x4c0
[ 318.529372] __x64_sys_close+0x8f/0x120
[ 318.529390] do_syscall_64+0x115/0x690
[ 318.529410] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 318.529425] RIP: 0033:0x7fdfffcd611b
[ 318.529436] RSP: 002b:00007fffb7d290d0 EFLAGS: 00000293 ORIG_RAX:
0000000000000003
[ 318.529450] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fdfffcd611b
[ 318.529460] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005
[ 318.529469] RBP: 00007fffb7d29100 R08: 0000000000000000 R09: 0000000000000032
[ 318.529478] R10: 0000000000000000 R11: 0000000000000293 R12: 000055b24d16e3f0
[ 318.529486] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 318.529503] </TASK>
Could you please advise whether this is a known issue/regression, and
what additional debug data would be most helpful? I can rerun with
extra debugging options if needed.
Thanks,
Zw Tang
^ permalink raw reply
* Re: [External] Re: [RFC PATCH bpf-next v3 1/3] libbpf: Optimize kprobe.session attachment for exact function names
From: Jiri Olsa @ 2026-03-02 10:58 UTC (permalink / raw)
To: Andrey Grodzovsky
Cc: bot+bpf-ci, yonghong.song, bpf, linux-open-source, ast, daniel,
andrii, rostedt, linux-trace-kernel, martin.lau, eddyz87, clm,
ihor.solodrai
In-Reply-To: <CAOu3gNgRpMd9poZYibhGg+UfQXiHpB3ucnhvghoT0TF1rFfmPQ@mail.gmail.com>
On Fri, Feb 27, 2026 at 05:28:05PM -0500, Andrey Grodzovsky wrote:
> On Fri, Feb 27, 2026 at 4:18 PM <bot+bpf-ci@kernel.org> wrote:
> >
> > > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> > > --- a/tools/lib/bpf/libbpf.c
> > > +++ b/tools/lib/bpf/libbpf.c
> > > @@ -12041,7 +12041,15 @@ struct kprobe_multi_resolve res = {
> > > if (addrs && syms)
> > > return libbpf_err_ptr(-EINVAL);
> > >
> > > - if (pattern) {
> > > + /*
> > > + * Exact function name (no wildcards): bypass kallsyms parsing
> > > + * and pass the symbol directly to the kernel via syms[] array.
> > > + * The kernel's ftrace_lookup_symbols() resolves it efficiently.
> > > + */
> > > + if (pattern && !strpbrk(pattern, "*?")) {
> > > + syms = &pattern;
> > > + cnt = 1;
> > > + } else if (pattern) {
> >
> > When the fast path is taken here, unique_match is never checked.
> > The unique_match validation is inside the "else if (pattern)"
> > branch:
> >
> > if (unique_match && res.cnt != 1) {
> > pr_warn("prog '%s': failed to find a unique match ...");
> > err = -EINVAL;
> > goto error;
> > }
> >
> > Before this change, a caller passing an exact function name with
> > unique_match = true would go through kallsyms parsing, which could
> > find multiple addresses for same-named static functions across
> > different translation units, and the unique_match check would
> > correctly return an error.
> >
> > With the fast path, the name goes directly to the kernel's
> > ftrace_lookup_symbols(), which resolves only the first match via
> > kallsyms_callback() skipping duplicates:
> >
> > if (args->addrs[idx])
> > return 0;
> >
> > Should the fast path condition also check unique_match, for
> > example:
> >
> > if (pattern && !strpbrk(pattern, "*?") && !unique_match) {
> >
> > This would preserve the optimization for the common case while
> > falling back to the slow path when unique_match is explicitly
> > requested.
> >
>
> I am not sure this makes sense, according to the original patchset [1]
> this flag was specifically tailored for patterns with wildcards where
> this indeed
> makes sense. in our case, cnt == 1 from the get go since no wildcards so this
> check can't ever fail.
I think the problem is that it could now pass for cases where
function name has multiple instances in kallsyms .. while before
it would be caught in libbpf_available_kallsyms_parse and fail
jirka
>
> Andrii, Yonghong - any suggestions ?
>
> [1] - https://lore.kernel.org/bpf/20241218225246.3170300-1-yonghong.song@linux.dev/
>
> Andrey
>
>
> > > [ ... ]
> >
> >
> > ---
> > AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> > See: https://urldefense.com/v3/__https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md__;!!BmdzS3_lV9HdKG8!z-aIXCz8YRZcraMmGI2bmb4YrDgW0brRTcX_BaJCWYwj7xfmkZL6qka6aqqIwzDPUjR1TxUU-Mc50s9AAYQf-vQMuPuGlVKW$
> >
> > CI run summary: https://urldefense.com/v3/__https://github.com/kernel-patches/bpf/actions/runs/22503275616__;!!BmdzS3_lV9HdKG8!z-aIXCz8YRZcraMmGI2bmb4YrDgW0brRTcX_BaJCWYwj7xfmkZL6qka6aqqIwzDPUjR1TxUU-Mc50s9AAYQf-vQMuGeekJPd$
> >
> > AI-authorship-score: medium
> > AI-authorship-explanation: Comments are unusually verbose for a simple optimization, and the commit message is well-structured with distinct sections, but the iterative v1-v3 refinement following reviewer feedback is typical of human development.
> > issues-found: 1
> > issue-severity-score: low
> > issue-severity-explanation: The fast path bypasses the unique_match check, which could silently attach to the wrong function among same-named statics, but requires the uncommon combination of unique_match=true with an exact name matching multiple kernel functions.
^ permalink raw reply
* [PATCH] trace: trace_events: allow multiple modules
From: Andrei-Alexandru Tachici @ 2026-03-02 10:27 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers
Cc: linux-kernel, linux-trace-kernel, kernel,
Andrei-Alexandru Tachici
Currently when multiple modules will be specified at boot
time in "kernel.trace_event=" only the last entry will
have trace events enabled.
Reconstruct through multiple setup calls the whole array
in bootup_event_buf in order to be parsed correctly by
early_enable_events().
Signed-off-by: Andrei-Alexandru Tachici <andrei-alexandru.tachici@oss.qualcomm.com>
---
Currently when multiple modules will be specified at boot
time in "kernel.trace_event=" only the last entry will
have trace events enabled.
Reconstruct through multiple setup calls the whole array
in bootup_event_buf in order to be parsed correctly by
early_enable_events().
Example bellow of a bootconfig:
kernel.trace_event = ":mod:rproc_qcom_common", ":mod:qrtr", ":mod:qcom_aoss"
Without the patch for the above only qcom_aoss would have
events enabled and debugging multiple modules that are
inserted at boot time would not be possible.
---
kernel/trace/trace_events.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 9928da636c9d..b07325e8b19a 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -4491,7 +4491,11 @@ static char bootup_event_buf[COMMAND_LINE_SIZE] __initdata;
static __init int setup_trace_event(char *str)
{
- strscpy(bootup_event_buf, str, COMMAND_LINE_SIZE);
+ if (bootup_event_buf[0] != '\0')
+ strlcat(bootup_event_buf, ",", COMMAND_LINE_SIZE);
+
+ strlcat(bootup_event_buf, str, COMMAND_LINE_SIZE);
+
trace_set_ring_buffer_expanded(NULL);
disable_tracing_selftest("running event tracing");
---
base-commit: a75cb869a8ccc88b0bc7a44e1597d9c7995c56e5
change-id: 20260227-trace-events-allow-multiple-modules-2253fb5531c6
Best regards,
--
Andrei-Alexandru Tachici <andrei-alexandru.tachici@oss.qualcomm.com>
^ permalink raw reply related
* Re: [PATCH net-next v2 04/10] devlink: allow to use devlink index as a command handle
From: Jiri Pirko @ 2026-03-02 10:23 UTC (permalink / raw)
To: Jakub Kicinski
Cc: netdev, davem, edumazet, pabeni, horms, donald.hunter, corbet,
skhan, saeedm, leon, tariqt, mbloch, przemyslaw.kitszel, mschmidt,
andrew+netdev, rostedt, mhiramat, mathieu.desnoyers, chuck.lever,
matttbe, cjubran, daniel.zahka, linux-doc, linux-rdma,
linux-trace-kernel
In-Reply-To: <20260228144846.40f5dfeb@kernel.org>
Sat, Feb 28, 2026 at 11:48:46PM +0100, kuba@kernel.org wrote:
>On Wed, 25 Feb 2026 14:34:16 +0100 Jiri Pirko wrote:
>> + if (attrs[DEVLINK_ATTR_INDEX]) {
>> + index = nla_get_uint(attrs[DEVLINK_ATTR_INDEX]);
>> + devlink = devlinks_xa_lookup_get(net, index);
>> + if (!devlink)
>> + return ERR_PTR(-ENODEV);
>> + goto found;
>> + }
>> +
>> if (!attrs[DEVLINK_ATTR_BUS_NAME] || !attrs[DEVLINK_ATTR_DEV_NAME])
>> return ERR_PTR(-EINVAL);
>
>If both INDEX and BUS_NAME + DEV_NAME are provided we should check
>that they point to the same device? Or reject user space passing both?
I implemented reject. I don't see much of value of allowing both. The
code that would do the checking is too much for this hypothetical case.
^ permalink raw reply
* [PATCH 6.18.y] x86/uprobes: Fix XOL allocation failure for 32-bit tasks
From: Oleg Nesterov @ 2026-03-02 9:45 UTC (permalink / raw)
To: Sasha Levin
Cc: stable, Paulo Andrade, Peter Zijlstra (Intel), linux-trace-kernel,
linux-perf-users
In-Reply-To: <20260301011537.1669125-1-sashal@kernel.org>
[ Upstream commit d55c571e4333fac71826e8db3b9753fadfbead6a ]
This script
#!/usr/bin/bash
echo 0 > /proc/sys/kernel/randomize_va_space
echo 'void main(void) {}' > TEST.c
# -fcf-protection to ensure that the 1st endbr32 insn can't be emulated
gcc -m32 -fcf-protection=branch TEST.c -o test
bpftrace -e 'uprobe:./test:main {}' -c ./test
"hangs", the probed ./test task enters an endless loop.
The problem is that with randomize_va_space == 0
get_unmapped_area(TASK_SIZE - PAGE_SIZE) called by xol_add_vma() can not
just return the "addr == TASK_SIZE - PAGE_SIZE" hint, this addr is used
by the stack vma.
arch_get_unmapped_area_topdown() doesn't take TIF_ADDR32 into account and
in_32bit_syscall() is false, this leads to info.high_limit > TASK_SIZE.
vm_unmapped_area() happily returns the high address > TASK_SIZE and then
get_unmapped_area() returns -ENOMEM after the "if (addr > TASK_SIZE - len)"
check.
handle_swbp() doesn't report this failure (probably it should) and silently
restarts the probed insn. Endless loop.
I think that the right fix should change the x86 get_unmapped_area() paths
to rely on TIF_ADDR32 rather than in_32bit_syscall(). Note also that if
CONFIG_X86_X32_ABI=y, in_x32_syscall() falsely returns true in this case
because ->orig_ax = -1.
But we need a simple fix for -stable, so this patch just sets TS_COMPAT if
the probed task is 32-bit to make in_ia32_syscall() true.
Fixes: 1b028f784e8c ("x86/mm: Introduce mmap_compat_base() for 32-bit mmap()")
Reported-by: Paulo Andrade <pandrade@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/aV5uldEvV7pb4RA8@redhat.com/
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/aWO7Fdxn39piQnxu@redhat.com
---
arch/x86/kernel/uprobes.c | 24 ++++++++++++++++++++++++
include/linux/uprobes.h | 1 +
kernel/events/uprobes.c | 10 +++++++---
3 files changed, 32 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 845aeaf36b8d..73be14736062 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -1819,3 +1819,27 @@ bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check ctx,
else
return regs->sp <= ret->stack;
}
+
+#ifdef CONFIG_IA32_EMULATION
+unsigned long arch_uprobe_get_xol_area(void)
+{
+ struct thread_info *ti = current_thread_info();
+ unsigned long vaddr;
+
+ /*
+ * HACK: we are not in a syscall, but x86 get_unmapped_area() paths
+ * ignore TIF_ADDR32 and rely on in_32bit_syscall() to calculate
+ * vm_unmapped_area_info.high_limit.
+ *
+ * The #ifdef above doesn't cover the CONFIG_X86_X32_ABI=y case,
+ * but in this case in_32bit_syscall() -> in_x32_syscall() always
+ * (falsely) returns true because ->orig_ax == -1.
+ */
+ if (test_thread_flag(TIF_ADDR32))
+ ti->status |= TS_COMPAT;
+ vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE, PAGE_SIZE, 0, 0);
+ ti->status &= ~TS_COMPAT;
+
+ return vaddr;
+}
+#endif
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index ee3d36eda45d..f548fea2adec 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -242,6 +242,7 @@ extern void arch_uprobe_clear_state(struct mm_struct *mm);
extern void arch_uprobe_init_state(struct mm_struct *mm);
extern void handle_syscall_uprobe(struct pt_regs *regs, unsigned long bp_vaddr);
extern void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr);
+extern unsigned long arch_uprobe_get_xol_area(void);
#else /* !CONFIG_UPROBES */
struct uprobes_state {
};
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index f11ceb8be8c4..4e45236064dc 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1694,6 +1694,12 @@ static const struct vm_special_mapping xol_mapping = {
.mremap = xol_mremap,
};
+unsigned long __weak arch_uprobe_get_xol_area(void)
+{
+ /* Try to map as high as possible, this is only a hint. */
+ return get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE, PAGE_SIZE, 0, 0);
+}
+
/* Slot allocation for XOL */
static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
{
@@ -1709,9 +1715,7 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
}
if (!area->vaddr) {
- /* Try to map as high as possible, this is only a hint. */
- area->vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE,
- PAGE_SIZE, 0, 0);
+ area->vaddr = arch_uprobe_get_xol_area();
if (IS_ERR_VALUE(area->vaddr)) {
ret = area->vaddr;
goto fail;
--
2.52.0
^ permalink raw reply related
* Re: FAILED: Patch "x86/uprobes: Fix XOL allocation failure for 32-bit tasks" failed to apply to 6.18-stable tree
From: Oleg Nesterov @ 2026-03-02 9:45 UTC (permalink / raw)
To: Sasha Levin
Cc: stable, Paulo Andrade, Peter Zijlstra (Intel), linux-trace-kernel,
linux-perf-users
In-Reply-To: <20260301011537.1669125-1-sashal@kernel.org>
On 02/28, Sasha Levin wrote:
>
> The patch below does not apply to the 6.18-stable tree.
> If someone wants it applied there, or to any other stable or longterm
> tree, then please email the backport, including the original git commit
> id to <stable@vger.kernel.org>.
I never know how to react to "failed to apply to stabe" emails. I am going
to send [PATCH 6.18.y] in reply to this email.
Is it OK?
Oleg.
> Thanks,
> Sasha
>
> ------------------ original commit in Linus's tree ------------------
>
> From d55c571e4333fac71826e8db3b9753fadfbead6a Mon Sep 17 00:00:00 2001
> From: Oleg Nesterov <oleg@redhat.com>
> Date: Sun, 11 Jan 2026 16:00:37 +0100
> Subject: [PATCH] x86/uprobes: Fix XOL allocation failure for 32-bit tasks
>
> This script
>
> #!/usr/bin/bash
>
> echo 0 > /proc/sys/kernel/randomize_va_space
>
> echo 'void main(void) {}' > TEST.c
>
> # -fcf-protection to ensure that the 1st endbr32 insn can't be emulated
> gcc -m32 -fcf-protection=branch TEST.c -o test
>
> bpftrace -e 'uprobe:./test:main {}' -c ./test
>
> "hangs", the probed ./test task enters an endless loop.
>
> The problem is that with randomize_va_space == 0
> get_unmapped_area(TASK_SIZE - PAGE_SIZE) called by xol_add_vma() can not
> just return the "addr == TASK_SIZE - PAGE_SIZE" hint, this addr is used
> by the stack vma.
>
> arch_get_unmapped_area_topdown() doesn't take TIF_ADDR32 into account and
> in_32bit_syscall() is false, this leads to info.high_limit > TASK_SIZE.
> vm_unmapped_area() happily returns the high address > TASK_SIZE and then
> get_unmapped_area() returns -ENOMEM after the "if (addr > TASK_SIZE - len)"
> check.
>
> handle_swbp() doesn't report this failure (probably it should) and silently
> restarts the probed insn. Endless loop.
>
> I think that the right fix should change the x86 get_unmapped_area() paths
> to rely on TIF_ADDR32 rather than in_32bit_syscall(). Note also that if
> CONFIG_X86_X32_ABI=y, in_x32_syscall() falsely returns true in this case
> because ->orig_ax = -1.
>
> But we need a simple fix for -stable, so this patch just sets TS_COMPAT if
> the probed task is 32-bit to make in_ia32_syscall() true.
>
> Fixes: 1b028f784e8c ("x86/mm: Introduce mmap_compat_base() for 32-bit mmap()")
> Reported-by: Paulo Andrade <pandrade@redhat.com>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://lore.kernel.org/all/aV5uldEvV7pb4RA8@redhat.com/
> Cc: stable@vger.kernel.org
> Link: https://patch.msgid.link/aWO7Fdxn39piQnxu@redhat.com
> ---
> arch/x86/kernel/uprobes.c | 24 ++++++++++++++++++++++++
> include/linux/uprobes.h | 1 +
> kernel/events/uprobes.c | 10 +++++++---
> 3 files changed, 32 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index 7be8e361ca55b..619dddf54424e 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -1823,3 +1823,27 @@ bool is_uprobe_at_func_entry(struct pt_regs *regs)
>
> return false;
> }
> +
> +#ifdef CONFIG_IA32_EMULATION
> +unsigned long arch_uprobe_get_xol_area(void)
> +{
> + struct thread_info *ti = current_thread_info();
> + unsigned long vaddr;
> +
> + /*
> + * HACK: we are not in a syscall, but x86 get_unmapped_area() paths
> + * ignore TIF_ADDR32 and rely on in_32bit_syscall() to calculate
> + * vm_unmapped_area_info.high_limit.
> + *
> + * The #ifdef above doesn't cover the CONFIG_X86_X32_ABI=y case,
> + * but in this case in_32bit_syscall() -> in_x32_syscall() always
> + * (falsely) returns true because ->orig_ax == -1.
> + */
> + if (test_thread_flag(TIF_ADDR32))
> + ti->status |= TS_COMPAT;
> + vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE, PAGE_SIZE, 0, 0);
> + ti->status &= ~TS_COMPAT;
> +
> + return vaddr;
> +}
> +#endif
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index ee3d36eda45dd..f548fea2adec8 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -242,6 +242,7 @@ extern void arch_uprobe_clear_state(struct mm_struct *mm);
> extern void arch_uprobe_init_state(struct mm_struct *mm);
> extern void handle_syscall_uprobe(struct pt_regs *regs, unsigned long bp_vaddr);
> extern void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr);
> +extern unsigned long arch_uprobe_get_xol_area(void);
> #else /* !CONFIG_UPROBES */
> struct uprobes_state {
> };
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index a7d7d83ca1d78..dfbce021fb027 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -1694,6 +1694,12 @@ static const struct vm_special_mapping xol_mapping = {
> .mremap = xol_mremap,
> };
>
> +unsigned long __weak arch_uprobe_get_xol_area(void)
> +{
> + /* Try to map as high as possible, this is only a hint. */
> + return get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE, PAGE_SIZE, 0, 0);
> +}
> +
> /* Slot allocation for XOL */
> static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
> {
> @@ -1709,9 +1715,7 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
> }
>
> if (!area->vaddr) {
> - /* Try to map as high as possible, this is only a hint. */
> - area->vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE,
> - PAGE_SIZE, 0, 0);
> + area->vaddr = arch_uprobe_get_xol_area();
> if (IS_ERR_VALUE(area->vaddr)) {
> ret = area->vaddr;
> goto fail;
> --
> 2.51.0
>
>
>
>
^ permalink raw reply
* Re: [PATCH net-next v2 06/10] devlink: add devlink_dev_driver_name() helper and use it in trace events
From: Jiri Pirko @ 2026-03-02 9:44 UTC (permalink / raw)
To: Jakub Kicinski
Cc: netdev, davem, edumazet, pabeni, horms, donald.hunter, corbet,
skhan, saeedm, leon, tariqt, mbloch, przemyslaw.kitszel, mschmidt,
andrew+netdev, rostedt, mhiramat, mathieu.desnoyers, chuck.lever,
matttbe, cjubran, daniel.zahka, linux-doc, linux-rdma,
linux-trace-kernel
In-Reply-To: <20260228145805.758ff8c0@kernel.org>
Sat, Feb 28, 2026 at 11:58:05PM +0100, kuba@kernel.org wrote:
>On Wed, 25 Feb 2026 14:34:18 +0100 Jiri Pirko wrote:
>> +const char *devlink_dev_driver_name(const struct devlink *devlink)
>> +{
>> + struct device *dev = devlink->dev;
>> +
>> + return dev ? dev->driver->name : NULL;
>> +}
>> +EXPORT_SYMBOL_GPL(devlink_dev_driver_name);
>
>You say we need this in prep for shared instances, which is fair, but
>shared instances should presumably share across the same driver, most
>of the time? So perhaps we should do a similar thing here as you did to
>the bus/dev name? Maybe when shared instance is allocated:
>
> devlink->driver_name = kasprintf("%s+", dev->driver);
>
>And then:
>
>+const char *devlink_dev_driver_name(const struct devlink *devlink)
>+{
>+ struct device *dev = devlink->dev;
>+
>+ return dev ? dev->driver->name : devlink->driver_name;
>+}
>+EXPORT_SYMBOL_GPL(devlink_dev_driver_name);
>
>?
>
Good idea. Will add is in some form.
^ permalink raw reply
* Re: [PATCH net-next v2 08/10] devlink: introduce shared devlink instance for PFs on same chip
From: Jiri Pirko @ 2026-03-02 9:30 UTC (permalink / raw)
To: Jakub Kicinski
Cc: netdev, davem, edumazet, pabeni, horms, donald.hunter, corbet,
skhan, saeedm, leon, tariqt, mbloch, przemyslaw.kitszel, mschmidt,
andrew+netdev, rostedt, mhiramat, mathieu.desnoyers, chuck.lever,
matttbe, cjubran, daniel.zahka, linux-doc, linux-rdma,
linux-trace-kernel
In-Reply-To: <20260228150311.1a1ded74@kernel.org>
Sun, Mar 01, 2026 at 12:03:11AM +0100, kuba@kernel.org wrote:
>On Wed, 25 Feb 2026 14:34:20 +0100 Jiri Pirko wrote:
>> +struct devlink_shd {
>> + struct list_head list; /* Node in shd list */
>> + const char *id; /* Identifier string (e.g., serial number) */
>> + refcount_t refcount; /* Reference count */
>> + char priv[] __aligned(NETDEV_ALIGN); /* Driver private data */
>> +};
>
>As pointed out by AI you promised a size member and a __counted_by()
>annotation :)
Yeah, somehow I got false impression this is not needed for priv. My
bad, sorry, adding it.
^ permalink raw reply
* Re: [PATCH net-next v2 09/10] documentation: networking: add shared devlink documentation
From: Jiri Pirko @ 2026-03-02 9:09 UTC (permalink / raw)
To: Jakub Kicinski
Cc: netdev, davem, edumazet, pabeni, horms, donald.hunter, corbet,
skhan, saeedm, leon, tariqt, mbloch, przemyslaw.kitszel, mschmidt,
andrew+netdev, rostedt, mhiramat, mathieu.desnoyers, chuck.lever,
matttbe, cjubran, daniel.zahka, linux-doc, linux-rdma,
linux-trace-kernel
In-Reply-To: <20260228150558.46f3be36@kernel.org>
Sun, Mar 01, 2026 at 12:05:58AM +0100, kuba@kernel.org wrote:
>On Wed, 25 Feb 2026 14:34:21 +0100 Jiri Pirko wrote:
>> +Shared devlink instances allow multiple physical functions (PFs) on the same
>> +chip to share an additional devlink instance for chip-wide operations. This
>> +is implemented within individual drivers alongside the individual PF devlink
>> +instances, not replacing them.
>
>Sounds like you want to preclude what was the goal in the discussion
>with Przemek you quoted - a shared instance _only_ case. We don't have
>to implement it today, but I think it's an entirely sane direction.
>So the docs should not state otherwise.
Fair enough, I can do it.
^ permalink raw reply
* [BUG] kprobes: WARNING in __arm_kprobe_ftrace when kprobe-ftrace arming fails with -ENOMEM under fault injection
From: Zw Tang @ 2026-03-02 9:00 UTC (permalink / raw)
To: Naveen N Rao, Masami Hiramatsu, Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, linux-perf-users,
Arnaldo Carvalho de Melo
Hi,
I am reporting a WARNING triggered by a syzkaller reproducer on Linux 7.0.0-rc1.
The kernel hits a WARN in kprobes while trying to arm a kprobe via ftrace:
Failed to arm kprobe-ftrace at __split_text_end+0x4/0x11 (error -12)
WARNING: kernel/kprobes.c:1147 at __arm_kprobe_ftrace()
This seems to be triggered through perf_event_open() -> trace_kprobe
-> kprobes. The reproducer enables systematic fault injection and
injects a failure (nth=7), and the arming path returns -ENOMEM (-12).
Instead of cleanly failing, kprobes emits a WARNING.
This is reproducible only with fault injection enabled.
Reproducer:
C reproducer: https://pastebin.com/raw/casZvuLe
console output: https://pastebin.com/raw/1xkwRUmc
kernel config: https://pastebin.com/raw/8Er8SZz0
Kernel:
git tree: torvalds/linux
commit: 4d349ee5c7782f8b27f6cb550f112c5e26fff38d
kernel version: 7.0.0-rc1-00301-g4d349ee5c778 #5 PREEMPT_RT (lazy)
hardware: QEMU Ubuntu 24.10
[ 92.516728] WARNING: kernel/kprobes.c:1147 at
arm_kprobe+0x563/0x620, CPU#0: syz.1.94/783
[ 92.516766] Modules linked in:
[ 92.516809] CPU: 0 UID: 0 PID: 783 Comm: syz.1.94 Not tainted
7.0.0-rc1-00301-g4d349ee5c778 #5 PREEMPT_{RT,(lazy)}
0b4dbcd6f14740930e77a74387d10aec6dbca841
[ 92.516842] Hardware name: QEMU Ubuntu 24.10 PC (i440FX + PIIX,
1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 92.516855] RIP: 0010:arm_kprobe+0x56a/0x620
[ 92.516885] Code: ff 4c 89 fa 48 b8 00 00 00 00 00 fc ff df 48 c1
ea 03 80 3c 02 00 0f 85 a8 00 00 00 48 8d 3d cd d3 8d 06 48 8b 75 28
44 89 e2 <67> 48 0f b9 3a e9 81 fc ff ff e8 87 8c ff ff 48 8d 3d c0 d3
8d 06
[ 92.516905] RSP: 0018:ffff88800faf7a48 EFLAGS: 00010246
[ 92.516924] RAX: dffffc0000000000 RBX: ffffffff89f46b40 RCX: 0000000000000000
[ 92.516939] RDX: 00000000fffffff4 RSI: ffffffff81200004 RDI: ffffffff88481300
[ 92.516955] RBP: ffff88800c566a18 R08: 0000000000000000 R09: fffffbfff108bacb
[ 92.516969] R10: fffffbfff108baca R11: ffffffff8845d657 R12: 00000000fffffff4
[ 92.516984] R13: ffffffff8845dc20 R14: ffff88800c566a90 R15: ffff88800c566a40
[ 92.517002] FS: 00007f42ed38f6c0(0000) GS:ffff8880e224e000(0000)
knlGS:0000000000000000
[ 92.517023] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 92.517041] CR2: 00007fe44aa68710 CR3: 000000000e340000 CR4: 0000000000350ef0
[ 92.517057] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 92.517073] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 92.517090] Call Trace:
[ 92.517099] <TASK>
[ 92.517126] enable_kprobe+0x1fc/0x2c0
[ 92.517173] enable_trace_kprobe+0x227/0x4b0
[ 92.517240] kprobe_register+0x84/0xc0
[ 92.517279] perf_trace_event_init+0x527/0xa20
[ 92.517329] perf_kprobe_init+0x156/0x200
[ 92.517367] perf_kprobe_event_init+0x101/0x1c0
[ 92.517406] perf_try_init_event+0x145/0xa10
[ 92.517458] perf_event_alloc+0x1f91/0x5390
[ 92.517509] ? perf_event_alloc+0x1e4d/0x5390
[ 92.517586] ? perf_event_mmap_output+0xf00/0xf00
[ 92.517709] __do_sys_perf_event_open+0x557/0x2d50
[ 92.517762] ? write_comp_data+0x29/0x80
[ 92.517788] ? irqentry_exit+0x157/0xb20
[ 92.517822] ? perf_release+0x50/0x50
[ 92.517848] ? irqentry_exit+0x157/0xb20
[ 92.517897] ? __split_text_end+0x4/0x11
[ 92.517956] ? tracer_hardirqs_on+0x80/0x3b0
[ 92.517986] ? do_syscall_64+0x94/0x1160
[ 92.518022] ? __sanitizer_cov_trace_pc+0x20/0x50
[ 92.518072] do_syscall_64+0x129/0x1160
[ 92.518118] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 92.518142] RIP: 0033:0x7f42ee92ebe9
[ 92.518164] Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40
00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89
01 48
[ 92.518184] RSP: 002b:00007f42ed38f038 EFLAGS: 00000246 ORIG_RAX:
000000000000012a
[ 92.518207] RAX: ffffffffffffffda RBX: 00007f42eeb65fa0 RCX: 00007f42ee92ebe9
[ 92.518222] RDX: 0000000000000000 RSI: ffffffffffffffff RDI: 0000200000000140
[ 92.518248] RBP: 00007f42ed38f090 R08: 0000000000000008 R09: 0000000000000000
[ 92.518262] R10: ffffffffffffffff R11: 0000000000000246 R12: 0000000000000001
[ 92.518277] R13: 00007f42eeb66038 R14: 00007f42eeb65fa0 R15: 00007fff172e7218
[ 92.518358] </TASK>
Notes:
The reproducer sets up fault injection (/proc/thread-self/fail-nth,
failslab/fail_page_alloc knobs) and injects nth=7 before calling
perf_event_open().
The failure is reported as -ENOMEM when arming kprobe-ftrace, and the
WARN is triggered in __arm_kprobe_ftrace().
Thanks,
Zw Tang
^ permalink raw reply
* Re: [PATCH] mm: add Adaptive Memory Pressure Signaling (AMPRESS)
From: David Hildenbrand (Arm) @ 2026-03-02 8:52 UTC (permalink / raw)
To: Andre Ramos, akpm, hannes
Cc: linux-mm, linux-kernel, linux-trace-kernel, rostedt
In-Reply-To: <CALXtAv3u1hgLkBEbEgR3=r_iz3=KrnHB8B-=tg8Q3CEOWAPFiA@mail.gmail.com>
On 3/2/26 04:45, Andre Ramos wrote:
> Introduce /dev/ampress, a bidirectional fd-based interface for
> cooperative memory reclaim between the kernel and userspace.
I'm very sure this should be tagged as RFC.
>
> Userspace processes open /dev/ampress and block on read() to receive
> struct ampress_event notifications carrying a graduated urgency level
> (LOW/MEDIUM/HIGH/FATAL), the NUMA node of the pressure source, and a
> suggested reclaim target in KiB. After freeing memory the process
> issues AMPRESS_IOC_ACK to close the feedback loop.
>
> The feature hooks into balance_pgdat() in mm/vmscan.c, mapping the
> kswapd scan priority to urgency bands:
> priority 10-12 -> LOW
> priority 7-9 -> MEDIUM
> priority 4-6 -> HIGH
> priority 1-3 -> FATAL
>
> ampress_notify() is IRQ-safe (read_lock_irqsave + spin_lock_irqsave,
> no allocations) so it can be called from any reclaim context.
> Per-subscriber events overwrite without queuing to prevent unbounded
> backlog. A debugfs trigger at /sys/kernel/debug/ampress/inject allows
> testing without real memory pressure.
[...]
>
> +ADAPTIVE MEMORY PRESSURE SIGNALING (AMPRESS)
> +M: Darabat <playbadly1@gmail.com>
> +L: linux-mm@kvack.org
> +S: Maintained
> +F: include/linux/ampress.h
> +F: include/trace/events/ampress.h
> +F: include/uapi/linux/ampress.h
> +F: mm/ampress.c
> +F: mm/ampress_test.c
> +F: tools/testing/ampress/
We generally don't make new kernel contributors MM maintainers.
But what sticks out more is the inconsistency between your name+mail and
"Darabat <playbadly1@gmail.com>".
--
Cheers,
David
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox