Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* Re: [PATCH v6 05/16] Documentation/rv: Add documentation about hybrid automata
From: Gabriele Monaco @ 2026-03-02 14:23 UTC (permalink / raw)
  To: Juri Lelli
  Cc: linux-kernel, Steven Rostedt, Nam Cao, Juri Lelli,
	Jonathan Corbet, linux-trace-kernel, linux-doc, Tomas Glozar,
	Clark Williams, John Kacur
In-Reply-To: <aaWXmBVIvTlVtiRp@jlelli-thinkpadt14gen4.remote.csb>

Hello,

On Mon, 2026-03-02 at 14:58 +0100, Juri Lelli wrote:
> Considering the spec above, does the 'event' need to be 'enqueue'
> instead of 'sched_wakeup' (or the other way around)? Or maybe it's
> equivalent?

Good catch, in fact enqueue/dequeue don't work well for this model (the actual
stall monitor uses wakeup), but for the sake of the, already simplified, example
I should keep it consistent.

Thanks,
Gabriele


^ permalink raw reply

* Re: [PATCH v6 12/16] sched: Add deadline tracepoints
From: Gabriele Monaco @ 2026-03-02 14:24 UTC (permalink / raw)
  To: Juri Lelli
  Cc: linux-kernel, Steven Rostedt, Nam Cao, Juri Lelli,
	Masami Hiramatsu, Ingo Molnar, Peter Zijlstra, linux-trace-kernel,
	Phil Auld, Tomas Glozar, Clark Williams, John Kacur
In-Reply-To: <aaWblsXCWP-L0WbB@jlelli-thinkpadt14gen4.remote.csb>

Hello,

On Mon, 2026-03-02 at 15:15 +0100, Juri Lelli wrote:
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 4ca79ff58fca..b5bb2eb112bf 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -124,6 +124,10 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_exit_tp);
> >  EXPORT_TRACEPOINT_SYMBOL_GPL(sched_set_need_resched_tp);
> >  EXPORT_TRACEPOINT_SYMBOL_GPL(sched_enqueue_tp);
> >  EXPORT_TRACEPOINT_SYMBOL_GPL(sched_dequeue_tp);
> > +EXPORT_TRACEPOINT_SYMBOL_GPL(sched_dl_throttle_tp);
> > +EXPORT_TRACEPOINT_SYMBOL_GPL(sched_dl_replenish_tp);
> > +EXPORT_TRACEPOINT_SYMBOL_GPL(sched_dl_server_start_tp);
> > +EXPORT_TRACEPOINT_SYMBOL_GPL(sched_dl_server_stop_tp);
> 
> Don't we need to export sched_dl_update_tp as well?

Right, we do.

> 
> >  DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
> >  DEFINE_PER_CPU(struct rnd_state, sched_rnd_state);
> 
> ...
> 
> > @@ -1532,7 +1551,8 @@ static void update_curr_dl_se(struct rq *rq, struct
> > sched_dl_entity *dl_se, s64
> >  
> >  		if (!is_leftmost(dl_se, &rq->dl))
> >  			resched_curr(rq);
> > -	}
> > +	} else
> > +		trace_sched_dl_update_tp(dl_se, cpu_of(rq),
> > dl_get_type(dl_se, rq));
> 
> This wants braces even if it's a single statement.

Alright, will fix.

Thanks,
Gabriele


^ permalink raw reply

* Re: [PATCH v3 07/18] rtla: Add strscpy() and replace strncpy() calls
From: Tomas Glozar @ 2026-03-02 14:33 UTC (permalink / raw)
  To: Wander Lairson Costa
  Cc: Steven Rostedt, Crystal Wood, Ivan Pravdin, Costa Shulyupin,
	John Kacur, Tiezhu Yang, Haiyong Sun, Daniel Wagner,
	Daniel Bristot de Oliveira,
	open list:Real-time Linux Analysis (RTLA) tools,
	open list:Real-time Linux Analysis (RTLA) tools,
	open list:BPF [MISC]:Keyword:(?:b|_)bpf(?:b|_)
In-Reply-To: <20260115163650.118910-8-wander@redhat.com>

čt 15. 1. 2026 v 18:26 odesílatel Wander Lairson Costa
<wander@redhat.com> napsal:
>
> Introduce a userspace strscpy() implementation that matches the Linux
> kernel's strscpy() semantics. The function is built on top of glibc's
> strlcpy() and provides guaranteed NUL-termination along with proper
> truncation detection through its return value.
>
> The previous strncpy() calls had potential issues: strncpy() does not
> guarantee NUL-termination when the source string length equals or
> exceeds the destination buffer size. This required defensive patterns
> like pre-zeroing buffers or manually setting the last byte to NUL.
> The new strscpy() function always NUL-terminates the destination buffer
> unless the size is zero, and returns -E2BIG on truncation, making error
> handling cleaner and more consistent with kernel code.
>
> Note that unlike the kernel's strscpy(), this implementation uses
> strlcpy() internally, which reads the entire source string to determine
> its length. The kernel avoids this to prevent potential DoS attacks from
> extremely long untrusted strings. This is harmless for a userspace CLI
> tool like rtla where input sources are bounded and trusted.
>

strlcpy() was only added in glibc 2.38 [1]. It is thus not available
on systems with older glibc, like RHEL 9. Using it for strscpy()
implementation causes RTLA to fail to build on those systems.

[1] https://www.gnu.org/software/gnulib/manual/html_node/strlcpy.html

> Replace all strncpy() calls in rtla with strscpy(), using sizeof() for
> buffer sizes instead of magic constants to ensure the sizes stay in
> sync with the actual buffer declarations. Also remove a now-redundant
> memset() call that was previously needed to work around strncpy()
> behavior.
>
> Signed-off-by: Wander Lairson Costa <wander@redhat.com>
> ---
>  tools/tracing/rtla/src/timerlat_aa.c |  6 ++---
>  tools/tracing/rtla/src/utils.c       | 34 ++++++++++++++++++++++++++--
>  tools/tracing/rtla/src/utils.h       |  1 +
>  3 files changed, 36 insertions(+), 5 deletions(-)
>

Tomas


^ permalink raw reply

* Re: [PATCH v6 15/16] rv: Add deadline monitors
From: Juri Lelli @ 2026-03-02 14:37 UTC (permalink / raw)
  To: Gabriele Monaco
  Cc: linux-kernel, Steven Rostedt, Nam Cao, Juri Lelli,
	Jonathan Corbet, Masami Hiramatsu, linux-trace-kernel, linux-doc,
	Peter Zijlstra, Tomas Glozar, Clark Williams, John Kacur
In-Reply-To: <20260225095122.80683-16-gmonaco@redhat.com>

Hello,

On 25/02/26 10:51, Gabriele Monaco wrote:
> Add the deadline monitors collection to validate the deadline scheduler,
> both for deadline tasks and servers.
> 
> The currently implemented monitors are:
> * throttle:
>     validate dl entities are throttled when they use up their runtime
> * nomiss:
>     validate dl entities run to completion before their deadiline
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Reviewed-by: Nam Cao <namcao@linutronix.de>
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> ---

...

> +static inline int extract_params(struct pt_regs *regs, long id, struct task_struct **p)
> +{
> +	size_t size = offsetof(struct sched_attr, sched_nice);
> +	struct sched_attr __user *uattr, attr;
> +	int new_policy = -1, ret;
> +	unsigned long args[6];
> +	pid_t pid;
> +
> +	switch (id) {
> +	case __NR_sched_setscheduler:
> +		syscall_get_arguments(current, regs, args);
> +		pid = args[0];
> +		new_policy = args[1];
> +		break;
> +	case __NR_sched_setattr:
> +		syscall_get_arguments(current, regs, args);
> +		pid = args[0];
> +		uattr = (void *)args[1];
> +		/*
> +		 * Just copy up to sched_flags, we are not interested after that
> +		 */
> +		ret = copy_struct_from_user(&attr, size, uattr, size);
> +		if (ret)
> +			return ret;
> +		if (attr.sched_flags & SCHED_FLAG_KEEP_POLICY)
> +			return -EINVAL;
> +		new_policy = attr.sched_policy;
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +	if (!pid)
> +		*p = current;
> +	else {
> +		/*
> +		 * Required for find_task_by_vpid, make sure the caller doesn't
> +		 * need to get_task_struct().
> +		 */
> +		guard(rcu)();
> +		*p = find_task_by_vpid(pid);
> +		if (unlikely(!*p))
> +			return -EINVAL;
> +	}

Not sure I get this comment. RCU is released when the function returns,
but then the task pointer is dereferenced by callers?

Thanks,
Juri


^ permalink raw reply

* Re: [PATCHv6 bpf-next 9/9] bpf,x86: Use single ftrace_ops for direct calls
From: Steven Rostedt @ 2026-03-02 15:10 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Ihor Solodrai, Florent Revest, Mark Rutland, bpf, linux-kernel,
	linux-trace-kernel, linux-arm-kernel, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Menglong Dong, Song Liu,
	Kumar Kartikeya Dwivedi
In-Reply-To: <aaVFeafZb76l9L0m@krava>

On Mon, 2 Mar 2026 09:08:25 +0100
Jiri Olsa <olsajiri@gmail.com> wrote:

> > As there's nothing after the comment and before the end of the block.  
> 
> ok, will do.. the original changes:
> 
>   05dc5e9c1fe1 ("ftrace: Add update_ftrace_direct_add function")
>   8d2c1233f371 ("ftrace: Add update_ftrace_direct_del function")
> 
> went through bpf tree, so I'll send the fix the same way,
> please let me know otherwise

As long as I give a reviewed-by tag.

Thanks,

-- Steve

^ permalink raw reply

* Re: [PATCH] mm: add Adaptive Memory Pressure Signaling (AMPRESS)
From: Johannes Weiner @ 2026-03-02 15:11 UTC (permalink / raw)
  To: Andre Ramos
  Cc: akpm, linux-mm, linux-kernel, linux-trace-kernel, david, rostedt
In-Reply-To: <CALXtAv3u1hgLkBEbEgR3=r_iz3=KrnHB8B-=tg8Q3CEOWAPFiA@mail.gmail.com>

On Mon, Mar 02, 2026 at 12:45:33AM -0300, Andre Ramos wrote:
> Introduce /dev/ampress, a bidirectional fd-based interface for
> cooperative memory reclaim between the kernel and userspace.
> 
> Userspace processes open /dev/ampress and block on read() to receive
> struct ampress_event notifications carrying a graduated urgency level
> (LOW/MEDIUM/HIGH/FATAL), the NUMA node of the pressure source, and a
> suggested reclaim target in KiB. After freeing memory the process
> issues AMPRESS_IOC_ACK to close the feedback loop.
> 
> The feature hooks into balance_pgdat() in mm/vmscan.c, mapping the
> kswapd scan priority to urgency bands:
>   priority 10-12 -> LOW
>   priority  7-9  -> MEDIUM
>   priority  4-6  -> HIGH
>   priority  1-3  -> FATAL

The scan priority is not a good proxy for pressure. We actually export
reclaim efficiency-based pressure levels like this in memory cgroups
v1, but they're being deprecated[1] in favor of PSI [2].

What are you trying to accomplish?

[1] 340afb8027fa ("memcg: initiate deprecation of pressure_level")
[2] Documentation/accounting/psi.rst

^ permalink raw reply

* Re: [PATCH v6 15/16] rv: Add deadline monitors
From: Gabriele Monaco @ 2026-03-02 15:11 UTC (permalink / raw)
  To: Juri Lelli
  Cc: linux-kernel, Steven Rostedt, Nam Cao, Juri Lelli,
	Jonathan Corbet, Masami Hiramatsu, linux-trace-kernel, linux-doc,
	Peter Zijlstra, Tomas Glozar, Clark Williams, John Kacur
In-Reply-To: <aaWgvBXaNg48qYRl@jlelli-thinkpadt14gen4.remote.csb>

On Mon, 2026-03-02 at 15:37 +0100, Juri Lelli wrote:
> > +	if (!pid)
> > +		*p = current;
> > +	else {
> > +		/*
> > +		 * Required for find_task_by_vpid, make sure the caller
> > doesn't
> > +		 * need to get_task_struct().
> > +		 */
> > +		guard(rcu)();
> > +		*p = find_task_by_vpid(pid);
> > +		if (unlikely(!*p))
> > +			return -EINVAL;
> > +	}
> 
> Not sure I get this comment. RCU is released when the function returns,
> but then the task pointer is dereferenced by callers?

The idea was that the caller should ensure there's no need to do
get_task_struct() (which is fine within the syscall, I'm assuming).

But looking at it again, that's not even necessary as long as the caller locked
RCU, which they should do instead of guarding here.

So yeah, the comment is misleading and I should just do:

  guard(rcu)();
  extract_params(...);

Thanks for the observation,
Gabriele


^ permalink raw reply

* [PATCH 6.12.y] x86/uprobes: Fix XOL allocation failure for 32-bit tasks
From: Oleg Nesterov @ 2026-03-02 15:14 UTC (permalink / raw)
  To: Sasha Levin
  Cc: stable, Paulo Andrade, Peter Zijlstra (Intel), linux-trace-kernel,
	linux-perf-users
In-Reply-To: <20260301012146.1677811-1-sashal@kernel.org>

[ Upstream commit d55c571e4333fac71826e8db3b9753fadfbead6a ]

This script

	#!/usr/bin/bash

	echo 0 > /proc/sys/kernel/randomize_va_space

	echo 'void main(void) {}' > TEST.c

	# -fcf-protection to ensure that the 1st endbr32 insn can't be emulated
	gcc -m32 -fcf-protection=branch TEST.c -o test

	bpftrace -e 'uprobe:./test:main {}' -c ./test

"hangs", the probed ./test task enters an endless loop.

The problem is that with randomize_va_space == 0
get_unmapped_area(TASK_SIZE - PAGE_SIZE) called by xol_add_vma() can not
just return the "addr == TASK_SIZE - PAGE_SIZE" hint, this addr is used
by the stack vma.

arch_get_unmapped_area_topdown() doesn't take TIF_ADDR32 into account and
in_32bit_syscall() is false, this leads to info.high_limit > TASK_SIZE.
vm_unmapped_area() happily returns the high address > TASK_SIZE and then
get_unmapped_area() returns -ENOMEM after the "if (addr > TASK_SIZE - len)"
check.

handle_swbp() doesn't report this failure (probably it should) and silently
restarts the probed insn. Endless loop.

I think that the right fix should change the x86 get_unmapped_area() paths
to rely on TIF_ADDR32 rather than in_32bit_syscall(). Note also that if
CONFIG_X86_X32_ABI=y, in_x32_syscall() falsely returns true in this case
because ->orig_ax = -1.

But we need a simple fix for -stable, so this patch just sets TS_COMPAT if
the probed task is 32-bit to make in_ia32_syscall() true.

Fixes: 1b028f784e8c ("x86/mm: Introduce mmap_compat_base() for 32-bit mmap()")
Reported-by: Paulo Andrade <pandrade@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/aV5uldEvV7pb4RA8@redhat.com/
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/aWO7Fdxn39piQnxu@redhat.com
---
 arch/x86/kernel/uprobes.c | 24 ++++++++++++++++++++++++
 include/linux/uprobes.h   |  1 +
 kernel/events/uprobes.c   | 10 +++++++---
 3 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 9194695662b2..6abb25ca9cd1 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -1223,3 +1223,27 @@ bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check ctx,
 	else
 		return regs->sp <= ret->stack;
 }
+
+#ifdef CONFIG_IA32_EMULATION
+unsigned long arch_uprobe_get_xol_area(void)
+{
+	struct thread_info *ti = current_thread_info();
+	unsigned long vaddr;
+
+	/*
+	 * HACK: we are not in a syscall, but x86 get_unmapped_area() paths
+	 * ignore TIF_ADDR32 and rely on in_32bit_syscall() to calculate
+	 * vm_unmapped_area_info.high_limit.
+	 *
+	 * The #ifdef above doesn't cover the CONFIG_X86_X32_ABI=y case,
+	 * but in this case in_32bit_syscall() -> in_x32_syscall() always
+	 * (falsely) returns true because ->orig_ax == -1.
+	 */
+	if (test_thread_flag(TIF_ADDR32))
+		ti->status |= TS_COMPAT;
+	vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE, PAGE_SIZE, 0, 0);
+	ti->status &= ~TS_COMPAT;
+
+	return vaddr;
+}
+#endif
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index d0cb0e02cd6a..34be2d579045 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -146,6 +146,7 @@ extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
 extern void uprobe_handle_trampoline(struct pt_regs *regs);
 extern void *arch_uprobe_trampoline(unsigned long *psize);
 extern unsigned long uprobe_get_trampoline_vaddr(void);
+extern unsigned long arch_uprobe_get_xol_area(void);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index e30c4dd345f4..80cd12eb5854 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1496,6 +1496,12 @@ static const struct vm_special_mapping xol_mapping = {
 	.fault = xol_fault,
 };
 
+unsigned long __weak arch_uprobe_get_xol_area(void)
+{
+	/* Try to map as high as possible, this is only a hint. */
+	return get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE, PAGE_SIZE, 0, 0);
+}
+
 /* Slot allocation for XOL */
 static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
 {
@@ -1511,9 +1517,7 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
 	}
 
 	if (!area->vaddr) {
-		/* Try to map as high as possible, this is only a hint. */
-		area->vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE,
-						PAGE_SIZE, 0, 0);
+		area->vaddr = arch_uprobe_get_xol_area();
 		if (IS_ERR_VALUE(area->vaddr)) {
 			ret = area->vaddr;
 			goto fail;
-- 
2.52.0



^ permalink raw reply related

* Re: [PATCH v4 5/5] mm: add tracepoints for zone lock
From: Dmitry Ilvokhin @ 2026-03-02 15:18 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Masami Hiramatsu, Mathieu Desnoyers, Rafael J. Wysocki,
	Pavel Machek, Len Brown, Brendan Jackman, Johannes Weiner, Zi Yan,
	Oscar Salvador, Qi Zheng, Shakeel Butt, linux-kernel, linux-mm,
	linux-trace-kernel, linux-pm
In-Reply-To: <20260227144649.3dbff742@gandalf.local.home>

On Fri, Feb 27, 2026 at 02:46:49PM -0500, Steven Rostedt wrote:
> On Fri, 27 Feb 2026 16:00:27 +0000
> Dmitry Ilvokhin <d@ilvokhin.com> wrote:
> 
> >  static inline void zone_lock_init(struct zone *zone)
> >  {
> > @@ -12,26 +59,41 @@ static inline void zone_lock_init(struct zone *zone)
> >  
> >  #define zone_lock_irqsave(zone, flags)				\
> >  do {								\
> > +	bool success = true;					\
> > +								\
> > +	__zone_lock_trace_start_locking(zone);			\
> >  	spin_lock_irqsave(&(zone)->_lock, flags);		\
> > +	__zone_lock_trace_acquire_returned(zone, success);	\
> 
> Why the "success" variable and not just:
> 
> 	__zone_lock_trace_acquire_returned(zone, true);
> 
>  ?

Good point, passing true directly is cleaner. Happy to respin if needed.

^ permalink raw reply

* Re: [PATCH v4 2/5] mm: convert zone lock users to wrappers
From: Dmitry Ilvokhin @ 2026-03-02 15:22 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Andrew Morton, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Rafael J. Wysocki, Pavel Machek, Len Brown,
	Brendan Jackman, Johannes Weiner, Zi Yan, Oscar Salvador,
	Qi Zheng, Shakeel Butt, linux-kernel, linux-mm,
	linux-trace-kernel, linux-pm, SeongJae Park
In-Reply-To: <7e93021d-53dd-4162-97e6-3bca1f46a0c6@kernel.org>

On Fri, Feb 27, 2026 at 09:39:11PM +0100, David Hildenbrand (Arm) wrote:
> On 2/27/26 17:00, Dmitry Ilvokhin wrote:
> > Replace direct zone lock acquire/release operations with the
> > newly introduced wrappers.
> > 
> > The changes are purely mechanical substitutions. No functional change
> > intended. Locking semantics and ordering remain unchanged.
> > 
> > The compaction path is left unchanged for now and will be
> > handled separately in the following patch due to additional
> > non-trivial modifications.
> > 
> > Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
> > Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> > Reviewed-by: SeongJae Park <sj@kernel.org>
> > ---
> 
> [...]
> 
> >  #ifdef CONFIG_COMPACTION
> > @@ -530,11 +531,14 @@ static bool compact_lock_irqsave(spinlock_t *lock, unsigned long *flags,
> >   * Returns true if compaction should abort due to fatal signal pending.
> >   * Returns false when compaction can continue.
> >   */
> > -static bool compact_unlock_should_abort(spinlock_t *lock,
> > -		unsigned long flags, bool *locked, struct compact_control *cc)
> > +
> > +static bool compact_unlock_should_abort(struct zone *zone,
> > +					unsigned long flags,
> > +					bool *locked,
> > +					struct compact_control *cc)
> 
> We tend to use two-tabs on second parameter line; like the existing code
> did.
> 
> 
> Besides that
> 
> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
> 

Thanks, David. Noted. Appreciate the review and ack.

> -- 
> Cheers,
> 
> David

^ permalink raw reply

* [PATCH 6.6.y] x86/uprobes: Fix XOL allocation failure for 32-bit tasks
From: Oleg Nesterov @ 2026-03-02 15:29 UTC (permalink / raw)
  To: Sasha Levin
  Cc: stable, Paulo Andrade, Peter Zijlstra (Intel), linux-trace-kernel,
	linux-perf-users
In-Reply-To: <20260301013253.1692011-1-sashal@kernel.org>

[ Upstream commit d55c571e4333fac71826e8db3b9753fadfbead6a ]

This script

	#!/usr/bin/bash

	echo 0 > /proc/sys/kernel/randomize_va_space

	echo 'void main(void) {}' > TEST.c

	# -fcf-protection to ensure that the 1st endbr32 insn can't be emulated
	gcc -m32 -fcf-protection=branch TEST.c -o test

	bpftrace -e 'uprobe:./test:main {}' -c ./test

"hangs", the probed ./test task enters an endless loop.

The problem is that with randomize_va_space == 0
get_unmapped_area(TASK_SIZE - PAGE_SIZE) called by xol_add_vma() can not
just return the "addr == TASK_SIZE - PAGE_SIZE" hint, this addr is used
by the stack vma.

arch_get_unmapped_area_topdown() doesn't take TIF_ADDR32 into account and
in_32bit_syscall() is false, this leads to info.high_limit > TASK_SIZE.
vm_unmapped_area() happily returns the high address > TASK_SIZE and then
get_unmapped_area() returns -ENOMEM after the "if (addr > TASK_SIZE - len)"
check.

handle_swbp() doesn't report this failure (probably it should) and silently
restarts the probed insn. Endless loop.

I think that the right fix should change the x86 get_unmapped_area() paths
to rely on TIF_ADDR32 rather than in_32bit_syscall(). Note also that if
CONFIG_X86_X32_ABI=y, in_x32_syscall() falsely returns true in this case
because ->orig_ax = -1.

But we need a simple fix for -stable, so this patch just sets TS_COMPAT if
the probed task is 32-bit to make in_ia32_syscall() true.

Fixes: 1b028f784e8c ("x86/mm: Introduce mmap_compat_base() for 32-bit mmap()")
Reported-by: Paulo Andrade <pandrade@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/aV5uldEvV7pb4RA8@redhat.com/
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/aWO7Fdxn39piQnxu@redhat.com
---
 arch/x86/kernel/uprobes.c | 24 ++++++++++++++++++++++++
 include/linux/uprobes.h   |  1 +
 kernel/events/uprobes.c   | 10 +++++++---
 3 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 6402fb3089d2..aac2a2c5c6c5 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -1102,3 +1102,27 @@ bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check ctx,
 	else
 		return regs->sp <= ret->stack;
 }
+
+#ifdef CONFIG_IA32_EMULATION
+unsigned long arch_uprobe_get_xol_area(void)
+{
+	struct thread_info *ti = current_thread_info();
+	unsigned long vaddr;
+
+	/*
+	 * HACK: we are not in a syscall, but x86 get_unmapped_area() paths
+	 * ignore TIF_ADDR32 and rely on in_32bit_syscall() to calculate
+	 * vm_unmapped_area_info.high_limit.
+	 *
+	 * The #ifdef above doesn't cover the CONFIG_X86_X32_ABI=y case,
+	 * but in this case in_32bit_syscall() -> in_x32_syscall() always
+	 * (falsely) returns true because ->orig_ax == -1.
+	 */
+	if (test_thread_flag(TIF_ADDR32))
+		ti->status |= TS_COMPAT;
+	vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE, PAGE_SIZE, 0, 0);
+	ti->status &= ~TS_COMPAT;
+
+	return vaddr;
+}
+#endif
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index d91e32aff5a1..a5ec2b024a22 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -140,6 +140,7 @@ extern bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c
 extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
 extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
 					 void *src, unsigned long len);
+extern unsigned long arch_uprobe_get_xol_area(void);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 6304238293ae..3b96952bd6ec 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1449,6 +1449,12 @@ void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned lon
 		set_bit(MMF_RECALC_UPROBES, &vma->vm_mm->flags);
 }
 
+unsigned long __weak arch_uprobe_get_xol_area(void)
+{
+	/* Try to map as high as possible, this is only a hint. */
+	return get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE, PAGE_SIZE, 0, 0);
+}
+
 /* Slot allocation for XOL */
 static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
 {
@@ -1464,9 +1470,7 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
 	}
 
 	if (!area->vaddr) {
-		/* Try to map as high as possible, this is only a hint. */
-		area->vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE,
-						PAGE_SIZE, 0, 0);
+		area->vaddr = arch_uprobe_get_xol_area();
 		if (IS_ERR_VALUE(area->vaddr)) {
 			ret = area->vaddr;
 			goto fail;
-- 
2.52.0



^ permalink raw reply related

* [PATCH 6.1.y] x86/uprobes: Fix XOL allocation failure for 32-bit tasks
From: Oleg Nesterov @ 2026-03-02 15:36 UTC (permalink / raw)
  To: Sasha Levin
  Cc: stable, Paulo Andrade, Peter Zijlstra (Intel), linux-trace-kernel,
	linux-perf-users
In-Reply-To: <20260301014209.1703943-1-sashal@kernel.org>

[ Upstream commit d55c571e4333fac71826e8db3b9753fadfbead6a ]

This script

	#!/usr/bin/bash

	echo 0 > /proc/sys/kernel/randomize_va_space

	echo 'void main(void) {}' > TEST.c

	# -fcf-protection to ensure that the 1st endbr32 insn can't be emulated
	gcc -m32 -fcf-protection=branch TEST.c -o test

	bpftrace -e 'uprobe:./test:main {}' -c ./test

"hangs", the probed ./test task enters an endless loop.

The problem is that with randomize_va_space == 0
get_unmapped_area(TASK_SIZE - PAGE_SIZE) called by xol_add_vma() can not
just return the "addr == TASK_SIZE - PAGE_SIZE" hint, this addr is used
by the stack vma.

arch_get_unmapped_area_topdown() doesn't take TIF_ADDR32 into account and
in_32bit_syscall() is false, this leads to info.high_limit > TASK_SIZE.
vm_unmapped_area() happily returns the high address > TASK_SIZE and then
get_unmapped_area() returns -ENOMEM after the "if (addr > TASK_SIZE - len)"
check.

handle_swbp() doesn't report this failure (probably it should) and silently
restarts the probed insn. Endless loop.

I think that the right fix should change the x86 get_unmapped_area() paths
to rely on TIF_ADDR32 rather than in_32bit_syscall(). Note also that if
CONFIG_X86_X32_ABI=y, in_x32_syscall() falsely returns true in this case
because ->orig_ax = -1.

But we need a simple fix for -stable, so this patch just sets TS_COMPAT if
the probed task is 32-bit to make in_ia32_syscall() true.

Fixes: 1b028f784e8c ("x86/mm: Introduce mmap_compat_base() for 32-bit mmap()")
Reported-by: Paulo Andrade <pandrade@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/aV5uldEvV7pb4RA8@redhat.com/
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/aWO7Fdxn39piQnxu@redhat.com
---
 arch/x86/kernel/uprobes.c | 24 ++++++++++++++++++++++++
 include/linux/uprobes.h   |  1 +
 kernel/events/uprobes.c   | 10 +++++++---
 3 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 6c07f6daaa22..6b431589305b 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -1097,3 +1097,27 @@ bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check ctx,
 	else
 		return regs->sp <= ret->stack;
 }
+
+#ifdef CONFIG_IA32_EMULATION
+unsigned long arch_uprobe_get_xol_area(void)
+{
+	struct thread_info *ti = current_thread_info();
+	unsigned long vaddr;
+
+	/*
+	 * HACK: we are not in a syscall, but x86 get_unmapped_area() paths
+	 * ignore TIF_ADDR32 and rely on in_32bit_syscall() to calculate
+	 * vm_unmapped_area_info.high_limit.
+	 *
+	 * The #ifdef above doesn't cover the CONFIG_X86_X32_ABI=y case,
+	 * but in this case in_32bit_syscall() -> in_x32_syscall() always
+	 * (falsely) returns true because ->orig_ax == -1.
+	 */
+	if (test_thread_flag(TIF_ADDR32))
+		ti->status |= TS_COMPAT;
+	vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE, PAGE_SIZE, 0, 0);
+	ti->status &= ~TS_COMPAT;
+
+	return vaddr;
+}
+#endif
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index d91e32aff5a1..a5ec2b024a22 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -140,6 +140,7 @@ extern bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c
 extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
 extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
 					 void *src, unsigned long len);
+extern unsigned long arch_uprobe_get_xol_area(void);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 29c0e7c6a6d2..692c0fae8ce1 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1441,6 +1441,12 @@ void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned lon
 		set_bit(MMF_RECALC_UPROBES, &vma->vm_mm->flags);
 }
 
+unsigned long __weak arch_uprobe_get_xol_area(void)
+{
+	/* Try to map as high as possible, this is only a hint. */
+	return get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE, PAGE_SIZE, 0, 0);
+}
+
 /* Slot allocation for XOL */
 static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
 {
@@ -1456,9 +1462,7 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
 	}
 
 	if (!area->vaddr) {
-		/* Try to map as high as possible, this is only a hint. */
-		area->vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE,
-						PAGE_SIZE, 0, 0);
+		area->vaddr = arch_uprobe_get_xol_area();
 		if (IS_ERR_VALUE(area->vaddr)) {
 			ret = area->vaddr;
 			goto fail;
-- 
2.52.0



^ permalink raw reply related

* Re: [PATCH] mm: add Adaptive Memory Pressure Signaling (AMPRESS)
From: Andre Ramos @ 2026-03-02 15:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: akpm, linux-mm, linux-kernel, linux-trace-kernel, david, rostedt
In-Reply-To: <aaWonCzCLayQDXOT@cmpxchg.org>

Thank you all for the review.

  David, Lorenzo — understood on all counts: RFC tag, patch series,
  and the maintainer entry. Noted for future submissions.

  Johannes — your pointer to PSI is appreciated. Given that it already
  covers this use case more correctly and without the reclaim path
  overhead, I'll look into that direction instead.

  André

^ permalink raw reply

* [PATCH 5.15.y] x86/uprobes: Fix XOL allocation failure for 32-bit tasks
From: Oleg Nesterov @ 2026-03-02 15:42 UTC (permalink / raw)
  To: Sasha Levin
  Cc: stable, Paulo Andrade, Peter Zijlstra (Intel), linux-trace-kernel,
	linux-perf-users
In-Reply-To: <20260301015033.1716584-1-sashal@kernel.org>

[ Upstream commit d55c571e4333fac71826e8db3b9753fadfbead6a ]

This script

	#!/usr/bin/bash

	echo 0 > /proc/sys/kernel/randomize_va_space

	echo 'void main(void) {}' > TEST.c

	# -fcf-protection to ensure that the 1st endbr32 insn can't be emulated
	gcc -m32 -fcf-protection=branch TEST.c -o test

	bpftrace -e 'uprobe:./test:main {}' -c ./test

"hangs", the probed ./test task enters an endless loop.

The problem is that with randomize_va_space == 0
get_unmapped_area(TASK_SIZE - PAGE_SIZE) called by xol_add_vma() can not
just return the "addr == TASK_SIZE - PAGE_SIZE" hint, this addr is used
by the stack vma.

arch_get_unmapped_area_topdown() doesn't take TIF_ADDR32 into account and
in_32bit_syscall() is false, this leads to info.high_limit > TASK_SIZE.
vm_unmapped_area() happily returns the high address > TASK_SIZE and then
get_unmapped_area() returns -ENOMEM after the "if (addr > TASK_SIZE - len)"
check.

handle_swbp() doesn't report this failure (probably it should) and silently
restarts the probed insn. Endless loop.

I think that the right fix should change the x86 get_unmapped_area() paths
to rely on TIF_ADDR32 rather than in_32bit_syscall(). Note also that if
CONFIG_X86_X32_ABI=y, in_x32_syscall() falsely returns true in this case
because ->orig_ax = -1.

But we need a simple fix for -stable, so this patch just sets TS_COMPAT if
the probed task is 32-bit to make in_ia32_syscall() true.

Fixes: 1b028f784e8c ("x86/mm: Introduce mmap_compat_base() for 32-bit mmap()")
Reported-by: Paulo Andrade <pandrade@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/aV5uldEvV7pb4RA8@redhat.com/
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/aWO7Fdxn39piQnxu@redhat.com
---
 arch/x86/kernel/uprobes.c | 24 ++++++++++++++++++++++++
 include/linux/uprobes.h   |  1 +
 kernel/events/uprobes.c   | 10 +++++++---
 3 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 6c07f6daaa22..6b431589305b 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -1097,3 +1097,27 @@ bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check ctx,
 	else
 		return regs->sp <= ret->stack;
 }
+
+#ifdef CONFIG_IA32_EMULATION
+unsigned long arch_uprobe_get_xol_area(void)
+{
+	struct thread_info *ti = current_thread_info();
+	unsigned long vaddr;
+
+	/*
+	 * HACK: we are not in a syscall, but x86 get_unmapped_area() paths
+	 * ignore TIF_ADDR32 and rely on in_32bit_syscall() to calculate
+	 * vm_unmapped_area_info.high_limit.
+	 *
+	 * The #ifdef above doesn't cover the CONFIG_X86_X32_ABI=y case,
+	 * but in this case in_32bit_syscall() -> in_x32_syscall() always
+	 * (falsely) returns true because ->orig_ax == -1.
+	 */
+	if (test_thread_flag(TIF_ADDR32))
+		ti->status |= TS_COMPAT;
+	vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE, PAGE_SIZE, 0, 0);
+	ti->status &= ~TS_COMPAT;
+
+	return vaddr;
+}
+#endif
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index f46e0ca0169c..3461199c4ec0 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -138,6 +138,7 @@ extern bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c
 extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
 extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
 					 void *src, unsigned long len);
+extern unsigned long arch_uprobe_get_xol_area(void);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 4e6ada6a11c7..3bd85f043881 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1437,6 +1437,12 @@ void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned lon
 		set_bit(MMF_RECALC_UPROBES, &vma->vm_mm->flags);
 }
 
+unsigned long __weak arch_uprobe_get_xol_area(void)
+{
+	/* Try to map as high as possible, this is only a hint. */
+	return get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE, PAGE_SIZE, 0, 0);
+}
+
 /* Slot allocation for XOL */
 static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
 {
@@ -1452,9 +1458,7 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
 	}
 
 	if (!area->vaddr) {
-		/* Try to map as high as possible, this is only a hint. */
-		area->vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE,
-						PAGE_SIZE, 0, 0);
+		area->vaddr = arch_uprobe_get_xol_area();
 		if (IS_ERR_VALUE(area->vaddr)) {
 			ret = area->vaddr;
 			goto fail;
-- 
2.52.0



^ permalink raw reply related

* [PATCH 5.10.y] x86/uprobes: Fix XOL allocation failure for 32-bit tasks
From: Oleg Nesterov @ 2026-03-02 15:51 UTC (permalink / raw)
  To: Sasha Levin
  Cc: stable, Paulo Andrade, Peter Zijlstra (Intel), linux-trace-kernel,
	linux-perf-users
In-Reply-To: <20260301020027.1726538-1-sashal@kernel.org>

[ Upstream commit d55c571e4333fac71826e8db3b9753fadfbead6a ]

This script

	#!/usr/bin/bash

	echo 0 > /proc/sys/kernel/randomize_va_space

	echo 'void main(void) {}' > TEST.c

	# -fcf-protection to ensure that the 1st endbr32 insn can't be emulated
	gcc -m32 -fcf-protection=branch TEST.c -o test

	bpftrace -e 'uprobe:./test:main {}' -c ./test

"hangs", the probed ./test task enters an endless loop.

The problem is that with randomize_va_space == 0
get_unmapped_area(TASK_SIZE - PAGE_SIZE) called by xol_add_vma() can not
just return the "addr == TASK_SIZE - PAGE_SIZE" hint, this addr is used
by the stack vma.

arch_get_unmapped_area_topdown() doesn't take TIF_ADDR32 into account and
in_32bit_syscall() is false, this leads to info.high_limit > TASK_SIZE.
vm_unmapped_area() happily returns the high address > TASK_SIZE and then
get_unmapped_area() returns -ENOMEM after the "if (addr > TASK_SIZE - len)"
check.

handle_swbp() doesn't report this failure (probably it should) and silently
restarts the probed insn. Endless loop.

I think that the right fix should change the x86 get_unmapped_area() paths
to rely on TIF_ADDR32 rather than in_32bit_syscall(). Note also that if
CONFIG_X86_X32_ABI=y, in_x32_syscall() falsely returns true in this case
because ->orig_ax = -1.

But we need a simple fix for -stable, so this patch just sets TS_COMPAT if
the probed task is 32-bit to make in_ia32_syscall() true.

Fixes: 1b028f784e8c ("x86/mm: Introduce mmap_compat_base() for 32-bit mmap()")
Reported-by: Paulo Andrade <pandrade@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/aV5uldEvV7pb4RA8@redhat.com/
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/aWO7Fdxn39piQnxu@redhat.com
---
 arch/x86/kernel/uprobes.c | 24 ++++++++++++++++++++++++
 include/linux/uprobes.h   |  1 +
 kernel/events/uprobes.c   | 10 +++++++---
 3 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 9f948b2d26f6..099ca674e3de 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -1095,3 +1095,27 @@ bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check ctx,
 	else
 		return regs->sp <= ret->stack;
 }
+
+#ifdef CONFIG_IA32_EMULATION
+unsigned long arch_uprobe_get_xol_area(void)
+{
+	struct thread_info *ti = current_thread_info();
+	unsigned long vaddr;
+
+	/*
+	 * HACK: we are not in a syscall, but x86 get_unmapped_area() paths
+	 * ignore TIF_ADDR32 and rely on in_32bit_syscall() to calculate
+	 * vm_unmapped_area_info.high_limit.
+	 *
+	 * The #ifdef above doesn't cover the CONFIG_X86_X32_ABI=y case,
+	 * but in this case in_32bit_syscall() -> in_x32_syscall() always
+	 * (falsely) returns true because ->orig_ax == -1.
+	 */
+	if (test_thread_flag(TIF_ADDR32))
+		ti->status |= TS_COMPAT;
+	vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE, PAGE_SIZE, 0, 0);
+	ti->status &= ~TS_COMPAT;
+
+	return vaddr;
+}
+#endif
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index f46e0ca0169c..3461199c4ec0 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -138,6 +138,7 @@ extern bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c
 extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
 extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
 					 void *src, unsigned long len);
+extern unsigned long arch_uprobe_get_xol_area(void);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 4f2a9fab8ae8..f3bc64c4fa78 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1438,6 +1438,12 @@ void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned lon
 		set_bit(MMF_RECALC_UPROBES, &vma->vm_mm->flags);
 }
 
+unsigned long __weak arch_uprobe_get_xol_area(void)
+{
+	/* Try to map as high as possible, this is only a hint. */
+	return get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE, PAGE_SIZE, 0, 0);
+}
+
 /* Slot allocation for XOL */
 static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
 {
@@ -1453,9 +1459,7 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
 	}
 
 	if (!area->vaddr) {
-		/* Try to map as high as possible, this is only a hint. */
-		area->vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE,
-						PAGE_SIZE, 0, 0);
+		area->vaddr = arch_uprobe_get_xol_area();
 		if (IS_ERR_VALUE(area->vaddr)) {
 			ret = area->vaddr;
 			goto fail;
-- 
2.52.0



^ permalink raw reply related

* Re: [PATCH bpf] ftrace: Add missing ftrace_lock to update_ftrace_direct_add/del
From: Alexei Starovoitov @ 2026-03-02 15:58 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Steven Rostedt, Alexei Starovoitov, Ihor Solodrai,
	Kumar Kartikeya Dwivedi, bpf, LKML, linux-trace-kernel,
	Daniel Borkmann, Andrii Nakryiko, Menglong Dong, Song Liu
In-Reply-To: <20260302081622.165713-1-jolsa@kernel.org>

On Mon, Mar 2, 2026 at 12:16 AM Jiri Olsa <jolsa@kernel.org> wrote:
>
> Ihor and Kumar reported splat from ftrace_get_addr_curr [1], which happened
> because of the missing ftrace_lock in update_ftrace_direct_add/del functions
> allowing concurrent access to ftrace internals.
>
> The ftrace_update_ops function must be guarded by ftrace_lock, adding that.
>
> Fixes: 05dc5e9c1fe1 ("ftrace: Add update_ftrace_direct_add function")
> Fixes: 8d2c1233f371 ("ftrace: Add update_ftrace_direct_del function")
> Reported-by: Ihor Solodrai <ihor.solodrai@linux.dev>
> Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> Closes: https://lore.kernel.org/bpf/1b58ffb2-92ae-433a-ba46-95294d6edea2@linux.dev/
> Tested-by: Ihor Solodrai <ihor.solodrai@linux.dev>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>

lgtm.

Steven,
should it land through ftrace tree?

^ permalink raw reply

* Re: [PATCH V2] blktrace: fix __this_cpu_read/write in preemptible context
From: Steven Rostedt @ 2026-03-02 15:59 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: axboe, mhiramat, mathieu.desnoyers, shinichiro.kawasaki,
	linux-block, linux-trace-kernel
In-Reply-To: <20260302002207.12165-1-kch@nvidia.com>

On Sun, 1 Mar 2026 16:22:07 -0800
Chaitanya Kulkarni <kch@nvidia.com> wrote:

> With this fix blktests for blktrace pass:
> 
>   blktests (master) # ./check blktrace
>   blktrace/001 (blktrace zone management command tracing)      [passed]
>       runtime  3.650s  ...  3.647s
>   blktrace/002 (blktrace ftrace corruption with sysfs trace)   [passed]
>       runtime  0.411s  ...  0.384s
> 
> Fixes: 7ffbd48d5cab ("tracing: Cache comms only after an event occurred")
> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Suggested-by: Steven Rostedt <rostedt@goodmis.org>
> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>

Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>

-- Steve

^ permalink raw reply

* Re: [PATCH V2] blktrace: fix __this_cpu_read/write in preemptible context
From: Steven Rostedt @ 2026-03-02 16:08 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: Jens Axboe, shinichiro.kawasaki@wdc.com,
	linux-block@vger.kernel.org, linux-trace-kernel@vger.kernel.org,
	mathieu.desnoyers@efficios.com, mhiramat@kernel.org
In-Reply-To: <84f1e52c-7e70-4b98-8302-ca1f1e3db9fc@nvidia.com>

On Mon, 2 Mar 2026 08:41:58 +0000
Chaitanya Kulkarni <chaitanyak@nvidia.com> wrote:

> I totally failed to understand why this bug is appearing right now
> than before.

I wonder if it is because it was never tested under PREEMPT_FULL, which
would be the only config that would trigger the warning, as
PREEMPT_VOLUNTARY has rcu_read_lock() not allow preemption.

Today, the default is now the new PREEMPT_LAZY and not PREEMPT_VOLUNTARY,
where it can preempt within rcu_read_lock(). It could be that this bug
existed since 2012 but was never triggered as most users use
PREEMPT_VOLUNTARY where it would not trigger. But if they used PREEMPT_FULL
back then, it may have.

Today we now use PREEMPT_LAZY where this does trigger.

-- Steve

^ permalink raw reply

* Re: [PATCH bpf] ftrace: Add missing ftrace_lock to update_ftrace_direct_add/del
From: Steven Rostedt @ 2026-03-02 16:12 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Alexei Starovoitov, Ihor Solodrai, Kumar Kartikeya Dwivedi, bpf,
	linux-kernel, linux-trace-kernel, Daniel Borkmann,
	Andrii Nakryiko, Menglong Dong, Song Liu
In-Reply-To: <20260302081622.165713-1-jolsa@kernel.org>

On Mon,  2 Mar 2026 09:16:22 +0100
Jiri Olsa <jolsa@kernel.org> wrote:

> Ihor and Kumar reported splat from ftrace_get_addr_curr [1], which happened
> because of the missing ftrace_lock in update_ftrace_direct_add/del functions
> allowing concurrent access to ftrace internals.
> 
> The ftrace_update_ops function must be guarded by ftrace_lock, adding that.
> 
> Fixes: 05dc5e9c1fe1 ("ftrace: Add update_ftrace_direct_add function")
> Fixes: 8d2c1233f371 ("ftrace: Add update_ftrace_direct_del function")
> Reported-by: Ihor Solodrai <ihor.solodrai@linux.dev>
> Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> Closes: https://lore.kernel.org/bpf/1b58ffb2-92ae-433a-ba46-95294d6edea2@linux.dev/
> Tested-by: Ihor Solodrai <ihor.solodrai@linux.dev>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>

Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>

-- Steve

> ---
>  kernel/trace/ftrace.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index 827fb9a0bf0d..8baf61c9be6d 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -6404,6 +6404,7 @@ int update_ftrace_direct_add(struct ftrace_ops *ops, struct ftrace_hash *hash)
>  			new_filter_hash = old_filter_hash;
>  		}
>  	} else {
> +		guard(mutex)(&ftrace_lock);
>  		err = ftrace_update_ops(ops, new_filter_hash, EMPTY_HASH);
>  		/*
>  		 * new_filter_hash is dup-ed, so we need to release it anyway,
> @@ -6530,6 +6531,7 @@ int update_ftrace_direct_del(struct ftrace_ops *ops, struct ftrace_hash *hash)
>  			ops->func_hash->filter_hash = NULL;
>  		}
>  	} else {
> +		guard(mutex)(&ftrace_lock);
>  		err = ftrace_update_ops(ops, new_filter_hash, EMPTY_HASH);
>  		/*
>  		 * new_filter_hash is dup-ed, so we need to release it anyway,


^ permalink raw reply

* Re: [PATCH V2] blktrace: fix __this_cpu_read/write in preemptible context
From: Jens Axboe @ 2026-03-02 16:16 UTC (permalink / raw)
  To: rostedt, mhiramat, mathieu.desnoyers, Chaitanya Kulkarni
  Cc: shinichiro.kawasaki, linux-block, linux-trace-kernel
In-Reply-To: <20260302002207.12165-1-kch@nvidia.com>


On Sun, 01 Mar 2026 16:22:07 -0800, Chaitanya Kulkarni wrote:
> tracing_record_cmdline() internally uses __this_cpu_read() and
> __this_cpu_write() on the per-CPU variable trace_cmdline_save, and
> trace_save_cmdline() explicitly asserts preemption is disabled via
> lockdep_assert_preemption_disabled(). These operations are only safe
> when preemption is off, as they were designed to be called from the
> scheduler context (probe_wakeup_sched_switch() / probe_wakeup()).
> 
> [...]

Applied, thanks!

[1/1] blktrace: fix __this_cpu_read/write in preemptible context
      commit: da46b5dfef48658d03347cda21532bcdbb521e67

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: [BUG] RCU stall / hung rcu_gp: process_srcu blocked in synchronize_rcu_normal triggered by perf trace teardown on 7.0.0-rc1
From: Steven Rostedt @ 2026-03-02 16:25 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Zw Tang, paulmck, peterz, mhiramat, jiangshanlai, mingo, acme,
	namhyung, rcu, linux-perf-users, linux-trace-kernel, linux-kernel,
	mathieu.desnoyers, josh, bigeasy, ast, boqun.feng, mark.rutland
In-Reply-To: <20260302133615.2304836-1-sashal@kernel.org>

On Mon,  2 Mar 2026 08:36:13 -0500
Sasha Levin <sashal@kernel.org> wrote:

> This response was AI-generated by bug-bot. The analysis may contain errors — please verify independently.
> 
> ## Bug Summary
> 
> This is an RCU stall and hung task deadlock on 7.0.0-rc1, triggered by perf trace teardown under perf interrupt storm conditions. The perf subsystem's tracepoint unregistration path now blocks on SRCU (tracepoint_srcu), which in turn blocks on RCU grace period completion, creating a cascading stall when RCU progress is delayed by perf NMI interrupt storms. Severity: system hang (multiple tasks blocked >143s, eventual complete stall).

Hmm, this analysis corresponds nicely to what I was thinking when looking
at the stack dumps, but it gives a bit more details than I would have.

> ## Root Cause Analysis
> 
> This is a regression introduced by commit a46023d5616ed ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast"), which switched tracepoint read-side protection from preempt_disable()+RCU to SRCU-fast via DEFINE_SRCU_FAST(tracepoint_srcu).

Yes, as soon as I saw the report, I knew it had to do with this commit.

> 
> The root cause is a new coupling between SRCU grace period processing and RCU grace period completion that did not exist before. The deadlock chain is:
> 
> 1. The reproducer creates perf events using tracepoints, then closes them while generating heavy perf interrupt load. The perf NMI interrupt storms ("perf: interrupt took too long" messages escalating from 69ms to 336ms) consume most CPU time, starving RCU quiescent state detection.

The real bug is the NMI interrupt storms. The commit above only makes it
more of an issue, but that commit itself is not the bug.


> ## Suggested Actions
> 
> 1. Confirm the regression by testing with the parent commit a77cb6a867667 (immediately before a46023d5616ed). If the issue disappears, this confirms the SRCU-fast tracepoint switch as the cause.
> 
> 2. As a quick workaround, reverting a46023d5616ed (and its preparatory commits a77cb6a867667, f7d327654b886, 16718274ee75d if needed) should eliminate the deadlock, at the cost of losing preemptible BPF tracepoint support.

That is not the answer, as the above only made the current bug (interrupt
storms) visible.

> 
> 3. The fundamental issue is that process_srcu() for SRCU-fast structures calls synchronize_rcu() synchronously from workqueue context. Possible fixes include:
>    - Using an asynchronous mechanism (e.g., call_rcu() with a callback to resume SRCU GP processing) instead of blocking synchronize_rcu() within the SRCU state machine.
>    - Having srcu_readers_active_idx_check() use poll_state_synchronize_rcu() and defer retrying instead of blocking.
>    - Bounding the perf interrupt rate escalation to prevent the RCU stall in the first place (though this would only mask the underlying SRCU↔RCU coupling issue).
> 
> 4. If you can reproduce reliably, adding the following debug options would provide more information: CONFIG_RCU_TRACE=y, CONFIG_PROVE_RCU=y, and booting with rcutree.rcu_kick_kthreads=1 to see if kicking the RCU threads helps break the stall.


The real fix is to find a way to disable the perf interrupt storms *before*
unregistering the tracepoint.

-- Steve

^ permalink raw reply

* Re: [PATCH] tracing: Fix WARN_ON in tracing_buffers_mmap_close
From: Steven Rostedt @ 2026-03-02 16:52 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Vincent Donnefort, Qing Wang, Masami Hiramatsu, Mathieu Desnoyers,
	linux-kernel, linux-trace-kernel, syzbot+3b5dd2030fe08afdf65d,
	linux-mm, Andrew Morton, Vlastimil Babka, David Hildenbrand
In-Reply-To: <e4deff21-2fb5-4f37-a7d3-ede5f69a4489@lucifer.local>

On Mon, 2 Mar 2026 12:13:24 +0000
Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

Hi Lorenzo,

Thanks for looking into this.

> > But looking at the various flags, I see there's a VM_SPECIAL. I'm wondering
> > if that is what we should use?  
> 
> VM_SPECIAL is not a VMA flag, it's a bitmask of all the flags which cause us not
> to permit things like splitting/merging of VMAs (because we can't safely do
> them), i.e. that are one or more of:

Yep, I knew it wasn't a flag, and actually picked it because it looked to
have the flags we may have wanted.

> 
>         VM_IO - Memory-mapped I/O range.
> 
>     VM_PFNMAP - A mapping without struct folio's/page's backing them, e.g. perhaps a
>                 raw kernel mapping.
> 
>   VM_MIXEDMAP - A combination of page/folio-backed memory and/or PFN-backed memory.
> 
> VM_DONTEXPAND - Disallow expansion of memory in mremap().
> 
> You already set VM_DONTEXPAND so you get these semantics already.
> 
> Setting VM_IO just to trigger a failure case in madvise() feels like a hack? I
> guess it'd do the trick though, but you're not going to be able to reclaim that
> memory, and you might get some unexpected behaviour in code paths that assume
> VM_IO means it's memory-mapped I/O... (for instance GUP will stop working, if
> you need that).

Well, we don't reclaim that memory anyway.

> 
> I'd take a step back and wonder why you are wanting to not allow copying on
> fork? Is this kernel-allocated memory? In which case you should set VM_MIXEDMAP
> or VM_PFNMAP as appropriate... If not and it has a folio etc. then it seems like
> strange semantics.
> 
> Are you really bothered also by users doing strange things? Maybe the solution
> is to tolerate a fork-copy even if it's broken? I presume somethings straight up
> breaks right now?

Yeah, right now the accounting gets screwed up as the mappings get out of
sync when it is forked.

> 
> Without more context that I don't really have much time to acquire it's hard to
> know what to advise.

Fair enough, let me explain everything then ;-)

This is a mapping of the ftrace ring buffer to user space. Until recently,
the only way user space could get access to the ftrace ring buffer was to
either read it, or splice() it to a file/pipe/whatever.

The way the ftrace ring buffer works is that it is made up of a bunch of
sub-buffers (must be multiple of PAGE_SIZE and usually is a single page).
There is one sub-buffer called the "reader-page" which writers never write
to (with an exception out of scope for understanding the mappings).

The "reader-page" is a sub-buffer that belongs to the reader. When the
reader is finish with it and wants to read more of the buffer, an operation
is performed to swap the current reader-page with one of the pages that the
writers have. The new page is now owned by the reader and writers will
leave it alone. This allows readers to do a zero copy splice of the data in
the ring buffer.

Now we added a feature to memory map this buffer to user space. The
reader-page and writer sub-buffers are all mapped read-only into the user's
memory address.

Another page is mapped called the "meta page" which tells user space how to
read the buffer (which sub-buffer is the current reader-page and the order
of the write sub-buffers).

The read-page is what user space will read directly, and when it is done,
it does an ioctl() on the file descriptor for the buffer:

  /sys/kernel/tracing/per_cpu/cpu*/trace_pipe_raw

One command of the ioctl() will tell the kernel to swap the reader-page with
a writer sub-buffer. The meta page is updated and the user space
application can read that.

Now the meta page is unique per ring buffer and not per process. If there's
a fork, any change to the meta page will affect all processes that have
this mapped.

If two processes map the same buffer, one process will see any updates in
its meta page that another process does to it.

Now there is nothing wrong with doing that, accept the user space processes
will likely get confused. And we currently allow two separate tasks to mmap
it at the same time (maybe we shouldn't have!).

What we didn't allow was forking, as the code didn't update the proper
accounting. It needs to know that the buffer is mapped because it handles
splice differently. As the pages are mapped to user space, the kernel can't
just allow splice to steal a page and send it off to whatever pipe. Instead
it makes a copy of the page (basically killing the performance splice()
gives it in the first place).

We originally added the DONTFORK so that we didn't need to handle the fork
case, but I'm guessing that you are suggesting that we should do that
instead of preventing it from being duplicated on fork. Am I correct?

-- Steve

^ permalink raw reply

* Re: [PATCH] mm: add Adaptive Memory Pressure Signaling (AMPRESS)
From: David Hildenbrand (Arm) @ 2026-03-02 17:00 UTC (permalink / raw)
  To: Andre Ramos, akpm, hannes
  Cc: linux-mm, linux-kernel, linux-trace-kernel, rostedt
In-Reply-To: <CALXtAv3u1hgLkBEbEgR3=r_iz3=KrnHB8B-=tg8Q3CEOWAPFiA@mail.gmail.com>

> +/**
> + * ampress_notify - dispatch a memory pressure event to all subscribers
> + * @urgency:      AMPRESS_URGENCY_* level
> + * @numa_node:    NUMA node of the pressure source (0xFF = system-wide)
> + * @requested_kb: Suggested reclaim target in KiB (0 = unspecified)
> + *
> + * Must be safe to call from any context including IRQ / reclaim paths:
> + *   - no sleeping allocations
> + *   - only spin_lock_irqsave and wake_up_interruptible
> + */
> +void ampress_notify(int urgency, int numa_node, unsigned long requested_kb)
> +{
> +    struct ampress_subscriber *sub;
> +    unsigned long rflags, flags;
> +    int notified = 0;
> +
> +    /*
> +     * Use irqsave variants: ampress_notify() may be called from a context
> +     * where interrupts are disabled (e.g. a future direct-reclaim hook).
> +     */
> +    read_lock_irqsave(&ampress_subscribers_lock, rflags);
> +    list_for_each_entry(sub, &ampress_subscribers, list) {
> +        if (!sub->subscribed)
> +            continue;
> +
> +        /*
> +         * Check if the urgency meets or exceeds the subscriber's
> +         * configured threshold for this urgency level.
> +         *
> +         * Default config has all thresholds at 0, meaning any
> +         * urgency >= 0 passes — i.e. everything is delivered.
> +         */
> +        spin_lock_irqsave(&sub->lock, flags);
> +        sub->pending_event.urgency     = (__u8)urgency;
> +        sub->pending_event.numa_node   = (__u8)(numa_node & 0xFF);
> +        sub->pending_event.reserved    = 0;
> +        sub->pending_event.requested_kb =
> +            (__u32)min_t(unsigned long, requested_kb, U32_MAX);

I didn't take a detailed look, but this one confused me while skimming
over some bits: Which value does exposing requested_kb really have if
there are multiply subscribers to notify? Doesn't quite make sense to
even expose that.

Assume it's 2 MiB and you notify 1024 processes. Are you suddenly
getting 2 GiB back?

On a bigger picture, I don't think the whole idea of having multiple
subscribers is really thought through.

What are some arbitrary processes supposed to do if the receive one of
the notifications? How much memory are they supposed to reclaim (given
that requested_kb is questionable)? How are they supposed to reclaim the
memory (MADV_DONTNEED? MADV_PAGEOUT? or what else?). There is a lot of
information missing from the patch description to understand the bigger
picture.

If apps have droppable caches, a better solution might be to use
something like MAP_DROPPABLE where possible, still leaving the kernel in
charge of when and how much memory to reclaim.

-- 
Cheers,

David

^ permalink raw reply

* Re: [BUG] kprobes: WARNING in __arm_kprobe_ftrace when kprobe-ftrace arming fails with -ENOMEM under fault injection
From: Sasha Levin @ 2026-03-02 17:35 UTC (permalink / raw)
  To: Zw Tang, Naveen N Rao, Masami Hiramatsu, Steven Rostedt
  Cc: Sasha Levin, linux-kernel, linux-trace-kernel, linux-perf-users,
	Arnaldo Carvalho de Melo, David S . Miller
In-Reply-To: <CAPHJ_V+J6YDb_wX2nhXU6kh466Dt_nyDSas-1i_Y8s7tqY-Mzw@mail.gmail.com>

https://lore.kernel.org/all/CAPHJ_V+J6YDb_wX2nhXU6kh466Dt_nyDSas-1i_Y8s7tqY-Mzw@mail.gmail.com/

## 1. Bug Summary

A WARNING is triggered in `__arm_kprobe_ftrace()` at `kernel/kprobes.c:1147` when fault injection causes `ftrace_set_filter_ip()` to return `-ENOMEM` during kprobe arming via `perf_event_open()`. This is a false-positive warning — the error path itself is correct and the error propagates cleanly to userspace, but the `WARN_ONCE()` macro fires a kernel warning splat that is inappropriate for a recoverable allocation failure. The affected subsystem is kprobes/ftrace. Severity: warning only (no crash, hang, or data corruption).

## 2. Stack Trace Analysis

```
WARNING: kernel/kprobes.c:1147 at arm_kprobe+0x563/0x620, CPU#0
Call Trace:
 <TASK>
 enable_kprobe+0x1fc/0x2c0
 enable_trace_kprobe+0x227/0x4b0
 kprobe_register+0x84/0xc0
 perf_trace_event_init+0x527/0xa20
 perf_kprobe_init+0x156/0x200
 perf_kprobe_event_init+0x101/0x1c0
 perf_try_init_event+0x145/0xa10
 perf_event_alloc+0x1f91/0x5390
 __do_sys_perf_event_open+0x557/0x2d50
 do_syscall_64+0x129/0x1160
 entry_SYSCALL_64_after_hwframe+0x4b/0x53
 </TASK>
```

The crash point is `__arm_kprobe_ftrace()` at `kernel/kprobes.c:1147`, inlined into `arm_kprobe()`. The calling chain is process context: `perf_event_open()` syscall -> `perf_kprobe_event_init()` -> `enable_trace_kprobe()` -> `enable_kprobe()` -> `arm_kprobe()` -> `__arm_kprobe_ftrace()`. R12 holds `0xfffffff4` which is `-12` (`-ENOMEM`), confirming the allocation failure injected by fault injection.

## 3. Root Cause Analysis

The root cause is an overly aggressive `WARN_ONCE()` in `__arm_kprobe_ftrace()` at `kernel/kprobes.c:1147`:

```c
ret = ftrace_set_filter_ip(ops, (unsigned long)p->addr, 0, 0);
if (WARN_ONCE(ret < 0, "Failed to arm kprobe-ftrace at %pS (error %d)\n", p->addr, ret))
    return ret;
```

Prior to commit 9c89bb8e3272 ("kprobes: treewide: Cleanup the error messages for kprobes"), this was a simple `pr_debug()`. That commit promoted it to `WARN_ONCE()` as part of a treewide message cleanup, under the rationale that failures here indicate unexpected conditions. However, `ftrace_set_filter_ip()` calls into memory allocation paths (e.g., `ftrace_hash_move_and_update_ops()` -> `__ftrace_hash_update_ipmodify()` or `ftrace_hash` allocation), and those allocations can legitimately fail under memory pressure or fault injection.

The error handling is actually correct — the `-ENOMEM` propagates back through `arm_kprobe()` -> `enable_kprobe()` and ultimately causes the `perf_event_open()` syscall to return an error to userspace. The only problem is the spurious `WARN_ONCE()` which triggers a kernel warning splat and stack trace for what is a recoverable, non-buggy situation.

The same issue also applies to the `WARN()` on line 1152 for `register_ftrace_function()`, which can also fail with `-ENOMEM`.

## 4. Affected Versions

This issue was introduced by commit 9c89bb8e3272 ("kprobes: treewide: Cleanup the error messages for kprobes"), which first appeared in v5.16-rc1. All kernel versions from v5.16 onward are affected, including the reporter's v7.0.0-rc1.

This is a regression from v5.15, where the same failure path used `pr_debug()` and did not emit any warning.

## 5. Relevant Commits and Fixes

Introducing commit:
  9c89bb8e3272 ("kprobes: treewide: Cleanup the error messages for kprobes")
  Author: Masami Hiramatsu <mhiramat@kernel.org>
  Merged in v5.16-rc1

This commit changed `pr_debug()` to `WARN_ONCE()` in `__arm_kprobe_ftrace()` for the `ftrace_set_filter_ip()` failure path, and changed a `pr_debug()` to `WARN()` for the `register_ftrace_function()` failure path.

No fix for this issue exists in mainline or stable as of the reporter's kernel version.

The suggested fix is to downgrade both `WARN_ONCE()` (line 1147) and `WARN()` (line 1152) in `__arm_kprobe_ftrace()` back to `pr_warn_once()` / `pr_warn()` respectively. This preserves the improved error messages from 9c89bb8e3272 while avoiding spurious warning splats on recoverable failures. The error code is already propagated correctly to the caller.

## 6. Prior Discussions

No prior reports of this specific issue were found on lore.kernel.org. No related mailing list discussions or proposed patches addressing the WARN severity in `__arm_kprobe_ftrace()` were found.

## 7. Suggested Actions

1. The fix is to downgrade the WARN_ONCE/WARN in __arm_kprobe_ftrace()
   (kernel/kprobes.c lines 1147 and 1152) to pr_warn_once/pr_warn.
   Specifically:

   - Line 1147: Change WARN_ONCE(ret < 0, ...) to a simple
     if (ret < 0) { pr_warn_once(...); return ret; }

   - Line 1152: Change WARN(ret < 0, ...) to a simple
     if (ret < 0) { pr_warn(...); goto err_ftrace; }

2. This is a low-severity issue that only manifests with fault injection
   enabled. No data corruption or crash occurs — the error is correctly
   propagated to userspace. The warning is cosmetic but noisy and can
   cause false-positive syzbot/syzkaller reports.

^ permalink raw reply

* Re: [BUG] RCU stall / hung rcu_gp: process_srcu blocked in synchronize_rcu_normal triggered by perf trace teardown on 7.0.0-rc1
From: Sasha Levin @ 2026-03-02 17:37 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Zw Tang, paulmck, peterz, mhiramat, jiangshanlai, mingo, acme,
	namhyung, rcu, linux-perf-users, linux-trace-kernel, linux-kernel,
	mathieu.desnoyers, josh, bigeasy, ast, boqun.feng, mark.rutland
In-Reply-To: <20260302112545.51f7e100@gandalf.local.home>

On Mon, Mar 02, 2026 at 11:25:45AM -0500, Steven Rostedt wrote:
>On Mon,  2 Mar 2026 08:36:13 -0500
>Sasha Levin <sashal@kernel.org> wrote:
>
>> This response was AI-generated by bug-bot. The analysis may contain errors — please verify independently.
>>
>> ## Bug Summary
>>
>> This is an RCU stall and hung task deadlock on 7.0.0-rc1, triggered by perf trace teardown under perf interrupt storm conditions. The perf subsystem's tracepoint unregistration path now blocks on SRCU (tracepoint_srcu), which in turn blocks on RCU grace period completion, creating a cascading stall when RCU progress is delayed by perf NMI interrupt storms. Severity: system hang (multiple tasks blocked >143s, eventual complete stall).
>
>Hmm, this analysis corresponds nicely to what I was thinking when looking
>at the stack dumps, but it gives a bit more details than I would have.

Thanks for the feedback!

I'll finish fine-tuning this script, and publish it in a day or two :)

-- 
Thanks,
Sasha

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox