Linux Trace Kernel
 help / color / mirror / Atom feed
* Re: [PATCH] mm: add Adaptive Memory Pressure Signaling (AMPRESS)
From: David Hildenbrand (Arm) @ 2026-03-02  8:52 UTC (permalink / raw)
  To: Andre Ramos, akpm, hannes
  Cc: linux-mm, linux-kernel, linux-trace-kernel, rostedt
In-Reply-To: <CALXtAv3u1hgLkBEbEgR3=r_iz3=KrnHB8B-=tg8Q3CEOWAPFiA@mail.gmail.com>

On 3/2/26 04:45, Andre Ramos wrote:
> Introduce /dev/ampress, a bidirectional fd-based interface for
> cooperative memory reclaim between the kernel and userspace.

I'm very sure this should be tagged as RFC.

> 
> Userspace processes open /dev/ampress and block on read() to receive
> struct ampress_event notifications carrying a graduated urgency level
> (LOW/MEDIUM/HIGH/FATAL), the NUMA node of the pressure source, and a
> suggested reclaim target in KiB. After freeing memory the process
> issues AMPRESS_IOC_ACK to close the feedback loop.
> 
> The feature hooks into balance_pgdat() in mm/vmscan.c, mapping the
> kswapd scan priority to urgency bands:
>   priority 10-12 -> LOW
>   priority  7-9  -> MEDIUM
>   priority  4-6  -> HIGH
>   priority  1-3  -> FATAL
> 
> ampress_notify() is IRQ-safe (read_lock_irqsave + spin_lock_irqsave,
> no allocations) so it can be called from any reclaim context.
> Per-subscriber events overwrite without queuing to prevent unbounded
> backlog. A debugfs trigger at /sys/kernel/debug/ampress/inject allows
> testing without real memory pressure.


[...]

> 
> +ADAPTIVE MEMORY PRESSURE SIGNALING (AMPRESS)
> +M:    Darabat <playbadly1@gmail.com>
> +L:    linux-mm@kvack.org
> +S:    Maintained
> +F:    include/linux/ampress.h
> +F:    include/trace/events/ampress.h
> +F:    include/uapi/linux/ampress.h
> +F:    mm/ampress.c
> +F:    mm/ampress_test.c
> +F:    tools/testing/ampress/

We generally don't make new kernel contributors MM maintainers.

But what sticks out more is the inconsistency between your name+mail and
"Darabat <playbadly1@gmail.com>".

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH] tracing/osnoise: Add option to align tlat threads
From: Tomas Glozar @ 2026-03-02  8:48 UTC (permalink / raw)
  To: Crystal Wood
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, John Kacur,
	Luis Goncalves, Costa Shulyupin, Wander Lairson Costa, LKML,
	linux-trace-kernel
In-Reply-To: <d7be6fcb6540b3734019fe82ff8e7f4ff49220c2.camel@redhat.com>

so 28. 2. 2026 v 0:50 odesílatel Crystal Wood <crwood@redhat.com> napsal:
> > Add an option called TIMERLAT_ALIGN to osnoise/options, together with a
> > corresponding setting osnoise/timerlat_align_us.
> >
> > This option sets the alignment of wakeup times between different
> > timerlat threads, similarly to cyclictest's -A/--aligned option. If
> > TIMERLAT_ALIGN is set, the first thread that reaches the first cycle
> > records its first wake-up time. Each following thread sets its first
> > wake-up time to a fixed offset from the recorded time, and incremenets
> > it by the same offset.
>
> Why not just set the initial timer expiration to be
> "period + cpu * align_us"?  Then you wouldn't need any interaction
> between CPUs.

"period + cpu * align_us" wouldn't quite do it, for two reasons:

1. The wake-up timers are set to absolute time, and are incremented by
"period" (once or multiple times, if the timer is significantly
delayed) each cycle. What can be done as an alternative to what v1
does is this: record the current time when starting the timerlat
tracer (I need to reset align_next to zero anyway even with the v1
design, that is a bug in the patch), and increment from that.

2. "cpu" makes a poor thread ID here. If my period is 1000us, and I
run on CPUs 0 and 100 with alignment 10, suddenly, the space between
the threads becomes 1000us, which is equivalent to 0us. I would need
to go through the cpuset and assign numbers from 0 to n to each CPU.
That would guarantee a fixed spacing of the threads independent of
when the threads wake up in the first cycle (unlike the v1 design),
but it would make the implementation more complex, since I would have
to store the numbers.

If I implemented both of those ideas, the interaction between the CPUs
can indeed be gotten rid of. I'm not sure if it is a better solution,
though. Another motivation of recording the first thread wake-up was
that when using user threads, the first thread might be created some
time after the tracer is enabled, and I did not want to have a large
gap that would have to be corrected by the while loop at the end of
wait_next_period().

>
> >  kernel/trace/trace_osnoise.c | 34 +++++++++++++++++++++++++++++++++-
> >  1 file changed, 33 insertions(+), 1 deletion(-)
>
> Documentation needs to be updated as well.
>
> Should mention that updating align_us while the timer is running won't
> take effect immediately (unlike period, which does).
>

Good idea, thanks! In general, I'm not expecting the user to change
timerlat parameters during a measurement - but it is supported, and
should be documented.

> > diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
> > index dee610e465b9..df1d4529d226 100644
> > --- a/kernel/trace/trace_osnoise.c
> > +++ b/kernel/trace/trace_osnoise.c
> > @@ -58,6 +58,7 @@ enum osnoise_options_index {
> >       OSN_PANIC_ON_STOP,
> >       OSN_PREEMPT_DISABLE,
> >       OSN_IRQ_DISABLE,
> > +     OSN_TIMERLAT_ALIGN,
> >       OSN_MAX
> >  };
> >
> > @@ -66,7 +67,8 @@ static const char * const osnoise_options_str[OSN_MAX] = {
> >                                                       "OSNOISE_WORKLOAD",
> >                                                       "PANIC_ON_STOP",
> >                                                       "OSNOISE_PREEMPT_DISABLE",
> > -                                                     "OSNOISE_IRQ_DISABLE" };
> > +                                                     "OSNOISE_IRQ_DISABLE",
> > +                                                     "TIMERLAT_ALIGN" };
>
> Do we really need a flag for this, or can we just interpret a non-zero
> align_us value as enabling the feature?
>

Yes, we need a flag for this, because a zero alignment is a common use case.

I used it in cyclictest to measure the overhead of a large number of
threads waking up at the same time. Similarly, a non-zero alignment
will get rid of most of that overhead. Without alignment set, the
thread wake-ups offsets are semi-random, depending on how the threads
wake up, which might lead to inconsistent results where one run has
good numbers and another run bad numbers, since the alignment is
determined in the first cycle.

For example, here are some bpftrace numbers (from the same command as
in the original message) with wake-ups with timerlat_align_us = 0:

@time[12]: 1
@time[11]: 1
@time[13]: 1
@time[10]: 3
@time[16]: 6
@time[19]: 6
@time[15]: 6
@time[18]: 7
@time[14]: 7
@time[17]: 71
@time[20]: 997

> > @@ -1820,6 +1824,7 @@ static int wait_next_period(struct timerlat_variables *tlat)
> >  {
> >       ktime_t next_abs_period, now;
> >       u64 rel_period = osnoise_data.timerlat_period * 1000;
> > +     static atomic64_t align_next;
>
> How will this get reset if the tracer is stopped and restarted?
>

It won't, I forgot to reset it. See my comment above in my reply to Steve,

> >       now = hrtimer_cb_get_time(&tlat->timer);
> >       next_abs_period = ns_to_ktime(tlat->abs_period + rel_period);
> > @@ -1829,6 +1834,17 @@ static int wait_next_period(struct timerlat_variables *tlat)
> >        */
> >       tlat->abs_period = (u64) ktime_to_ns(next_abs_period);
> >
> > +     if (test_bit(OSN_TIMERLAT_ALIGN, &osnoise_options) && !tlat->count
> > +         && atomic64_cmpxchg_relaxed(&align_next, 0, tlat->abs_period)) {
> > +             /*
> > +              * Align thread in first cycle on each CPU to the set alignment.
> > +              */
> > +             tlat->abs_period = atomic64_fetch_add_relaxed(osnoise_data.timerlat_align_us * 1000,
> > +                     &align_next);
> > +             tlat->abs_period += osnoise_data.timerlat_align_us * 1000;
> > +             next_abs_period = ns_to_ktime(tlat->abs_period);
> > +     }
>
> I'm already unclear about the existing purpose of next_abs_period, but
> if it has any use at all shouldn't it be to avoid writing intermediate
> values like this back to tlat?
>

next_abs_period is basically just the ktime_t variant of
tlat->abs_period for local calculations of the next period inside
wait_next_period(). Its only purpose is the ktime_compare() call that
increments tlat->abs_period by the period until it lands into the
future, if it happens to be in the past. This is necessary to do for
both a regular cycle (which might take long due to noise) and the
first cycle with alignment (because the other thread's first wake up
might be late), so it has to be set in the new code as well,
otherwise, the while loop won't see the time is in the past.

I agree that this part of the code is confusing. There is also a
field, timerlat_variables.rel_period (tlat->rel_period), that is not
used anywhere, since the relative period is pulled out of
osnoise_variables. Something like this would be easier to read and
comprehend, IMHO:

/*
 * wait_next_period - Wait for the next period for timerlat
 */
static int wait_next_period(struct timerlat_variables *tlat)
{
    ktime_t now;
    u64 rel_period = osnoise_data.timerlat_period * 1000;

    now = hrtimer_cb_get_time(&tlat->timer);

    /*
     * Set the next abs_period.
     */
    tlat->abs_period += rel_period;

    /*
     * If the new abs_period is in the past, skip the activation.
     */
    while (ktime_compare(now, ns_to_ktime(tlat->abs_period) > 0) {
        next_abs_period = ns_to_ktime(tlat->abs_period + rel_period);
        tlat->abs_period = (u64) ktime_to_ns(next_abs_period);
    }

    set_current_state(TASK_INTERRUPTIBLE);

    hrtimer_start(&tlat->timer, next_abs_period, HRTIMER_MODE_ABS_PINNED_HARD);
    schedule();
    return 1;
}

(Excluding the changes from this patch.) What do you think?

Tomas


^ permalink raw reply

* Re: [PATCH V2] blktrace: fix __this_cpu_read/write in preemptible context
From: Chaitanya Kulkarni @ 2026-03-02  8:41 UTC (permalink / raw)
  To: Jens Axboe, rostedt@goodmis.org
  Cc: shinichiro.kawasaki@wdc.com, linux-block@vger.kernel.org,
	linux-trace-kernel@vger.kernel.org,
	mathieu.desnoyers@efficios.com, mhiramat@kernel.org,
	Chaitanya Kulkarni
In-Reply-To: <CAKb3OG_1yUUhLK8uHUxrYtnpph8jNr_j1kjivEWQ9VtkP-fRpQ@mail.gmail.com>

On 3/1/26 22:02, Jens Axboe wrote:
> On 3/1/26 5:22 PM, Chaitanya Kulkarni wrote:
>> Fixes: 7ffbd48d5cab ("tracing: Cache comms only after an event occurred")
> I don't understand, this dates back to 2012?
>
> --
> Jens Axboe

The commit c71a896154119 ("blktrace: add ftrace plugin") added
tracing_record_cmdline() definition first.

Then commit 7ffbd48d5cab ("tracing: Cache comms only after an event occurred")
updated tracing_record_cmdline() function with  __trace_cpu_read() and
__trace_cpu_write().

Above added __trace_cpu_read() when used in process context in the call
chain from starting blk_add_trace() is resulting in the splat :-

run blktests blktrace/002 at 2026-02-25 22:24:33
null_blk: disk nullb1 created

_BUG: using __this_cpu_read() in preemptible [00000000] code: dd/2516_

caller is tracing_record_cmdline+0x10/0x40
CPU: 16 UID: 0 PID: 2516 Comm: dd Tainted: G                 N  7.0.0-rc1lblk+ #84 PREEMPT(full)
Tainted: [N]=TEST
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x8d/0xb0
check_preemption_disabled+0xce/0xe0
tracing_record_cmdline+0x10/0x40
  __blk_add_trace+0x307/0x5d0
? lock_acquire+0xe0/0x300
? iov_iter_extract_pages+0x101/0xa30
blk_add_trace_bio+0x106/0x

[...]

Hence when __trace_cpu_read() is added in the path from blk_add_trace()
this bug was introduced 7ffbd48d5cab ?

I totally failed to understand why this bug is appearing right now
than before.

-ck

Other reference commits :-

The commit 2cc621fd2e9b8
("tracing: Move saved_cmdline code into trace_sched_switch.c")
moved the lockdep to new file.

The commit c0a581d7126c0
("tracing: Disable interrupt or preemption before acquiring arch_spinlock_t")
Added lockdep_assert_preemption_disabled() in trace_save_cmdline().



^ permalink raw reply

* [PATCH bpf] ftrace: Add missing ftrace_lock to update_ftrace_direct_add/del
From: Jiri Olsa @ 2026-03-02  8:16 UTC (permalink / raw)
  To: Steven Rostedt, Alexei Starovoitov
  Cc: Ihor Solodrai, Kumar Kartikeya Dwivedi, bpf, linux-kernel,
	linux-trace-kernel, Daniel Borkmann, Andrii Nakryiko,
	Menglong Dong, Song Liu

Ihor and Kumar reported splat from ftrace_get_addr_curr [1], which happened
because of the missing ftrace_lock in update_ftrace_direct_add/del functions
allowing concurrent access to ftrace internals.

The ftrace_update_ops function must be guarded by ftrace_lock, adding that.

Fixes: 05dc5e9c1fe1 ("ftrace: Add update_ftrace_direct_add function")
Fixes: 8d2c1233f371 ("ftrace: Add update_ftrace_direct_del function")
Reported-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Closes: https://lore.kernel.org/bpf/1b58ffb2-92ae-433a-ba46-95294d6edea2@linux.dev/
Tested-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 kernel/trace/ftrace.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 827fb9a0bf0d..8baf61c9be6d 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -6404,6 +6404,7 @@ int update_ftrace_direct_add(struct ftrace_ops *ops, struct ftrace_hash *hash)
 			new_filter_hash = old_filter_hash;
 		}
 	} else {
+		guard(mutex)(&ftrace_lock);
 		err = ftrace_update_ops(ops, new_filter_hash, EMPTY_HASH);
 		/*
 		 * new_filter_hash is dup-ed, so we need to release it anyway,
@@ -6530,6 +6531,7 @@ int update_ftrace_direct_del(struct ftrace_ops *ops, struct ftrace_hash *hash)
 			ops->func_hash->filter_hash = NULL;
 		}
 	} else {
+		guard(mutex)(&ftrace_lock);
 		err = ftrace_update_ops(ops, new_filter_hash, EMPTY_HASH);
 		/*
 		 * new_filter_hash is dup-ed, so we need to release it anyway,
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCHv6 bpf-next 9/9] bpf,x86: Use single ftrace_ops for direct calls
From: Jiri Olsa @ 2026-03-02  8:08 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jiri Olsa, Ihor Solodrai, Florent Revest, Mark Rutland, bpf,
	linux-kernel, linux-trace-kernel, linux-arm-kernel,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Menglong Dong, Song Liu, Kumar Kartikeya Dwivedi
In-Reply-To: <20260228153921.19cd42a6@fedora>

On Sat, Feb 28, 2026 at 03:39:21PM -0500, Steven Rostedt wrote:
> On Fri, 27 Feb 2026 22:24:37 +0100
> Jiri Olsa <olsajiri@gmail.com> wrote:
> 
> > diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> > index 827fb9a0bf0d..e333749a5896 100644
> > --- a/kernel/trace/ftrace.c
> > +++ b/kernel/trace/ftrace.c
> > @@ -6404,7 +6404,9 @@ int update_ftrace_direct_add(struct ftrace_ops *ops, struct ftrace_hash *hash)
> >  			new_filter_hash = old_filter_hash;
> >  		}
> >  	} else {
> 
> As this looks to fix the issue, just add:
> 
> 		guard(mutex)(&ftrace_lock);
> 
> > +		mutex_lock(&ftrace_lock);
> >  		err = ftrace_update_ops(ops, new_filter_hash, EMPTY_HASH);
> > +		mutex_unlock(&ftrace_lock);
> >  		/*
> >  		 * new_filter_hash is dup-ed, so we need to release it anyway,
> >  		 * old_filter_hash either stays on error or is already released
> > @@ -6530,7 +6532,9 @@ int update_ftrace_direct_del(struct ftrace_ops *ops, struct ftrace_hash *hash)
> >  			ops->func_hash->filter_hash = NULL;
> >  		}
> >  	} else {
> 
> And here too.
> 
> As there's nothing after the comment and before the end of the block.

ok, will do.. the original changes:

  05dc5e9c1fe1 ("ftrace: Add update_ftrace_direct_add function")
  8d2c1233f371 ("ftrace: Add update_ftrace_direct_del function")

went through bpf tree, so I'll send the fix the same way,
please let me know otherwise

thanks,
jirka


> 
> -- Steve
> 
> > +		mutex_lock(&ftrace_lock);
> >  		err = ftrace_update_ops(ops, new_filter_hash, EMPTY_HASH);
> > +		mutex_unlock(&ftrace_lock);
> >  		/*
> >  		 * new_filter_hash is dup-ed, so we need to release it anyway,
> >  		 * old_filter_hash either stays on error or is already released
> 
> 
> 
> -- Steve

^ permalink raw reply

* Re: [PATCH] tracing/osnoise: Add option to align tlat threads
From: Tomas Glozar @ 2026-03-02  7:09 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, Mathieu Desnoyers, John Kacur, Luis Goncalves,
	Crystal Wood, Costa Shulyupin, Wander Lairson Costa, LKML,
	linux-trace-kernel
In-Reply-To: <20260227105207.01473471@gandalf.local.home>

pá 27. 2. 2026 v 16:51 odesílatel Steven Rostedt <rostedt@goodmis.org> napsal:
> > Example:
> >
> > osnoise/timerlat_period is set to 1000, osnoise/timerlat_align_us is
> > set to 50. There are four threads, on CPUs 1 to 4.
>
> Is it set to 50 or 20?
>

That is a typo, good catch.

>
> So the first one here sets 'align_next' and all others fall into this path.
>
> As 'align_next' is a static variable for this function, what happens if you
> run timerlat a second time with different values?
>

That is an oversight. It worked during my testing, since timerlat took
the align_next of the previous run, and since it was far in the past,
it incremented it to be in the future, and everything worked. But that
is not the intended behavior (and will be slow if a lot of time passes
between the timerlat runs, since we increment it multiple times by the
period)

It should be a global variable reset with each run. I got too excited
about the patch and forgot this is not RTLA where static variables get
reset with each run (unless I want the user to reboot after each run).
I will fix it.

Tomas


^ permalink raw reply

* Re: [PATCH V2] blktrace: fix __this_cpu_read/write in preemptible context
From: Jens Axboe @ 2026-03-02  6:02 UTC (permalink / raw)
  To: Chaitanya Kulkarni, rostedt, mhiramat, mathieu.desnoyers
  Cc: shinichiro.kawasaki, linux-block, linux-trace-kernel
In-Reply-To: <20260302002207.12165-1-kch@nvidia.com>

On 3/1/26 5:22 PM, Chaitanya Kulkarni wrote:
> Fixes: 7ffbd48d5cab ("tracing: Cache comms only after an event occurred")

I don't understand, this dates back to 2012?

--
Jens Axboe

^ permalink raw reply

* [PATCH] mm: add Adaptive Memory Pressure Signaling (AMPRESS)
From: Andre Ramos @ 2026-03-02  3:45 UTC (permalink / raw)
  To: akpm, hannes; +Cc: linux-mm, linux-kernel, linux-trace-kernel, david, rostedt

Introduce /dev/ampress, a bidirectional fd-based interface for
cooperative memory reclaim between the kernel and userspace.

Userspace processes open /dev/ampress and block on read() to receive
struct ampress_event notifications carrying a graduated urgency level
(LOW/MEDIUM/HIGH/FATAL), the NUMA node of the pressure source, and a
suggested reclaim target in KiB. After freeing memory the process
issues AMPRESS_IOC_ACK to close the feedback loop.

The feature hooks into balance_pgdat() in mm/vmscan.c, mapping the
kswapd scan priority to urgency bands:
  priority 10-12 -> LOW
  priority  7-9  -> MEDIUM
  priority  4-6  -> HIGH
  priority  1-3  -> FATAL

ampress_notify() is IRQ-safe (read_lock_irqsave + spin_lock_irqsave,
no allocations) so it can be called from any reclaim context.
Per-subscriber events overwrite without queuing to prevent unbounded
backlog. A debugfs trigger at /sys/kernel/debug/ampress/inject allows
testing without real memory pressure.

New files:
  include/uapi/linux/ampress.h   - UAPI structs and ioctl definitions
  include/linux/ampress.h        - internal header and ampress_notify()
  include/trace/events/ampress.h - tracepoints for notify and ack
  mm/ampress.c                   - miscdevice driver and core logic
  mm/ampress_test.c              - KUnit tests (3/3 passing)
  tools/testing/ampress/         - userspace integration and stress tests

Signed-off-by: André Castro Ramos <acastroramos1987@gmail.com>
---
 MAINTAINERS                            |  11 +
 include/linux/ampress.h                |  34 +++
 include/trace/events/ampress.h         |  70 ++++++
 include/uapi/linux/ampress.h           |  40 ++++
 mm/Kconfig                             |  26 ++
 mm/Makefile                            |   2 +
 mm/ampress.c                           | 320 +++++++++++++++++++++++++
 mm/ampress_test.c                      | 124 ++++++++++
 mm/vmscan.c                            |  27 +++
 tools/testing/ampress/.gitignore       |   2 +
 tools/testing/ampress/Makefile         |  21 ++
 tools/testing/ampress/ampress_stress.c | 199 +++++++++++++++
 tools/testing/ampress/ampress_test.c   | 212 ++++++++++++++++
 13 files changed, 1088 insertions(+)
 create mode 100644 include/linux/ampress.h
 create mode 100644 include/trace/events/ampress.h
 create mode 100644 include/uapi/linux/ampress.h
 create mode 100644 mm/ampress.c
 create mode 100644 mm/ampress_test.c
 create mode 100644 tools/testing/ampress/.gitignore
 create mode 100644 tools/testing/ampress/Makefile
 create mode 100644 tools/testing/ampress/ampress_stress.c
 create mode 100644 tools/testing/ampress/ampress_test.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 61bf550fd37..ea4d7861ff9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16629,6 +16629,17 @@ F:    mm/memremap.c
 F:    mm/memory_hotplug.c
 F:    tools/testing/selftests/memory-hotplug/

+ADAPTIVE MEMORY PRESSURE SIGNALING (AMPRESS)
+M:    Darabat <playbadly1@gmail.com>
+L:    linux-mm@kvack.org
+S:    Maintained
+F:    include/linux/ampress.h
+F:    include/trace/events/ampress.h
+F:    include/uapi/linux/ampress.h
+F:    mm/ampress.c
+F:    mm/ampress_test.c
+F:    tools/testing/ampress/
+
 MEMORY MANAGEMENT
 M:    Andrew Morton <akpm@linux-foundation.org>
 L:    linux-mm@kvack.org
diff --git a/include/linux/ampress.h b/include/linux/ampress.h
new file mode 100644
index 00000000000..a0f54a65f94
--- /dev/null
+++ b/include/linux/ampress.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_AMPRESS_H
+#define _LINUX_AMPRESS_H
+
+#include <uapi/linux/ampress.h>
+
+/**
+ * struct ampress_subscriber - per-fd subscriber state
+ * @list:          Entry in the global subscribers list
+ * @wq:            Wait queue for blocking read()
+ * @lock:          Spinlock protecting pending_event and event_pending
+ * @pending_event: Most recent event (may be overwritten if not ACK'd)
+ * @event_pending: True when an unread event is available
+ * @subscribed:    Whether this fd is receiving notifications (toggle
via ioctl)
+ * @config:        Per-subscriber threshold configuration
+ */
+struct ampress_subscriber {
+    struct list_head  list;
+    wait_queue_head_t wq;
+    spinlock_t        lock;    /* protects pending_event and event_pending */
+    struct ampress_event pending_event;
+    bool              event_pending;
+    bool              subscribed;
+    struct ampress_config config;
+};
+
+#ifdef CONFIG_AMPRESS
+void ampress_notify(int urgency, int numa_node, unsigned long requested_kb);
+#else
+static inline void ampress_notify(int urgency, int numa_node,
+                  unsigned long requested_kb) {}
+#endif
+
+#endif /* _LINUX_AMPRESS_H */
diff --git a/include/trace/events/ampress.h b/include/trace/events/ampress.h
new file mode 100644
index 00000000000..37ae9d3acd4
--- /dev/null
+++ b/include/trace/events/ampress.h
@@ -0,0 +1,70 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM ampress
+
+#if !defined(_TRACE_AMPRESS_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_AMPRESS_H
+
+#include <linux/tracepoint.h>
+
+/**
+ * ampress_notify_sent - fired each time ampress_notify() delivers an event
+ * @urgency:          AMPRESS_URGENCY_* level
+ * @numa_node:        NUMA node (0xFF = system-wide)
+ * @requested_kb:     Requested reclaim in KiB
+ * @subscriber_count: Number of subscribers that received the event
+ */
+TRACE_EVENT(ampress_notify_sent,
+
+    TP_PROTO(int urgency, int numa_node, unsigned long requested_kb,
+         int subscriber_count),
+
+    TP_ARGS(urgency, numa_node, requested_kb, subscriber_count),
+
+    TP_STRUCT__entry(
+        __field(int,           urgency)
+        __field(int,           numa_node)
+        __field(unsigned long, requested_kb)
+        __field(int,           subscriber_count)
+    ),
+
+    TP_fast_assign(
+        __entry->urgency          = urgency;
+        __entry->numa_node        = numa_node;
+        __entry->requested_kb     = requested_kb;
+        __entry->subscriber_count = subscriber_count;
+    ),
+
+    TP_printk("urgency=%d numa_node=%d requested_kb=%lu subscribers=%d",
+          __entry->urgency, __entry->numa_node,
+          __entry->requested_kb, __entry->subscriber_count)
+);
+
+/**
+ * ampress_ack_received - fired when a userspace process acknowledges an event
+ * @pid:      PID of the acknowledging process
+ * @freed_kb: Amount of memory freed in KiB as reported by userspace
+ */
+TRACE_EVENT(ampress_ack_received,
+
+    TP_PROTO(pid_t pid, unsigned long freed_kb),
+
+    TP_ARGS(pid, freed_kb),
+
+    TP_STRUCT__entry(
+        __field(pid_t,         pid)
+        __field(unsigned long, freed_kb)
+    ),
+
+    TP_fast_assign(
+        __entry->pid      = pid;
+        __entry->freed_kb = freed_kb;
+    ),
+
+    TP_printk("pid=%d freed_kb=%lu", __entry->pid, __entry->freed_kb)
+);
+
+#endif /* _TRACE_AMPRESS_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/include/uapi/linux/ampress.h b/include/uapi/linux/ampress.h
new file mode 100644
index 00000000000..da3e0ba38fc
--- /dev/null
+++ b/include/uapi/linux/ampress.h
@@ -0,0 +1,40 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_AMPRESS_H
+#define _UAPI_LINUX_AMPRESS_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+/* Urgency levels */
+#define AMPRESS_URGENCY_LOW    0   /* Soft hint — shed non-critical caches */
+#define AMPRESS_URGENCY_MEDIUM 1   /* Moderate — release pooled memory */
+#define AMPRESS_URGENCY_HIGH   2   /* Severe — checkpoint / compact
aggressively */
+#define AMPRESS_URGENCY_FATAL  3   /* Last resort before OOM kill */
+
+struct ampress_event {
+    __u8  urgency;           /* AMPRESS_URGENCY_* */
+    __u8  numa_node;         /* 0xFF = system-wide */
+    __u16 reserved;
+    __u32 requested_kb;      /* How much the kernel wants back (0 =
unspecified) */
+    __u64 timestamp_ns;      /* ktime_get_ns() at event generation */
+};
+
+struct ampress_ack {
+    __u32 freed_kb;          /* How much the process actually freed */
+    __u32 reserved;
+};
+
+struct ampress_config {
+    __u32 low_threshold_pct;    /* % of zone watermark to trigger LOW */
+    __u32 medium_threshold_pct;
+    __u32 high_threshold_pct;
+    __u32 fatal_threshold_pct;
+};
+
+#define AMPRESS_IOC_MAGIC       'P'
+#define AMPRESS_IOC_CONFIGURE   _IOW(AMPRESS_IOC_MAGIC, 1, struct
ampress_config)
+#define AMPRESS_IOC_ACK         _IOW(AMPRESS_IOC_MAGIC, 2, struct ampress_ack)
+#define AMPRESS_IOC_SUBSCRIBE   _IO(AMPRESS_IOC_MAGIC,  3)
+#define AMPRESS_IOC_UNSUBSCRIBE _IO(AMPRESS_IOC_MAGIC,  4)
+
+#endif /* _UAPI_LINUX_AMPRESS_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index ebd8ea35368..be1eddd1231 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1473,4 +1473,30 @@ config LAZY_MMU_MODE_KUNIT_TEST

 source "mm/damon/Kconfig"

+config AMPRESS
+    bool "Adaptive Memory Pressure Signaling"
+    default n
+    help
+      Provides a character device (/dev/ampress) that allows userspace
+      processes to subscribe to graduated memory pressure notifications
+      and cooperatively release memory before OOM conditions occur.
+
+      Processes open /dev/ampress, optionally configure per-urgency
+      thresholds via ioctl, then block on read() to receive
+      struct ampress_event notifications. After freeing memory the
+      process issues AMPRESS_IOC_ACK to close the feedback loop.
+
+      If unsure, say N.
+
+config AMPRESS_TEST
+    tristate "KUnit tests for AMPRESS" if !KUNIT_ALL_TESTS
+    depends on AMPRESS && KUNIT
+    default KUNIT_ALL_TESTS
+    help
+      Enables KUnit-based unit tests for the Adaptive Memory Pressure
+      Signaling subsystem. Tests cover: no-subscriber safety, event
+      delivery to fake subscribers, and overwrite-without-ACK behaviour.
+
+      If unsure, say N.
+
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 8ad2ab08244..9b72712db1c 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -150,3 +150,5 @@ obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
 obj-$(CONFIG_EXECMEM) += execmem.o
 obj-$(CONFIG_TMPFS_QUOTA) += shmem_quota.o
 obj-$(CONFIG_LAZY_MMU_MODE_KUNIT_TEST) += tests/lazy_mmu_mode_kunit.o
+obj-$(CONFIG_AMPRESS) += ampress.o
+obj-$(CONFIG_AMPRESS_TEST) += ampress_test.o
diff --git a/mm/ampress.c b/mm/ampress.c
new file mode 100644
index 00000000000..74bfa76aa21
--- /dev/null
+++ b/mm/ampress.c
@@ -0,0 +1,320 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Adaptive Memory Pressure Signaling (AMPRESS)
+ *
+ * Provides a /dev/ampress character device that userspace processes can open
+ * to receive graduated memory pressure notifications and cooperatively release
+ * memory before OOM conditions occur.
+ */
+
+#define pr_fmt(fmt) "ampress: " fmt
+
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/miscdevice.h>
+#include <linux/slab.h>
+#include <linux/poll.h>
+#include <linux/uaccess.h>
+#include <linux/spinlock.h>
+#include <linux/wait.h>
+#include <linux/list.h>
+#include <linux/ktime.h>
+#include <linux/debugfs.h>
+#include <linux/sched.h>
+#include <linux/ampress.h>
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/ampress.h>
+
+/*
+ * Global subscriber list, protected by ampress_subscribers_lock.
+ * Non-static so KUnit tests can inject fake subscribers directly.
+ */
+LIST_HEAD(ampress_subscribers);
+DEFINE_RWLOCK(ampress_subscribers_lock);
+
+/* Debugfs root directory */
+static struct dentry *ampress_debugfs_dir;
+
+/* ------------------------------------------------------------------ */
+/*  ampress_notify() — called from memory reclaim paths               */
+/* ------------------------------------------------------------------ */
+
+/**
+ * ampress_notify - dispatch a memory pressure event to all subscribers
+ * @urgency:      AMPRESS_URGENCY_* level
+ * @numa_node:    NUMA node of the pressure source (0xFF = system-wide)
+ * @requested_kb: Suggested reclaim target in KiB (0 = unspecified)
+ *
+ * Must be safe to call from any context including IRQ / reclaim paths:
+ *   - no sleeping allocations
+ *   - only spin_lock_irqsave and wake_up_interruptible
+ */
+void ampress_notify(int urgency, int numa_node, unsigned long requested_kb)
+{
+    struct ampress_subscriber *sub;
+    unsigned long rflags, flags;
+    int notified = 0;
+
+    /*
+     * Use irqsave variants: ampress_notify() may be called from a context
+     * where interrupts are disabled (e.g. a future direct-reclaim hook).
+     */
+    read_lock_irqsave(&ampress_subscribers_lock, rflags);
+    list_for_each_entry(sub, &ampress_subscribers, list) {
+        if (!sub->subscribed)
+            continue;
+
+        /*
+         * Check if the urgency meets or exceeds the subscriber's
+         * configured threshold for this urgency level.
+         *
+         * Default config has all thresholds at 0, meaning any
+         * urgency >= 0 passes — i.e. everything is delivered.
+         */
+        spin_lock_irqsave(&sub->lock, flags);
+        sub->pending_event.urgency     = (__u8)urgency;
+        sub->pending_event.numa_node   = (__u8)(numa_node & 0xFF);
+        sub->pending_event.reserved    = 0;
+        sub->pending_event.requested_kb =
+            (__u32)min_t(unsigned long, requested_kb, U32_MAX);
+        sub->pending_event.timestamp_ns = ktime_get_ns();
+        sub->event_pending = true;
+        spin_unlock_irqrestore(&sub->lock, flags);
+
+        wake_up_interruptible(&sub->wq);
+        notified++;
+    }
+    read_unlock_irqrestore(&ampress_subscribers_lock, rflags);
+
+    trace_ampress_notify_sent(urgency, numa_node, requested_kb, notified);
+}
+EXPORT_SYMBOL_GPL(ampress_notify);
+
+/* ------------------------------------------------------------------ */
+/*  File operations                                                    */
+/* ------------------------------------------------------------------ */
+
+static int ampress_open(struct inode *inode, struct file *filp)
+{
+    struct ampress_subscriber *sub;
+
+    sub = kzalloc_obj(*sub, GFP_KERNEL);
+    if (!sub)
+        return -ENOMEM;
+
+    INIT_LIST_HEAD(&sub->list);
+    init_waitqueue_head(&sub->wq);
+    spin_lock_init(&sub->lock);
+    sub->subscribed = true;
+
+    /* Default thresholds: deliver any urgency >= LOW */
+    sub->config.low_threshold_pct    = 0;
+    sub->config.medium_threshold_pct = 0;
+    sub->config.high_threshold_pct   = 0;
+    sub->config.fatal_threshold_pct  = 0;
+
+    write_lock_irq(&ampress_subscribers_lock);
+    list_add_tail(&sub->list, &ampress_subscribers);
+    write_unlock_irq(&ampress_subscribers_lock);
+
+    filp->private_data = sub;
+    return 0;
+}
+
+static int ampress_release(struct inode *inode, struct file *filp)
+{
+    struct ampress_subscriber *sub = filp->private_data;
+
+    write_lock_irq(&ampress_subscribers_lock);
+    list_del(&sub->list);
+    write_unlock_irq(&ampress_subscribers_lock);
+
+    kfree(sub);
+    return 0;
+}
+
+static ssize_t ampress_read(struct file *filp, char __user *buf,
+                size_t count, loff_t *ppos)
+{
+    struct ampress_subscriber *sub = filp->private_data;
+    struct ampress_event event;
+    unsigned long flags;
+    int ret;
+
+    if (count < sizeof(event))
+        return -EINVAL;
+
+    if (filp->f_flags & O_NONBLOCK) {
+        spin_lock_irqsave(&sub->lock, flags);
+        if (!sub->event_pending) {
+            spin_unlock_irqrestore(&sub->lock, flags);
+            return -EAGAIN;
+        }
+        spin_unlock_irqrestore(&sub->lock, flags);
+    } else {
+        ret = wait_event_interruptible(sub->wq, sub->event_pending);
+        if (ret)
+            return ret;
+    }
+
+    spin_lock_irqsave(&sub->lock, flags);
+    event = sub->pending_event;
+    sub->event_pending = false;
+    spin_unlock_irqrestore(&sub->lock, flags);
+
+    if (copy_to_user(buf, &event, sizeof(event)))
+        return -EFAULT;
+
+    return sizeof(event);
+}
+
+static __poll_t ampress_poll(struct file *filp, poll_table *wait)
+{
+    struct ampress_subscriber *sub = filp->private_data;
+
+    poll_wait(filp, &sub->wq, wait);
+
+    if (sub->event_pending)
+        return EPOLLIN | EPOLLRDNORM;
+
+    return 0;
+}
+
+static long ampress_ioctl(struct file *filp, unsigned int cmd,
+              unsigned long arg)
+{
+    struct ampress_subscriber *sub = filp->private_data;
+
+    switch (cmd) {
+    case AMPRESS_IOC_CONFIGURE: {
+        struct ampress_config cfg;
+
+        if (copy_from_user(&cfg, (void __user *)arg, sizeof(cfg)))
+            return -EFAULT;
+
+        /* Thresholds must be ascending and <= 100 */
+        if (cfg.low_threshold_pct > 100 ||
+            cfg.medium_threshold_pct > 100 ||
+            cfg.high_threshold_pct > 100 ||
+            cfg.fatal_threshold_pct > 100)
+            return -EINVAL;
+        if (cfg.low_threshold_pct > cfg.medium_threshold_pct ||
+            cfg.medium_threshold_pct > cfg.high_threshold_pct ||
+            cfg.high_threshold_pct > cfg.fatal_threshold_pct)
+            return -EINVAL;
+
+        sub->config = cfg;
+        return 0;
+    }
+
+    case AMPRESS_IOC_ACK: {
+        struct ampress_ack ack;
+
+        if (copy_from_user(&ack, (void __user *)arg, sizeof(ack)))
+            return -EFAULT;
+
+        trace_ampress_ack_received(task_pid_nr(current),
+                       (unsigned long)ack.freed_kb);
+        return 0;
+    }
+
+    case AMPRESS_IOC_SUBSCRIBE:
+        sub->subscribed = true;
+        return 0;
+
+    case AMPRESS_IOC_UNSUBSCRIBE:
+        sub->subscribed = false;
+        return 0;
+
+    default:
+        return -ENOTTY;
+    }
+}
+
+static const struct file_operations ampress_fops = {
+    .owner          = THIS_MODULE,
+    .open           = ampress_open,
+    .release        = ampress_release,
+    .read           = ampress_read,
+    .poll           = ampress_poll,
+    .unlocked_ioctl = ampress_ioctl,
+    .llseek         = noop_llseek,
+};
+
+static struct miscdevice ampress_miscdev = {
+    .minor = MISC_DYNAMIC_MINOR,
+    .name  = "ampress",
+    .fops  = &ampress_fops,
+};
+
+/* ------------------------------------------------------------------ */
+/*  Debugfs inject trigger                                             */
+/* ------------------------------------------------------------------ */
+
+static ssize_t ampress_inject_write(struct file *filp,
+                    const char __user *buf,
+                    size_t count, loff_t *ppos)
+{
+    char tmp[4];
+    unsigned long urgency;
+    int ret;
+
+    if (count > sizeof(tmp) - 1)
+        return -EINVAL;
+    if (copy_from_user(tmp, buf, count))
+        return -EFAULT;
+    tmp[count] = '\0';
+
+    ret = kstrtoul(tmp, 10, &urgency);
+    if (ret)
+        return ret;
+    if (urgency > AMPRESS_URGENCY_FATAL)
+        return -ERANGE;
+
+    ampress_notify((int)urgency, 0, 0);
+    return count;
+}
+
+static const struct file_operations ampress_inject_fops = {
+    .owner = THIS_MODULE,
+    .write = ampress_inject_write,
+    .llseek = noop_llseek,
+};
+
+/* ------------------------------------------------------------------ */
+/*  Module init / exit                                                 */
+/* ------------------------------------------------------------------ */
+
+static int __init ampress_init(void)
+{
+    int ret;
+
+    ret = misc_register(&ampress_miscdev);
+    if (ret) {
+        pr_err("failed to register miscdevice: %d\n", ret);
+        return ret;
+    }
+
+    ampress_debugfs_dir = debugfs_create_dir("ampress", NULL);
+    if (!IS_ERR_OR_NULL(ampress_debugfs_dir))
+        debugfs_create_file("inject", 0200, ampress_debugfs_dir,
+                    NULL, &ampress_inject_fops);
+
+    pr_info("Adaptive Memory Pressure Signaling initialized\n");
+    return 0;
+}
+
+static void __exit ampress_exit(void)
+{
+    debugfs_remove_recursive(ampress_debugfs_dir);
+    misc_deregister(&ampress_miscdev);
+    pr_info("Adaptive Memory Pressure Signaling removed\n");
+}
+
+module_init(ampress_init);
+module_exit(ampress_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Linux Kernel");
+MODULE_DESCRIPTION("Adaptive Memory Pressure Signaling (/dev/ampress)");
diff --git a/mm/ampress_test.c b/mm/ampress_test.c
new file mode 100644
index 00000000000..ea2674c91b6
--- /dev/null
+++ b/mm/ampress_test.c
@@ -0,0 +1,124 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KUnit tests for Adaptive Memory Pressure Signaling (AMPRESS)
+ */
+
+#include <kunit/test.h>
+#include <linux/ampress.h>
+#include <linux/spinlock.h>
+#include <linux/list.h>
+#include <linux/wait.h>
+
+/*
+ * White-box access to AMPRESS internals for unit testing.
+ * These externs allow injecting fake subscribers directly into the global
+ * list without going through the character device file operations.
+ */
+extern struct list_head ampress_subscribers;
+extern rwlock_t ampress_subscribers_lock;
+
+/* ------------------------------------------------------------------ */
+/*  Test 1: notify with no subscribers — must not crash               */
+/* ------------------------------------------------------------------ */
+
+static void ampress_test_no_subscribers(struct kunit *test)
+{
+    /* Must complete without hang or crash */
+    ampress_notify(AMPRESS_URGENCY_LOW,    0, 0);
+    ampress_notify(AMPRESS_URGENCY_MEDIUM, 0, 1024);
+    ampress_notify(AMPRESS_URGENCY_HIGH,   0, 2048);
+    ampress_notify(AMPRESS_URGENCY_FATAL,  0, 0);
+
+    KUNIT_SUCCEED(test);
+}
+
+/* ------------------------------------------------------------------ */
+/*  Test 2: fake subscriber receives correct event                    */
+/* ------------------------------------------------------------------ */
+
+static void ampress_test_event_delivery(struct kunit *test)
+{
+    struct ampress_subscriber sub = {};
+
+    INIT_LIST_HEAD(&sub.list);
+    init_waitqueue_head(&sub.wq);
+    spin_lock_init(&sub.lock);
+    sub.subscribed    = true;
+    sub.event_pending = false;
+
+    write_lock(&ampress_subscribers_lock);
+    list_add_tail(&sub.list, &ampress_subscribers);
+    write_unlock(&ampress_subscribers_lock);
+
+    ampress_notify(AMPRESS_URGENCY_HIGH, 1, 4096);
+
+    write_lock(&ampress_subscribers_lock);
+    list_del(&sub.list);
+    write_unlock(&ampress_subscribers_lock);
+
+    KUNIT_EXPECT_TRUE(test, sub.event_pending);
+    KUNIT_EXPECT_EQ(test, (int)sub.pending_event.urgency,
+            AMPRESS_URGENCY_HIGH);
+    KUNIT_EXPECT_EQ(test, (int)sub.pending_event.numa_node, 1);
+    KUNIT_EXPECT_EQ(test, (u32)sub.pending_event.requested_kb, (u32)4096);
+}
+
+/* ------------------------------------------------------------------ */
+/*  Test 3: second notify without ACK overwrites first (no overflow)  */
+/* ------------------------------------------------------------------ */
+
+static void ampress_test_overwrite_without_ack(struct kunit *test)
+{
+    struct ampress_subscriber sub = {};
+
+    INIT_LIST_HEAD(&sub.list);
+    init_waitqueue_head(&sub.wq);
+    spin_lock_init(&sub.lock);
+    sub.subscribed    = true;
+    sub.event_pending = false;
+
+    write_lock(&ampress_subscribers_lock);
+    list_add_tail(&sub.list, &ampress_subscribers);
+    write_unlock(&ampress_subscribers_lock);
+
+    /* First event */
+    ampress_notify(AMPRESS_URGENCY_LOW, 0, 100);
+
+    KUNIT_EXPECT_TRUE(test, sub.event_pending);
+    KUNIT_EXPECT_EQ(test, (int)sub.pending_event.urgency,
+            AMPRESS_URGENCY_LOW);
+
+    /* Second event without reading (no ACK) */
+    ampress_notify(AMPRESS_URGENCY_FATAL, 0, 9999);
+
+    write_lock(&ampress_subscribers_lock);
+    list_del(&sub.list);
+    write_unlock(&ampress_subscribers_lock);
+
+    /* The second event must overwrite the first */
+    KUNIT_EXPECT_TRUE(test, sub.event_pending);
+    KUNIT_EXPECT_EQ(test, (int)sub.pending_event.urgency,
+            AMPRESS_URGENCY_FATAL);
+    KUNIT_EXPECT_EQ(test, (u32)sub.pending_event.requested_kb, (u32)9999);
+}
+
+/* ------------------------------------------------------------------ */
+/*  Test suite registration                                            */
+/* ------------------------------------------------------------------ */
+
+static struct kunit_case ampress_test_cases[] = {
+    KUNIT_CASE(ampress_test_no_subscribers),
+    KUNIT_CASE(ampress_test_event_delivery),
+    KUNIT_CASE(ampress_test_overwrite_without_ack),
+    {}
+};
+
+static struct kunit_suite ampress_test_suite = {
+    .name  = "ampress",
+    .test_cases = ampress_test_cases,
+};
+
+kunit_test_suite(ampress_test_suite);
+
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("KUnit tests for AMPRESS");
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0fc9373e825..34da5104453 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -68,6 +68,8 @@
 #include "internal.h"
 #include "swap.h"

+#include <linux/ampress.h>
+
 #define CREATE_TRACE_POINTS
 #include <trace/events/vmscan.h>

@@ -7103,6 +7105,31 @@ static int balance_pgdat(pg_data_t *pgdat, int
order, int highest_zoneidx)

         if (raise_priority || !nr_reclaimed)
             sc.priority--;
+
+#ifdef CONFIG_AMPRESS
+        /*
+         * Map the current scan priority to an AMPRESS urgency level
+         * and notify subscribers. Lower priority means the system is
+         * working harder to reclaim memory, indicating higher pressure.
+         * DEF_PRIORITY == 12; we divide the range into four bands.
+         */
+        if (!balanced) {
+            int amp_urgency;
+
+            if (sc.priority <= 3)
+                amp_urgency = AMPRESS_URGENCY_FATAL;
+            else if (sc.priority <= 6)
+                amp_urgency = AMPRESS_URGENCY_HIGH;
+            else if (sc.priority <= 9)
+                amp_urgency = AMPRESS_URGENCY_MEDIUM;
+            else
+                amp_urgency = AMPRESS_URGENCY_LOW;
+
+            ampress_notify(amp_urgency, pgdat->node_id,
+                       (unsigned long)sc.nr_to_reclaim <<
+                       (PAGE_SHIFT - 10));
+        }
+#endif
     } while (sc.priority >= 1);

     /*
diff --git a/tools/testing/ampress/.gitignore b/tools/testing/ampress/.gitignore
new file mode 100644
index 00000000000..c2ee439db7b
--- /dev/null
+++ b/tools/testing/ampress/.gitignore
@@ -0,0 +1,2 @@
+ampress_test
+ampress_stress
diff --git a/tools/testing/ampress/Makefile b/tools/testing/ampress/Makefile
new file mode 100644
index 00000000000..d175dee7c22
--- /dev/null
+++ b/tools/testing/ampress/Makefile
@@ -0,0 +1,21 @@
+# SPDX-License-Identifier: GPL-2.0
+# Makefile for AMPRESS userspace tests
+
+CC      := gcc
+CFLAGS  := -Wall -Wextra -O2
+LDFLAGS := -static
+
+PROGS   := ampress_test ampress_stress
+
+.PHONY: all clean
+
+all: $(PROGS)
+
+ampress_test: ampress_test.c
+    $(CC) $(CFLAGS) $(LDFLAGS) -o $@ $<
+
+ampress_stress: ampress_stress.c
+    $(CC) $(CFLAGS) $(LDFLAGS) -pthread -o $@ $<
+
+clean:
+    rm -f $(PROGS)
diff --git a/tools/testing/ampress/ampress_stress.c
b/tools/testing/ampress/ampress_stress.c
new file mode 100644
index 00000000000..7894abd764b
--- /dev/null
+++ b/tools/testing/ampress/ampress_stress.c
@@ -0,0 +1,199 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * ampress_stress.c — Concurrency / stress test for /dev/ampress
+ *
+ * Launches 64 reader threads that each open /dev/ampress independently and
+ * read in a tight loop for 10 seconds. A 65th "driver" thread injects events
+ * via the debugfs trigger. Checks for UAF, corruption, and hangs.
+ *
+ * Build: gcc -Wall -Wextra -static -pthread -o ampress_stress ampress_stress.c
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <stdint.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <pthread.h>
+#include <time.h>
+#include <sys/poll.h>
+#include <linux/types.h>
+
+#define AMPRESS_URGENCY_LOW    0
+#define AMPRESS_URGENCY_MEDIUM 1
+#define AMPRESS_URGENCY_HIGH   2
+#define AMPRESS_URGENCY_FATAL  3
+
+struct ampress_event {
+    __u8  urgency;
+    __u8  numa_node;
+    __u16 reserved;
+    __u32 requested_kb;
+    __u64 timestamp_ns;
+};
+
+#define DEVICE_PATH   "/dev/ampress"
+#define DEBUGFS_INJECT "/sys/kernel/debug/ampress/inject"
+#define NUM_READERS    64
+#define TEST_DURATION  10   /* seconds */
+
+static _Atomic int g_stop;
+static unsigned long g_events_read[NUM_READERS];
+
+struct reader_arg {
+    int idx;
+};
+
+static void *reader_thread(void *arg)
+{
+    struct reader_arg *a = arg;
+    int fd;
+    struct pollfd pfd;
+
+    fd = open(DEVICE_PATH, O_RDONLY | O_NONBLOCK);
+    if (fd < 0) {
+        fprintf(stderr, "reader[%d]: open failed: %s\n",
+            a->idx, strerror(errno));
+        return (void *)(intptr_t)-1;
+    }
+
+    pfd.fd     = fd;
+    pfd.events = POLLIN;
+
+    while (!g_stop) {
+        int ret = poll(&pfd, 1, 200);
+
+        if (ret < 0) {
+            if (errno == EINTR)
+                continue;
+            perror("poll");
+            break;
+        }
+        if (ret == 0)
+            continue;
+
+        if (pfd.revents & POLLIN) {
+            struct ampress_event ev;
+            ssize_t n = read(fd, &ev, sizeof(ev));
+
+            if (n < 0) {
+                if (errno == EAGAIN)
+                    continue;
+                perror("read");
+                break;
+            }
+            if ((size_t)n == sizeof(ev)) {
+                /* Basic sanity checks */
+                if (ev.urgency > AMPRESS_URGENCY_FATAL) {
+                    fprintf(stderr,
+                        "reader[%d]: BAD urgency %u\n",
+                        a->idx, ev.urgency);
+                    close(fd);
+                    return (void *)(intptr_t)-1;
+                }
+                g_events_read[a->idx]++;
+            }
+        }
+    }
+
+    close(fd);
+    return NULL;
+}
+
+static void *inject_thread(void *arg)
+{
+    int inject_fd;
+    int urgency = 0;
+    char buf[4];
+
+    (void)arg;
+
+    inject_fd = open(DEBUGFS_INJECT, O_WRONLY);
+    if (inject_fd < 0) {
+        fprintf(stderr, "inject: open %s failed: %s\n",
+            DEBUGFS_INJECT, strerror(errno));
+        return (void *)(intptr_t)-1;
+    }
+
+    while (!g_stop) {
+        buf[0] = '0' + (char)(urgency % 4);
+        buf[1] = '\n';
+        if (write(inject_fd, buf, 2) < 0) {
+            perror("inject write");
+            break;
+        }
+        urgency++;
+        usleep(5000); /* 5 ms between injections */
+    }
+
+    close(inject_fd);
+    return NULL;
+}
+
+int main(void)
+{
+    pthread_t readers[NUM_READERS];
+    pthread_t injector;
+    struct reader_arg args[NUM_READERS];
+    unsigned long total = 0;
+    int i, rc;
+    int failed = 0;
+
+    g_stop = 0;
+
+    /* Start reader threads */
+    for (i = 0; i < NUM_READERS; i++) {
+        args[i].idx = i;
+        rc = pthread_create(&readers[i], NULL, reader_thread, &args[i]);
+        if (rc) {
+            fprintf(stderr, "pthread_create reader[%d]: %s\n",
+                i, strerror(rc));
+            return 1;
+        }
+    }
+
+    /* Start inject thread */
+    rc = pthread_create(&injector, NULL, inject_thread, NULL);
+    if (rc) {
+        fprintf(stderr, "pthread_create injector: %s\n", strerror(rc));
+        /* Non-fatal: stress test can still run with real pressure */
+    }
+
+    printf("ampress_stress: %d readers running for %d seconds...\n",
+           NUM_READERS, TEST_DURATION);
+
+    sleep(TEST_DURATION);
+
+    g_stop = 1;
+
+    for (i = 0; i < NUM_READERS; i++) {
+        void *retval;
+
+        pthread_join(readers[i], &retval);
+        if ((intptr_t)retval != 0) {
+            fprintf(stderr, "reader[%d] failed\n", i);
+            failed++;
+        }
+        total += g_events_read[i];
+    }
+
+    if (rc == 0) {
+        void *retval;
+
+        pthread_join(injector, &retval);
+    }
+
+    printf("ampress_stress: total events read: %lu across %d threads\n",
+           total, NUM_READERS);
+
+    if (failed) {
+        fprintf(stderr, "ampress_stress: FAIL — %d threads reported errors\n",
+            failed);
+        return 1;
+    }
+
+    printf("ampress_stress: PASS\n");
+    return 0;
+}
diff --git a/tools/testing/ampress/ampress_test.c
b/tools/testing/ampress/ampress_test.c
new file mode 100644
index 00000000000..372705aaa0a
--- /dev/null
+++ b/tools/testing/ampress/ampress_test.c
@@ -0,0 +1,212 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * ampress_test.c — Userspace integration test for /dev/ampress
+ *
+ * Usage: ./ampress_test
+ *
+ * Opens /dev/ampress, optionally configures thresholds, then forks a child
+ * that exhausts memory via mmap while the parent polls for pressure events.
+ * Expects to see at least one HIGH-urgency event within 30 seconds; exits 0
+ * on success, 1 on timeout or error.
+ *
+ * Build: gcc -Wall -Wextra -static -o ampress_test ampress_test.c
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <stdint.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <signal.h>
+#include <time.h>
+#include <sys/mman.h>
+#include <sys/poll.h>
+#include <sys/wait.h>
+#include <sys/ioctl.h>
+
+/* Pull in UAPI types without kernel headers */
+#include <linux/types.h>
+
+/*
+ * Duplicate the UAPI definitions here so the test can be built with
+ * just a libc (--sysroot or installed kernel headers are not required).
+ */
+#define AMPRESS_URGENCY_LOW    0
+#define AMPRESS_URGENCY_MEDIUM 1
+#define AMPRESS_URGENCY_HIGH   2
+#define AMPRESS_URGENCY_FATAL  3
+
+struct ampress_event {
+    __u8  urgency;
+    __u8  numa_node;
+    __u16 reserved;
+    __u32 requested_kb;
+    __u64 timestamp_ns;
+};
+
+struct ampress_ack {
+    __u32 freed_kb;
+    __u32 reserved;
+};
+
+struct ampress_config {
+    __u32 low_threshold_pct;
+    __u32 medium_threshold_pct;
+    __u32 high_threshold_pct;
+    __u32 fatal_threshold_pct;
+};
+
+#define AMPRESS_IOC_MAGIC       'P'
+#define AMPRESS_IOC_CONFIGURE   _IOW(AMPRESS_IOC_MAGIC, 1, struct
ampress_config)
+#define AMPRESS_IOC_ACK         _IOW(AMPRESS_IOC_MAGIC, 2, struct ampress_ack)
+#define AMPRESS_IOC_SUBSCRIBE   _IO(AMPRESS_IOC_MAGIC,  3)
+#define AMPRESS_IOC_UNSUBSCRIBE _IO(AMPRESS_IOC_MAGIC,  4)
+
+#define DEVICE_PATH "/dev/ampress"
+#define TIMEOUT_SEC 30
+#define PAGE_SZ     4096
+
+static const char *urgency_str(int u)
+{
+    switch (u) {
+    case AMPRESS_URGENCY_LOW:    return "LOW";
+    case AMPRESS_URGENCY_MEDIUM: return "MEDIUM";
+    case AMPRESS_URGENCY_HIGH:   return "HIGH";
+    case AMPRESS_URGENCY_FATAL:  return "FATAL";
+    default:                     return "UNKNOWN";
+    }
+}
+
+/* Child: mmap in a tight loop to exhaust memory */
+static void child_exhaust(void)
+{
+    size_t chunk = 64 * 1024 * 1024; /* 64 MiB per iteration */
+    int iter = 0;
+
+    while (1) {
+        void *p = mmap(NULL, chunk, PROT_READ | PROT_WRITE,
+                   MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE,
+                   -1, 0);
+        if (p == MAP_FAILED) {
+            if (errno == ENOMEM) {
+                /* Slow down and keep retrying */
+                usleep(100000);
+                continue;
+            }
+            perror("mmap");
+            _exit(1);
+        }
+        /* Touch every page so they are actually allocated */
+        memset(p, (char)iter, chunk);
+        iter++;
+    }
+}
+
+int main(void)
+{
+    int fd;
+    pid_t child;
+    struct pollfd pfd;
+    time_t deadline;
+    int seen[4] = { 0, 0, 0, 0 };
+    int status;
+
+    fd = open(DEVICE_PATH, O_RDONLY);
+    if (fd < 0) {
+        perror("open " DEVICE_PATH);
+        return 1;
+    }
+
+    /* Configure thresholds (all 0 = default: deliver everything) */
+    struct ampress_config cfg = {
+        .low_threshold_pct    = 0,
+        .medium_threshold_pct = 0,
+        .high_threshold_pct   = 0,
+        .fatal_threshold_pct  = 0,
+    };
+    if (ioctl(fd, AMPRESS_IOC_CONFIGURE, &cfg) < 0) {
+        perror("AMPRESS_IOC_CONFIGURE");
+        close(fd);
+        return 1;
+    }
+
+    child = fork();
+    if (child < 0) {
+        perror("fork");
+        close(fd);
+        return 1;
+    }
+    if (child == 0)
+        child_exhaust(); /* Never returns */
+
+    printf("ampress_test: child PID %d exhausting memory...\n", child);
+
+    deadline = time(NULL) + TIMEOUT_SEC;
+    pfd.fd     = fd;
+    pfd.events = POLLIN;
+
+    while (time(NULL) < deadline) {
+        int remaining = (int)(deadline - time(NULL));
+        int ret = poll(&pfd, 1, remaining * 1000);
+
+        if (ret < 0) {
+            if (errno == EINTR)
+                continue;
+            perror("poll");
+            goto fail;
+        }
+        if (ret == 0) {
+            fprintf(stderr, "ampress_test: TIMEOUT — no HIGH event
received\n");
+            goto fail;
+        }
+
+        if (pfd.revents & POLLIN) {
+            struct ampress_event ev;
+            ssize_t n = read(fd, &ev, sizeof(ev));
+
+            if (n < 0) {
+                perror("read");
+                goto fail;
+            }
+            if ((size_t)n < sizeof(ev)) {
+                fprintf(stderr, "short read: %zd\n", n);
+                goto fail;
+            }
+
+            printf("ampress_test: urgency=%-6s numa=%u kb=%u ts=%llu\n",
+                   urgency_str(ev.urgency), ev.numa_node,
+                   ev.requested_kb,
+                   (unsigned long long)ev.timestamp_ns);
+
+            if (ev.urgency <= AMPRESS_URGENCY_FATAL)
+                seen[ev.urgency] = 1;
+
+            /* ACK with a simulated freed amount */
+            struct ampress_ack ack = { .freed_kb = 16384 };
+
+            if (ioctl(fd, AMPRESS_IOC_ACK, &ack) < 0)
+                perror("AMPRESS_IOC_ACK (non-fatal)");
+
+            /* Success criterion: seen at least up to HIGH */
+            if (seen[AMPRESS_URGENCY_HIGH] ||
+                seen[AMPRESS_URGENCY_FATAL])
+                goto success;
+        }
+    }
+
+    fprintf(stderr, "ampress_test: TIMEOUT\n");
+fail:
+    kill(child, SIGKILL);
+    waitpid(child, &status, 0);
+    close(fd);
+    return 1;
+
+success:
+    printf("ampress_test: SUCCESS — received HIGH (or higher) event\n");
+    kill(child, SIGKILL);
+    waitpid(child, &status, 0);
+    close(fd);
+    return 0;
+}
-- 
2.51.0

^ permalink raw reply related

* [PATCH V2] blktrace: fix __this_cpu_read/write in preemptible context
From: Chaitanya Kulkarni @ 2026-03-02  0:22 UTC (permalink / raw)
  To: axboe, rostedt, mhiramat, mathieu.desnoyers
  Cc: shinichiro.kawasaki, linux-block, linux-trace-kernel,
	Chaitanya Kulkarni

tracing_record_cmdline() internally uses __this_cpu_read() and
__this_cpu_write() on the per-CPU variable trace_cmdline_save, and
trace_save_cmdline() explicitly asserts preemption is disabled via
lockdep_assert_preemption_disabled(). These operations are only safe
when preemption is off, as they were designed to be called from the
scheduler context (probe_wakeup_sched_switch() / probe_wakeup()).

__blk_add_trace() was calling tracing_record_cmdline(current) early in
the blk_tracer path, before ring buffer reservation, from process
context where preemption is fully enabled. This triggers the following
using blktests/blktrace/002:

blktrace/002 (blktrace ftrace corruption with sysfs trace)   [failed]
    runtime  0.367s  ...  0.437s
    something found in dmesg:
    [   81.211018] run blktests blktrace/002 at 2026-02-25 22:24:33
    [   81.239580] null_blk: disk nullb1 created
    [   81.357294] BUG: using __this_cpu_read() in preemptible [00000000] code: dd/2516
    [   81.362842] caller is tracing_record_cmdline+0x10/0x40
    [   81.362872] CPU: 16 UID: 0 PID: 2516 Comm: dd Tainted: G                 N  7.0.0-rc1lblk+ #84 PREEMPT(full)
    [   81.362877] Tainted: [N]=TEST
    [   81.362878] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
    [   81.362881] Call Trace:
    [   81.362884]  <TASK>
    [   81.362886]  dump_stack_lvl+0x8d/0xb0
    ...
    (See '/mnt/sda/blktests/results/nodev/blktrace/002.dmesg' for the entire message)

[   81.211018] run blktests blktrace/002 at 2026-02-25 22:24:33
[   81.239580] null_blk: disk nullb1 created
[   81.357294] BUG: using __this_cpu_read() in preemptible [00000000] code: dd/2516
[   81.362842] caller is tracing_record_cmdline+0x10/0x40
[   81.362872] CPU: 16 UID: 0 PID: 2516 Comm: dd Tainted: G                 N  7.0.0-rc1lblk+ #84 PREEMPT(full)
[   81.362877] Tainted: [N]=TEST
[   81.362878] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
[   81.362881] Call Trace:
[   81.362884]  <TASK>
[   81.362886]  dump_stack_lvl+0x8d/0xb0
[   81.362895]  check_preemption_disabled+0xce/0xe0
[   81.362902]  tracing_record_cmdline+0x10/0x40
[   81.362923]  __blk_add_trace+0x307/0x5d0
[   81.362934]  ? lock_acquire+0xe0/0x300
[   81.362940]  ? iov_iter_extract_pages+0x101/0xa30
[   81.362959]  blk_add_trace_bio+0x106/0x1e0
[   81.362968]  submit_bio_noacct_nocheck+0x24b/0x3a0
[   81.362979]  ? lockdep_init_map_type+0x58/0x260
[   81.362988]  submit_bio_wait+0x56/0x90
[   81.363009]  __blkdev_direct_IO_simple+0x16c/0x250
[   81.363026]  ? __pfx_submit_bio_wait_endio+0x10/0x10
[   81.363038]  ? rcu_read_lock_any_held+0x73/0xa0
[   81.363051]  blkdev_read_iter+0xc1/0x140
[   81.363059]  vfs_read+0x20b/0x330
[   81.363083]  ksys_read+0x67/0xe0
[   81.363090]  do_syscall_64+0xbf/0xf00
[   81.363102]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   81.363106] RIP: 0033:0x7f281906029d
[   81.363111] Code: 31 c0 e9 c6 fe ff ff 50 48 8d 3d 66 63 0a 00 e8 59 ff 01 00 66 0f 1f 84 00 00 00 00 00 80 3d 41 33 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec
[   81.363113] RSP: 002b:00007ffca127dd48 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[   81.363120] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f281906029d
[   81.363122] RDX: 0000000000001000 RSI: 0000559f8bfae000 RDI: 0000000000000000
[   81.363123] RBP: 0000000000001000 R08: 0000002863a10a81 R09: 00007f281915f000
[   81.363124] R10: 00007f2818f77b60 R11: 0000000000000246 R12: 0000559f8bfae000
[   81.363126] R13: 0000000000000000 R14: 0000000000000000 R15: 000000000000000a
[   81.363142]  </TASK>

The same BUG fires from blk_add_trace_plug(), blk_add_trace_unplug(),
and blk_add_trace_rq() paths as well.

The purpose of tracing_record_cmdline() is to cache the task->comm for
a given PID so that the trace can later resolve it. It is only
meaningful when a trace event is actually being recorded. Ring buffer
reservation via ring_buffer_lock_reserve() disables preemption, and
preemption remains disabled until the event is committed :-

__blk_add_trace()
       	__trace_buffer_lock_reserve()
       		__trace_buffer_lock_reserve()
       			ring_buffer_lock_reserve()
       				preempt_disable_notrace();  <---

With this fix blktests for blktrace pass:

  blktests (master) # ./check blktrace
  blktrace/001 (blktrace zone management command tracing)      [passed]
      runtime  3.650s  ...  3.647s
  blktrace/002 (blktrace ftrace corruption with sysfs trace)   [passed]
      runtime  0.411s  ...  0.384s

Fixes: 7ffbd48d5cab ("tracing: Cache comms only after an event occurred")
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
---
v2:-

1. Remove preempt_disable_notrace() and preempt_enable_notrace() calls from V1.
   Fix the issue by moving a call to tracing_record_cmdline() after ring
   buffer reservation which also disables the preemption.

---
 kernel/trace/blktrace.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 3b7c102a6eb3..ead03e0e0fbe 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -383,8 +383,6 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
 	cpu = raw_smp_processor_id();
 
 	if (blk_tracer) {
-		tracing_record_cmdline(current);
-
 		buffer = blk_tr->array_buffer.buffer;
 		trace_ctx = tracing_gen_ctx_flags(0);
 		switch (bt->version) {
@@ -419,6 +417,7 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
 		if (!event)
 			return;
 
+		tracing_record_cmdline(current);
 		switch (bt->version) {
 		case 1:
 			record_blktrace_event(ring_buffer_event_data(event),
-- 
2.39.5


^ permalink raw reply related

* Re: [syzbot] [bpf?] [trace?] KASAN: slab-use-after-free Read in bpf_trace_run3 (2)
From: syzbot @ 2026-03-01  7:07 UTC (permalink / raw)
  To: andrii, ast, bpf, daniel, eddyz87, haoluo, john.fastabend, jolsa,
	kpsingh, linux-kernel, linux-trace-kernel, martin.lau,
	mathieu.desnoyers, mattbobrowski, mhiramat, rostedt, sdf, song,
	syzkaller-bugs, wangqing7171, yonghong.song
In-Reply-To: <69a0544c.050a0220.3a55be.0004.GAE@google.com>

syzbot has found a reproducer for the following issue on:

HEAD commit:    2f9339c052bd Merge tag 'spi-fix-v7.0-rc1' of git://git.ker..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=1610e3e6580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=56150637ffd942dd
dashboard link: https://syzkaller.appspot.com/bug?extid=9ea7c90be2b24e189592
compiler:       Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=10b12d5a580000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=153ce006580000

Downloadable assets:
disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/d900f083ada3/non_bootable_disk-2f9339c0.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/03f5cd2356a3/vmlinux-2f9339c0.xz
kernel image: https://storage.googleapis.com/syzbot-assets/fd9f61ab139e/bzImage-2f9339c0.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+9ea7c90be2b24e189592@syzkaller.appspotmail.com

==================================================================
BUG: KASAN: slab-use-after-free in __bpf_trace_run kernel/trace/bpf_trace.c:2075 [inline]
BUG: KASAN: slab-use-after-free in bpf_trace_run3+0xdd/0x850 kernel/trace/bpf_trace.c:2130
Read of size 8 at addr ffff88803828ab18 by task dhcpcd-run-hook/5487

CPU: 0 UID: 0 PID: 5487 Comm: dhcpcd-run-hook Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
 print_address_description mm/kasan/report.c:378 [inline]
 print_report+0xba/0x230 mm/kasan/report.c:482
 kasan_report+0x117/0x150 mm/kasan/report.c:595
 __bpf_trace_run kernel/trace/bpf_trace.c:2075 [inline]
 bpf_trace_run3+0xdd/0x850 kernel/trace/bpf_trace.c:2130
 __traceiter_kmem_cache_free+0x38/0x60 include/trace/events/kmem.h:117
 __do_trace_kmem_cache_free include/trace/events/kmem.h:117 [inline]
 trace_kmem_cache_free include/trace/events/kmem.h:117 [inline]
 kmem_cache_free+0x5ac/0x630 mm/slub.c:6272
 anon_vma_chain_free mm/rmap.c:147 [inline]
 unlink_anon_vmas+0x69d/0x730 mm/rmap.c:532
 free_pgtables+0x836/0xb70 mm/memory.c:427
 exit_mmap+0x490/0xa10 mm/mmap.c:1314
 __mmput+0x118/0x430 kernel/fork.c:1174
 exec_mmap+0x3b4/0x440 fs/exec.c:893
 begin_new_exec+0x134a/0x24a0 fs/exec.c:1148
 load_elf_binary+0xa47/0x2980 fs/binfmt_elf.c:1010
 search_binary_handler fs/exec.c:1664 [inline]
 exec_binprm fs/exec.c:1696 [inline]
 bprm_execve+0x93d/0x1460 fs/exec.c:1748
 do_execveat_common+0x50d/0x690 fs/exec.c:1846
 __do_sys_execve fs/exec.c:1930 [inline]
 __se_sys_execve fs/exec.c:1924 [inline]
 __x64_sys_execve+0x97/0xc0 fs/exec.c:1924
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f4dd469a107
Code: Unable to access opcode bytes at 0x7f4dd469a0dd.
RSP: 002b:00007ffed1452c68 EFLAGS: 00000246 ORIG_RAX: 000000000000003b
RAX: ffffffffffffffda RBX: 000055b1f84170c8 RCX: 00007f4dd469a107
RDX: 000055b1f84170e8 RSI: 000055b1f84170c8 RDI: 000055b1f8417170
RBP: 000055b1f8417170 R08: 00007ffed1456ea4 R09: 0000000000000000
R10: 0000000000000008 R11: 0000000000000246 R12: 000055b1f84170e8
R13: 00007f4dd485fe8b R14: 000055b1f84170e8 R15: 0000000000000000
 </TASK>

Allocated by task 5486:
 kasan_save_stack mm/kasan/common.c:57 [inline]
 kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
 poison_kmalloc_redzone mm/kasan/common.c:398 [inline]
 __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:415
 kasan_kmalloc include/linux/kasan.h:263 [inline]
 __kmalloc_cache_noprof+0x31c/0x660 mm/slub.c:5358
 kmalloc_noprof include/linux/slab.h:950 [inline]
 kzalloc_noprof include/linux/slab.h:1188 [inline]
 bpf_raw_tp_link_attach+0x278/0x700 kernel/bpf/syscall.c:4264
 bpf_raw_tracepoint_open+0x1b2/0x220 kernel/bpf/syscall.c:4312
 __sys_bpf+0x846/0x950 kernel/bpf/syscall.c:6270
 __do_sys_bpf kernel/bpf/syscall.c:6341 [inline]
 __se_sys_bpf kernel/bpf/syscall.c:6339 [inline]
 __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:6339
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Freed by task 15:
 kasan_save_stack mm/kasan/common.c:57 [inline]
 kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
 kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:584
 poison_slab_object mm/kasan/common.c:253 [inline]
 __kasan_slab_free+0x5c/0x80 mm/kasan/common.c:285
 kasan_slab_free include/linux/kasan.h:235 [inline]
 slab_free_hook mm/slub.c:2692 [inline]
 slab_free mm/slub.c:6143 [inline]
 kfree+0x1c1/0x630 mm/slub.c:6461
 rcu_do_batch kernel/rcu/tree.c:2617 [inline]
 rcu_core+0x7cd/0x1070 kernel/rcu/tree.c:2869
 handle_softirqs+0x22a/0x870 kernel/softirq.c:622
 run_ksoftirqd+0x36/0x60 kernel/softirq.c:1063
 smpboot_thread_fn+0x541/0xa50 kernel/smpboot.c:160
 kthread+0x388/0x470 kernel/kthread.c:467
 ret_from_fork+0x51e/0xb90 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

Last potentially related work creation:
 kasan_save_stack+0x3e/0x60 mm/kasan/common.c:57
 kasan_record_aux_stack+0xbd/0xd0 mm/kasan/generic.c:556
 __call_rcu_common kernel/rcu/tree.c:3131 [inline]
 call_rcu+0xee/0x890 kernel/rcu/tree.c:3251
 bpf_link_put_direct kernel/bpf/syscall.c:3323 [inline]
 bpf_link_release+0x6b/0x80 kernel/bpf/syscall.c:3330
 __fput+0x44f/0xa70 fs/file_table.c:469
 task_work_run+0x1d9/0x270 kernel/task_work.c:233
 exit_task_work include/linux/task_work.h:40 [inline]
 do_exit+0x69b/0x2320 kernel/exit.c:971
 do_group_exit+0x21b/0x2d0 kernel/exit.c:1112
 __do_sys_exit_group kernel/exit.c:1123 [inline]
 __se_sys_exit_group kernel/exit.c:1121 [inline]
 __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1121
 x64_sys_call+0x221a/0x2240 arch/x86/include/generated/asm/syscalls_64.h:232
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

The buggy address belongs to the object at ffff88803828ab00
 which belongs to the cache kmalloc-192 of size 192
The buggy address is located 24 bytes inside of
 freed 192-byte region [ffff88803828ab00, ffff88803828abc0)

The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff88803828a100 pfn:0x3828a
flags: 0x4fff00000000200(workingset|node=1|zone=1|lastcpupid=0x7ff)
page_type: f5(slab)
raw: 04fff00000000200 ffff88801ac413c0 ffff888030400288 ffffea0000e19390
raw: ffff88803828a100 000000080010000e 00000000f5000000 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0xd2c00(GFP_NOIO|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 1, tgid 1 (swapper/0), ts 22245651735, free_ts 22245107019
 set_page_owner include/linux/page_owner.h:32 [inline]
 post_alloc_hook+0x231/0x280 mm/page_alloc.c:1889
 prep_new_page mm/page_alloc.c:1897 [inline]
 get_page_from_freelist+0x24dc/0x2580 mm/page_alloc.c:3962
 __alloc_frozen_pages_noprof+0x18d/0x380 mm/page_alloc.c:5250
 alloc_slab_page mm/slub.c:3269 [inline]
 allocate_slab+0x77/0x660 mm/slub.c:3458
 new_slab mm/slub.c:3516 [inline]
 refill_objects+0x331/0x3c0 mm/slub.c:7153
 refill_sheaf mm/slub.c:2818 [inline]
 __pcs_replace_empty_main+0x2b9/0x620 mm/slub.c:4592
 alloc_from_pcs mm/slub.c:4695 [inline]
 slab_alloc_node mm/slub.c:4829 [inline]
 __do_kmalloc_node mm/slub.c:5237 [inline]
 __kmalloc_noprof+0x474/0x760 mm/slub.c:5250
 kmalloc_noprof include/linux/slab.h:954 [inline]
 usb_alloc_urb+0x46/0x150 drivers/usb/core/urb.c:75
 usb_internal_control_msg drivers/usb/core/message.c:96 [inline]
 usb_control_msg+0x118/0x3e0 drivers/usb/core/message.c:154
 usb_control_msg_send drivers/usb/core/message.c:214 [inline]
 usb_set_configuration+0x127a/0x2110 drivers/usb/core/message.c:2149
 usb_generic_driver_probe+0x8d/0x150 drivers/usb/core/generic.c:250
 usb_probe_device+0x1c4/0x3b0 drivers/usb/core/driver.c:291
 call_driver_probe drivers/base/dd.c:-1 [inline]
 really_probe+0x267/0xaf0 drivers/base/dd.c:661
 __driver_probe_device+0x18c/0x320 drivers/base/dd.c:803
 driver_probe_device+0x4f/0x240 drivers/base/dd.c:833
 __device_attach_driver+0x2d4/0x4c0 drivers/base/dd.c:961
page last free pid 30 tgid 30 stack trace:
 reset_page_owner include/linux/page_owner.h:25 [inline]
 __free_pages_prepare mm/page_alloc.c:1433 [inline]
 __free_frozen_pages+0xc2b/0xdb0 mm/page_alloc.c:2978
 __kasan_populate_vmalloc_do mm/kasan/shadow.c:393 [inline]
 __kasan_populate_vmalloc+0x1b2/0x1d0 mm/kasan/shadow.c:424
 kasan_populate_vmalloc include/linux/kasan.h:580 [inline]
 alloc_vmap_area+0xd73/0x14b0 mm/vmalloc.c:2129
 __get_vm_area_node+0x1f8/0x300 mm/vmalloc.c:3232
 __vmalloc_node_range_noprof+0x372/0x1730 mm/vmalloc.c:4024
 __vmalloc_node_noprof+0xc2/0x100 mm/vmalloc.c:4124
 alloc_thread_stack_node kernel/fork.c:355 [inline]
 dup_task_struct+0x228/0x9a0 kernel/fork.c:924
 copy_process+0x508/0x3cf0 kernel/fork.c:2050
 kernel_clone+0x248/0x8e0 kernel/fork.c:2654
 user_mode_thread+0x110/0x180 kernel/fork.c:2730
 call_usermodehelper_exec_work+0x5c/0x230 kernel/umh.c:171
 process_one_work kernel/workqueue.c:3275 [inline]
 process_scheduled_works+0xb02/0x1830 kernel/workqueue.c:3358
 worker_thread+0xa50/0xfc0 kernel/workqueue.c:3439
 kthread+0x388/0x470 kernel/kthread.c:467
 ret_from_fork+0x51e/0xb90 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

Memory state around the buggy address:
 ffff88803828aa00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff88803828aa80: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
>ffff88803828ab00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                            ^
 ffff88803828ab80: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
 ffff88803828ac00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================


---
If you want syzbot to run the reproducer, reply with:
#syz test: git://repo/address.git branch-or-commit-hash
If you attach or paste a git patch, syzbot will apply it before testing.

^ permalink raw reply

* FAILED: Patch "x86/uprobes: Fix XOL allocation failure for 32-bit tasks" failed to apply to 5.10-stable tree
From: Sasha Levin @ 2026-03-01  2:00 UTC (permalink / raw)
  To: stable, oleg
  Cc: Paulo Andrade, Peter Zijlstra (Intel), linux-trace-kernel,
	linux-perf-users

The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable@vger.kernel.org>.

Thanks,
Sasha

------------------ original commit in Linus's tree ------------------

From d55c571e4333fac71826e8db3b9753fadfbead6a Mon Sep 17 00:00:00 2001
From: Oleg Nesterov <oleg@redhat.com>
Date: Sun, 11 Jan 2026 16:00:37 +0100
Subject: [PATCH] x86/uprobes: Fix XOL allocation failure for 32-bit tasks

This script

	#!/usr/bin/bash

	echo 0 > /proc/sys/kernel/randomize_va_space

	echo 'void main(void) {}' > TEST.c

	# -fcf-protection to ensure that the 1st endbr32 insn can't be emulated
	gcc -m32 -fcf-protection=branch TEST.c -o test

	bpftrace -e 'uprobe:./test:main {}' -c ./test

"hangs", the probed ./test task enters an endless loop.

The problem is that with randomize_va_space == 0
get_unmapped_area(TASK_SIZE - PAGE_SIZE) called by xol_add_vma() can not
just return the "addr == TASK_SIZE - PAGE_SIZE" hint, this addr is used
by the stack vma.

arch_get_unmapped_area_topdown() doesn't take TIF_ADDR32 into account and
in_32bit_syscall() is false, this leads to info.high_limit > TASK_SIZE.
vm_unmapped_area() happily returns the high address > TASK_SIZE and then
get_unmapped_area() returns -ENOMEM after the "if (addr > TASK_SIZE - len)"
check.

handle_swbp() doesn't report this failure (probably it should) and silently
restarts the probed insn. Endless loop.

I think that the right fix should change the x86 get_unmapped_area() paths
to rely on TIF_ADDR32 rather than in_32bit_syscall(). Note also that if
CONFIG_X86_X32_ABI=y, in_x32_syscall() falsely returns true in this case
because ->orig_ax = -1.

But we need a simple fix for -stable, so this patch just sets TS_COMPAT if
the probed task is 32-bit to make in_ia32_syscall() true.

Fixes: 1b028f784e8c ("x86/mm: Introduce mmap_compat_base() for 32-bit mmap()")
Reported-by: Paulo Andrade <pandrade@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/aV5uldEvV7pb4RA8@redhat.com/
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/aWO7Fdxn39piQnxu@redhat.com
---
 arch/x86/kernel/uprobes.c | 24 ++++++++++++++++++++++++
 include/linux/uprobes.h   |  1 +
 kernel/events/uprobes.c   | 10 +++++++---
 3 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 7be8e361ca55b..619dddf54424e 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -1823,3 +1823,27 @@ bool is_uprobe_at_func_entry(struct pt_regs *regs)
 
 	return false;
 }
+
+#ifdef CONFIG_IA32_EMULATION
+unsigned long arch_uprobe_get_xol_area(void)
+{
+	struct thread_info *ti = current_thread_info();
+	unsigned long vaddr;
+
+	/*
+	 * HACK: we are not in a syscall, but x86 get_unmapped_area() paths
+	 * ignore TIF_ADDR32 and rely on in_32bit_syscall() to calculate
+	 * vm_unmapped_area_info.high_limit.
+	 *
+	 * The #ifdef above doesn't cover the CONFIG_X86_X32_ABI=y case,
+	 * but in this case in_32bit_syscall() -> in_x32_syscall() always
+	 * (falsely) returns true because ->orig_ax == -1.
+	 */
+	if (test_thread_flag(TIF_ADDR32))
+		ti->status |= TS_COMPAT;
+	vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE, PAGE_SIZE, 0, 0);
+	ti->status &= ~TS_COMPAT;
+
+	return vaddr;
+}
+#endif
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index ee3d36eda45dd..f548fea2adec8 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -242,6 +242,7 @@ extern void arch_uprobe_clear_state(struct mm_struct *mm);
 extern void arch_uprobe_init_state(struct mm_struct *mm);
 extern void handle_syscall_uprobe(struct pt_regs *regs, unsigned long bp_vaddr);
 extern void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr);
+extern unsigned long arch_uprobe_get_xol_area(void);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index a7d7d83ca1d78..dfbce021fb027 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1694,6 +1694,12 @@ static const struct vm_special_mapping xol_mapping = {
 	.mremap = xol_mremap,
 };
 
+unsigned long __weak arch_uprobe_get_xol_area(void)
+{
+	/* Try to map as high as possible, this is only a hint. */
+	return get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE, PAGE_SIZE, 0, 0);
+}
+
 /* Slot allocation for XOL */
 static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
 {
@@ -1709,9 +1715,7 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
 	}
 
 	if (!area->vaddr) {
-		/* Try to map as high as possible, this is only a hint. */
-		area->vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE,
-						PAGE_SIZE, 0, 0);
+		area->vaddr = arch_uprobe_get_xol_area();
 		if (IS_ERR_VALUE(area->vaddr)) {
 			ret = area->vaddr;
 			goto fail;
-- 
2.51.0





^ permalink raw reply related

* FAILED: Patch "tracing: Fix to set write permission to per-cpu buffer_size_kb" failed to apply to 5.15-stable tree
From: Sasha Levin @ 2026-03-01  1:55 UTC (permalink / raw)
  To: stable, mhiramat
  Cc: Mathieu Desnoyers, Steven Rostedt (Google), linux-trace-kernel

The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable@vger.kernel.org>.

Thanks,
Sasha

------------------ original commit in Linus's tree ------------------

From f844282deed7481cf2f813933229261e27306551 Mon Sep 17 00:00:00 2001
From: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Date: Tue, 10 Feb 2026 17:43:36 +0900
Subject: [PATCH] tracing: Fix to set write permission to per-cpu
 buffer_size_kb

Since the per-cpu buffer_size_kb file is writable for changing
per-cpu ring buffer size, the file should have the write access
permission.

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/177071301597.2293046.11683339475076917920.stgit@mhiramat.tok.corp.google.com
Fixes: 21ccc9cd7211 ("tracing: Disable "other" permission bits in the tracefs files")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/trace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 845b8a165daf3..fd470675809b3 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -8613,7 +8613,7 @@ tracing_init_tracefs_percpu(struct trace_array *tr, long cpu)
 	trace_create_cpu_file("stats", TRACE_MODE_READ, d_cpu,
 				tr, cpu, &tracing_stats_fops);
 
-	trace_create_cpu_file("buffer_size_kb", TRACE_MODE_READ, d_cpu,
+	trace_create_cpu_file("buffer_size_kb", TRACE_MODE_WRITE, d_cpu,
 				tr, cpu, &tracing_entries_fops);
 
 	if (tr->range_addr_start)
-- 
2.51.0





^ permalink raw reply related

* FAILED: Patch "x86/uprobes: Fix XOL allocation failure for 32-bit tasks" failed to apply to 5.15-stable tree
From: Sasha Levin @ 2026-03-01  1:50 UTC (permalink / raw)
  To: stable, oleg
  Cc: Paulo Andrade, Peter Zijlstra (Intel), linux-trace-kernel,
	linux-perf-users

The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable@vger.kernel.org>.

Thanks,
Sasha

------------------ original commit in Linus's tree ------------------

From d55c571e4333fac71826e8db3b9753fadfbead6a Mon Sep 17 00:00:00 2001
From: Oleg Nesterov <oleg@redhat.com>
Date: Sun, 11 Jan 2026 16:00:37 +0100
Subject: [PATCH] x86/uprobes: Fix XOL allocation failure for 32-bit tasks

This script

	#!/usr/bin/bash

	echo 0 > /proc/sys/kernel/randomize_va_space

	echo 'void main(void) {}' > TEST.c

	# -fcf-protection to ensure that the 1st endbr32 insn can't be emulated
	gcc -m32 -fcf-protection=branch TEST.c -o test

	bpftrace -e 'uprobe:./test:main {}' -c ./test

"hangs", the probed ./test task enters an endless loop.

The problem is that with randomize_va_space == 0
get_unmapped_area(TASK_SIZE - PAGE_SIZE) called by xol_add_vma() can not
just return the "addr == TASK_SIZE - PAGE_SIZE" hint, this addr is used
by the stack vma.

arch_get_unmapped_area_topdown() doesn't take TIF_ADDR32 into account and
in_32bit_syscall() is false, this leads to info.high_limit > TASK_SIZE.
vm_unmapped_area() happily returns the high address > TASK_SIZE and then
get_unmapped_area() returns -ENOMEM after the "if (addr > TASK_SIZE - len)"
check.

handle_swbp() doesn't report this failure (probably it should) and silently
restarts the probed insn. Endless loop.

I think that the right fix should change the x86 get_unmapped_area() paths
to rely on TIF_ADDR32 rather than in_32bit_syscall(). Note also that if
CONFIG_X86_X32_ABI=y, in_x32_syscall() falsely returns true in this case
because ->orig_ax = -1.

But we need a simple fix for -stable, so this patch just sets TS_COMPAT if
the probed task is 32-bit to make in_ia32_syscall() true.

Fixes: 1b028f784e8c ("x86/mm: Introduce mmap_compat_base() for 32-bit mmap()")
Reported-by: Paulo Andrade <pandrade@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/aV5uldEvV7pb4RA8@redhat.com/
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/aWO7Fdxn39piQnxu@redhat.com
---
 arch/x86/kernel/uprobes.c | 24 ++++++++++++++++++++++++
 include/linux/uprobes.h   |  1 +
 kernel/events/uprobes.c   | 10 +++++++---
 3 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 7be8e361ca55b..619dddf54424e 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -1823,3 +1823,27 @@ bool is_uprobe_at_func_entry(struct pt_regs *regs)
 
 	return false;
 }
+
+#ifdef CONFIG_IA32_EMULATION
+unsigned long arch_uprobe_get_xol_area(void)
+{
+	struct thread_info *ti = current_thread_info();
+	unsigned long vaddr;
+
+	/*
+	 * HACK: we are not in a syscall, but x86 get_unmapped_area() paths
+	 * ignore TIF_ADDR32 and rely on in_32bit_syscall() to calculate
+	 * vm_unmapped_area_info.high_limit.
+	 *
+	 * The #ifdef above doesn't cover the CONFIG_X86_X32_ABI=y case,
+	 * but in this case in_32bit_syscall() -> in_x32_syscall() always
+	 * (falsely) returns true because ->orig_ax == -1.
+	 */
+	if (test_thread_flag(TIF_ADDR32))
+		ti->status |= TS_COMPAT;
+	vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE, PAGE_SIZE, 0, 0);
+	ti->status &= ~TS_COMPAT;
+
+	return vaddr;
+}
+#endif
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index ee3d36eda45dd..f548fea2adec8 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -242,6 +242,7 @@ extern void arch_uprobe_clear_state(struct mm_struct *mm);
 extern void arch_uprobe_init_state(struct mm_struct *mm);
 extern void handle_syscall_uprobe(struct pt_regs *regs, unsigned long bp_vaddr);
 extern void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr);
+extern unsigned long arch_uprobe_get_xol_area(void);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index a7d7d83ca1d78..dfbce021fb027 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1694,6 +1694,12 @@ static const struct vm_special_mapping xol_mapping = {
 	.mremap = xol_mremap,
 };
 
+unsigned long __weak arch_uprobe_get_xol_area(void)
+{
+	/* Try to map as high as possible, this is only a hint. */
+	return get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE, PAGE_SIZE, 0, 0);
+}
+
 /* Slot allocation for XOL */
 static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
 {
@@ -1709,9 +1715,7 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
 	}
 
 	if (!area->vaddr) {
-		/* Try to map as high as possible, this is only a hint. */
-		area->vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE,
-						PAGE_SIZE, 0, 0);
+		area->vaddr = arch_uprobe_get_xol_area();
 		if (IS_ERR_VALUE(area->vaddr)) {
 			ret = area->vaddr;
 			goto fail;
-- 
2.51.0





^ permalink raw reply related

* FAILED: Patch "tracing: Fix to set write permission to per-cpu buffer_size_kb" failed to apply to 6.1-stable tree
From: Sasha Levin @ 2026-03-01  1:47 UTC (permalink / raw)
  To: stable, mhiramat
  Cc: Mathieu Desnoyers, Steven Rostedt (Google), linux-trace-kernel

The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable@vger.kernel.org>.

Thanks,
Sasha

------------------ original commit in Linus's tree ------------------

From f844282deed7481cf2f813933229261e27306551 Mon Sep 17 00:00:00 2001
From: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Date: Tue, 10 Feb 2026 17:43:36 +0900
Subject: [PATCH] tracing: Fix to set write permission to per-cpu
 buffer_size_kb

Since the per-cpu buffer_size_kb file is writable for changing
per-cpu ring buffer size, the file should have the write access
permission.

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/177071301597.2293046.11683339475076917920.stgit@mhiramat.tok.corp.google.com
Fixes: 21ccc9cd7211 ("tracing: Disable "other" permission bits in the tracefs files")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/trace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 845b8a165daf3..fd470675809b3 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -8613,7 +8613,7 @@ tracing_init_tracefs_percpu(struct trace_array *tr, long cpu)
 	trace_create_cpu_file("stats", TRACE_MODE_READ, d_cpu,
 				tr, cpu, &tracing_stats_fops);
 
-	trace_create_cpu_file("buffer_size_kb", TRACE_MODE_READ, d_cpu,
+	trace_create_cpu_file("buffer_size_kb", TRACE_MODE_WRITE, d_cpu,
 				tr, cpu, &tracing_entries_fops);
 
 	if (tr->range_addr_start)
-- 
2.51.0





^ permalink raw reply related

* FAILED: Patch "x86/uprobes: Fix XOL allocation failure for 32-bit tasks" failed to apply to 6.1-stable tree
From: Sasha Levin @ 2026-03-01  1:42 UTC (permalink / raw)
  To: stable, oleg
  Cc: Paulo Andrade, Peter Zijlstra (Intel), linux-trace-kernel,
	linux-perf-users

The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable@vger.kernel.org>.

Thanks,
Sasha

------------------ original commit in Linus's tree ------------------

From d55c571e4333fac71826e8db3b9753fadfbead6a Mon Sep 17 00:00:00 2001
From: Oleg Nesterov <oleg@redhat.com>
Date: Sun, 11 Jan 2026 16:00:37 +0100
Subject: [PATCH] x86/uprobes: Fix XOL allocation failure for 32-bit tasks

This script

	#!/usr/bin/bash

	echo 0 > /proc/sys/kernel/randomize_va_space

	echo 'void main(void) {}' > TEST.c

	# -fcf-protection to ensure that the 1st endbr32 insn can't be emulated
	gcc -m32 -fcf-protection=branch TEST.c -o test

	bpftrace -e 'uprobe:./test:main {}' -c ./test

"hangs", the probed ./test task enters an endless loop.

The problem is that with randomize_va_space == 0
get_unmapped_area(TASK_SIZE - PAGE_SIZE) called by xol_add_vma() can not
just return the "addr == TASK_SIZE - PAGE_SIZE" hint, this addr is used
by the stack vma.

arch_get_unmapped_area_topdown() doesn't take TIF_ADDR32 into account and
in_32bit_syscall() is false, this leads to info.high_limit > TASK_SIZE.
vm_unmapped_area() happily returns the high address > TASK_SIZE and then
get_unmapped_area() returns -ENOMEM after the "if (addr > TASK_SIZE - len)"
check.

handle_swbp() doesn't report this failure (probably it should) and silently
restarts the probed insn. Endless loop.

I think that the right fix should change the x86 get_unmapped_area() paths
to rely on TIF_ADDR32 rather than in_32bit_syscall(). Note also that if
CONFIG_X86_X32_ABI=y, in_x32_syscall() falsely returns true in this case
because ->orig_ax = -1.

But we need a simple fix for -stable, so this patch just sets TS_COMPAT if
the probed task is 32-bit to make in_ia32_syscall() true.

Fixes: 1b028f784e8c ("x86/mm: Introduce mmap_compat_base() for 32-bit mmap()")
Reported-by: Paulo Andrade <pandrade@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/aV5uldEvV7pb4RA8@redhat.com/
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/aWO7Fdxn39piQnxu@redhat.com
---
 arch/x86/kernel/uprobes.c | 24 ++++++++++++++++++++++++
 include/linux/uprobes.h   |  1 +
 kernel/events/uprobes.c   | 10 +++++++---
 3 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 7be8e361ca55b..619dddf54424e 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -1823,3 +1823,27 @@ bool is_uprobe_at_func_entry(struct pt_regs *regs)
 
 	return false;
 }
+
+#ifdef CONFIG_IA32_EMULATION
+unsigned long arch_uprobe_get_xol_area(void)
+{
+	struct thread_info *ti = current_thread_info();
+	unsigned long vaddr;
+
+	/*
+	 * HACK: we are not in a syscall, but x86 get_unmapped_area() paths
+	 * ignore TIF_ADDR32 and rely on in_32bit_syscall() to calculate
+	 * vm_unmapped_area_info.high_limit.
+	 *
+	 * The #ifdef above doesn't cover the CONFIG_X86_X32_ABI=y case,
+	 * but in this case in_32bit_syscall() -> in_x32_syscall() always
+	 * (falsely) returns true because ->orig_ax == -1.
+	 */
+	if (test_thread_flag(TIF_ADDR32))
+		ti->status |= TS_COMPAT;
+	vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE, PAGE_SIZE, 0, 0);
+	ti->status &= ~TS_COMPAT;
+
+	return vaddr;
+}
+#endif
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index ee3d36eda45dd..f548fea2adec8 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -242,6 +242,7 @@ extern void arch_uprobe_clear_state(struct mm_struct *mm);
 extern void arch_uprobe_init_state(struct mm_struct *mm);
 extern void handle_syscall_uprobe(struct pt_regs *regs, unsigned long bp_vaddr);
 extern void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr);
+extern unsigned long arch_uprobe_get_xol_area(void);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index a7d7d83ca1d78..dfbce021fb027 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1694,6 +1694,12 @@ static const struct vm_special_mapping xol_mapping = {
 	.mremap = xol_mremap,
 };
 
+unsigned long __weak arch_uprobe_get_xol_area(void)
+{
+	/* Try to map as high as possible, this is only a hint. */
+	return get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE, PAGE_SIZE, 0, 0);
+}
+
 /* Slot allocation for XOL */
 static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
 {
@@ -1709,9 +1715,7 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
 	}
 
 	if (!area->vaddr) {
-		/* Try to map as high as possible, this is only a hint. */
-		area->vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE,
-						PAGE_SIZE, 0, 0);
+		area->vaddr = arch_uprobe_get_xol_area();
 		if (IS_ERR_VALUE(area->vaddr)) {
 			ret = area->vaddr;
 			goto fail;
-- 
2.51.0





^ permalink raw reply related

* FAILED: Patch "tracing: Wake up poll waiters for hist files when removing an event" failed to apply to 6.6-stable tree
From: Sasha Levin @ 2026-03-01  1:39 UTC (permalink / raw)
  To: stable, petr.pavlu
  Cc: Mathieu Desnoyers, Tom Zanussi, Masami Hiramatsu (Google),
	Steven Rostedt (Google), linux-trace-kernel

The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable@vger.kernel.org>.

Thanks,
Sasha

------------------ original commit in Linus's tree ------------------

From 9678e53179aa7e907360f5b5b275769008a69b80 Mon Sep 17 00:00:00 2001
From: Petr Pavlu <petr.pavlu@suse.com>
Date: Thu, 19 Feb 2026 17:27:02 +0100
Subject: [PATCH] tracing: Wake up poll waiters for hist files when removing an
 event

The event_hist_poll() function attempts to verify whether an event file is
being removed, but this check may not occur or could be unnecessarily
delayed. This happens because hist_poll_wakeup() is currently invoked only
from event_hist_trigger() when a hist command is triggered. If the event
file is being removed, no associated hist command will be triggered and a
waiter will be woken up only after an unrelated hist command is triggered.

Fix the issue by adding a call to hist_poll_wakeup() in
remove_event_file_dir() after setting the EVENT_FILE_FL_FREED flag. This
ensures that a task polling on a hist file is woken up and receives
EPOLLERR.

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://patch.msgid.link/20260219162737.314231-3-petr.pavlu@suse.com
Fixes: 1bd13edbbed6 ("tracing/hist: Add poll(POLLIN) support on hist file")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/linux/trace_events.h | 5 +++++
 kernel/trace/trace_events.c  | 3 +++
 2 files changed, 8 insertions(+)

diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 0a2b8229b999c..37eb2f0f3dd8e 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -683,6 +683,11 @@ static inline void hist_poll_wakeup(void)
 
 #define hist_poll_wait(file, wait)	\
 	poll_wait(file, &hist_poll_wq, wait)
+
+#else
+static inline void hist_poll_wakeup(void)
+{
+}
 #endif
 
 #define __TRACE_EVENT_FLAGS(name, value)				\
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 61fe01dce7a6f..b659653dc03ac 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -1311,6 +1311,9 @@ static void remove_event_file_dir(struct trace_event_file *file)
 	free_event_filter(file->filter);
 	file->flags |= EVENT_FILE_FL_FREED;
 	event_file_put(file);
+
+	/* Wake up hist poll waiters to notice the EVENT_FILE_FL_FREED flag. */
+	hist_poll_wakeup();
 }
 
 /*
-- 
2.51.0





^ permalink raw reply related

* FAILED: Patch "tracing: Fix checking of freed trace_event_file for hist files" failed to apply to 6.6-stable tree
From: Sasha Levin @ 2026-03-01  1:39 UTC (permalink / raw)
  To: stable, petr.pavlu
  Cc: Mathieu Desnoyers, Tom Zanussi, Masami Hiramatsu (Google),
	Steven Rostedt (Google), linux-trace-kernel

The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable@vger.kernel.org>.

Thanks,
Sasha

------------------ original commit in Linus's tree ------------------

From f0a0da1f907e8488826d91c465f7967a56a95aca Mon Sep 17 00:00:00 2001
From: Petr Pavlu <petr.pavlu@suse.com>
Date: Thu, 19 Feb 2026 17:27:01 +0100
Subject: [PATCH] tracing: Fix checking of freed trace_event_file for hist
 files

The event_hist_open() and event_hist_poll() functions currently retrieve
a trace_event_file pointer from a file struct by invoking
event_file_data(), which simply returns file->f_inode->i_private. The
functions then check if the pointer is NULL to determine whether the event
is still valid. This approach is flawed because i_private is assigned when
an eventfs inode is allocated and remains set throughout its lifetime.
Instead, the code should call event_file_file(), which checks for
EVENT_FILE_FL_FREED. Using the incorrect access function may result in the
code potentially opening a hist file for an event that is being removed or
becoming stuck while polling on this file.

Correct the access method to event_file_file() in both functions.

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Link: https://patch.msgid.link/20260219162737.314231-2-petr.pavlu@suse.com
Fixes: 1bd13edbbed6 ("tracing/hist: Add poll(POLLIN) support on hist file")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/trace_events_hist.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index e6f449f53afcc..768df987419e3 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -5784,7 +5784,7 @@ static __poll_t event_hist_poll(struct file *file, struct poll_table_struct *wai
 
 	guard(mutex)(&event_mutex);
 
-	event_file = event_file_data(file);
+	event_file = event_file_file(file);
 	if (!event_file)
 		return EPOLLERR;
 
@@ -5822,7 +5822,7 @@ static int event_hist_open(struct inode *inode, struct file *file)
 
 	guard(mutex)(&event_mutex);
 
-	event_file = event_file_data(file);
+	event_file = event_file_file(file);
 	if (!event_file) {
 		ret = -ENODEV;
 		goto err;
-- 
2.51.0





^ permalink raw reply related

* FAILED: Patch "tracing: Fix to set write permission to per-cpu buffer_size_kb" failed to apply to 6.6-stable tree
From: Sasha Levin @ 2026-03-01  1:38 UTC (permalink / raw)
  To: stable, mhiramat
  Cc: Mathieu Desnoyers, Steven Rostedt (Google), linux-trace-kernel

The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable@vger.kernel.org>.

Thanks,
Sasha

------------------ original commit in Linus's tree ------------------

From f844282deed7481cf2f813933229261e27306551 Mon Sep 17 00:00:00 2001
From: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Date: Tue, 10 Feb 2026 17:43:36 +0900
Subject: [PATCH] tracing: Fix to set write permission to per-cpu
 buffer_size_kb

Since the per-cpu buffer_size_kb file is writable for changing
per-cpu ring buffer size, the file should have the write access
permission.

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/177071301597.2293046.11683339475076917920.stgit@mhiramat.tok.corp.google.com
Fixes: 21ccc9cd7211 ("tracing: Disable "other" permission bits in the tracefs files")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/trace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 845b8a165daf3..fd470675809b3 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -8613,7 +8613,7 @@ tracing_init_tracefs_percpu(struct trace_array *tr, long cpu)
 	trace_create_cpu_file("stats", TRACE_MODE_READ, d_cpu,
 				tr, cpu, &tracing_stats_fops);
 
-	trace_create_cpu_file("buffer_size_kb", TRACE_MODE_READ, d_cpu,
+	trace_create_cpu_file("buffer_size_kb", TRACE_MODE_WRITE, d_cpu,
 				tr, cpu, &tracing_entries_fops);
 
 	if (tr->range_addr_start)
-- 
2.51.0





^ permalink raw reply related

* FAILED: Patch "x86/uprobes: Fix XOL allocation failure for 32-bit tasks" failed to apply to 6.6-stable tree
From: Sasha Levin @ 2026-03-01  1:32 UTC (permalink / raw)
  To: stable, oleg
  Cc: Paulo Andrade, Peter Zijlstra (Intel), linux-trace-kernel,
	linux-perf-users

The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable@vger.kernel.org>.

Thanks,
Sasha

------------------ original commit in Linus's tree ------------------

From d55c571e4333fac71826e8db3b9753fadfbead6a Mon Sep 17 00:00:00 2001
From: Oleg Nesterov <oleg@redhat.com>
Date: Sun, 11 Jan 2026 16:00:37 +0100
Subject: [PATCH] x86/uprobes: Fix XOL allocation failure for 32-bit tasks

This script

	#!/usr/bin/bash

	echo 0 > /proc/sys/kernel/randomize_va_space

	echo 'void main(void) {}' > TEST.c

	# -fcf-protection to ensure that the 1st endbr32 insn can't be emulated
	gcc -m32 -fcf-protection=branch TEST.c -o test

	bpftrace -e 'uprobe:./test:main {}' -c ./test

"hangs", the probed ./test task enters an endless loop.

The problem is that with randomize_va_space == 0
get_unmapped_area(TASK_SIZE - PAGE_SIZE) called by xol_add_vma() can not
just return the "addr == TASK_SIZE - PAGE_SIZE" hint, this addr is used
by the stack vma.

arch_get_unmapped_area_topdown() doesn't take TIF_ADDR32 into account and
in_32bit_syscall() is false, this leads to info.high_limit > TASK_SIZE.
vm_unmapped_area() happily returns the high address > TASK_SIZE and then
get_unmapped_area() returns -ENOMEM after the "if (addr > TASK_SIZE - len)"
check.

handle_swbp() doesn't report this failure (probably it should) and silently
restarts the probed insn. Endless loop.

I think that the right fix should change the x86 get_unmapped_area() paths
to rely on TIF_ADDR32 rather than in_32bit_syscall(). Note also that if
CONFIG_X86_X32_ABI=y, in_x32_syscall() falsely returns true in this case
because ->orig_ax = -1.

But we need a simple fix for -stable, so this patch just sets TS_COMPAT if
the probed task is 32-bit to make in_ia32_syscall() true.

Fixes: 1b028f784e8c ("x86/mm: Introduce mmap_compat_base() for 32-bit mmap()")
Reported-by: Paulo Andrade <pandrade@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/aV5uldEvV7pb4RA8@redhat.com/
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/aWO7Fdxn39piQnxu@redhat.com
---
 arch/x86/kernel/uprobes.c | 24 ++++++++++++++++++++++++
 include/linux/uprobes.h   |  1 +
 kernel/events/uprobes.c   | 10 +++++++---
 3 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 7be8e361ca55b..619dddf54424e 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -1823,3 +1823,27 @@ bool is_uprobe_at_func_entry(struct pt_regs *regs)
 
 	return false;
 }
+
+#ifdef CONFIG_IA32_EMULATION
+unsigned long arch_uprobe_get_xol_area(void)
+{
+	struct thread_info *ti = current_thread_info();
+	unsigned long vaddr;
+
+	/*
+	 * HACK: we are not in a syscall, but x86 get_unmapped_area() paths
+	 * ignore TIF_ADDR32 and rely on in_32bit_syscall() to calculate
+	 * vm_unmapped_area_info.high_limit.
+	 *
+	 * The #ifdef above doesn't cover the CONFIG_X86_X32_ABI=y case,
+	 * but in this case in_32bit_syscall() -> in_x32_syscall() always
+	 * (falsely) returns true because ->orig_ax == -1.
+	 */
+	if (test_thread_flag(TIF_ADDR32))
+		ti->status |= TS_COMPAT;
+	vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE, PAGE_SIZE, 0, 0);
+	ti->status &= ~TS_COMPAT;
+
+	return vaddr;
+}
+#endif
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index ee3d36eda45dd..f548fea2adec8 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -242,6 +242,7 @@ extern void arch_uprobe_clear_state(struct mm_struct *mm);
 extern void arch_uprobe_init_state(struct mm_struct *mm);
 extern void handle_syscall_uprobe(struct pt_regs *regs, unsigned long bp_vaddr);
 extern void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr);
+extern unsigned long arch_uprobe_get_xol_area(void);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index a7d7d83ca1d78..dfbce021fb027 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1694,6 +1694,12 @@ static const struct vm_special_mapping xol_mapping = {
 	.mremap = xol_mremap,
 };
 
+unsigned long __weak arch_uprobe_get_xol_area(void)
+{
+	/* Try to map as high as possible, this is only a hint. */
+	return get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE, PAGE_SIZE, 0, 0);
+}
+
 /* Slot allocation for XOL */
 static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
 {
@@ -1709,9 +1715,7 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
 	}
 
 	if (!area->vaddr) {
-		/* Try to map as high as possible, this is only a hint. */
-		area->vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE,
-						PAGE_SIZE, 0, 0);
+		area->vaddr = arch_uprobe_get_xol_area();
 		if (IS_ERR_VALUE(area->vaddr)) {
 			ret = area->vaddr;
 			goto fail;
-- 
2.51.0





^ permalink raw reply related

* FAILED: Patch "tracing: Wake up poll waiters for hist files when removing an event" failed to apply to 6.12-stable tree
From: Sasha Levin @ 2026-03-01  1:29 UTC (permalink / raw)
  To: stable, petr.pavlu
  Cc: Mathieu Desnoyers, Tom Zanussi, Masami Hiramatsu (Google),
	Steven Rostedt (Google), linux-trace-kernel

The patch below does not apply to the 6.12-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable@vger.kernel.org>.

Thanks,
Sasha

------------------ original commit in Linus's tree ------------------

From 9678e53179aa7e907360f5b5b275769008a69b80 Mon Sep 17 00:00:00 2001
From: Petr Pavlu <petr.pavlu@suse.com>
Date: Thu, 19 Feb 2026 17:27:02 +0100
Subject: [PATCH] tracing: Wake up poll waiters for hist files when removing an
 event

The event_hist_poll() function attempts to verify whether an event file is
being removed, but this check may not occur or could be unnecessarily
delayed. This happens because hist_poll_wakeup() is currently invoked only
from event_hist_trigger() when a hist command is triggered. If the event
file is being removed, no associated hist command will be triggered and a
waiter will be woken up only after an unrelated hist command is triggered.

Fix the issue by adding a call to hist_poll_wakeup() in
remove_event_file_dir() after setting the EVENT_FILE_FL_FREED flag. This
ensures that a task polling on a hist file is woken up and receives
EPOLLERR.

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://patch.msgid.link/20260219162737.314231-3-petr.pavlu@suse.com
Fixes: 1bd13edbbed6 ("tracing/hist: Add poll(POLLIN) support on hist file")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/linux/trace_events.h | 5 +++++
 kernel/trace/trace_events.c  | 3 +++
 2 files changed, 8 insertions(+)

diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 0a2b8229b999c..37eb2f0f3dd8e 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -683,6 +683,11 @@ static inline void hist_poll_wakeup(void)
 
 #define hist_poll_wait(file, wait)	\
 	poll_wait(file, &hist_poll_wq, wait)
+
+#else
+static inline void hist_poll_wakeup(void)
+{
+}
 #endif
 
 #define __TRACE_EVENT_FLAGS(name, value)				\
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 61fe01dce7a6f..b659653dc03ac 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -1311,6 +1311,9 @@ static void remove_event_file_dir(struct trace_event_file *file)
 	free_event_filter(file->filter);
 	file->flags |= EVENT_FILE_FL_FREED;
 	event_file_put(file);
+
+	/* Wake up hist poll waiters to notice the EVENT_FILE_FL_FREED flag. */
+	hist_poll_wakeup();
 }
 
 /*
-- 
2.51.0





^ permalink raw reply related

* FAILED: Patch "tracing: ring-buffer: Fix to check event length before using" failed to apply to 6.12-stable tree
From: Sasha Levin @ 2026-03-01  1:29 UTC (permalink / raw)
  To: stable, mhiramat
  Cc: Mathieu Desnoyers, Steven Rostedt (Google), linux-trace-kernel

The patch below does not apply to the 6.12-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable@vger.kernel.org>.

Thanks,
Sasha

------------------ original commit in Linus's tree ------------------

From 912b0ee248c529a4f45d1e7f568dc1adddbf2a4a Mon Sep 17 00:00:00 2001
From: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Date: Mon, 16 Feb 2026 18:30:15 +0900
Subject: [PATCH] tracing: ring-buffer: Fix to check event length before using

Check the event length before adding it for accessing next index in
rb_read_data_buffer(). Since this function is used for validating
possibly broken ring buffers, the length of the event could be broken.
In that case, the new event (e + len) can point a wrong address.
To avoid invalid memory access at boot, check whether the length of
each event is in the possible range before using it.

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 5f3b6e839f3c ("ring-buffer: Validate boot range memory events")
Link: https://patch.msgid.link/177123421541.142205.9414352170164678966.stgit@devnote2
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/ring_buffer.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index bdc8010d8f482..1e7a34a31851c 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1849,6 +1849,7 @@ static int rb_read_data_buffer(struct buffer_data_page *dpage, int tail, int cpu
 	struct ring_buffer_event *event;
 	u64 ts, delta;
 	int events = 0;
+	int len;
 	int e;
 
 	*delta_ptr = 0;
@@ -1856,9 +1857,12 @@ static int rb_read_data_buffer(struct buffer_data_page *dpage, int tail, int cpu
 
 	ts = dpage->time_stamp;
 
-	for (e = 0; e < tail; e += rb_event_length(event)) {
+	for (e = 0; e < tail; e += len) {
 
 		event = (struct ring_buffer_event *)(dpage->data + e);
+		len = rb_event_length(event);
+		if (len <= 0 || len > tail - e)
+			return -1;
 
 		switch (event->type_len) {
 
-- 
2.51.0





^ permalink raw reply related

* FAILED: Patch "tracing: Fix checking of freed trace_event_file for hist files" failed to apply to 6.12-stable tree
From: Sasha Levin @ 2026-03-01  1:29 UTC (permalink / raw)
  To: stable, petr.pavlu
  Cc: Mathieu Desnoyers, Tom Zanussi, Masami Hiramatsu (Google),
	Steven Rostedt (Google), linux-trace-kernel

The patch below does not apply to the 6.12-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable@vger.kernel.org>.

Thanks,
Sasha

------------------ original commit in Linus's tree ------------------

From f0a0da1f907e8488826d91c465f7967a56a95aca Mon Sep 17 00:00:00 2001
From: Petr Pavlu <petr.pavlu@suse.com>
Date: Thu, 19 Feb 2026 17:27:01 +0100
Subject: [PATCH] tracing: Fix checking of freed trace_event_file for hist
 files

The event_hist_open() and event_hist_poll() functions currently retrieve
a trace_event_file pointer from a file struct by invoking
event_file_data(), which simply returns file->f_inode->i_private. The
functions then check if the pointer is NULL to determine whether the event
is still valid. This approach is flawed because i_private is assigned when
an eventfs inode is allocated and remains set throughout its lifetime.
Instead, the code should call event_file_file(), which checks for
EVENT_FILE_FL_FREED. Using the incorrect access function may result in the
code potentially opening a hist file for an event that is being removed or
becoming stuck while polling on this file.

Correct the access method to event_file_file() in both functions.

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Link: https://patch.msgid.link/20260219162737.314231-2-petr.pavlu@suse.com
Fixes: 1bd13edbbed6 ("tracing/hist: Add poll(POLLIN) support on hist file")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/trace_events_hist.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index e6f449f53afcc..768df987419e3 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -5784,7 +5784,7 @@ static __poll_t event_hist_poll(struct file *file, struct poll_table_struct *wai
 
 	guard(mutex)(&event_mutex);
 
-	event_file = event_file_data(file);
+	event_file = event_file_file(file);
 	if (!event_file)
 		return EPOLLERR;
 
@@ -5822,7 +5822,7 @@ static int event_hist_open(struct inode *inode, struct file *file)
 
 	guard(mutex)(&event_mutex);
 
-	event_file = event_file_data(file);
+	event_file = event_file_file(file);
 	if (!event_file) {
 		ret = -ENODEV;
 		goto err;
-- 
2.51.0





^ permalink raw reply related

* FAILED: Patch "ring-buffer: Fix possible dereference of uninitialized pointer" failed to apply to 6.12-stable tree
From: Sasha Levin @ 2026-03-01  1:29 UTC (permalink / raw)
  To: stable, d.dulov
  Cc: kernel test robot, Dan Carpenter, Masami Hiramatsu (Google),
	Steven Rostedt (Google), linux-trace-kernel

The patch below does not apply to the 6.12-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable@vger.kernel.org>.

Thanks,
Sasha

------------------ original commit in Linus's tree ------------------

From f1547779402c4cd67755c33616b7203baa88420b Mon Sep 17 00:00:00 2001
From: Daniil Dulov <d.dulov@aladdin.ru>
Date: Fri, 13 Feb 2026 13:01:30 +0300
Subject: [PATCH] ring-buffer: Fix possible dereference of uninitialized
 pointer

There is a pointer head_page in rb_meta_validate_events() which is not
initialized at the beginning of a function. This pointer can be dereferenced
if there is a failure during reader page validation. In this case the control
is passed to "invalid" label where the pointer is dereferenced in a loop.

To fix the issue initialize orig_head and head_page before calling
rb_validate_buffer.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Cc: stable@vger.kernel.org
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://patch.msgid.link/20260213100130.2013839-1-d.dulov@aladdin.ru
Closes: https://lore.kernel.org/r/202406130130.JtTGRf7W-lkp@intel.com/
Fixes: 5f3b6e839f3c ("ring-buffer: Validate boot range memory events")
Signed-off-by: Daniil Dulov <d.dulov@aladdin.ru>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/ring_buffer.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index d331034089552..bdc8010d8f482 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1919,6 +1919,8 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 	if (!meta || !meta->head_buffer)
 		return;
 
+	orig_head = head_page = cpu_buffer->head_page;
+
 	/* Do the reader page first */
 	ret = rb_validate_buffer(cpu_buffer->reader_page->page, cpu_buffer->cpu);
 	if (ret < 0) {
@@ -1929,7 +1931,6 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
 	entry_bytes += local_read(&cpu_buffer->reader_page->page->commit);
 	local_set(&cpu_buffer->reader_page->entries, ret);
 
-	orig_head = head_page = cpu_buffer->head_page;
 	ts = head_page->page->time_stamp;
 
 	/*
-- 
2.51.0





^ permalink raw reply related

* FAILED: Patch "fgraph: Do not call handlers direct when not using ftrace_ops" failed to apply to 6.12-stable tree
From: Sasha Levin @ 2026-03-01  1:29 UTC (permalink / raw)
  To: stable, rostedt
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Mark Rutland,
	linux-trace-kernel

The patch below does not apply to the 6.12-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable@vger.kernel.org>.

Thanks,
Sasha

------------------ original commit in Linus's tree ------------------

From f4ff9f646a4d373f9e895c2f0073305da288bc0a Mon Sep 17 00:00:00 2001
From: Steven Rostedt <rostedt@goodmis.org>
Date: Wed, 18 Feb 2026 10:42:44 -0500
Subject: [PATCH] fgraph: Do not call handlers direct when not using ftrace_ops

The function graph tracer was modified to us the ftrace_ops of the
function tracer. This simplified the code as well as allowed more features
of the function graph tracer.

Not all architectures were converted over as it required the
implementation of HAVE_DYNAMIC_FTRACE_WITH_ARGS to implement. For those
architectures, it still did it the old way where the function graph tracer
handle was called by the function tracer trampoline. The handler then had
to check the hash to see if the registered handlers wanted to be called by
that function or not.

In order to speed up the function graph tracer that used ftrace_ops, if
only one callback was registered with function graph, it would call its
function directly via a static call.

Now, if the architecture does not support the use of using ftrace_ops and
still has the ftrace function trampoline calling the function graph
handler, then by doing a direct call it removes the check against the
handler's hash (list of functions it wants callbacks to), and it may call
that handler for functions that the handler did not request calls for.

On 32bit x86, which does not support the ftrace_ops use with function
graph tracer, it shows the issue:

 ~# trace-cmd start -p function -l schedule
 ~# trace-cmd show
 # tracer: function_graph
 #
 # CPU  DURATION                  FUNCTION CALLS
 # |     |   |                     |   |   |   |
  2) * 11898.94 us |  schedule();
  3) # 1783.041 us |  schedule();
  1)               |  schedule() {
  ------------------------------------------
  1)   bash-8369    =>  kworker-7669
  ------------------------------------------
  1)               |        schedule() {
  ------------------------------------------
  1)  kworker-7669  =>   bash-8369
  ------------------------------------------
  1) + 97.004 us   |  }
  1)               |  schedule() {
 [..]

Now by starting the function tracer is another instance:

 ~# trace-cmd start -B foo -p function

This causes the function graph tracer to trace all functions (because the
function trace calls the function graph tracer for each on, and the
function graph trace is doing a direct call):

 ~# trace-cmd show
 # tracer: function_graph
 #
 # CPU  DURATION                  FUNCTION CALLS
 # |     |   |                     |   |   |   |
  1)   1.669 us    |          } /* preempt_count_sub */
  1) + 10.443 us   |        } /* _raw_spin_unlock_irqrestore */
  1)               |        tick_program_event() {
  1)               |          clockevents_program_event() {
  1)   1.044 us    |            ktime_get();
  1)   6.481 us    |            lapic_next_event();
  1) + 10.114 us   |          }
  1) + 11.790 us   |        }
  1) ! 181.223 us  |      } /* hrtimer_interrupt */
  1) ! 184.624 us  |    } /* __sysvec_apic_timer_interrupt */
  1)               |    irq_exit_rcu() {
  1)   0.678 us    |      preempt_count_sub();

When it should still only be tracing the schedule() function.

To fix this, add a macro FGRAPH_NO_DIRECT to be set to 0 when the
architecture does not support function graph use of ftrace_ops, and set to
1 otherwise. Then use this macro to know to allow function graph tracer to
call the handlers directly or not.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://patch.msgid.link/20260218104244.5f14dade@gandalf.local.home
Fixes: cc60ee813b503 ("function_graph: Use static_call and branch to optimize entry function")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/linux/ftrace.h | 13 ++++++++++---
 kernel/trace/fgraph.c  | 12 +++++++++++-
 2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 1a4d36fc90852..c242fe49af4c9 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -1092,10 +1092,17 @@ static inline bool is_ftrace_trampoline(unsigned long addr)
 
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
 #ifndef ftrace_graph_func
-#define ftrace_graph_func ftrace_stub
-#define FTRACE_OPS_GRAPH_STUB FTRACE_OPS_FL_STUB
+# define ftrace_graph_func ftrace_stub
+# define FTRACE_OPS_GRAPH_STUB FTRACE_OPS_FL_STUB
+/*
+ * The function graph is called every time the function tracer is called.
+ * It must always test the ops hash and cannot just directly call
+ * the handler.
+ */
+# define FGRAPH_NO_DIRECT	1
 #else
-#define FTRACE_OPS_GRAPH_STUB 0
+# define FTRACE_OPS_GRAPH_STUB	0
+# define FGRAPH_NO_DIRECT	0
 #endif
 #endif /* CONFIG_FUNCTION_GRAPH_TRACER */
 
diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index 4df766c690f92..40d373d65f9b9 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -539,7 +539,11 @@ static struct fgraph_ops fgraph_stub = {
 static struct fgraph_ops *fgraph_direct_gops = &fgraph_stub;
 DEFINE_STATIC_CALL(fgraph_func, ftrace_graph_entry_stub);
 DEFINE_STATIC_CALL(fgraph_retfunc, ftrace_graph_ret_stub);
+#if FGRAPH_NO_DIRECT
+static DEFINE_STATIC_KEY_FALSE(fgraph_do_direct);
+#else
 static DEFINE_STATIC_KEY_TRUE(fgraph_do_direct);
+#endif
 
 /**
  * ftrace_graph_stop - set to permanently disable function graph tracing
@@ -843,7 +847,7 @@ __ftrace_return_to_handler(struct ftrace_regs *fregs, unsigned long frame_pointe
 	bitmap = get_bitmap_bits(current, offset);
 
 #ifdef CONFIG_HAVE_STATIC_CALL
-	if (static_branch_likely(&fgraph_do_direct)) {
+	if (!FGRAPH_NO_DIRECT && static_branch_likely(&fgraph_do_direct)) {
 		if (test_bit(fgraph_direct_gops->idx, &bitmap))
 			static_call(fgraph_retfunc)(&trace, fgraph_direct_gops, fregs);
 	} else
@@ -1285,6 +1289,9 @@ static void ftrace_graph_enable_direct(bool enable_branch, struct fgraph_ops *go
 	trace_func_graph_ret_t retfunc = NULL;
 	int i;
 
+	if (FGRAPH_NO_DIRECT)
+		return;
+
 	if (gops) {
 		func = gops->entryfunc;
 		retfunc = gops->retfunc;
@@ -1308,6 +1315,9 @@ static void ftrace_graph_enable_direct(bool enable_branch, struct fgraph_ops *go
 
 static void ftrace_graph_disable_direct(bool disable_branch)
 {
+	if (FGRAPH_NO_DIRECT)
+		return;
+
 	if (disable_branch)
 		static_branch_disable(&fgraph_do_direct);
 	static_call_update(fgraph_func, ftrace_graph_entry_stub);
-- 
2.51.0





^ permalink raw reply related

* FAILED: Patch "function_graph: Restore direct mode when callbacks drop to one" failed to apply to 6.12-stable tree
From: Sasha Levin @ 2026-03-01  1:28 UTC (permalink / raw)
  To: stable, hu.shengming; +Cc: Steven Rostedt (Google), linux-trace-kernel

The patch below does not apply to the 6.12-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable@vger.kernel.org>.

Thanks,
Sasha

------------------ original commit in Linus's tree ------------------

From 53b2fae90ff01fede6520ca744ed5e8e366497ba Mon Sep 17 00:00:00 2001
From: Shengming Hu <hu.shengming@zte.com.cn>
Date: Fri, 13 Feb 2026 14:29:32 +0800
Subject: [PATCH] function_graph: Restore direct mode when callbacks drop to
 one

When registering a second fgraph callback, direct path is disabled and
array loop is used instead.  When ftrace_graph_active falls back to one,
we try to re-enable direct mode via ftrace_graph_enable_direct(true, ...).
But ftrace_graph_enable_direct() incorrectly disables the static key
rather than enabling it.  This leaves fgraph_do_direct permanently off
after first multi-callback transition, so direct fast mode is never
restored.

Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260213142932519cuWSpEXeS4-UnCvNXnK2P@zte.com.cn
Fixes: cc60ee813b503 ("function_graph: Use static_call and branch to optimize entry function")
Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/fgraph.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index cc48d16be43e0..4df766c690f92 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -1303,7 +1303,7 @@ static void ftrace_graph_enable_direct(bool enable_branch, struct fgraph_ops *go
 	static_call_update(fgraph_func, func);
 	static_call_update(fgraph_retfunc, retfunc);
 	if (enable_branch)
-		static_branch_disable(&fgraph_do_direct);
+		static_branch_enable(&fgraph_do_direct);
 }
 
 static void ftrace_graph_disable_direct(bool disable_branch)
-- 
2.51.0





^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox