Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* [BUG] tracing/kprobe: perf dynamic ustring sample can exceed PERF_MAX_TRACE_SIZE and WARN
From: Yifei Chu @ 2026-05-24 14:44 UTC (permalink / raw)
  To: linux-trace-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, linux-kernel

[-- Attachment #1.1: Type: text/plain, Size: 1776 bytes --]

Hello,

Short version: I can make a kprobe/kretprobe trace event with dynamic
ustring fetch args ask perf_trace_buf_alloc() for more than
PERF_MAX_TRACE_SIZE. That hits WARN_ONCE(), and with panic_on_warn=1 it
becomes a reproducible kernel panic.

The reproducers create a kprobe or kretprobe trace event with several
ustring args pointing at a 4095-byte userspace string, open the event
through perf_event_open(PERF_TYPE_TRACEPOINT), and trigger it. The dynamic
payload size is then passed to perf_trace_buf_alloc():

WARN_ONCE(size > PERF_MAX_TRACE_SIZE, …)

I reproduced this through both kprobe and kretprobe events.

Tested environment:

Linux version 7.0.9, x86_64 QEMU
gcc 12.3.0, GNU ld 2.38
Boot args included: panic_on_warn=1 nokaslr console=ttyS0

Kprobe result:

perf buffer not large enough, wanted 16420, have 8192
WARNING: kernel/trace/trace_event_perf.c:405 at
perf_trace_buf_alloc+0x111/0x160
Kernel panic - not syncing: kernel: panic_on_warn set …

Kretprobe result:

perf buffer not large enough, wanted 16428, have 8192
WARNING: kernel/trace/trace_event_perf.c:405 at
perf_trace_buf_alloc+0x111/0x160
kretprobe_perf_func+0x24b/0x750
Kernel panic - not syncing: kernel: panic_on_warn set …

I checked current mainline source and still see PERF_MAX_TRACE_SIZE as 8192
and the WARN_ONCE path in perf_trace_buf_alloc(). I have reproduced the
panic on the 7.0.9 QEMU build above; I have not yet runtime-tested current
mainline.

My expectation is that a user-defined dynamic trace payload that is too
large for the perf trace buffer should be rejected, capped, or dropped
without reaching WARN_ONCE().

The attached tarball has README files, both C reproducers, and the full
QEMU logs.

Thanks,
Chuyifei

[-- Attachment #1.2: Type: text/html, Size: 1888 bytes --]

[-- Attachment #2: trace_kprobe_kretprobe_perf_ustring_warn_panic.tar.gz --]
[-- Type: application/x-tar, Size: 26750 bytes --]

^ permalink raw reply

* [PATCH] tracing: fix CFI violation in probestub helper
From: Eva Kurchatova @ 2026-05-24 15:43 UTC (permalink / raw)
  To: mhiramat, rostedt
  Cc: linux-trace-kernel, linux-kernel, mathieu.desnoyers, peterz,
	jpoimboe, samitolvanen, eva.kurchatova

When multiple callbacks are registered on the same tracepoint, probestub
will be indirectly called via traceiter helper.

Pointer to probestub callback resides in __tracepoints section, which is
excluded from ENDBR checks in objtool. Pointers to regfunc/unregfunc
callbacks reside in extended structure however, which is not affected.

Registering multiple callbacks will result in a #CP exception due to
missed ENDBR in __probestub helper on a CFI-enabled machine.

Fix this by adding CFI_NOSEAL annotation to probestub declaration.

Fixes: d5173f753750 ("objtool: Exclude __tracepoints data from ENDBR checks")
Signed-off-by: Eva Kurchatova <eva.kurchatova@virtuozzo.com>
---
 include/linux/tracepoint.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 583d962abcc3..5a32a709759c 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -19,6 +19,7 @@
 #include <linux/rcupdate.h>
 #include <linux/tracepoint-defs.h>
 #include <linux/static_call.h>
+#include <asm/cfi.h>

 struct module;
 struct tracepoint;
@@ -356,6 +357,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 	void __probestub_##_name(void *__data, proto)			\
 	{								\
 	}								\
+	CFI_NOSEAL(__probestub_##_name);				\
 	DEFINE_STATIC_CALL(tp_func_##_name, __traceiter_##_name);

 #define DEFINE_TRACE_FN(_name, _reg, _unreg, _proto, _args)		\
-- 
2.54.0

^ permalink raw reply related

* [PATCH RESEND] rtla: Fix output files in source tree
From: Ben Hutchings @ 2026-05-24 16:24 UTC (permalink / raw)
  To: Steven Rostedt, Tomas Glozar; +Cc: linux-trace-kernel, Ian Rogers

[-- Attachment #1: Type: text/plain, Size: 4786 bytes --]

Some output files (src/timerlat.bpf.o, src/timerlat.skel.h,
example/timerlat_bpf_action.o, tests/bpf/bpf_action_map.o) are
currently generated in the source tree, preventing a fully out-of-tree
build.  To fix this:

- Add $(OUTPUT) to their filenames in the relevant Makefile rules, and
  create subdirectories as needed
- Add $(OUTPUT)src to the include path
- Add ${OUTPUT} to the BPF object filename in tests/timerlat.t

Fixes: e34293ddcebd ("rtla/timerlat: Add BPF skeleton to collect samples")
Fixes: 0304a3b7ec9a ("rtla/timerlat: Add example for BPF action program")
Fixes: 5525aebd4e0c ("rtla/tests: Test BPF action program")
Signed-off-by: Ben Hutchings <benh@debian.org>
Reviewed-by: Ian Rogers <irogers@google.com>
---
 tools/tracing/rtla/Makefile         | 31 ++++++++++++++++++-----------
 tools/tracing/rtla/tests/timerlat.t |  4 ++--
 2 files changed, 21 insertions(+), 14 deletions(-)

diff --git a/tools/tracing/rtla/Makefile b/tools/tracing/rtla/Makefile
index 45690ee14544..f54da7be735d 100644
--- a/tools/tracing/rtla/Makefile
+++ b/tools/tracing/rtla/Makefile
@@ -66,30 +66,37 @@ ifeq ($(config),1)
   include Makefile.config
 endif
 
+INCLUDES	= -I$(OUTPUT)src
+
 CFLAGS		+= $(INCLUDES) $(LIB_INCLUDES)
 
 export CFLAGS OUTPUT srctree
 
 ifeq ($(BUILD_BPF_SKEL),1)
-src/timerlat.bpf.o: src/timerlat.bpf.c
+$(OUTPUT)src/timerlat.bpf.o: src/timerlat.bpf.c
+	mkdir -p $(@D)
 	$(QUIET_CLANG)$(CLANG) -g -O2 -target bpf -c $(filter %.c,$^) -o $@
 
-src/timerlat.skel.h: src/timerlat.bpf.o
+$(OUTPUT)src/timerlat.skel.h: $(OUTPUT)src/timerlat.bpf.o
+	mkdir -p $(@D)
 	$(QUIET_GENSKEL)$(SYSTEM_BPFTOOL) gen skeleton $< > $@
 
-example/timerlat_bpf_action.o: example/timerlat_bpf_action.c
+$(OUTPUT)example/timerlat_bpf_action.o: example/timerlat_bpf_action.c
+	mkdir -p $(@D)
 	$(QUIET_CLANG)$(CLANG) -g -O2 -target bpf -c $(filter %.c,$^) -o $@
 
-tests/bpf/bpf_action_map.o: tests/bpf/bpf_action_map.c
+$(OUTPUT)tests/bpf/bpf_action_map.o: tests/bpf/bpf_action_map.c
+	mkdir -p $(@D)
 	$(QUIET_CLANG)$(CLANG) -g -O2 -target bpf -c $(filter %.c,$^) -o $@
 else
-src/timerlat.skel.h:
-	$(Q)echo '/* BPF skeleton is disabled */' > src/timerlat.skel.h
+$(OUTPUT)src/timerlat.skel.h:
+	mkdir -p $(@D)
+	$(Q)echo '/* BPF skeleton is disabled */' > $@
 
-example/timerlat_bpf_action.o: example/timerlat_bpf_action.c
+$(OUTPUT)example/timerlat_bpf_action.o: example/timerlat_bpf_action.c
 	$(Q)echo "BPF skeleton support is disabled, skipping example/timerlat_bpf_action.o"
 
-tests/bpf/bpf_action_map.o: tests/bpf/bpf_action_map.c
+$(OUTPUT)tests/bpf/bpf_action_map.o: tests/bpf/bpf_action_map.c
 	$(Q)echo "BPF skeleton support is disabled, skipping tests/bpf/bpf_action_map.o"
 endif
 
@@ -103,7 +110,7 @@ static: $(RTLA_IN)
 rtla.%: fixdep FORCE
 	make -f $(srctree)/tools/build/Makefile.build dir=. $@
 
-$(RTLA_IN): fixdep FORCE src/timerlat.skel.h
+$(RTLA_IN): fixdep FORCE $(OUTPUT)src/timerlat.skel.h
 	make $(build)=rtla
 
 clean: doc_clean fixdep-clean
@@ -111,10 +118,10 @@ clean: doc_clean fixdep-clean
 	$(Q)find . -name '*.o' -delete -o -name '\.*.cmd' -delete -o -name '\.*.d' -delete
 	$(Q)rm -f rtla rtla-static fixdep FEATURE-DUMP rtla-*
 	$(Q)rm -rf feature
-	$(Q)rm -f src/timerlat.bpf.o src/timerlat.skel.h example/timerlat_bpf_action.o
+	$(Q)rm -f $(OUTPUT)src/timerlat.bpf.o $(OUTPUT)src/timerlat.skel.h $(OUTPUT)example/timerlat_bpf_action.o
 	$(Q)rm -f $(UNIT_TESTS)
 
-check: $(RTLA) tests/bpf/bpf_action_map.o
+check: $(RTLA) $(OUTPUT)tests/bpf/bpf_action_map.o
 	RTLA=$(RTLA) BPFTOOL=$(SYSTEM_BPFTOOL) prove -o -f -v tests/
-examples: example/timerlat_bpf_action.o
+examples: $(OUTPUT)example/timerlat_bpf_action.o
 .PHONY: FORCE clean check
diff --git a/tools/tracing/rtla/tests/timerlat.t b/tools/tracing/rtla/tests/timerlat.t
index fd4935fd7b49..e0f3fc4df655 100644
--- a/tools/tracing/rtla/tests/timerlat.t
+++ b/tools/tracing/rtla/tests/timerlat.t
@@ -74,12 +74,12 @@ then
 	# Test BPF action program properly in BPF mode
 	[ -z "$BPFTOOL" ] && BPFTOOL=bpftool
 	check "hist with BPF action program (BPF mode)" \
-		"timerlat hist -T 2 --bpf-action tests/bpf/bpf_action_map.o --on-threshold shell,command='$BPFTOOL map dump name rtla_test_map'" \
+		"timerlat hist -T 2 --bpf-action ${OUTPUT}tests/bpf/bpf_action_map.o --on-threshold shell,command='$BPFTOOL map dump name rtla_test_map'" \
 		2 '"value": 42'
 else
 	# Test BPF action program failure in non-BPF mode
 	check "hist with BPF action program (non-BPF mode)" \
-		"timerlat hist -T 2 --bpf-action tests/bpf/bpf_action_map.o" \
+		"timerlat hist -T 2 --bpf-action ${OUTPUT}tests/bpf/bpf_action_map.o" \
 		1 "BPF actions are not supported in tracefs-only mode"
 fi
 done

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* Re: [PATCHv3 03/12] uprobes/x86: Allow to copy uprobe trampolines on fork
From: Jiri Olsa @ 2026-05-24 21:54 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
	Andrii Nakryiko, bpf, linux-trace-kernel
In-Reply-To: <CAEf4BzYo-8PAXFJt9MHoUn9ux1O2YVxJADC0tGSsacVu_R8Stw@mail.gmail.com>

On Fri, May 22, 2026 at 11:50:54AM -0700, Andrii Nakryiko wrote:
> On Thu, May 21, 2026 at 5:44 AM Jiri Olsa <jolsa@kernel.org> wrote:
> >
> > When we do fork or clone without CLONE_VM the new process won't
> > have uprobe trampoline vma objects and at the same time it will
> > have optimized code calling that trampoline and crash.
> >
> > Fixing this by allowing vma uprobe trampoline objects to be copied
> > on fork to the new process.
> >
> > Fixes: ba2bfc97b462 ("uprobes/x86: Add support to optimize uprobes")
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > ---
> >  arch/x86/kernel/uprobes.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> > index 6824376e253d..11ec6b89b135 100644
> > --- a/arch/x86/kernel/uprobes.c
> > +++ b/arch/x86/kernel/uprobes.c
> > @@ -701,7 +701,7 @@ static struct vm_area_struct *get_uprobe_trampoline(unsigned long vaddr)
> >                 return ERR_PTR(vaddr);
> >
> >         return _install_special_mapping(current->mm, vaddr, PAGE_SIZE,
> > -                               VM_READ|VM_EXEC|VM_MAYEXEC|VM_MAYREAD|VM_DONTCOPY|VM_IO,
> > +                               VM_READ|VM_EXEC|VM_MAYEXEC|VM_MAYREAD|VM_IO,
> 
> so on fork we'll get sys_uprobe invocations which will go into uprobe
> trampoline and syscall will just keep returning -EPROTO, is that
> right?

so the child gets the inherited optimized call path.and now also the
trampoline, which executes the uprobe syscall and so if there's related
uprobe it will be processed

-EPROTO is returned for uprobe syscall executed out of uprobe trampoline


> 
> what would happen in the similar situation for process with int3
> uprobe being forked/cloned? Will it inherit int3 as well, and then
> will keep hitting interrupts that would just do nothing?
> 
> is there a way to restore original memory page for clones? this
> behavior (unless I'm misunderstanding) seems suboptimal
> performance-wise

the standard uprobes seem to handle this situation by removing the brekpoint
from current->mm, if the uprobe->handler returns UPROBE_HANDLER_REMOVE (for all
uprobe consumers) .. and the uprobe->handler does current->mm filter first in
uprobe_perf_func and returns UPROBE_HANDLER_REMOVE in case it does not match

this works also for optimized uprobes on top of standard uprobes

but we don't do that for uprobe_multi and I think we could do something similar,
the code is there, perhaps we just need to add the filter

jirka

^ permalink raw reply

* Re: [PATCHv3 02/12] uprobes/x86: Remove struct uprobe_trampoline object
From: Jiri Olsa @ 2026-05-24 22:13 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: oleg, peterz, mingo, mhiramat, andrii, bpf, linux-trace-kernel,
	ast, daniel, martin.lau, eddyz87, yonghong.song, clm,
	ihor.solodrai
In-Reply-To: <8861cc7af8e0dfff2674ca4ae6381ea2d82c242dec115774e8e11048bde358c4@mail.kernel.org>

On Thu, May 21, 2026 at 01:26:09PM +0000, bot+bpf-ci@kernel.org wrote:
> > commit 35d93b034b144060a547b507cfb4268692c328c2
> > Author: Jiri Olsa <jolsa@kernel.org>
> > Date:   Thu May 21 14:44:01 2026 +0200
> >
> >     uprobes/x86: Remove struct uprobe_trampoline object
> >
> >     Removing struct uprobe_trampoline object and it's tracking code,
> >     because it's not needed. We can do same thing directly on top of
> >     struct vm_area_struct objects.
> >
> >     This makes the code simpler and allows easy propagation of the
> >     trampoline vma object into child process in following change.
> >
> >     Note the original code called destroy_uprobe_trampoline if the
> >     optimiation failed, but it only freed the struct uprobe_trampoline
> >     object, not the vma.
> >
> >     Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> 
> This looks like a bug fix for a VMA resource leak. The commit message
> notes that when uprobe optimization failed, the original code only freed
> the struct uprobe_trampoline object but leaked the VMA created by
> _install_special_mapping. Should this include:
> 
>   Fixes: ba2bfc97b462 ("uprobes/x86: Add support to optimize uprobes")

no, the logic stays, we just removed one layer.. but we need separate
fix for the possible vma leak, will include it in the next version

jirka

^ permalink raw reply

* Re: [PATCH v8] blk-mq: add tracepoint block_rq_tag_wait
From: Aaron Tomlin @ 2026-05-24 22:39 UTC (permalink / raw)
  To: axboe, rostedt, mhiramat, mathieu.desnoyers
  Cc: bvanassche, johannes.thumshirn, kch, dlemoal, ritesh.list,
	john.g.garry, loberman, neelx, sean, mproche, chjohnst,
	linux-block, linux-kernel, linux-trace-kernel
In-Reply-To: <20260524014204.622699-1-atomlin@atomlin.com>

On Sat, May 23, 2026 at 09:42:04PM -0400, Aaron Tomlin wrote:
> diff --git a/include/trace/events/block.h b/include/trace/events/block.h
> index 6aa79e2d799c..736e176f6d17 100644
> --- a/include/trace/events/block.h
> +++ b/include/trace/events/block.h
> @@ -226,6 +226,61 @@ DECLARE_EVENT_CLASS(block_rq,
>  		  IOPRIO_PRIO_LEVEL(__entry->ioprio), __entry->comm)
>  );
>  
> +/**
> + * block_rq_tag_wait - triggered when a request is starved of a tag
> + * @q: request queue of the target device
> + * @hctx: hardware context of the request experiencing starvation
> + * @is_sched_tag: indicates whether the starved pool is the software scheduler
> + * @alloc_flags: allocation flags dictating the specific tag pool
> + *
> + * Called immediately before the submitting context is forced to block due
> + * to the exhaustion of available tags (i.e., physical hardware driver
> + * tags, software scheduler tags, or reserved tags). This trace point
> + * indicates that the context will be placed into an uninterruptible state
> + * via io_schedule() until an active request completes and relinquishes its
> + * assigned tag.
> + */
> +TRACE_EVENT(block_rq_tag_wait,
> +
> +	TP_PROTO(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
> +		 bool is_sched_tag, unsigned int alloc_flags),
> +
> +	TP_ARGS(q, hctx, is_sched_tag, alloc_flags),
> +
> +	TP_STRUCT__entry(
> +		__field( dev_t,		dev			)
> +		__field( u32,		hctx_id			)
> +		__field( u32,		nr_tags			)
> +		__field( bool,		is_sched_tag		)
> +		__field( bool,		is_reserved		)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->dev		= q->disk ? disk_devt(q->disk) : 0;
> +		__entry->hctx_id	= hctx->queue_num;
> +		__entry->is_sched_tag	= is_sched_tag;
> +		__entry->is_reserved	= alloc_flags & BLK_MQ_REQ_RESERVED;
> +
> +		if (__entry->is_reserved) {
> +			__entry->nr_tags = is_sched_tag ?
> +					   hctx->sched_tags->nr_reserved_tags :
> +					   hctx->tags->nr_reserved_tags;
> +		} else {
> +			__entry->nr_tags = is_sched_tag ?
> +					   hctx->sched_tags->nr_tags :
> +					   hctx->tags->nr_tags;
> +		}
> +
> +	),
> +
> +	TP_printk("%d,%d hctx=%u starved on %s%s tags (depth=%u)",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->hctx_id,
> +		  __entry->is_sched_tag ? "scheduler" : "hardware",
> +		  __entry->is_reserved ? " reserved" : "",
> +		  __entry->nr_tags)
> +);

This is wrong.

If __entry->is_reserved is false, the current logic incorrectly reports the
total capacity pool depth (i.e., both reserved and standard tags combined).

I have refactored the TP_fast_assign block to evaluate the reserved status
orthogonally, ensuring nr_reserved_tags is correctly reported for I/O
schedulers. Additionally, the unreserved pool calculation has been fixed to
accurately subtract nr_reserved_tags from nr_tags.

I will include these corrections in the next iteration. Given the extent of
the functional changes to the tracepoint assignment logic, I will drop the
existing "Reviewed-by:" tags.

-- 
Aaron Tomlin

^ permalink raw reply

* [PATCH v9] blk-mq: add tracepoint block_rq_tag_wait
From: Aaron Tomlin @ 2026-05-25  0:51 UTC (permalink / raw)
  To: axboe, rostedt, mhiramat, mathieu.desnoyers
  Cc: bvanassche, johannes.thumshirn, kch, dlemoal, ritesh.list,
	john.g.garry, loberman, neelx, sean, mproche, chjohnst,
	linux-block, linux-kernel, linux-trace-kernel

In high-performance storage environments, particularly when utilising
RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe
latency spikes can occur when fast devices (SSDs) are starved of hardware
tags when sharing the same blk_mq_tag_set.

Currently, diagnosing this specific hardware queue contention is
difficult. When a CPU thread exhausts the tag pool, blk_mq_get_tag()
forces the current thread to block uninterruptible via io_schedule().
While this can be inferred via sched:sched_switch or dynamically
traced by attaching a kprobe to blk_mq_mark_tag_wait(), there is no
dedicated, out-of-the-box observability for this event.

This patch introduces the block_rq_tag_wait tracepoint in the tag
allocation slow-path. It triggers immediately before the task state
is altered to TASK_UNINTERRUPTIBLE (ensuring safety for PREEMPT_RT
locks). It exposes the exact hardware context (hctx) that is starved,
the specific pool experiencing starvation (driver, software scheduler,
or reserved), and the exact pool depth.

This provides storage engineers with a zero-configuration, low-overhead
mechanism to definitively identify shared-tag bottlenecks. For example,
userspace can trivially replicate tag starvation counters using bpftrace:

    # bpftrace -e 'tracepoint:block:block_rq_tag_wait { @tag_waits[cpu] = count(); }'
    Attaching 1 probe...
    ^C
    @tag_waits[4]: 12
    @tag_waits[12]: 87

Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
---
Hi Johannes, Damien, Chaitanya, Laurence,

I have dropped the earlier "Reviewed-by:" and "Tested-by:" tags because
of functional logic changes in the tracepoint assignment block. A fresh
review would be highly appreciated. Thank you.

Changes since v8 [1]:
 - Fixed the standard pool depth calculation in TP_fast_assign to
   accurately report the unreserved capacity by mathematically
   subtracting nr_reserved_tags from nr_tags

 - Removed "Reviewed-by:" and "Tested-by:" tags due to the functional
   logic updates in the tracepoint assignment block

Changes since v7 [2]:
 - Added an is_reserved boolean to the trace record to explicitly expose
   reserved pool starvation to userspace

 - Fixed TP_fast_assign to report the correct nr_reserved_tags depth
   when I/O schedulers utilise the reserved pool

Changes since v6 [3]:
 - Dropped Patch 2. Observability is now driven entirely by the tracepoint,
   with the commit message updated to demonstrate how userspace (e.g.,
   bpftrace) can safely replicate counting out-of-band (Jens Axboe)

 - Moved tracepoint call above sbitmap_prepare_to_wait(). This prevents
   inadvertently resetting the task state under PREEMPT_RT locks

 - Updated the tracepoint signature and TP_fast_assign block to evaluate
   the allocation flags. If the submitting context is starved of a reserved
   tag (BLK_MQ_REQ_RESERVED), the tracepoint now accurately reports the
   severely constrained nr_reserved_tags depth instead of the total nr_tags
   depth.

Changes since v5 [4]:
 - Replaced this_cpu_inc() with raw_cpu_inc() within
   blk_mq_debugfs_inc_wait_tags(). This resolves a preemption warning
   triggered under CONFIG_DEBUG_PREEMPT=y, as the routine is invoked from a
   preemptible context immediately prior to io_schedule(). This adjustment
   deliberately prioritises the reduction of execution overhead over
   absolute statistical precision for this diagnostic interface.

Changes since v4 [5]:
 - Prevented a NULL pointer dereference in the tracepoint fast-assign for
   disk-less request queues by safely checking q->disk before resolving the
   dev_t

 - Fixed a Use-After-Free (UAF) and permanent memory leak by decoupling
   the per-CPU counter allocation from the volatile debugfs lifecycle and
   tying it directly to the core hctx lifecycle (i.e., blk_mq_init_hctx()
   and blk_mq_exit_hctx())

 - Fixed a potential compiler double-fetch bug by wrapping the per-CPU
   pointer evaluations with READ_ONCE() in blk_mq_debugfs_inc_wait_tags()

 - Passed the appropriate gfp_t flags down to the allocation routines to
   maintain the strict GFP_NOIO context

 - Updated kernel-doc descriptions to clarify that the NULL pointer
   checks guard against memory allocation failures under pressure, rather
   than initialisation race conditions

Changes since v3 [6]:
 - Transitioned tracking architecture from shared atomic_t variables to
   dynamically allocated per-CPU counters to resolve cache line bouncing
   (Bart Van Assche)

Changes since v2 [7]:
 - Added "Reviewed-by:" and "Tested-by:" tags for patch 1

 - Evaluate is_sched_tag directly within TP_fast_assign (Steven Rostedt)

 - Introduced atomic counters via debugfs

Changes since v1 [8]:
 - Improved the description of the trace point (Damien Le Moal)

 - Removed the redundant "active requests" (Laurence Oberman)

 - Introduced pool-specific starvation tracking

[1]: https://lore.kernel.org/lkml/20260524014204.622699-1-atomlin@atomlin.com/
[2]: https://lore.kernel.org/lkml/20260523200942.587199-1-atomlin@atomlin.com/
[3]: https://lore.kernel.org/lkml/20260517213614.350367-1-atomlin@atomlin.com/
[4]: https://lore.kernel.org/lkml/20260427020142.358912-1-atomlin@atomlin.com/
[5]: https://lore.kernel.org/lkml/20260419023036.1419514-1-atomlin@atomlin.com/
[6]: https://lore.kernel.org/lkml/20260319221956.332770-1-atomlin@atomlin.com/
[7]: https://lore.kernel.org/lkml/20260319015300.287653-1-atomlin@atomlin.com/
[8]: https://lore.kernel.org/lkml/20260317182835.258183-1-atomlin@atomlin.com/
---
 block/blk-mq-tag.c           |  6 ++++
 include/trace/events/block.h | 59 ++++++++++++++++++++++++++++++++++++
 2 files changed, 65 insertions(+)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 33946cdb5716..35deee5bbc73 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -13,6 +13,7 @@
 #include <linux/kmemleak.h>
 
 #include <linux/delay.h>
+#include <trace/events/block.h>
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-sched.h"
@@ -181,6 +182,11 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		if (tag != BLK_MQ_NO_TAG)
 			break;
 
+		/* Log the starvation event before altering task state */
+		trace_block_rq_tag_wait(data->q, data->hctx,
+					data->rq_flags & RQF_SCHED_TAGS,
+					data->flags);
+
 		sbitmap_prepare_to_wait(bt, ws, &wait, TASK_UNINTERRUPTIBLE);
 
 		tag = __blk_mq_get_tag(data, bt);
diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 6aa79e2d799c..9c97a16850b9 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -226,6 +226,65 @@ DECLARE_EVENT_CLASS(block_rq,
 		  IOPRIO_PRIO_LEVEL(__entry->ioprio), __entry->comm)
 );
 
+/**
+ * block_rq_tag_wait - triggered when a request is starved of a tag
+ * @q: request queue of the target device
+ * @hctx: hardware context of the request experiencing starvation
+ * @is_sched_tag: indicates whether the starved pool is the software scheduler
+ * @alloc_flags: allocation flags dictating the specific tag pool
+ *
+ * Called immediately before the submitting context is forced to block due
+ * to the exhaustion of available tags (i.e., physical hardware driver
+ * tags, software scheduler tags, or reserved tags). This trace point
+ * indicates that the context will be placed into an uninterruptible state
+ * via sbitmap_prepare_to_wait(). If a tag is not acquired in the final
+ * lockless retry, the context will yield the CPU via io_schedule() until
+ * an active request completes and relinquishes its assigned tag.
+ */
+TRACE_EVENT(block_rq_tag_wait,
+
+	TP_PROTO(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
+		 bool is_sched_tag, unsigned int alloc_flags),
+
+	TP_ARGS(q, hctx, is_sched_tag, alloc_flags),
+
+	TP_STRUCT__entry(
+		__field( dev_t,		dev			)
+		__field( u32,		hctx_id			)
+		__field( u32,		nr_tags			)
+		__field( bool,		is_sched_tag		)
+		__field( bool,		is_reserved		)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= q->disk ? disk_devt(q->disk) : 0;
+		__entry->hctx_id	= hctx->queue_num;
+		__entry->is_sched_tag	= is_sched_tag;
+		__entry->is_reserved	= alloc_flags & BLK_MQ_REQ_RESERVED;
+
+		if (__entry->is_reserved) {
+			__entry->nr_tags = is_sched_tag ?
+					   hctx->sched_tags->nr_reserved_tags :
+					   hctx->tags->nr_reserved_tags;
+		} else {
+			if (is_sched_tag)
+				__entry->nr_tags = hctx->sched_tags->nr_tags -
+						   hctx->sched_tags->nr_reserved_tags;
+			else
+				__entry->nr_tags = hctx->tags->nr_tags -
+						   hctx->tags->nr_reserved_tags;
+		}
+
+	),
+
+	TP_printk("%d,%d hctx=%u starved on %s%s tags (depth=%u)",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->hctx_id,
+		  __entry->is_sched_tag ? "scheduler" : "hardware",
+		  __entry->is_reserved ? " reserved" : "",
+		  __entry->nr_tags)
+);
+
 /**
  * block_rq_insert - insert block operation request into queue
  * @rq: block IO operation request

base-commit: 6779b50faa562e6cca1aa6a4649a4d764c6c7e28
-- 
2.51.0


^ permalink raw reply related

* Re: [BUG] tracing/uprobe: oversized dynamic ustring triggers WARN_ON_ONCE panic
From: Masami Hiramatsu @ 2026-05-25  0:56 UTC (permalink / raw)
  To: Yifei Chu
  Cc: linux-trace-kernel, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, linux-kernel
In-Reply-To: <CAPJnbgJ3ayTqZvau6x4c5Y0=xsL_hUeDm1mVTKCYAhfGZCL6bg@mail.gmail.com>

On Sun, 24 May 2026 10:44:09 -0400
Yifei Chu <yifeichu24@gmail.com> wrote:

> Hello,
> 
> Short version: I can make trace_uprobe hit WARN_ON_ONCE() by creating an
> uprobe/uretprobe event with several dynamic ustring fetch args. With
> panic_on_warn=1, this becomes a reproducible panic.
> 
> The setup is pretty direct. The reproducers mount tracefs, create a trace
> event with several ustring arguments pointing at a 4095-byte userspace
> string, and then trigger the event. At probe hit time, the dynamic string
> sizes are accumulated and prepare_uprobe_buffer() sees a payload larger
> than MAX_UCB_BUFFER_SIZE/PAGE_SIZE:
> 
> WARN_ON_ONCE(ucb->dsize > MAX_UCB_BUFFER_SIZE)
> 
> I reproduced the same class through both uprobe and uretprobe events.

This should be fixed by [1]

[1] https://lore.kernel.org/all/20260428122302.706610ba@gandalf.local.home/

Thanks,

> 
> Tested environment:
> 
> Linux version 7.0.9, x86_64 QEMU
> gcc 12.3.0, GNU ld 2.38
> Boot args included: panic_on_warn=1 nokaslr console=ttyS0
> 
> Uprobe result:
> 
> WARNING: kernel/trace/trace_uprobe.c:982 at
> prepare_uprobe_buffer.part.0+0x458/0x5b0
> Kernel panic - not syncing: kernel: panic_on_warn set …
> 
> Uretprobe result:
> 
> triggering uretprobe oversized ustring buffer at offset 0x1db0
> WARNING: kernel/trace/trace_uprobe.c:982 at
> prepare_uprobe_buffer.part.0+0x458/0x5b0
> uretprobe_dispatcher+0x328/0x3e0
> Kernel panic - not syncing: kernel: panic_on_warn set …
> 
> I checked current mainline source and still see the runtime WARN path in
> kernel/trace/trace_uprobe.c. I have reproduced the panic on the 7.0.9 QEMU
> build above; I have not yet runtime-tested current mainline.
> 
> My expectation is that oversized user-controlled dynamic trace data should
> be rejected, capped, or dropped before it reaches a WARN invariant. A
> tracefs user should not be able to turn a long string fetch into a kernel
> warning/panic.
> 
> The attached tarball has README files, both C reproducers, and the full
> QEMU logs.
> 
> Thanks,
> Chuyifei


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [BUG] tracing/kprobe: perf dynamic ustring sample can exceed PERF_MAX_TRACE_SIZE and WARN
From: Masami Hiramatsu @ 2026-05-25  0:58 UTC (permalink / raw)
  To: Yifei Chu
  Cc: linux-trace-kernel, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, linux-kernel
In-Reply-To: <CAPJnbgKc7swb2MdOuRHcLUWNX5iApK=-RVN5r6kE74XF=nPgPg@mail.gmail.com>

Hi

On Sun, 24 May 2026 10:44:20 -0400
Yifei Chu <yifeichu24@gmail.com> wrote:

> Hello,
> 
> Short version: I can make a kprobe/kretprobe trace event with dynamic
> ustring fetch args ask perf_trace_buf_alloc() for more than
> PERF_MAX_TRACE_SIZE. That hits WARN_ONCE(), and with panic_on_warn=1 it
> becomes a reproducible kernel panic.
> 
> The reproducers create a kprobe or kretprobe trace event with several
> ustring args pointing at a 4095-byte userspace string, open the event
> through perf_event_open(PERF_TYPE_TRACEPOINT), and trigger it. The dynamic
> payload size is then passed to perf_trace_buf_alloc():
> 
> WARN_ONCE(size > PERF_MAX_TRACE_SIZE, …)
> 
> I reproduced this through both kprobe and kretprobe events.

This also should be fixed by [1]

[1] https://lore.kernel.org/all/20260428122302.706610ba@gandalf.local.home/

But thank you for reporting.

Thanks,

> 
> Tested environment:
> 
> Linux version 7.0.9, x86_64 QEMU
> gcc 12.3.0, GNU ld 2.38
> Boot args included: panic_on_warn=1 nokaslr console=ttyS0
> 
> Kprobe result:
> 
> perf buffer not large enough, wanted 16420, have 8192
> WARNING: kernel/trace/trace_event_perf.c:405 at
> perf_trace_buf_alloc+0x111/0x160
> Kernel panic - not syncing: kernel: panic_on_warn set …
> 
> Kretprobe result:
> 
> perf buffer not large enough, wanted 16428, have 8192
> WARNING: kernel/trace/trace_event_perf.c:405 at
> perf_trace_buf_alloc+0x111/0x160
> kretprobe_perf_func+0x24b/0x750
> Kernel panic - not syncing: kernel: panic_on_warn set …
> 
> I checked current mainline source and still see PERF_MAX_TRACE_SIZE as 8192
> and the WARN_ONCE path in perf_trace_buf_alloc(). I have reproduced the
> panic on the 7.0.9 QEMU build above; I have not yet runtime-tested current
> mainline.
> 
> My expectation is that a user-defined dynamic trace payload that is too
> large for the perf trace buffer should be rejected, capped, or dropped
> without reaching WARN_ONCE().
> 
> The attached tarball has README files, both C reproducers, and the full
> QEMU logs.
> 
> Thanks,
> Chuyifei


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-05-25  1:50 UTC (permalink / raw)
  To: Balbir Singh
  Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman
In-Reply-To: <ag6XyvxR-NU5rGn-@parvat>

On Thu, May 21, 2026 at 04:23:28PM +1000, Balbir Singh wrote:
> On Sun, Feb 22, 2026 at 03:48:15AM -0500, Gregory Price wrote:
> > Topic type: MM
> > 
> > Presenter: Gregory Price <gourry@gourry.net>
> > 
> > This series introduces N_MEMORY_PRIVATE, a NUMA node state for memory
> > managed by the buddy allocator but excluded from normal allocations.
> > 
> > I present it with an end-to-end Compressed RAM service (mm/cram.c)
> > that would otherwise not be possible (or would be considerably more
> > difficult, be device-specific, and add to the ZONE_DEVICE boondoggle).
> > 
> 
> Do we have updates/notes from the meeting?
> 

I have been on leave since LSF, but I do have some notes posted:

https://lore.kernel.org/linux-mm/af9i7dkNvGGxPHzu@gourry-fedora-PF4VCD3F/
https://lore.kernel.org/linux-mm/agYJcRgOHho8upVv@gourry-fedora-PF4VCD3F/

I will be trying to post an updated set stripped down without the GFP
flag as a first pass w/o RFC tags and no UAPI implications so that
device folks can play with this upstream.

I'm debating on whether to include OPS_MEMPOLICY in the initial version
if only because it's not intuitive how it interacts with pagecache. That
needs more time to bake.

> > 
> > page = alloc_pages_node(nid, __GFP_PRIVATE, 0);
> 
> Do we want to provide kernel level control over allocation of private
> pages, I assumed that only user space applications? I would assume
> node affinity would be the way to do so, unless we have multiple
> 

alloc_pages_node() is the kernel interface

> > 
> > /* Ok but I want to do something useful with it */
> > static const struct node_private_ops ops = {
> >         .migrate_to     = my_migrate_to,
> >         .folio_migrate  = my_folio_migrate,
> >         .flags = NP_OPS_MIGRATION | NP_OPS_MEMPOLICY,
> > };
> > node_private_set_ops(nid, &ops);
> >
> 
> Could you explain this further? Why does OPS_MIGRATION
> and OPS_MEMPOLICY needs to be set explictly?
>

Both of these have been removed from the upcoming version, but in this
RFC version i was testing OPS_MIGRATION as an explicit flag that meant
"migrate.c can touch the folios" while OPS_MEMPOLICY meant "mempolicy.c
can touch the folios".

As it turns out, OPS_MIGRATION is not a useful filter, as it doesn't
actually filter anything (anything using OPS_MIGRATION would also need
its own filter flag, so better to just drop it and do per-server
opt-ins).

~Gregory

^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-05-25  2:03 UTC (permalink / raw)
  To: Arun George/Arun George
  Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	longman, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, osalvador, ziy, matthew.brost, joshua.hahnjy,
	rakie.kim, byungchul, ying.huang, apopple, axelrasmussen, yuanchu,
	weixugc, yury.norov, linux, mhiramat, mathieu.desnoyers, tj,
	hannes, mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman, gost.dev, arungeorge05, cpgs
In-Reply-To: <1891546521.01779449881859.JavaMail.epsvc@epcpadp2new>

On Fri, May 22, 2026 at 02:10:34PM +0530, Arun George/Arun George wrote:
> Thanks.
> 
> On 05-05-2026 01:15 pm, Gregory Price wrote:
> > In the scenario i'm talking about, a "write budget" is defined as a
> > number of pages that are allows to be mapped writable in the page
> > tables at any given time.
> > Agree. I was also in the same context.
> 
> I am trying to bring the device perspective here, and would like to 
> discuss a few corner cases and possible solutions.
> 
> As I see, solving the compressed memory problem statement has these 
> aspects mainly:
> 
> 1) Allocation control: private/managed memory concept.
> 2) Write control: write-protected PTEs, write-controlled use cases like 
> ZSWAP
> 3) Proactive reclaims: optional methods to ease back-pressure using 
> memory shrinkers, ballooning, kswapd, promotion etc. These methods will 
> be triggered based on notifications/interrupts from the device.
> 
> May be they are not enough to cover some corner cases for cram!
> 
>   I believe that this thin-provisioned memory infra is susceptible to 

I'm not understanding the "thin provisioned" terminology you're using
here.  Can you help define what you mean by thin-provision in this case?

> 'writes-above-media-capacity corner cases' (because of not handling 
> device back-pressure notifications in time) whichever methods we use in 
> the kernel. Even if we use write-controlled methods like ZSWAP and 
> pro-active reclaims, there could be corner cases where the communication 
> with the device could be broken and the write path is not aware of it 
> immediately. Note that OCP spec [1] says the device should mark the 
> memory location as 'poisoned' in 'over-capacity' writes.
> 

The intent is to use the low-watermark to prevent new allocations from
occurring, and the write-controls prevent writing to the device without
interposition.

With a sufficient watermark such that the interrupt is delivered within
some number of microseconds, that should be perfectly fine to prevent
poison from ever occurring at all.

Since poison is only delivered *on read*, the system can go a long,
long time before poison is discovered. From the end-user perspective,
this poison is basically unacceptable.

So either we can prevent poison from always occurring, or the hardware
is not viable to support in a scaled production.  

If you think a sufficiently conservative watermark + write-protection is
insufficient to defend against poison, then please let me know why.

> So I have the following proposals / options for this scenario.
> 
>     Option 1: Poisoned data management - This is about accepting that 
> poisoning of memory locations can happen in much more regular frequency 
> here than regular memories and we need to figure out potential recovery 
> mechanisms in host (not recovery of data; but recovery from the poison 
> situation). But I guess folks will not be okay with it in general, and I 
> am not aware of any workloads where data poisoning is tolerated (may be 
> caching workloads?).
> 

Given option 1, I would never put such a device into my production
environment.  The only reasonable action for handling poison is killing
the software, as the data is functionally corrupted.

>     Option 2 (preferred): Device assisted write budgeting - This is 
> about a device aware / assisted mechanism for the write-controlled 
> use-cases (Ex: ZSWAP) to know the 'safe number of  writes' that can be 
> performed to the device (Or allows to be mapped writable in the page 
> tables). This could be like a 'token bucket' algorithm, where the device 
> provides a 'budget / set of tokens' to the host. And it need to be 
> replenished periodically in the device communication code path; and if 
> the host does not find the token, writes cannot go ahead.
> 

When I say budgeting, I mean literally a budget of writable pages,
entirely controlled by software (mm/cram.c or zswap.c or whatever).

This has nothing to do with device operation / throttling / bandwidth
budgets etc.  It is simply a proposal of an optimization that allows the
user to say:  X out of Y possible pages may be mapped writable.

I don't think this would be part of an initial MVP for a compressed ram
service (regardless of it's cram.c or zswap.c)

> In short, the communication with the device has to be maintained to make 
> pages mapped writable. For MVP, this could be a simple constraint of 
> checking actual device capacity periodically to replenish write-budget 
> for CRAM. For other users of private nodes (GPU memory?), this 
> constraint may not be needed at all.
> 
> We are planning to send an RFC code which will fit into your CRAM infra 
> to discuss this poison management approach further.
> 

I'll try to get a new version out this or next week, apologies for the
lag on this series, I've had a number of disruptions and major movements
on the patch set since I last updated it in February.

~Gregory

^ permalink raw reply

* [PATCH] tracing/probes: Point the error offset correctly for eprobe argument error
From: Masami Hiramatsu (Google) @ 2026-05-25  2:21 UTC (permalink / raw)
  To: Steven Rostedt, Shuah Khan
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, linux-kselftest

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Fix to point the error offset correctly for eprobe argument error.
In the cleanup commit 1b8b0cd754cd ("tracing/probes: Move event parameter
fetching code to common parser"), due to incorrect backward compatibility
aimed at conforming to the test specifications, the error location was set
to 0 when a non-existent formal parameter was specified for Eprobe.
However, this should be corrected in both the test and the implementation
to point correct error position.

Fixes: 1b8b0cd754cd ("tracing/probes: Move event parameter fetching code to common parser")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 kernel/trace/trace_probe.c                         |    2 --
 .../test.d/dynevent/eprobes_syntax_errors.tc       |    2 +-
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 39f040c863e8..695310571b08 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -957,8 +957,6 @@ static int parse_probe_vars(char *orig_arg, const struct fetch_type *t,
 			code->op = FETCH_OP_COMM;
 			return 0;
 		}
-		/* backward compatibility */
-		ctx->offset = 0;
 		goto inval;
 	}
 
diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/eprobes_syntax_errors.tc b/tools/testing/selftests/ftrace/test.d/dynevent/eprobes_syntax_errors.tc
index 4f5e8c665156..2a680c086047 100644
--- a/tools/testing/selftests/ftrace/test.d/dynevent/eprobes_syntax_errors.tc
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/eprobes_syntax_errors.tc
@@ -20,7 +20,7 @@ check_error 'e:foo/^123456789012345678901234567890123456789012345678901234567890
 check_error 'e:foo/^bar.1 syscalls/sys_enter_openat'	# BAD_EVENT_NAME
 
 check_error 'e:foo/bar syscalls/sys_enter_openat arg=^dfd'	# BAD_FETCH_ARG
-check_error 'e:foo/bar syscalls/sys_enter_openat ^arg=$foo'	# BAD_ATTACH_ARG
+check_error 'e:foo/bar syscalls/sys_enter_openat arg=^$foo'	# BAD_ATTACH_ARG
 
 if grep -q '<attached-group>\.<attached-event>.*\[if <filter>\]' README; then
   check_error 'e:foo/bar syscalls/sys_enter_openat if ^'	# NO_EP_FILTER


^ permalink raw reply related

* Re: [PATCH v3 2/2] spi: qcom-geni: Add trace events for Qualcomm GENI SPI driver
From: Mukesh Savaliya @ 2026-05-25  5:17 UTC (permalink / raw)
  To: Konrad Dybcio, Praveen Talari, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Mark Brown
  Cc: linux-kernel, linux-trace-kernel, linux-arm-msm, linux-spi,
	aniket.randive, chandana.chiluveru, jyothi.seerapu
In-Reply-To: <5c59952f-72c5-4aee-a1a6-4b11b0a542c3@oss.qualcomm.com>



On 5/19/2026 5:04 PM, Konrad Dybcio wrote:
> On 5/19/26 12:59 PM, Mukesh Savaliya wrote:
>> Hi Praveen, one question below.
>>
>> On 5/18/2026 10:30 PM, Praveen Talari wrote:
>>> Add tracepoints to the Qualcomm GENI (Generic Interface) SPI driver.
>>> These trace events enable runtime debugging and performance analysis
>>> of SPI operations.
>>>
>>> The trace events capture SPI clock configuration, setup parameters,
>>> transfer details, interrupt status.
>> Don't you need trace logs around PM operations ? ie. runtime and system PM ?
> 
> The PM core provides a couple, are those enough?
> 
Understood, should be enough. Thanks !
> Konrad


^ permalink raw reply

* Re: [PATCH v2] perf/ftrace: Fix WARNING in __unregister_ftrace_function
From: Masami Hiramatsu @ 2026-05-25  5:39 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, kernel-team, sashiko-bot, sashiko-reviews
In-Reply-To: <8f8d0cd380f40b84557304d110f38ecefc994a35.camel@surriel.com>

On Fri, 22 May 2026 16:39:41 -0400
Rik van Riel <riel@surriel.com> wrote:

> On Wed, 2026-05-20 at 16:41 -0400, Steven Rostedt wrote:
> > 
> > Rik,
> > 
> > Can you reply to Sashiko here?
> > 
> > The below comes from Sashiko:
> > https://sashiko.dev/?list=org.kernel.vger.linux-trace-kernel#/patchset/20260513161916.04151502%40fangorn
> > 
> > Thanks,
> > 
> > -- Steve
> > 
> > 
> > > commit f90ac36f39b79a19c015522bb1caf7ef903caf02
> > > Author: Rik van Riel <riel@surriel.com>
> > > perf/ftrace: Fix WARNING in __unregister_ftrace_function
> > > This commit attempts to fix a WARN_ON triggered during
> > > perf_event_alloc() error
> > > cleanup by checking FTRACE_OPS_FL_ENABLED before unregistering the
> > > ftrace
> > > function. The warning occurs because the ftrace_ops registration
> > > failed or was
> > > torn down by perf_try_init_event() earlier, but the event->destroy
> > > callback is
> > > executed again during the error path.
> > > > diff --git a/kernel/trace/trace_event_perf.c
> > > > b/kernel/trace/trace_event_perf.c
> > > > index 61e3a2620fa3c..8d64f17135e8f 100644
> > > > --- a/kernel/trace/trace_event_perf.c
> > > > +++ b/kernel/trace/trace_event_perf.c
> > > > @@ -495,7 +495,11 @@ static int
> > > > perf_ftrace_function_register(struct perf_event *event)
> > > >  static int perf_ftrace_function_unregister(struct perf_event
> > > > *event)
> > > >  {
> > > >  	struct ftrace_ops *ops = &event->ftrace_ops;
> > > > -	int ret = unregister_ftrace_function(ops);
> > > > +	int ret = 0;
> > > > +
> > > > +	if (ops->flags & FTRACE_OPS_FL_ENABLED)
> > > > +		ret = unregister_ftrace_function(ops);
> > > > +
> > > >  	ftrace_free_filter(ops);
> > 
> > > Does calling ftrace_free_filter() unconditionally here mask a
> > > double-teardown
> > > regression while leaving the underlying double-free active?
> 
> I don't see how calling ftrace_free_filter() twice would
> call issues, given that it sets the ->*_hash values to
> EMPTY_HASH:
> 
> void ftrace_free_filter(struct ftrace_ops *ops)
> {
>         ftrace_ops_init(ops);
>         if (WARN_ON(ops->flags & FTRACE_OPS_FL_ENABLED))
>                 return;
>         free_ftrace_hash(ops->func_hash->filter_hash);
>         free_ftrace_hash(ops->func_hash->notrace_hash);
>         ops->func_hash->filter_hash = EMPTY_HASH;
>         ops->func_hash->notrace_hash = EMPTY_HASH;
> }
> 
> void free_ftrace_hash(struct ftrace_hash *hash)
> {
>         if (!hash || hash == EMPTY_HASH)
>                 return;
> ..
> 

Yeah, confirmed.

> 
> > > In perf_try_init_event(), if a PMU event_init() succeeds but a
> > > subsequent
> > > capability check fails, it explicitly calls event->destroy(event)
> > > to roll back:
> > > kernel/events/core.c:perf_try_init_event() {
> > >     ...
> > > 		if (ret && event->destroy)
> > > 			event->destroy(event);
> > >     ...
> > > }
> 
> The error handling there all seems to "goto err_destroy"
> 
> err_destroy:
>         if (event->destroy) {
>                 event->destroy(event);
>                 event->destroy = NULL;
>         }
> 
> 
> > > However, it does not set event->destroy to NULL.
> 
> ... but it does?
> 
> I am not sure what code Sashiko is looking at,
> but it does not look like the code I just pulled.

Indeed.

> 
> Is there a different tree I should be looking at
> than upstream Linus?

You can see the baseline info if you expand the collapsed triangle.
Anyway, it said:

linux-trace/HEAD (70575e77839f4c5337ce2653b39b86bb365a870e)

So that is linux-trace/master.

commit 70575e77839f4c5337ce2653b39b86bb365a870e (linux-trace/master)
Merge: 7bc6e90d7aa4 a43ae8057cc1
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Fri Sep 30 09:41:34 2022 -0700

    Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost


Hmm, this is too old. And linux-trace/master is not used anymore.

Reported to Sashiko.

https://github.com/sashiko-dev/sashiko/issues/218

Thank you,

-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [RFC PATCH v2 0/3] trace: stack trace deduplication for ftrace ring buffer
From: Masami Hiramatsu @ 2026-05-25  6:58 UTC (permalink / raw)
  To: Li Pengfei
  Cc: linux-trace-kernel, rostedt, mhiramat, linux-kernel, cmllamas,
	zhangbo56, lipengfei28, lkp
In-Reply-To: <20260522104017.1668638-1-lipengfei28@xiaomi.com>

Hi Pengfei,

On Fri, 22 May 2026 18:40:14 +0800
Li Pengfei <ljdlns1987@gmail.com> wrote:

> From: Pengfei Li <lipengfei28@xiaomi.com>
> 
> Hi Steven, all,
> 
> This is v2 of the ftrace stackmap series. It addresses the Sashiko
> review at [1] and incorporates the kernel test robot's toctree fix.
> 
> The series adds stack trace deduplication to ftrace. When the
> stacktrace option is enabled, the ring buffer stores a 4-byte
> stack_id instead of a full kernel stack trace, while the full
> stacks are exported via tracefs.

Sashiko still made some comments on the series. Please review it.

https://sashiko.dev/#/patchset/20260522104017.1668638-1-lipengfei28%40xiaomi.com

And reply to the comment on this thread, so that we can discuss it
here.

Thanks,



-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [RFC PATCH v2 0/3] trace: stack trace deduplication for ftrace ring buffer
From: Li Pengfei @ 2026-05-25  7:39 UTC (permalink / raw)
  To: mhiramat
  Cc: linux-trace-kernel, rostedt, linux-kernel, cmllamas, zhangbo56,
	lipengfei28, lkp
In-Reply-To: <20260525155841.b15adcd50d25485aab287043@kernel.org>

Hi Masami,

I went through the Sashiko comments on v2 [1]. Per-finding response
below; v3 will incorporate the fixes.

[1] https://sashiko.dev/#/patchset/20260522104017.1668638-1-lipengfei28%40xiaomi.com

Patch 1/3:

  - memset() torn reads against lockless readers: agreed, the
    reset path is not well serialized against tracefs readers.
    Will tighten slow-path synchronization in v3.

  - seq_next() not advancing *pos on EOF: agreed, will fix in v3.

  - atomic_read(&resetting) without acquire: agreed, will switch
    to atomic_read_acquire() in v3.

  - Plain reads of entry->key: agreed, will use READ_ONCE() in v3.

  - atomic64_inc() in NMI-safe hot path on 32-bit GENERIC_ATOMIC64:
    agreed, will move the counters off the hot path (local_t /
    per-CPU) in v3.

Patch 2/3:

  - TRACE_STACK_ID not in trace_valid_entry(): agreed, will add in v3.

  - "NULL from kzalloc" comment: wording bug, will correct in v3.

  - Reset memset synchronization: same fix as patch 1, finding 1.

Patch 3/3:

  - Selftest missing 'function:tracer' in '# requires:': agreed,
    will add in v3.

  - Selftest wiping the ring buffer via 'echo nop > current_tracer'
    before reading trace: agreed, will reorder in v3.

I'll send v3 once the changes are tested.

Pengfei

^ permalink raw reply

* Re: [PATCH] tracing: Replace BUG_ON with lockdep_assert_held in uprobe_buffer functions
From: Masami Hiramatsu @ 2026-05-25  7:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Yash Suthar, mhiramat, mathieu.desnoyers, linux-kernel,
	linux-trace-kernel, skhan
In-Reply-To: <20260521181601.77ce38e6@gandalf.local.home>

On Thu, 21 May 2026 18:16:01 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Fri, 22 May 2026 00:58:46 +0530
> Yash Suthar <yashsuthar983@gmail.com> wrote:
> 
> > Replace BUG_ON(!mutex_is_locked(&event_mutex)) with
> > lockdep_assert_held(&event_mutex) in uprobe_buffer_enable() and
> > uprobe_buffer_disable().
> > 
> > BUG_ON() will crash the kernel. mutex_is_locked() only checks
> > if any task holds lock,but not the caller task. lockdep_assert_held()
> > also check current task for lock and no crash on true condition.
> > 
> > Signed-off-by: Yash Suthar <yashsuthar983@gmail.com>
> 
> This looks good to me.
> 
> Acked-by: Steven Rostedt <rostedt@goodmis.org>
> 
> Masami, do you want to take this?

Yeah, let me pick this.

Thanks!

> 
> -- Steve
> 
> > ---
> >  kernel/trace/trace_uprobe.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> > index 2cabf8a23ec5..aee0960d0cf7 100644
> > --- a/kernel/trace/trace_uprobe.c
> > +++ b/kernel/trace/trace_uprobe.c
> > @@ -912,7 +912,7 @@ static int uprobe_buffer_enable(void)
> >  {
> >  	int ret = 0;
> >  
> > -	BUG_ON(!mutex_is_locked(&event_mutex));
> > +	lockdep_assert_held(&event_mutex);
> >  
> >  	if (uprobe_buffer_refcnt++ == 0) {
> >  		ret = uprobe_buffer_init();
> > @@ -927,7 +927,7 @@ static void uprobe_buffer_disable(void)
> >  {
> >  	int cpu;
> >  
> > -	BUG_ON(!mutex_is_locked(&event_mutex));
> > +	lockdep_assert_held(&event_mutex);
> >  
> >  	if (--uprobe_buffer_refcnt == 0) {
> >  		for_each_possible_cpu(cpu)
> 
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH 07/13] rv: Simply hybrid automata monitors's clock variables
From: Gabriele Monaco @ 2026-05-25  8:03 UTC (permalink / raw)
  To: Nam Cao
  Cc: Steven Rostedt, Wander Lairson Costa, linux-trace-kernel,
	linux-kernel
In-Reply-To: <a779af6dc89721179e0dbab08623e42aa0191275.1777962130.git.namcao@linutronix.de>

On Tue, 2026-05-05 at 08:59 +0200, Nam Cao wrote:
> -static inline bool ha_check_invariant_ns(struct ha_monitor *ha_mon,
> -					 enum envs env, u64 time_ns)
> +static inline bool ha_check_invariant_ns(struct ha_monitor *ha_mon, enum envs
> env,
> +					 u64 time_ns, u64 expire_ns)
>  {
> -	return READ_ONCE(ha_mon->env_store[env]) >= time_ns;
> +	return time_ns - READ_ONCE(ha_mon->env_store[env]) <= expire_ns;
>  }

This function had the silent assumption that invalid/uninitialised
values (max u64) pass the check.

This is no longer working (see nomiss) but could be restored by doing:

  READ_ONCE(ha_mon->env_store[env]) >= time_ns - expire_ns

But.. Yeah, that's a weak assumption. We should probably refactor the
thing to use ha_reset_env() in ha_monitor_reset_all_stored(), then
variables are never going to be uninitialised. It needs a bit of
tinkering but it's definitely better than now.

I'll try and add that to my fixes series.
And I should add some nomiss and stall selftest..

> -static inline bool ha_check_invariant_jiffy(struct ha_monitor *ha_mon,
> -					    enum envs env, u64 time_ns)
> +static inline bool ha_check_invariant_jiffy(struct ha_monitor *ha_mon, enum
> envs env,
> +					    u64 time_ns, u64 expire_jiffy)
>  {
> -	return time_after64(READ_ONCE(ha_mon->env_store[env]),
> get_jiffies_64());
> -
> +	return time_after64(READ_ONCE(ha_mon->env_store[env]) + expire_jiffy,
> get_jiffies_64());
>  }

I'd prefer if this was consistent with the above as in (now - env <=
expire) or (env >= now - env), whichever you prefer but let's keep it
equivalent.
Or do you have a reason to rearrange it here?

Thanks,
Gabriele

^ permalink raw reply

* Re: [PATCH 2/3] rv/rtapp/sleep: Update nanosleep rule
From: Nam Cao @ 2026-05-25 11:38 UTC (permalink / raw)
  To: Gabriele Monaco; +Cc: Steven Rostedt, linux-kernel, linux-trace-kernel
In-Reply-To: <376235e7f56f4d7fe21c68b39e12bacb9d3e3d19.camel@redhat.com>

Gabriele Monaco <gmonaco@redhat.com> writes:
> Was it intentional to remove this too? It seems to me now it's cleared
> only for existing user threads.
>
> Aren't we sure new kthreads aren't stopping?
...
> You missed removing a couple of these in the next branch, the monitor
> doesn't build.

These are just mistakes from me, sorry about that. Not sure what I was
smoking while sending these patches..

Nam

^ permalink raw reply

* Re: [PATCH 3/3] rv/rtapp: Add wakeup monitor
From: Nam Cao @ 2026-05-25 11:45 UTC (permalink / raw)
  To: Gabriele Monaco; +Cc: Steven Rostedt, linux-kernel, linux-trace-kernel
In-Reply-To: <fb0c1a36b4e773c65591bde2ef31c48711d30163.camel@redhat.com>

Gabriele Monaco <gmonaco@redhat.com> writes:
> Looks neat, so the idea here is that we are looking at the same events
> sleep would react for, just from a different perspective, so it does
> make sense to run them both together.
>
> You may want to set
>
>   depends on RV_PER_TASK_MONITORS >= 3
>
> in rtapp/Kconfig. Though if we plan to add more per-task monitors we may
> need to find a better solution.

Yeah, perhaps some sort of list instead of the current fixed-length array.

>> diff --git a/tools/verification/models/rtapp/wakeup.ltl
>> b/tools/verification/models/rtapp/wakeup.ltl
>> new file mode 100644
>> index 000000000000..a5d63ca0811a
>> --- /dev/null
>> +++ b/tools/verification/models/rtapp/wakeup.ltl
>> @@ -0,0 +1,5 @@
>> +RULE = always (((RT and USER_THREAD) imply
>> +		(not (WOKEN_BY_LOWER_PRIO or WOKEN_BY_SOFTIRQ)) or
>> ALLOWLIST))
>> +
>> +ALLOWLIST = BLOCK_ON_RT_MUTEX
>> +         or FUTEX_LOCK_PI
>
> So here the events and atoms are similar to the ones in sleep, but since
> we fail on the waking event, we are going to see it from the perspective
> of the waker task, right?
>
> But are those really equivalent? Why do we do RT and USER_THREAD here
> while there is a much more nuanced set of conditions in sleep?
>
> If I understand it correctly, sleep can monitor some kernel threads but
> this monitor does not, is there a reason for that? Are we just not
> interested in the waker for kernel threads?

Urgh, initially I had a patch which drops kernel threads in the sleep
monitor, but finally I decided to drop that patch. This is the leftover
of that.

You are right that we should be consistent and consider kernel threads
here as well.

> I'm not sure if there is any better terminology, but "waking" task makes
> me think of the task that is about to be woken, though it can mean also
> that task that is waking another (what you probably mean here).
>
> What about using the waker/wakee terminology?

waker/wakee would be clearer.

> I see the kernel (events/sched.h) uses waking as well, but it says
> waking context (which a bit clearer to me than waking task).
> May be worth running it through an LLM which can produce more
> English-native unambiguous wording, or maybe I'm just flipping..
>
> Also please document it in Documentation/trace/rv/monitor_rtapp.rst

Right, thanks for the reminder.

Nam

^ permalink raw reply

* Re: [PATCH mm-unstable v17 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Nico Pache @ 2026-05-25 13:08 UTC (permalink / raw)
  To: Wei Yang
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <20260521015510.k4p22m365q2wqkro@master>

On Wed, May 20, 2026 at 7:55 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Wed, May 20, 2026 at 06:05:31AM -0600, Nico Pache wrote:
> >On Tue, May 12, 2026 at 9:44 AM Wei Yang <richard.weiyang@gmail.com> wrote:
> >>
> >> On Mon, May 11, 2026 at 12:58:11PM -0600, Nico Pache wrote:
> >> >Enable khugepaged to collapse to mTHP orders. This patch implements the
> >> >main scanning logic using a bitmap to track occupied pages and a stack
> >> >structure that allows us to find optimal collapse sizes.
> >> >
> >> >Previous to this patch, PMD collapse had 3 main phases, a light weight
> >> >scanning phase (mmap_read_lock) that determines a potential PMD
> >> >collapse, an alloc phase (mmap unlocked), then finally heavier collapse
> >> >phase (mmap_write_lock).
> >> >
> >> >To enabled mTHP collapse we make the following changes:
> >> >
> >> >During PMD scan phase, track occupied pages in a bitmap. When mTHP
> >> >orders are enabled, we remove the restriction of max_ptes_none during the
> >> >scan phase to avoid missing potential mTHP collapse candidates. Once we
> >> >have scanned the full PMD range and updated the bitmap to track occupied
> >> >pages, we use the bitmap to find the optimal mTHP size.
> >> >
> >> >Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
> >> >and determine the best eligible order for the collapse. A stack structure
> >> >is used instead of traditional recursion to manage the search. This also
> >> >prevents a traditional recursive approach when the kernel stack struct is
> >> >limited. The algorithm recursively splits the bitmap into smaller chunks to
> >> >find the highest order mTHPs that satisfy the collapse criteria. We start
> >> >by attempting the PMD order, then moved on the consecutively lower orders
> >> >(mTHP collapse). The stack maintains a pair of variables (offset, order),
> >> >indicating the number of PTEs from the start of the PMD, and the order of
> >> >the potential collapse candidate.
> >> >
> >> >The algorithm for consuming the bitmap works as such:
> >> >    1) push (0, HPAGE_PMD_ORDER) onto the stack
> >> >    2) pop the stack
> >> >    3) check if the number of set bits in that (offset,order) pair
> >> >       statisfy the max_ptes_none threshold for that order
> >> >    4) if yes, attempt collapse
> >> >    5) if no (or collapse fails), push two new stack items representing
> >> >       the left and right halves of the current bitmap range, at the
> >> >       next lower order
> >> >    6) repeat at step (2) until stack is empty.
> >> >
> >> >Below is a diagram representing the algorithm and stack items:
> >> >
> >> >                            offset   mid_offset
> >> >                            |        |
> >> >                            |        |
> >> >                            v        v
> >> >          ____________________________________
> >> >         |          PTE Page Table            |
> >> >         --------------------------------------
> >> >                           <-------><------->
> >> >                             order-1  order-1
> >> >
> >> >mTHP collapses reject regions containing swapped out or shared pages.
> >> >This is because adding new entries can lead to new none pages, and these
> >> >may lead to constant promotion into a higher order mTHP. A similar
> >> >issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
> >> >introducing at least 2x the number of pages, and on a future scan will
> >> >satisfy the promotion condition once again. This issue is prevented via
> >> >the collapse_max_ptes_none() function which imposes the max_ptes_none
> >> >restrictions above.
> >> >
> >> >We currently only support mTHP collapse for max_ptes_none values of 0
> >> >and HPAGE_PMD_NR - 1. resulting in the following behavior:
> >> >
> >> >    - max_ptes_none=0: Never introduce new empty pages during collapse
> >> >    - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
> >> >      available mTHP order
> >> >
> >> >Any other max_ptes_none value will emit a warning and skip mTHP collapse
> >> >attempts. There should be no behavior change for PMD collapse.
> >> >
> >> >Once we determine what mTHP sizes fits best in that PMD range a collapse
> >> >is attempted. A minimum collapse order of 2 is used as this is the lowest
> >> >order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
> >> >
> >> >Currently madv_collapse is not supported and will only attempt PMD
> >> >collapse.
> >> >
> >> >We can also remove the check for is_khugepaged inside the PMD scan as
> >> >the collapse_max_ptes_none() function handles this logic now.
> >> >
> >> >Signed-off-by: Nico Pache <npache@redhat.com>
> >>
> >> [...]
> >>
> >> >+static int mthp_collapse(struct mm_struct *mm, unsigned long address,
> >> >+              int referenced, int unmapped, struct collapse_control *cc,
> >> >+              unsigned long enabled_orders)
> >> >+{
> >> >+      unsigned int nr_occupied_ptes, nr_ptes;
> >> >+      int max_ptes_none, collapsed = 0, stack_size = 0;
> >> >+      unsigned long collapse_address;
> >> >+      struct mthp_range range;
> >> >+      u16 offset;
> >> >+      u8 order;
> >> >+
> >> >+      collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
> >> >+
> >> >+      while (stack_size) {
> >> >+              range = collapse_mthp_stack_pop(cc, &stack_size);
> >> >+              order = range.order;
> >> >+              offset = range.offset;
> >> >+              nr_ptes = 1UL << order;
> >> >+
> >> >+              if (!test_bit(order, &enabled_orders))
> >> >+                      goto next_order;
> >> >+
> >> >+              max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
> >>
> >> I am thinking whether there is a behavioral change for userfaultfd_armed(vma).
> >>
> >> collapse_single_pmd()
> >>     collapse_scan_pmd
> >>         max_ptes_none = collapse_max_ptes_none(cc, vma)
> >>         max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT                --- (1)
> >>         mthp_collapse
> >>             max_ptes_none = collapse_max_ptes_none(cc, NULL)     --- (2)
> >>             collapse_huge_page(mm)
> >>                 hugepage_vma_revalidate(&vma)
> >>                 __collapse_huge_page_isolate(vma)
> >>                     max_ptes_none = collapse_max_ptes_none(cc, vma)
> >>
> >> Before mthp_collapse() introduced, userfaultfd_armed(vma) is skipped if there
> >> is any pte_none_or_zero() in collapse_scan_pmd().
> >>
> >> But now, max_ptes_none could be set to KHUGEPAGED_MAX_PTES_LIMIT at (1), so
> >> that we can scan all the pte to get the bitmap. This means
> >> userfaultfd_armed(vma) could continue even with pte_none_or_zero().
> >>
> >> Then in mthp_collapse(), collapse_max_ptes_none() at (2) ignores
> >> userfaultfd_armed(vma), which means it will continue to collapse a
> >> userfaultfd_armed(vma) when there is pte_none_or_zero().
> >>
> >> The good news is we will stop at __collapse_huge_page_isolate(), where we
> >> get collapse_max_ptes_none() with vma. But we already did a lot of work.
> >
> >Good catch!
> >
> >As you stated we eventually ensure we respect the uffd checks. So
> >there are no correctness issues, just the potential for wasted cycles.
> >
> >At (1) we only do this if mTHPs are enabled. If that is the case, the
> >only waste that can arise is at the PMD order, as that order respects
> >the max_ptes_none value.
> >
> >I think one approach is to gate (1) with the uffd check as well. That
> >way, if mTHPs are enabled and its uffd-armed, max_ptes_none will stay
> >at 0, and we bail early on the scan early if any none_ptes are hit.
> >
> >But then we lose the ability to collapse to mTHPs that are uffd-armed,
> >where the PMD has none/zero-ptes and the mTHP fully has 0
> >non-none/zero-ptes.
> >
> >ie) assume a PMD is 16 x's [xxxxxxxx00000000]
> >where x is a populated pte and 0 is not
> >If we guard this scan (1), then we will never check if its possible to
> >collapse to the smaller orders.
> >
> >Let me know if you see a flaw in my logic, I think it's best to keep it as is?
> >
>
> Yes, gate it at (1) is not a proper place.
>
> I am thinking whether we could pass vma to (2)? So that we could respect
> uffd-armed?

Ok, sorry I never replied but i did implement it at (2). Sashiko
brought up a good point that this can result in a UAF; I verified
that. I'm going to send a fixup to the v18 to undo the change.

I added this to my todo list, and will look into optimizing/finding a
solution for this in a future series. The good thing is, as you stated
earlier, this can result in some wasted work, but it is not logically
incorrect overall.

Cheers,
-- Nico

>
> >>
> >> Not sure if I missed something.
> >>
> >> >+
> >> >+              if (max_ptes_none < 0)
> >> >+                      return collapsed;
> >> >+
> >> >+              nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
> >> >+                                                             nr_ptes);
> >> >+
> >> >+              if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
> >> >+                      int ret;
> >> >+
> >> >+                      collapse_address = address + offset * PAGE_SIZE;
> >> >+                      ret = collapse_huge_page(mm, collapse_address, referenced,
> >> >+                                               unmapped, cc, order);
> >> >+                      if (ret == SCAN_SUCCEED) {
> >> >+                              collapsed += nr_ptes;
> >> >+                              continue;
> >> >+                      }
> >> >+              }
> >> >+
> >> >+next_order:
> >> >+              if (order > KHUGEPAGED_MIN_MTHP_ORDER) {
> >> >+                      const u8 next_order = order - 1;
> >> >+                      const u16 mid_offset = offset + (nr_ptes / 2);
> >> >+
> >> >+                      collapse_mthp_stack_push(cc, &stack_size, mid_offset,
> >> >+                                               next_order);
> >> >+                      collapse_mthp_stack_push(cc, &stack_size, offset,
> >> >+                                               next_order);
> >> >+              }
> >> >+      }
> >> >+      return collapsed;
> >> >+}
> >> >+
> >> > static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >> >               struct vm_area_struct *vma, unsigned long start_addr,
> >> >               bool *lock_dropped, struct collapse_control *cc)
> >> > {
> >> >-      const int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> >> >+      int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> >> >       const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
> >> >       const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
> >> >+      enum tva_type tva_flags = cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
> >> >       pmd_t *pmd;
> >> >-      pte_t *pte, *_pte;
> >> >-      int none_or_zero = 0, shared = 0, referenced = 0;
> >> >+      pte_t *pte, *_pte, pteval;
> >> >+      int i;
> >> >+      int none_or_zero = 0, shared = 0, nr_collapsed = 0, referenced = 0;
> >> >       enum scan_result result = SCAN_FAIL;
> >> >       struct page *page = NULL;
> >> >       struct folio *folio = NULL;
> >> >       unsigned long addr;
> >> >+      unsigned long enabled_orders;
> >> >       spinlock_t *ptl;
> >> >       int node = NUMA_NO_NODE, unmapped = 0;
> >> >
> >> >@@ -1429,8 +1579,19 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >> >               goto out;
> >> >       }
> >> >
> >> >+      bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
> >> >       memset(cc->node_load, 0, sizeof(cc->node_load));
> >> >       nodes_clear(cc->alloc_nmask);
> >> >+
> >> >+      enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
> >>
> >> Would it be 0 at this point?
> >
> >If your question relates to the issue you brought up above, then yes,
> >max_ptes_none would be 0 if it's uffd-armed. We must recheck the
> >uffd-armed status before modifying it to 511.
> >
> >>
> >> >+
> >> >+      /*
> >> >+       * If PMD is the only enabled order, enforce max_ptes_none, otherwise
> >> >+       * scan all pages to populate the bitmap for mTHP collapse.
> >> >+       */
> >> >+      if (enabled_orders != BIT(HPAGE_PMD_ORDER))
> >> >+              max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
> >> >+
> >> >       pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
> >> >       if (!pte) {
> >> >               cc->progress++;
> >> >@@ -1438,11 +1599,13 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >> >               goto out;
> >> >       }
> >> >
> >> >-      for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> >> >-           _pte++, addr += PAGE_SIZE) {
> >> >+      for (i = 0; i < HPAGE_PMD_NR; i++) {
> >> >+              _pte = pte + i;
> >> >+              addr = start_addr + i * PAGE_SIZE;
> >> >+              pteval = ptep_get(_pte);
> >> >+
> >> >               cc->progress++;
> >> >
> >> >-              pte_t pteval = ptep_get(_pte);
> >> >               if (pte_none_or_zero(pteval)) {
> >> >                       if (++none_or_zero > max_ptes_none) {
> >> >                               result = SCAN_EXCEED_NONE_PTE;
> >> >@@ -1522,6 +1685,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >> >                       }
> >> >               }
> >> >
> >> >+              /* Set bit for occupied pages */
> >> >+              __set_bit(i, cc->mthp_bitmap);
> >> >               /*
> >> >                * Record which node the original page is from and save this
> >> >                * information to cc->node_load[].
> >> >@@ -1580,10 +1745,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >> >       if (result == SCAN_SUCCEED) {
> >> >               /* collapse_huge_page expects the lock to be dropped before calling */
> >> >               mmap_read_unlock(mm);
> >> >-              result = collapse_huge_page(mm, start_addr, referenced,
> >> >-                                          unmapped, cc, HPAGE_PMD_ORDER);
> >> >+              nr_collapsed = mthp_collapse(mm, start_addr, referenced, unmapped,
> >> >+                                            cc, enabled_orders);
> >> >               /* collapse_huge_page will return with the mmap_lock released */
> >>
> >> collapse_huge_page will return with mmap_lock released, but mthp_collapse()
> >> may not?
> >
> >We are now releasing the lock before calling mthp_collapse, which
> >subsequently calls collapse_huge_page. Even if `collapse_huge_page` is
> >never called-- say, because enabled_orders is 0 (which should not
> >happen) and all collapse orders are skipped (never calling
> >collapse_huge_page)-- we still return here with the lock dropped.
> >
> >I think this is sound. Let me know if you think differently.
> >
>
> You are right. I missed the lock is released in previous patch.
>
> >Cheers :)
> >-- Nico
> >
> >>
> >> >               *lock_dropped = true;
> >> >+              result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
> >> >       }
> >> > out:
> >> >       trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
> >> >--
> >> >2.54.0
> >>
> >> --
> >> Wei Yang
> >> Help you, Help me
> >>
>
> --
> Wei Yang
> Help you, Help me
>


^ permalink raw reply

* [PATCH] rethook: Use tsk->on_cpu to check task execution state
From: Tengda Wu @ 2026-05-25 13:22 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Mathieu Desnoyers, Alexei Starovoitov, linux-trace-kernel,
	linux-kernel, Tengda Wu

When a task calls schedule() to yield the CPU, its state remains
TASK_RUNNING, but its stack is frozen and safe to walk.

Replace task_is_running(tsk) with tsk->on_cpu to avoid overly
conservative rejections.

Fixes: 54ecbe6f1ed5 ("rethook: Add a generic return hook")
Signed-off-by: Tengda Wu <wutengda@huaweicloud.com>
---
 kernel/trace/rethook.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
index 5a8bdf88999a..bd5e5f455e85 100644
--- a/kernel/trace/rethook.c
+++ b/kernel/trace/rethook.c
@@ -250,7 +250,7 @@ unsigned long rethook_find_ret_addr(struct task_struct *tsk, unsigned long frame
 	if (WARN_ON_ONCE(!cur))
 		return 0;
 
-	if (tsk != current && task_is_running(tsk))
+	if (tsk != current && tsk->on_cpu)
 		return 0;
 
 	do {
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Nico Pache @ 2026-05-25 14:15 UTC (permalink / raw)
  To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, akpm
  Cc: aarcange, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <20260522150009.121603-12-npache@redhat.com>


On 5/22/26 9:00 AM, Nico Pache wrote:
> Enable khugepaged to collapse to mTHP orders. This patch implements the
> main scanning logic using a bitmap to track occupied pages and a stack
> structure that allows us to find optimal collapse sizes.
>
> Previous to this patch, PMD collapse had 3 main phases, a light weight
> scanning phase (mmap_read_lock) that determines a potential PMD
> collapse, an alloc phase (mmap unlocked), then finally heavier collapse
> phase (mmap_write_lock).
>
> To enabled mTHP collapse we make the following changes:
>
> During PMD scan phase, track occupied pages in a bitmap. When mTHP
> orders are enabled, we remove the restriction of max_ptes_none during the
> scan phase to avoid missing potential mTHP collapse candidates. Once we
> have scanned the full PMD range and updated the bitmap to track occupied
> pages, we use the bitmap to find the optimal mTHP size.
>
> Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
> and determine the best eligible order for the collapse. A stack structure
> is used instead of traditional recursion to manage the search. This also
> prevents a traditional recursive approach when the kernel stack struct is
> limited. The algorithm recursively splits the bitmap into smaller chunks to
> find the highest order mTHPs that satisfy the collapse criteria. We start
> by attempting the PMD order, then moved on the consecutively lower orders
> (mTHP collapse). The stack maintains a pair of variables (offset, order),
> indicating the number of PTEs from the start of the PMD, and the order of
> the potential collapse candidate.
>
> The algorithm for consuming the bitmap works as such:
>      1) push (0, HPAGE_PMD_ORDER) onto the stack
>      2) pop the stack
>      3) check if the number of set bits in that (offset,order) pair
>         statisfy the max_ptes_none threshold for that order
>      4) if yes, attempt collapse
>      5) if no (or collapse fails), push two new stack items representing
>         the left and right halves of the current bitmap range, at the
>         next lower order
>      6) repeat at step (2) until stack is empty.
>
> Below is a diagram representing the algorithm and stack items:
>
>                              offset   mid_offset
>                              |        |
>                              |        |
>                              v        v
>            ____________________________________
>           |          PTE Page Table            |
>           --------------------------------------
> 			    <-------><------->
>                               order-1  order-1
>
> mTHP collapses reject regions containing swapped out or shared pages.
> This is because adding new entries can lead to new none pages, and these
> may lead to constant promotion into a higher order mTHP. A similar
> issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
> introducing at least 2x the number of pages, and on a future scan will
> satisfy the promotion condition once again. This issue is prevented via
> the collapse_max_ptes_none() function which imposes the max_ptes_none
> restrictions above.
>
> We currently only support mTHP collapse for max_ptes_none values of 0
> and HPAGE_PMD_NR - 1. resulting in the following behavior:
>
>      - max_ptes_none=0: Never introduce new empty pages during collapse
>      - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
>        available mTHP order
>
> Any other max_ptes_none value will emit a warning and default mTHP
> collapse to max_ptes_none=0. There should be no behavior change for PMD
> collapse.
>
> Once we determine what mTHP sizes fits best in that PMD range a collapse
> is attempted. A minimum collapse order of 2 is used as this is the lowest
> order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
>
> Currently madv_collapse is not supported and will only attempt PMD
> collapse.
>
> We can also remove the check for is_khugepaged inside the PMD scan as
> the collapse_max_ptes_none() function handles this logic now.
>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>   mm/khugepaged.c | 181 +++++++++++++++++++++++++++++++++++++++++++++---
>   1 file changed, 172 insertions(+), 9 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 64ceebc9d8a7..d3d7db8be26c 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -99,6 +99,30 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
>   
>   static struct kmem_cache *mm_slot_cache __ro_after_init;
>   
> +#define KHUGEPAGED_MIN_MTHP_ORDER	2
> +/*
> + * mthp_collapse() does an iterative DFS over a binary tree, from
> + * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
> + * size needed for a DFS on a binary tree is height + 1, where
> + * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
> + *
> + * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
> + * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
> + */
> +#define MTHP_STACK_SIZE	(ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
> +
> +/*
> + * Defines a range of PTE entries in a PTE page table which are being
> + * considered for mTHP collapse.
> + *
> + * @offset: the offset of the first PTE entry in a PMD range.
> + * @order: the order of the PTE entries being considered for collapse.
> + */
> +struct mthp_range {
> +	u16 offset;
> +	u8 order;
> +};
> +
>   struct collapse_control {
>   	bool is_khugepaged;
>   
> @@ -110,6 +134,12 @@ struct collapse_control {
>   
>   	/* nodemask for allocation fallback */
>   	nodemask_t alloc_nmask;
> +
> +	/* Each bit represents a single occupied (!none/zero) page. */
> +	DECLARE_BITMAP(mthp_bitmap, MAX_PTRS_PER_PTE);
> +	/* A mask of the current range being considered for mTHP collapse. */
> +	DECLARE_BITMAP(mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> +	struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
>   };
>   
>   /**
> @@ -1411,20 +1441,137 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
>   	return result;
>   }
>   
> +static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
> +				     u16 offset, u8 order)
> +{
> +	const int size = *stack_size;
> +	struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
> +
> +	VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
> +	stack->order = order;
> +	stack->offset = offset;
> +	(*stack_size)++;
> +}
> +
> +static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
> +						 int *stack_size)
> +{
> +	const int size = *stack_size;
> +
> +	VM_WARN_ON_ONCE(size <= 0);
> +	(*stack_size)--;
> +	return cc->mthp_bitmap_stack[size - 1];
> +}
> +
> +static unsigned int collapse_mthp_count_present(struct collapse_control *cc,
> +						u16 offset, unsigned int nr_ptes)
> +{
> +	bitmap_zero(cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> +	bitmap_set(cc->mthp_bitmap_mask, offset, nr_ptes);
> +	return bitmap_weight_and(cc->mthp_bitmap, cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> +}
> +
> +/*
> + * mthp_collapse() consumes the bitmap that is generated during
> + * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
> + *
> + * Each bit in cc->mthp_bitmap represents a single occupied (!none/zero) page.
> + * A stack structure cc->mthp_bitmap_stack is used to check different regions
> + * of the bitmap for collapse eligibility. The stack maintains a pair of
> + * variables (offset, order), indicating the number of PTEs from the start of
> + * the PMD, and the order of the potential collapse candidate respectively. We
> + * start at the PMD order and check if it is eligible for collapse; if not, we
> + * add two entries to the stack at a lower order to represent the left and right
> + * halves of the PTE page table we are examining.
> + *
> + *                         offset       mid_offset
> + *                         |         |
> + *                         |         |
> + *                         v         v
> + *      --------------------------------------
> + *      |          cc->mthp_bitmap            |
> + *      --------------------------------------
> + *                         <-------><------->
> + *                          order-1  order-1
> + *
> + * For each of these, we determine how many PTE entries are occupied in the
> + * range of PTE entries we propose to collapse, then we compare this to a
> + * threshold number of PTE entries which would need to be occupied for a
> + * collapse to be permitted at that order (accounting for max_ptes_none).
> + *
> + * If a collapse is permitted, we attempt to collapse the PTE range into a
> + * mTHP.
> + */
> +static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
> +		unsigned long address, int referenced, int unmapped,
> +		struct collapse_control *cc, unsigned long enabled_orders)
> +{
> +	unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
> +	int collapsed = 0, stack_size = 0;
> +	unsigned long collapse_address;
> +	struct mthp_range range;
> +	u16 offset;
> +	u8 order;
> +
> +	collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
> +
> +	while (stack_size) {
> +		range = collapse_mthp_stack_pop(cc, &stack_size);
> +		order = range.order;
> +		offset = range.offset;
> +		nr_ptes = 1UL << order;
> +
> +		if (!test_bit(order, &enabled_orders))
> +			goto next_order;
> +
> +		max_ptes_none = collapse_max_ptes_none(cc, vma, order);
> +
> +		nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
> +							       nr_ptes);
> +
> +		if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
> +			int ret;
> +
> +			collapse_address = address + offset * PAGE_SIZE;
> +			ret = collapse_huge_page(mm, collapse_address, referenced,
> +						 unmapped, cc, order);
> +			if (ret == SCAN_SUCCEED) {
> +				collapsed += nr_ptes;
> +				continue;
> +			}
> +		}
> +
> +next_order:
> +		if ((BIT(order) - 1) & enabled_orders) {
> +			const u8 next_order = order - 1;
> +			const u16 mid_offset = offset + (nr_ptes / 2);
> +
> +			collapse_mthp_stack_push(cc, &stack_size, mid_offset,
> +						 next_order);
> +			collapse_mthp_stack_push(cc, &stack_size, offset,
> +						 next_order);
> +		}
> +	}
> +	return collapsed;
> +}

Hi Andrew,

Can you please append the following fixup that reverts one of the

changes requested in V17. The issue with the change is described

below.


commit 1e099144dfcdd28e3b3b50b32535798db53866aa
Author: Nico Pache <npache@redhat.com>
Date:   Mon May 25 07:38:59 2026 -0600

     fixup: fix potential use-after-free of vma in mthp_collapse()

     Between V17 and v18, one reviewer (Wei) brought up that we are not 
doing
     the uffd-armed check until deep in the collapse operation. While not
     functionally incorrect, it can lead to unnecessary work.

     We optimized this by passing the vma variable to mthp_collapse() 
and using
     the collapse_max_ptes_none() function to check the state of uffd-armed
     preventing the wasted work later in the collapse.

     mthp_collapse() is called after mmap_read_unlock(), so the vma pointer
     can become stale. Remove the vma parameter and pass NULL to
     collapse_max_ptes_none() instead.

     Signed-off-by: Nico Pache <npache@redhat.com>

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d3d7db8be26c..a901db5c9201 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1502,9 +1502,9 @@ static unsigned int 
collapse_mthp_count_present(struct collapse_control *cc,
   * If a collapse is permitted, we attempt to collapse the PTE range into a
   * mTHP.
   */
-static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
-        unsigned long address, int referenced, int unmapped,
-        struct collapse_control *cc, unsigned long enabled_orders)
+static int mthp_collapse(struct mm_struct *mm, unsigned long address,
+        int referenced, int unmapped, struct collapse_control *cc,
+        unsigned long enabled_orders)
  {
      unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
      int collapsed = 0, stack_size = 0;
@@ -1524,7 +1524,7 @@ static int mthp_collapse(struct mm_struct *mm, 
struct vm_area_struct *vma,
          if (!test_bit(order, &enabled_orders))
              goto next_order;

-        max_ptes_none = collapse_max_ptes_none(cc, vma, order);
+        max_ptes_none = collapse_max_ptes_none(cc, NULL, order);

          nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
                                     nr_ptes);
@@ -1749,7 +1749,7 @@ static enum scan_result collapse_scan_pmd(struct 
mm_struct *mm,
      if (result == SCAN_SUCCEED) {
          /* collapse_huge_page expects the lock to be dropped before 
calling */
          mmap_read_unlock(mm);
-        nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
+        nr_collapsed = mthp_collapse(mm, start_addr, referenced,
                           unmapped, cc, enabled_orders);
          /* mmap_lock was released above, set lock_dropped */
          *lock_dropped = true;


> +
>   static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>   		struct vm_area_struct *vma, unsigned long start_addr,
>   		bool *lock_dropped, struct collapse_control *cc)
>   {
> -	const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
>   	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
>   	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
> +	unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> +	enum tva_type tva_flags = cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
>   	pmd_t *pmd;
> -	pte_t *pte, *_pte;
> -	int none_or_zero = 0, shared = 0, referenced = 0;
> +	pte_t *pte, *_pte, pteval;
> +	int i;
> +	int none_or_zero = 0, shared = 0, nr_collapsed = 0, referenced = 0;
>   	enum scan_result result = SCAN_FAIL;
>   	struct page *page = NULL;
>   	struct folio *folio = NULL;
>   	unsigned long addr;
> +	unsigned long enabled_orders;
>   	spinlock_t *ptl;
>   	int node = NUMA_NO_NODE, unmapped = 0;
>   
> @@ -1436,8 +1583,19 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>   		goto out;
>   	}
>   
> +	bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
>   	memset(cc->node_load, 0, sizeof(cc->node_load));
>   	nodes_clear(cc->alloc_nmask);
> +
> +	enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
> +
> +	/*
> +	 * If PMD is the only enabled order, enforce max_ptes_none, otherwise
> +	 * scan all pages to populate the bitmap for mTHP collapse.
> +	 */
> +	if (enabled_orders != BIT(HPAGE_PMD_ORDER))
> +		max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
> +
>   	pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
>   	if (!pte) {
>   		cc->progress++;
> @@ -1445,11 +1603,13 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>   		goto out;
>   	}
>   
> -	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> -	     _pte++, addr += PAGE_SIZE) {
> +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +		_pte = pte + i;
> +		addr = start_addr + i * PAGE_SIZE;
> +		pteval = ptep_get(_pte);
> +
>   		cc->progress++;
>   
> -		pte_t pteval = ptep_get(_pte);
>   		if (pte_none_or_zero(pteval)) {
>   			if (++none_or_zero > max_ptes_none) {
>   				result = SCAN_EXCEED_NONE_PTE;
> @@ -1529,6 +1689,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>   			}
>   		}
>   
> +		/* Set bit for occupied pages */
> +		__set_bit(i, cc->mthp_bitmap);
>   		/*
>   		 * Record which node the original page is from and save this
>   		 * information to cc->node_load[].
> @@ -1587,10 +1749,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>   	if (result == SCAN_SUCCEED) {
>   		/* collapse_huge_page expects the lock to be dropped before calling */
>   		mmap_read_unlock(mm);
> -		result = collapse_huge_page(mm, start_addr, referenced,
> -					    unmapped, cc, HPAGE_PMD_ORDER);
> -		/* collapse_huge_page will return with the mmap_lock released */
> +		nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
> +					     unmapped, cc, enabled_orders);
> +		/* mmap_lock was released above, set lock_dropped */
>   		*lock_dropped = true;
> +		result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
>   	}
>   out:
>   	trace_mm_khugepaged_scan_pmd(mm, folio, referenced,


^ permalink raw reply related

* Re: [PATCH v6] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Masami Hiramatsu @ 2026-05-25 14:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linux trace kernel, Masami Hiramatsu, Mathieu Desnoyers,
	Mark Rutland, Peter Zijlstra, Namhyung Kim, Takaya Saeki,
	Douglas Raillard, Tom Zanussi, Andrew Morton, Thomas Gleixner,
	Ian Rogers, Jiri Olsa
In-Reply-To: <20260521225033.56458336@fedora>

On Thu, 21 May 2026 22:50:33 -0400
Steven Rostedt <rostedt@kernel.org> wrote:

> +static int handle_typecast(char *arg, struct fetch_insn **pcode,
> +			   struct fetch_insn *end,
> +			   struct traceprobe_parse_context *ctx)
> +{
> +	char *tmp;
> +	int ret;
> +
> +	/* Currently this only works for eprobes */
> +	if (!(ctx->flags & TPARG_FL_TEVENT)) {
> +		trace_probe_log_err(ctx->offset, TYPECAST_NOT_EVENT);
> +		return -EINVAL;
> +	}
> +
> +	tmp = strchr(arg, ')');
> +	if (!tmp) {
> +		trace_probe_log_err(ctx->offset + strlen(arg),
> +				    DEREF_OPEN_BRACE);
> +		return -EINVAL;
> +	}
> +	*tmp = '\0';
> +	ret = query_btf_struct(arg + 1, ctx);
> +	*tmp = ')';

BTW, is there any reason to recover this? The @arg is copied
string, see traceprobe_parse_probe_arg_body().

Thanks,


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCHv3 11/12] selftests/bpf: Add tests for uprobe nop10 red zone clobbering
From: Jiri Olsa @ 2026-05-25 15:44 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: oleg, peterz, mingo, mhiramat, andrii, bpf, linux-trace-kernel,
	ast, daniel, martin.lau, eddyz87, yonghong.song, clm,
	ihor.solodrai
In-Reply-To: <7773c27d0070aa7544a5a7596aceb8abacb4fbdb20a3aab998f05c39de4f20dc@mail.kernel.org>

On Thu, May 21, 2026 at 01:26:07PM +0000, bot+bpf-ci@kernel.org wrote:
> > diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> > index 969f4deba9fd..efff0c515184 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> 
> [ ... ]
> 
> > diff --git a/tools/testing/selftests/bpf/prog_tests/usdt.c b/tools/testing/selftests/bpf/prog_tests/usdt.c
> > index fda3a298ccfc..8004c9568ffa 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/usdt.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/usdt.c
> 
> [ ... ]
> 
> > diff --git a/tools/testing/selftests/bpf/progs/test_usdt.c b/tools/testing/selftests/bpf/progs/test_usdt.c
> > index f00cb52874e0..0ee78fb050a1 100644
> > --- a/tools/testing/selftests/bpf/progs/test_usdt.c
> > +++ b/tools/testing/selftests/bpf/progs/test_usdt.c
> 
> [ ... ]
> 
> > diff --git a/tools/testing/selftests/bpf/usdt_2.c b/tools/testing/selftests/bpf/usdt_2.c
> > index b359b389f6c0..5e38f8605b02 100644
> > --- a/tools/testing/selftests/bpf/usdt_2.c
> > +++ b/tools/testing/selftests/bpf/usdt_2.c
> > @@ -13,4 +13,17 @@ void usdt_2(void)
> >  	USDT(optimized_attach, usdt_2);
> >  }
> >
> > +static volatile unsigned long usdt_red_zone_arg1 = 0xDEADBEEF;
> > +static volatile unsigned long usdt_red_zone_arg2 = 0xCAFEBABE;
> > +static volatile unsigned long usdt_red_zone_arg3 = 0xFEEDFACE;
> > +
> > +void __attribute__((noinline)) usdt_red_zone_trigger(void)
> > +{
> > +	unsigned long a1 = usdt_red_zone_arg1;
> > +	unsigned long a2 = usdt_red_zone_arg2;
> > +	unsigned long a3 = usdt_red_zone_arg3;
> > +
> > +	USDT(optimized_attach, usdt_red_zone, a1, a2, a3);
> > +}
> 
> Following up on a review comment from sashiko-bot: is this test reliable
> across different compiler versions?

Andrii replied this one in the original post:
  https://lore.kernel.org/bpf/CAEf4BzZ-nASpoF+-vLYZ1D8eWRVd5JWAuhrwx8XZNrX7ZPY5_Q@mail.gmail.com/

jirka

> 
> The x86-64 USDT macro uses the "nor" inline assembly constraint, which allows
> the compiler to allocate these operands to CPU registers. Because there is no
> register pressure in usdt_red_zone_trigger(), modern compilers might allocate
> the local variables a1, a2, and a3 directly to registers rather than the
> stack's red zone.
> 
> If the operands are placed in registers, the uprobe CALL optimization
> clobbering [rsp-8] will not corrupt them. This could cause the test to pass
> even on buggy kernels, creating a false positive.
> 
> The author acknowledged this concern but kept the test as-is, noting that
> selftests compile with -O0 and there's also an assembly-based test.
> 
> Would it be safer to explicitly force these operands into the red zone using
> inline assembly constraints, rather than depending on the compiler's register
> allocator?
> 
> Reference:
> https://lore.kernel.org/bpf/20260509021238.AE98FC2BCB0@smtp.kernel.org/
> 
> 
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
> 
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26227520868


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox