* Re: [PATCH] unwind: Add sframe_(un)register() system calls
From: Jens Remus @ 2026-05-22 9:43 UTC (permalink / raw)
To: Steven Rostedt
Cc: LKML, Linux Trace Kernel, bpf, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Linus Torvalds, Andrew Morton, Florian Weimer,
Kees Cook, Carlos O'Donell, Sam James, Dylan Hatch,
Borislav Petkov, Dave Hansen, David Hildenbrand, H. Peter Anvin,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
Suren Baghdasaryan, Vlastimil Babka, Heiko Carstens,
Vasily Gorbik
In-Reply-To: <20260521183532.7a145c8a@gandalf.local.home>
On 5/22/2026 12:35 AM, Steven Rostedt wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
>
> Add system calls to register and unregister sframes that can be used by
> dynamic linkers to tell the kernel where the sframe section is in memory
> for libraries it loads.
Why two separate system calls? Can't that be one single stacktracectl?
Could they at least be non-sframe specific, e.g. stracktrace_register
and stracktrace_unregister, so that if one would implement e.g. unwind
user dwarf/eh_frame in the future one could pass ehframe_start and
ehframe_end in addition to sframe_start and sframe_end?
>
> Both system calls take a pointer to a new structure:
>
> struct sframe_setup {
> unsigned long sframe_start;
> unsigned long sframe_size;
> unsigned long text_start;
> unsigned long text_size;
> };
>
> and a size of the passed in structure. If the system call needs to be
> extended, then the structure could be changed and the size of that
> structure will tell the kernel that it is the new version. If the kernel
> does not recognize the structure size, it will return -EINVAL.
>
> sframe_start - The virtual address of the sframe section
> sframe_size - The length of the sframe section
> text_start - the text section the sframe represents
> test_size - the length of the section
>
> If other stack tracing functionality is added, it will require a new
> system call.
>
> The unregister only needs the sframe_start and requires all the rest of
> the fields to be 0. In the future, if more can be done, then user space
> can update the other values and check the return code to see if the kernel
> supports it.
>
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> ---
>
> Based on top of Jens patches here:
>
> https://lore.kernel.org/linux-trace-kernel/20260520154004.3845823-1-jremus@linux.ibm.com/
>
> [ Note, I tested this with the same program from the RFC patch ]
>
> Changes from RFC: https://patch.msgid.link/20260429114355.6c712e6a@gandalf.local.home
>
> - Remove the ioctl() like system call for a unique system call for each
> functionality. Right now there's two functionalities:
> 1. register sframe section
> 2. unregister sframe sections
>
> - Added taking a lock around the mtree logic in __sframe_remove_section()
> as Sashiko mentioned that there could be races from user space
> registering and unregistering sframe sections at the same time.
Doesn't sframe_add_section() then also need likewise?
>
> - Removed [RFC] from subject as I believe this is more likely the way
> this system call will be done.
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> @@ -999,6 +999,8 @@ asmlinkage long sys_lsm_get_self_attr(unsigned int attr, struct lsm_ctx __user *
> asmlinkage long sys_lsm_set_self_attr(unsigned int attr, struct lsm_ctx __user *ctx,
> u32 size, u32 flags);
> asmlinkage long sys_lsm_list_modules(u64 __user *ids, u32 __user *size, u32 flags);
> +asmlinkage long sys_sframe_register(void *data, unsigned int size);
> +asmlinkage long sys_sframe_unregister(void *data, unsigned int size);
>
> /*
> * Architecture-specific system calls
> diff --git a/include/uapi/linux/sframe.h b/include/uapi/linux/sframe.h
> @@ -0,0 +1,12 @@
> +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
> +#ifndef _UAPI_LINUX_SFRAME_H
> +#define _UAPI_LINUX_SFRAME_H
> +
> +struct sframe_setup {
> + unsigned long sframe_start;
> + unsigned long sframe_size;
> + unsigned long text_start;
> + unsigned long text_size;
> +};
> +
> +#endif /* _UAPI_LINUX_SFRAME_H */
> diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
> @@ -842,9 +844,11 @@ static void sframe_free_srcu(struct rcu_head *rcu)
> static int __sframe_remove_section(struct mm_struct *mm,
> struct sframe_section *sec)
> {
> - if (!mtree_erase(&mm->sframe_mt, sec->text_start)) {
> - dbg_sec("mtree_erase failed: text=%lx\n", sec->text_start);
> - return -EINVAL;
> + scoped_guard(mmap_read_lock, mm) {
Why is a read lock sufficient? Doesn't that allow multiple readers?
How does that prevent a concurrent modification of the mm->sframe_mt?
> + if (!mtree_erase(&mm->sframe_mt, sec->text_start)) {
> + dbg_sec("mtree_erase failed: text=%lx\n", sec->text_start);
> + return -EINVAL;
> + }
Is (or why not) likewise required in sframe_add_section() for the
mtree_insert_range()?
Wasn't the reported issue that while mt_for_each() in
sframe_remove_section() there could be concurrent mtree_erase() in
__sframe_remove_section() followed by mtree_insert_range() in
sframe_add_section(), so that the mt_for_each() could get confused?
> }
>
> call_srcu(&sframe_srcu, &sec->rcu, sframe_free_srcu);
> @@ -936,3 +940,56 @@ void sframe_free_mm(struct mm_struct *mm)
>
> mtree_destroy(&mm->sframe_mt);
> }
> +
> +/**
> + * sys_sframe_register - register an address for user space stacktrace walking.
> + * @data: Structure of sframe data used to register the sframe section
> + * @size: The size of the given structure.
> + *
> + * This system call is used by dynamic library utilities to inform the kernel
> + * of meta data that it loaded that can be used by the kernel to know how
> + * to stack walk the given text locations.
> + *
> + * Return: 0 if successful, otherwise a negative error.
> + */
> +SYSCALL_DEFINE2(sframe_register, __user struct sframe_setup *, data, unsigned int, size)
> +{
> + struct sframe_setup sframe;
> +
> + if (sizeof(sframe) != size)
> + return -EINVAL;
> +
> + if (copy_from_user(&sframe, data, size))
> + return -EFAULT;
> +
> + return sframe_add_section(sframe.sframe_start,
> + sframe.sframe_start + sframe.sframe_size,
> + sframe.text_start,
> + sframe.text_start + sframe.text_size);
> +}
> +
> +/**
> + * sys_sframe_unregister - unregister an sframe address
> + * @data: Structure of sframe data used to register the sframe section
> + * @size: The size of the given structure.
> + *
> + * The data->sframe_start is the only value that is used. The rest must
> + * be zero.
> + *
> + * Return: 0 if successful, otherwise a negative error.
> + */
> +SYSCALL_DEFINE2(sframe_unregister, __user struct sframe_setup *, data, unsigned int, size)
> +{
> + struct sframe_setup sframe;
> +
> + if (sizeof(sframe) != size)
> + return -EINVAL;
> +
> + if (copy_from_user(&sframe, data, size))
> + return -EFAULT;
> +
> + if (sframe.sframe_size || sframe.text_start || sframe.text_size)
> + return -EINVAL;
> +
> + return sframe_remove_section(sframe.sframe_start);
> +}
Thanks and regards,
Jens
--
Jens Remus
Linux on Z Development (D3303)
jremus@de.ibm.com / jremus@linux.ibm.com
IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Ehningen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/
^ permalink raw reply
* Re: [PATCH] ftrace: Use flexible array for hash buckets
From: kernel test robot @ 2026-05-22 8:45 UTC (permalink / raw)
To: Rosen Penev
Cc: oe-lkp, lkp, linux-kernel, linux-trace-kernel, Steven Rostedt,
Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, oliver.sang
In-Reply-To: <20260520220030.16887-1-rosenp@gmail.com>
Hello,
kernel test robot noticed "BUG:KASAN:global-out-of-bounds_in_ftrace_find_rec_direct" on:
commit: 42ed22b1b4ae179c78e088477950eb1dd1dc9e90 ("[PATCH] ftrace: Use flexible array for hash buckets")
url: https://github.com/intel-lab-lkp/linux/commits/Rosen-Penev/ftrace-Use-flexible-array-for-hash-buckets/20260521-060312
base: https://git.kernel.org/cgit/linux/kernel/git/trace/linux-trace for-next
patch link: https://lore.kernel.org/all/20260520220030.16887-1-rosenp@gmail.com/
patch subject: [PATCH] ftrace: Use flexible array for hash buckets
in testcase: boot
config: x86_64-rhel-9.4-kunit
compiler: gcc-14
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 32G
(please refer to attached dmesg/kmsg for entire log/backtrace)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202605221023.2bcc5f10-lkp@intel.com
[ 9.849798][ T1] BUG: KASAN: global-out-of-bounds in ftrace_find_rec_direct (trace/ftrace.c:1172 (discriminator 2) trace/ftrace.c:2611 (discriminator 2))
[ 9.849798][ T1] Read of size 8 at addr ffffffff9675bd48 by task swapper/0/1
[ 9.849798][ T1]
[ 9.849798][ T1] CPU: 1 UID: 0 PID: 1 Comm: swapper/0 Not tainted 7.1.0-rc3+ #1 PREEMPT(lazy)
[ 9.849798][ T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 9.849798][ T1] Call Trace:
[ 9.849798][ T1] <TASK>
[ 9.849798][ T1] dump_stack_lvl (dump_stack.c:94 dump_stack.c:120)
[ 9.849798][ T1] print_address_description+0x70/0x300
[ 9.849798][ T1] print_report (kasan/report.c:482)
[ 9.849798][ T1] ? __virt_addr_valid (linux/mmzone.h:2198 (discriminator 1) linux/mmzone.h:2280 (discriminator 1) x86/mm/physaddr.c:54 (discriminator 1))
[ 9.849798][ T1] ? ftrace_find_rec_direct (trace/ftrace.c:1172 (discriminator 2) trace/ftrace.c:2611 (discriminator 2))
[ 9.849798][ T1] kasan_report (kasan/report.c:595)
[ 9.849798][ T1] ? ftrace_find_rec_direct (trace/ftrace.c:1172 (discriminator 2) trace/ftrace.c:2611 (discriminator 2))
[ 9.849798][ T1] ? __pfx_trace_selftest_dynamic_test_func (??:?)
[ 9.849798][ T1] ftrace_find_rec_direct (trace/ftrace.c:1172 (discriminator 2) trace/ftrace.c:2611 (discriminator 2))
[ 9.849798][ T1] register_ftrace_direct (trace/ftrace.c:6057)
[ 9.849798][ T1] ? __pfx_ftrace_stub_direct_tramp (x86/kernel/ftrace_64.S:318)
[ 9.849798][ T1] trace_selftest_startup_function_graph (trace/trace_selftest.c:1139)
[ 9.849798][ T1] ? __pfx_trace_selftest_startup_function_graph (trace/trace_selftest.c:754)
[ 9.849798][ T1] run_tracer_selftest (trace/trace.c:1335)
[ 9.849798][ T1] register_tracer (trace/trace.c:1375 trace/trace.c:1492)
[ 9.849798][ T1] ? __pfx_init_graph_trace (trace/trace_functions_graph.c:1811)
[ 9.849798][ T1] do_one_initcall (main.c:1347)
[ 9.849798][ T1] ? __pfx_do_one_initcall (trace/events/initcall.h:10)
[ 9.849798][ T1] ? asm_sysvec_apic_timer_interrupt (x86/include/asm/idtentry.h:569)
[ 9.849798][ T1] ? ret_from_fork_asm (x86/entry/entry_64.S:245)
[ 9.849798][ T1] do_initcalls (main.c:1409 (discriminator 1) main.c:1425 (discriminator 1))
[ 9.849798][ T1] kernel_init_freeable (main.c:1445 main.c:1658)
[ 9.849798][ T1] ? __pfx_kernel_init_freeable (main.c:1626)
[ 9.849798][ T1] ? __pfx_schedule_timeout (??:?)
[ 9.849798][ T1] ? __pfx__raw_spin_lock_irq (locking/spinlock.c:183)
[ 9.849798][ T1] ? __pfx_kernel_init (main.c:717)
[ 9.849798][ T1] kernel_init (main.c:1548)
[ 9.849798][ T1] ? __pfx_kernel_init (main.c:717)
[ 9.849798][ T1] ret_from_fork (x86/kernel/process.c:158)
[ 9.849798][ T1] ? __pfx_ret_from_fork (x86/include/asm/entry-common.h:54)
[ 9.849798][ T1] ? switch_fpu (linux/instrumented.h:82 asm-generic/bitops/instrumented-non-atomic.h:141 linux/thread_info.h:133 linux/sched.h:2066 x86/include/asm/fpu/sched.h:34)
[ 9.849798][ T1] ? __switch_to (x86/kernel/process_64.c:619)
[ 9.849798][ T1] ? __switch_to_asm (x86/entry/entry_64.S:206)
[ 9.849798][ T1] ? __pfx_kernel_init (main.c:717)
[ 9.849798][ T1] ret_from_fork_asm (x86/entry/entry_64.S:245)
[ 9.849798][ T1] </TASK>
[ 9.849798][ T1]
[ 9.849798][ T1] The buggy address belongs to the variable:
[ 9.849798][ T1] empty_hash+0x28/0xa0
[ 9.849798][ T1]
[ 9.849798][ T1] The buggy address belongs to the physical page:
[ 9.849798][ T1] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x77d95b
[ 9.849798][ T1] flags: 0x17ffffc0002000(reserved|node=0|zone=2|lastcpupid=0x1fffff)
[ 9.849798][ T1] raw: 0017ffffc0002000 ffffea001df656c8 ffffea001df656c8 0000000000000000
[ 9.849798][ T1] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
[ 9.849798][ T1] page dumped because: kasan: bad access detected
[ 9.849798][ T1]
[ 9.849798][ T1] Memory state around the buggy address:
[ 9.849798][ T1] ffffffff9675bc00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 9.849798][ T1] ffffffff9675bc80: 00 00 00 00 00 00 f9 f9 f9 f9 f9 f9 00 00 00 00
[ 9.849798][ T1] >ffffffff9675bd00: f9 f9 f9 f9 00 00 00 00 00 f9 f9 f9 f9 f9 f9 f9
[ 9.849798][ T1] ^
[ 9.849798][ T1] ffffffff9675bd80: 00 02 f9 f9 f9 f9 f9 f9 00 00 00 00 00 00 00 00
[ 9.849798][ T1] ffffffff9675be00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 9.849798][ T1] ==================================================================
[ 9.904835][ T1] Disabling lock debugging due to kernel taint
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20260522/202605221023.2bcc5f10-lkp@intel.com
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply
* Re: [PATCH] Re: Re: [RFC PATCH v2 04/10] rv/da: add pre-allocated storage pool for per-object monitors
From: Gabriele Monaco @ 2026-05-22 6:45 UTC (permalink / raw)
To: Wen Yang; +Cc: linux-kernel, linux-trace-kernel, rostedt
In-Reply-To: <7917085b-fef1-4017-b289-eefe84332ff5@linux.dev>
Hi Wen,
On Fri, 2026-05-22 at 01:40 +0800, Wen Yang wrote:
> Hi Gabriele,
>
> No specific reason for REL_SOFT — not intentional, reverting to
> REL_HARD.
>
> Reproduced the stall on the same config (PREEMPT_RT +
> PROVE_LOCKING/PROVE_RCU).
>
> Root cause is a cleanup ordering bug in uprobe_detail_waiting.tc,
> unrelated to REL_SOFT/REL_HARD:
>
>
> # original cleanup — wrong order
>
> echo "-${UPROBE_TARGET}:${start_offset}" > "$TLOB_MONITOR" # (A)
>
> kill "$hog_pid" # (B)
>
>
> (A)
> triggers synchronize_srcu() in the kernel. But tlob_target is stuck
> mid-uprobe_notify_resume holding an SRCU read lock, preempted by the
> FIFO-99 hog -> so the reader never finishes and (B) is never reached.
> rcuc/0 (a kthread on PREEMPT_RT) is also starved by the hog -> RCU stall.
great you found the issue and solution. Wonder why lockdep wasn't more
informative, but probably the issue was so frequent to hog that too.
> Fix: kill the hog first:
>
>
> kill "$hog_pid"; wait "$hog_pid"
>
> echo "-${UPROBE_TARGET}:${start_offset}" > "$TLOB_MONITOR"
>
>
> On the PREEMPT_RT it is more reliably triggered there because rcuc/0
> runs as a preemptible kthread rather than in softirq, making it easier
> for the hog to monopolise the CPU long enough to hit the stall.
>
> Thank you for the thorough review and valuable suggestions. We are
> working through all of them and running the full test suite.
> We expect to post v3 within the next two days.
Alright, sounds good.
Thanks,
Gabriele
^ permalink raw reply
* [PATCH v6] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Steven Rostedt @ 2026-05-22 2:50 UTC (permalink / raw)
To: LKML, Linux trace kernel
Cc: Masami Hiramatsu, Mathieu Desnoyers, Mark Rutland, Peter Zijlstra,
Namhyung Kim, Takaya Saeki, Douglas Raillard, Tom Zanussi,
Andrew Morton, Thomas Gleixner, Ian Rogers, Jiri Olsa
From: Steven Rostedt <rostedt@goodmis.org>
Add syntax to the parsing of eprobes to be able to typecast a trace event
field that is a pointer to a structure.
Currently, a dereference must be a number, where the user has to figure
out manually the offset of a member of a structure that they want to
dereference.
But for event probes that records a field that happens to be a pointer to
a structure, it cannot dereference these values with BTF naming, but
must use numerical offsets.
For example, to find out what device a sk_buff is pointing to in the
net_dev_xmit trace event, one must first use gdb to find the offsets of the
members of the structures:
(gdb) p &((struct sk_buff *)0)->dev
$1 = (struct net_device **) 0x10
(gdb) p &((struct net_device *)0)->name
$2 = (char (*)[16]) 0x118
And then use the raw numbers to dereference:
# echo 'e:xmit net.net_dev_xmit +0x118(+0x10($skbaddr)):string' >> dynamic_events
If BTF is in the kernel, then instead, the skbaddr can be typecast to
sk_buff and use the normal dereference logic.
# echo 'e:xmit net.net_dev_xmit (sk_buff)skbaddr->dev->name:string' >> dynamic_events
# echo 1 > events/eprobes/xmit/enable
# cat trace
[..]
sshd-session-1022 [000] b..2. 860.249343: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.250061: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.250142: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.263553: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.283820: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.302716: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.322905: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.342828: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.362268: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.382335: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.400856: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.419893: xmit: (net.net_dev_xmit) arg1="enp7s0"
The syntax is simply: (STRUCT)(FIELD)->MEMBER[->MEMBER..]
Also add comments around the #else and #endif of #ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
to know what they are for.
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v5: https://patch.msgid.link/20260519130144.40e71a00@fedora
- Restructure parse_btf_arg() to only use typecast when TEVENT flag is set
(eprobes).
- Move found_type label to still test for (!type) in case ctx->last_struct
is NULL for some reason.
- Restructure the code to be more specific to when to use ctx->struct_btf
and use ctx->btf for places that the typecast will not be used.
- Move parse_trace_event() function outside of BTF #ifdef as it is used
outside of it as well.
- Move the code in the '(' switch statement into a helper function called
handle_typecast(). This will be a stub when BTF is not configured in.
- Have the ctx->last_struct and ctx->struct_btf only set within the
handle_typecast() function, and have that function reset it too.
- Add comments to the #else and #endif of the #ifdef
CONFIG_PROBE_EVENTS_BTF_ARGS to know what they are for.
Documentation/trace/eprobetrace.rst | 4 +
kernel/trace/trace_probe.c | 171 +++++++++++++++++++++++-----
kernel/trace/trace_probe.h | 7 +-
3 files changed, 152 insertions(+), 30 deletions(-)
diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
index 89b5157cfab8..fe3602540569 100644
--- a/Documentation/trace/eprobetrace.rst
+++ b/Documentation/trace/eprobetrace.rst
@@ -46,6 +46,10 @@ Synopsis of eprobe_events
(x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
"string", "ustring", "symbol", "symstr" and "bitfield" are
supported.
+ (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+ a pointer to STRUCT and then derference the pointer defined by
+ ->MEMBER. Note that when this is used, the FIELD name does not
+ need to be prefixed with a '$'.
Types
-----
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index e0d3a0da26af..4a205cebda60 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -332,6 +332,25 @@ static int parse_trace_event_arg(char *arg, struct fetch_insn *code,
return -ENOENT;
}
+static int parse_trace_event(char *arg, struct fetch_insn *code,
+ struct traceprobe_parse_context *ctx)
+{
+ int ret;
+
+ if (code->data)
+ return -EFAULT;
+ ret = parse_trace_event_arg(arg, code, ctx);
+ if (!ret)
+ return 0;
+ if (strcmp(arg, "comm") == 0 || strcmp(arg, "COMM") == 0) {
+ code->op = FETCH_OP_COMM;
+ return 0;
+ }
+ /* backward compatibility */
+ ctx->offset = 0;
+ return -EINVAL;
+}
+
#ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
static u32 btf_type_int(const struct btf_type *t)
@@ -376,11 +395,17 @@ static bool btf_type_is_char_array(struct btf *btf, const struct btf_type *type)
&& BTF_INT_BITS(intdata) == 8;
}
+static struct btf *ctx_btf(struct traceprobe_parse_context *ctx)
+{
+ return ctx->flags & TPARG_FL_TYPECAST ?
+ ctx->struct_btf : ctx->btf;
+}
+
static int check_prepare_btf_string_fetch(char *typename,
struct fetch_insn **pcode,
struct traceprobe_parse_context *ctx)
{
- struct btf *btf = ctx->btf;
+ struct btf *btf = ctx_btf(ctx);
if (!btf || !ctx->last_type)
return 0;
@@ -554,22 +579,29 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
struct fetch_insn *code = *pcode;
const struct btf_member *field;
u32 bitoffs, anon_offs;
+ bool is_struct = ctx->flags & TPARG_FL_TYPECAST;
+ struct btf *btf = ctx_btf(ctx);
char *next;
int is_ptr;
s32 tid;
do {
- /* Outer loop for solving arrow operator ('->') */
- if (BTF_INFO_KIND(type->info) != BTF_KIND_PTR) {
- trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
- return -EINVAL;
- }
- /* Convert a struct pointer type to a struct type */
- type = btf_type_skip_modifiers(ctx->btf, type->type, &tid);
- if (!type) {
- trace_probe_log_err(ctx->offset, BAD_BTF_TID);
- return -EINVAL;
+ if (!is_struct) {
+ /* Outer loop for solving arrow operator ('->') */
+ if (BTF_INFO_KIND(type->info) != BTF_KIND_PTR) {
+ trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
+ return -EINVAL;
+ }
+
+ /* Convert a struct pointer type to a struct type */
+ type = btf_type_skip_modifiers(btf, type->type, &tid);
+ if (!type) {
+ trace_probe_log_err(ctx->offset, BAD_BTF_TID);
+ return -EINVAL;
+ }
}
+ /* Only the first type can skip being a pointer */
+ is_struct = false;
bitoffs = 0;
do {
@@ -580,7 +612,7 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
return is_ptr;
anon_offs = 0;
- field = btf_find_struct_member(ctx->btf, type, fieldname,
+ field = btf_find_struct_member(btf, type, fieldname,
&anon_offs);
if (IS_ERR(field)) {
trace_probe_log_err(ctx->offset, BAD_BTF_TID);
@@ -602,7 +634,7 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
ctx->last_bitsize = 0;
}
- type = btf_type_skip_modifiers(ctx->btf, field->type, &tid);
+ type = btf_type_skip_modifiers(btf, field->type, &tid);
if (!type) {
trace_probe_log_err(ctx->offset, BAD_BTF_TID);
return -EINVAL;
@@ -627,6 +659,7 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
return 0;
}
+
static int __store_entry_arg(struct trace_probe *tp, int argnum);
static int parse_btf_arg(char *varname,
@@ -640,7 +673,7 @@ static int parse_btf_arg(char *varname,
int i, is_ptr, ret;
u32 tid;
- if (WARN_ON_ONCE(!ctx->funcname))
+ if (WARN_ON_ONCE(!ctx->funcname && !(ctx->flags & TPARG_FL_TEVENT)))
return -EINVAL;
is_ptr = split_next_field(varname, &field, ctx);
@@ -653,6 +686,20 @@ static int parse_btf_arg(char *varname,
return -EOPNOTSUPP;
}
+ if (ctx->flags & TPARG_FL_TEVENT) {
+ int ret;
+
+ ret = parse_trace_event(varname, code, ctx);
+ if (ret < 0)
+ return ret;
+
+ if (ctx->flags & TPARG_FL_TYPECAST) {
+ type = ctx->last_struct;
+ goto found_type;
+ }
+ return 0;
+ }
+
if (ctx->flags & TPARG_FL_RETURN && !strcmp(varname, "$retval")) {
code->op = FETCH_OP_RETVAL;
/* Check whether the function return type is not void */
@@ -709,6 +756,7 @@ static int parse_btf_arg(char *varname,
found:
type = btf_type_skip_modifiers(ctx->btf, tid, &tid);
+found_type:
if (!type) {
trace_probe_log_err(ctx->offset, BAD_BTF_TID);
return -EINVAL;
@@ -727,7 +775,7 @@ static int parse_btf_arg(char *varname,
static const struct fetch_type *find_fetch_type_from_btf_type(
struct traceprobe_parse_context *ctx)
{
- struct btf *btf = ctx->btf;
+ struct btf *btf = ctx_btf(ctx);
const char *typestr = NULL;
if (btf && ctx->last_type)
@@ -758,7 +806,70 @@ static int parse_btf_bitfield(struct fetch_insn **pcode,
return 0;
}
-#else
+static int query_btf_struct(const char *sname, struct traceprobe_parse_context *ctx)
+{
+ int id;
+
+ if (!ctx->struct_btf) {
+ struct btf *btf;
+
+ id = bpf_find_btf_id(sname, BTF_KIND_STRUCT, &btf);
+ if (id < 0)
+ return id;
+ ctx->struct_btf = btf;
+ } else {
+ id = btf_find_by_name_kind(ctx->struct_btf, sname, BTF_KIND_STRUCT);
+ if (id < 0)
+ return id;
+ }
+
+ ctx->last_struct = btf_type_by_id(ctx->struct_btf, id);
+ return 0;
+}
+
+static int handle_typecast(char *arg, struct fetch_insn **pcode,
+ struct fetch_insn *end,
+ struct traceprobe_parse_context *ctx)
+{
+ char *tmp;
+ int ret;
+
+ /* Currently this only works for eprobes */
+ if (!(ctx->flags & TPARG_FL_TEVENT)) {
+ trace_probe_log_err(ctx->offset, TYPECAST_NOT_EVENT);
+ return -EINVAL;
+ }
+
+ tmp = strchr(arg, ')');
+ if (!tmp) {
+ trace_probe_log_err(ctx->offset + strlen(arg),
+ DEREF_OPEN_BRACE);
+ return -EINVAL;
+ }
+ *tmp = '\0';
+ ret = query_btf_struct(arg + 1, ctx);
+ *tmp = ')';
+
+ if (ret < 0) {
+ trace_probe_log_err(ctx->offset + 1, NO_PTR_STRCT);
+ ret = -EINVAL;
+ goto out_put;
+ }
+
+ ctx->flags |= TPARG_FL_TYPECAST;
+ tmp++;
+
+ ctx->offset += tmp - arg;
+ ret = parse_btf_arg(tmp, pcode, end, ctx);
+ ctx->flags &= ~TPARG_FL_TYPECAST;
+ ctx->last_struct = NULL;
+out_put:
+ btf_put(ctx->struct_btf);
+ return ret;
+}
+
+#else /* !CONFIG_PROBE_EVENTS_BTF_ARGS */
+
static void clear_btf_context(struct traceprobe_parse_context *ctx)
{
ctx->btf = NULL;
@@ -794,7 +905,15 @@ static int check_prepare_btf_string_fetch(char *typename,
return 0;
}
-#endif
+static int handle_typecast(char *arg, struct fetch_insn **pcode,
+ struct fetch_insn *end,
+ struct traceprobe_parse_context *ctx)
+{
+ trace_probe_log_err(ctx->offset, NOSUP_BTFARG);
+ return -EOPNOTSUPP;
+}
+
+#endif /* CONFIG_PROBE_EVENTS_BTF_ARGS */
#ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
@@ -953,18 +1072,9 @@ static int parse_probe_vars(char *orig_arg, const struct fetch_type *t,
int len;
if (ctx->flags & TPARG_FL_TEVENT) {
- if (code->data)
- return -EFAULT;
- ret = parse_trace_event_arg(arg, code, ctx);
- if (!ret)
- return 0;
- if (strcmp(arg, "comm") == 0 || strcmp(arg, "COMM") == 0) {
- code->op = FETCH_OP_COMM;
- return 0;
- }
- /* backward compatibility */
- ctx->offset = 0;
- goto inval;
+ if (parse_trace_event(arg, code, ctx) < 0)
+ goto inval;
+ return 0;
}
if (str_has_prefix(arg, "retval")) {
@@ -1231,6 +1341,9 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
code->op = FETCH_OP_IMM;
}
break;
+ case '(':
+ ret = handle_typecast(arg, pcode, end, ctx);
+ break;
default:
if (isalpha(arg[0]) || arg[0] == '_') { /* BTF variable */
if (!tparg_is_function_entry(ctx->flags) &&
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 262d8707a3df..952e3d7582b8 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -394,6 +394,7 @@ static inline int traceprobe_get_entry_data_size(struct trace_probe *tp)
* TPARG_FL_KERNEL and TPARG_FL_USER are also mutually exclusive.
* TPARG_FL_FPROBE and TPARG_FL_TPOINT are optional but it should be with
* TPARG_FL_KERNEL.
+ * TPARG_FL_TYPECAST is set if an argument was typecast to a structure.
*/
#define TPARG_FL_RETURN BIT(0)
#define TPARG_FL_KERNEL BIT(1)
@@ -402,6 +403,7 @@ static inline int traceprobe_get_entry_data_size(struct trace_probe *tp)
#define TPARG_FL_USER BIT(4)
#define TPARG_FL_FPROBE BIT(5)
#define TPARG_FL_TPOINT BIT(6)
+#define TPARG_FL_TYPECAST BIT(7)
#define TPARG_FL_LOC_MASK GENMASK(4, 0)
static inline bool tparg_is_function_entry(unsigned int flags)
@@ -422,7 +424,9 @@ struct traceprobe_parse_context {
const struct btf_param *params; /* Parameter of the function */
s32 nr_params; /* The number of the parameters */
struct btf *btf; /* The BTF to be used */
+ struct btf *struct_btf; /* The BTF to be used for structs */
const struct btf_type *last_type; /* Saved type */
+ const struct btf_type *last_struct; /* Saved structure */
u32 last_bitoffs; /* Saved bitoffs */
u32 last_bitsize; /* Saved bitsize */
struct trace_probe *tp;
@@ -563,7 +567,8 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
C(NEED_STRING_TYPE, "$comm and immediate-string only accepts string type"),\
C(TOO_MANY_ARGS, "Too many arguments are specified"), \
C(TOO_MANY_EARGS, "Too many entry arguments specified"), \
- C(EVENT_TOO_BIG, "Event too big (too many fields?)"),
+ C(EVENT_TOO_BIG, "Event too big (too many fields?)"), \
+ C(TYPECAST_NOT_EVENT, "Typecasts are only for eprobe fields"),
#undef C
#define C(a, b) TP_ERR_##a
--
2.53.0
^ permalink raw reply related
* Re: [PATCH v2] tracing/osnoise: Array printk init and cleanup
From: Crystal Wood @ 2026-05-21 23:48 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-trace-kernel, John Kacur, Tomas Glozar, Costa Shulyupin,
Wander Lairson Costa, sashiko-bot, sashiko-reviews
In-Reply-To: <20260520162810.3112a9d6@gandalf.local.home>
On Wed, 2026-05-20 at 16:28 -0400, Steven Rostedt wrote:
> [ Replying to Sashiko: https://sashiko.dev/?list=org.kernel.vger.linux-trace-kernel#/patchset/20260511223035.1475676-1-crwood%40redhat.com ]
>
> > commit 859dc1eded9e36304f27e72643f28f30f327f078
> > Author: Crystal Wood <crwood@redhat.com>
> > tracing/osnoise: Array printk init and cleanup
> > This commit initializes the trace array buffer for osnoise instances on
> > creation. It also introduces the osnoise_print() helper function and switches
> > the tracer to use trace_array_printk() instead of trace_array_printk_buf().
> > > Use trace_array_printk() instead of trace_array_printk_buf(), as we're
> > > only writing to the main buffer (of a non-main instance) anyway -- and
>
> > Is this description accurate? Can the osnoise tracer also be enabled on the
> > global trace array, for example by writing to current_tracer?
> > If it is used on the global instance, trace_array_init_printk() explicitly
> > returns -EINVAL and refuses to allocate a buffer, and trace_array_printk()
> > will return 0 without printing.
>
> Hmm, I guess it would drop any of these prints if it's using the top level
> trace array. But I believe this is for use with RTLA which always creates a
> trace instance.
It doesn't *have* to be used with rtla, but you won't get the printks if
you use the main instance. The parenthetical was just meant to avoid
potential misunderstanding of the phrase "main buffer".
> > > diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
> > > index 75678053b21c5..2be188768ab42 100644
> > > --- a/kernel/trace/trace_osnoise.c
> > > +++ b/kernel/trace/trace_osnoise.c
> > > @@ -83,6 +83,22 @@ struct osnoise_instance {
> > >
> > > static struct list_head osnoise_instances;
> > >
> > > +static void osnoise_print(const char *fmt, ...)
> > > +{
> > > + struct osnoise_instance *inst;
> > > + struct trace_array *tr;
> > > + va_list ap;
> > > +
> > > + rcu_read_lock();
> > > + list_for_each_entry_rcu(inst, &osnoise_instances, list) {
> > > + tr = inst->tr;
> > > + va_start(ap, fmt);
> > > + trace_array_vprintk(tr, _RET_IP_, fmt, ap);
>
> > Does this code create a use-after-free on the trace array if an instance is
> > removed concurrently?
If so, it was already an issue with osnoise_taint(),
osnoise_stop_tracing(), osnoise_stop_exception(), etc. Wouldn't be
surprising, as this file has a number of other synchronization issues as
well.
> > When a user deletes a trace instance via rmdir, the unregister function
> > removes the instance from the list using list_del_rcu(). However, the removal
> > routine does not appear to wait for an RCU grace period before freeing the
> > trace array itself.
> > Could a concurrent execution of this loop inside the rcu_read_lock() section
> > still access the unlinked instance, read the freed inst->tr, and pass it to
> > trace_array_vprintk()? This appears to be an existing issue, but it still
> > affects the loop here.
>
> Hmm, this is interesting. osnoise keeps track of its own instances via a
> osnoise_instances list. But it only use kfree_rcu() to free the list
> descriptor but doesn't take care of the tr being freed before hand!
>
> Something like this could work [not even compiled]
>
> diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
> index 75678053b21c..bda1e0e0d2e1 100644
> --- a/kernel/trace/trace_osnoise.c
> +++ b/kernel/trace/trace_osnoise.c
> @@ -476,8 +476,11 @@ static void print_osnoise_headers(struct seq_file *s)
> \
> rcu_read_lock(); \
> list_for_each_entry_rcu(inst, &osnoise_instances, list) { \
> + if (trace_array_get(inst->tr) < 0) \
> + continue; \
> buffer = inst->tr->array_buffer.buffer; \
> trace_array_printk_buf(buffer, _THIS_IP_, msg); \
> + trace_array_put(inst->tr); \
> } \
> rcu_read_unlock(); \
> osnoise_data.tainted = true; \
OK, I'll prepare a v3.
> > > @@ -1189,10 +1198,10 @@ static __always_inline void osnoise_stop_exception(char *msg, int cpu)
> > > rcu_read_lock();
> > > list_for_each_entry_rcu(inst, &osnoise_instances, list) {
> > > tr = inst->tr;
> > > - trace_array_printk_buf(tr->array_buffer.buffer, _THIS_IP_,
> > > - "stop tracing hit on cpu %d due to exception: %s\n",
> > > - smp_processor_id(),
> > > - msg);
> > > + trace_array_printk(tr, _THIS_IP_,
> > > + "stop tracing hit on cpu %d due to exception: %s\n",
> > > + smp_processor_id(),
> > > + msg);
>
> > Does this switch to trace_array_printk() silently drop messages when tracing
> > on the global instance?
> > Looking at trace_array_printk(), it explicitly returns 0 if the
> > TRACE_ARRAY_FL_GLOBAL flag is set on the trace array:
> > kernel/trace/trace_printk.c:trace_array_printk() {
> > ...
> > if (tr->flags & TRACE_ARRAY_FL_GLOBAL)
> > return 0;
> > ...
> > }
> > This means stop and exception logs would be discarded if the tracer is
> > running on the global instance.
>
> Yep. But maybe that's not an issue?
I'd actually consider it a fix, if the policy is actually about not
allowing tracers to "spam" the main instance, rather than just avoiding
the percpu allocation. Especially for osnoise_stop_exception(), which
is called only one place, that already printed the same message with
osnoise_taint(). :-P
As I mentioned in the v1 patch, if trace_array_printk_buf() is going to
bypass the global instance check, should probably be internal to the
core trace code.
-Crystal
^ permalink raw reply
* [PATCH] unwind: Add sframe_(un)register() system calls
From: Steven Rostedt @ 2026-05-21 22:35 UTC (permalink / raw)
To: LKML, Linux Trace Kernel, bpf
Cc: Masami Hiramatsu, Mathieu Desnoyers, Jens Remus, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Linus Torvalds, Andrew Morton,
Florian Weimer, Kees Cook, Carlos O'Donell, Sam James,
Dylan Hatch, Borislav Petkov, Dave Hansen, David Hildenbrand,
H. Peter Anvin, Liam R. Howlett, Lorenzo Stoakes, Michal Hocko,
Mike Rapoport, Suren Baghdasaryan, Vlastimil Babka,
Heiko Carstens, Vasily Gorbik
From: Steven Rostedt <rostedt@goodmis.org>
Add system calls to register and unregister sframes that can be used by
dynamic linkers to tell the kernel where the sframe section is in memory
for libraries it loads.
Both system calls take a pointer to a new structure:
struct sframe_setup {
unsigned long sframe_start;
unsigned long sframe_size;
unsigned long text_start;
unsigned long text_size;
};
and a size of the passed in structure. If the system call needs to be
extended, then the structure could be changed and the size of that
structure will tell the kernel that it is the new version. If the kernel
does not recognize the structure size, it will return -EINVAL.
sframe_start - The virtual address of the sframe section
sframe_size - The length of the sframe section
text_start - the text section the sframe represents
test_size - the length of the section
If other stack tracing functionality is added, it will require a new
system call.
The unregister only needs the sframe_start and requires all the rest of
the fields to be 0. In the future, if more can be done, then user space
can update the other values and check the return code to see if the kernel
supports it.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
Based on top of Jens patches here:
https://lore.kernel.org/linux-trace-kernel/20260520154004.3845823-1-jremus@linux.ibm.com/
[ Note, I tested this with the same program from the RFC patch ]
Changes from RFC: https://patch.msgid.link/20260429114355.6c712e6a@gandalf.local.home
- Remove the ioctl() like system call for a unique system call for each
functionality. Right now there's two functionalities:
1. register sframe section
2. unregister sframe sections
- Added taking a lock around the mtree logic in __sframe_remove_section()
as Sashiko mentioned that there could be races from user space
registering and unregistering sframe sections at the same time.
- Removed [RFC] from subject as I believe this is more likely the way
this system call will be done.
arch/alpha/kernel/syscalls/syscall.tbl | 2 +
arch/arm/tools/syscall.tbl | 2 +
arch/arm64/tools/syscall_32.tbl | 2 +
arch/m68k/kernel/syscalls/syscall.tbl | 2 +
arch/microblaze/kernel/syscalls/syscall.tbl | 2 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 2 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 2 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 2 +
arch/parisc/kernel/syscalls/syscall.tbl | 2 +
arch/powerpc/kernel/syscalls/syscall.tbl | 2 +
arch/s390/kernel/syscalls/syscall.tbl | 3 +
arch/sh/kernel/syscalls/syscall.tbl | 2 +
arch/sparc/kernel/syscalls/syscall.tbl | 2 +
arch/x86/entry/syscalls/syscall_32.tbl | 2 +
arch/x86/entry/syscalls/syscall_64.tbl | 2 +
arch/xtensa/kernel/syscalls/syscall.tbl | 2 +
include/linux/syscalls.h | 2 +
include/uapi/asm-generic/unistd.h | 7 ++-
include/uapi/linux/sframe.h | 12 ++++
kernel/sys_ni.c | 3 +
kernel/unwind/sframe.c | 63 ++++++++++++++++++++-
scripts/syscall.tbl | 2 +
22 files changed, 118 insertions(+), 4 deletions(-)
create mode 100644 include/uapi/linux/sframe.h
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index f31b7afffc34..f0639b831f2a 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -511,3 +511,5 @@
579 common file_setattr sys_file_setattr
580 common listns sys_listns
581 common rseq_slice_yield sys_rseq_slice_yield
+582 common sframe_register sys_sframe_register
+583 common sframe_unregister sys_sframe_unregister
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 94351e22bfcf..887b242ffb25 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -486,3 +486,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/arm64/tools/syscall_32.tbl b/arch/arm64/tools/syscall_32.tbl
index 62d93d88e0fe..c820f1ff718c 100644
--- a/arch/arm64/tools/syscall_32.tbl
+++ b/arch/arm64/tools/syscall_32.tbl
@@ -483,3 +483,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 248934257101..4c7f17f0364b 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -471,3 +471,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 223d26303627..e8dc2cc149f4 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -477,3 +477,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 7430714e2b8f..d0bae05d16af 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -410,3 +410,5 @@
469 n32 file_setattr sys_file_setattr
470 n32 listns sys_listns
471 n32 rseq_slice_yield sys_rseq_slice_yield
+472 n32 sframe_register sys_sframe_register
+473 n32 sframe_unregister sys_sframe_unregister
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 630aab9e5425..2e200de6a58c 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -386,3 +386,5 @@
469 n64 file_setattr sys_file_setattr
470 n64 listns sys_listns
471 n64 rseq_slice_yield sys_rseq_slice_yield
+472 n64 sframe_register sys_sframe_register
+473 n64 sframe_unregister sys_sframe_unregister
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 128653112284..0e3b82011ae2 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -459,3 +459,5 @@
469 o32 file_setattr sys_file_setattr
470 o32 listns sys_listns
471 o32 rseq_slice_yield sys_rseq_slice_yield
+472 o32 sframe_register sys_sframe_register
+473 o32 sframe_unregister sys_sframe_unregister
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index c6331dad9461..e0758ef8667d 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -470,3 +470,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 4fcc7c58a105..eda40c4f4f2f 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -562,3 +562,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 nospu rseq_slice_yield sys_rseq_slice_yield
+472 nospu sframe_register sys_sframe_register
+473 nospu sframe_unregister sys_sframe_unregister
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 09a7ef04d979..52519e2acdc8 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -398,3 +398,6 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common stacktrace_setup sys_stacktrace_setup
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 70b315cbe710..62ac7b1b4dd4 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -475,3 +475,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 7e71bf7fcd14..f92273ae608a 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -517,3 +517,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index f832ebd2d79b..409a50df3b21 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -477,3 +477,5 @@
469 i386 file_setattr sys_file_setattr
470 i386 listns sys_listns
471 i386 rseq_slice_yield sys_rseq_slice_yield
+472 i386 sframe_register sys_sframe_register
+473 i386 sframe_unregister sys_sframe_unregister
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 524155d655da..9b7c5a449751 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -396,6 +396,8 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index a9bca4e484de..037b8040f69d 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -442,3 +442,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register' sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index f5639d5ac331..992ccc401c5e 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -999,6 +999,8 @@ asmlinkage long sys_lsm_get_self_attr(unsigned int attr, struct lsm_ctx __user *
asmlinkage long sys_lsm_set_self_attr(unsigned int attr, struct lsm_ctx __user *ctx,
u32 size, u32 flags);
asmlinkage long sys_lsm_list_modules(u64 __user *ids, u32 __user *size, u32 flags);
+asmlinkage long sys_sframe_register(void *data, unsigned int size);
+asmlinkage long sys_sframe_unregister(void *data, unsigned int size);
/*
* Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a627acc8fb5f..17042d7e5e87 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -863,8 +863,13 @@ __SYSCALL(__NR_listns, sys_listns)
#define __NR_rseq_slice_yield 471
__SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
+#define __NR_sframe_register 472
+__SYSCALL(__NR_sframe_register, sys_sframe_register)
+#define __NR_sframe_unregister 473
+__SYSCALL(__NR_sframe_unregister, sys_sframe_unregister)
+
#undef __NR_syscalls
-#define __NR_syscalls 472
+#define __NR_syscalls 474
/*
* 32 bit systems traditionally used different
diff --git a/include/uapi/linux/sframe.h b/include/uapi/linux/sframe.h
new file mode 100644
index 000000000000..137a2ebf91f4
--- /dev/null
+++ b/include/uapi/linux/sframe.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_SFRAME_H
+#define _UAPI_LINUX_SFRAME_H
+
+struct sframe_setup {
+ unsigned long sframe_start;
+ unsigned long sframe_size;
+ unsigned long text_start;
+ unsigned long text_size;
+};
+
+#endif /* _UAPI_LINUX_SFRAME_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index add3032da16f..eca5293f5d40 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -394,3 +394,6 @@ COND_SYSCALL(rseq_slice_yield);
COND_SYSCALL(uretprobe);
COND_SYSCALL(uprobe);
+
+COND_SYSCALL(sframe_register);
+COND_SYSCALL(sframe_unregister);
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index db88d993dff1..9956f1e3aba1 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -12,8 +12,10 @@
#include <linux/mm.h>
#include <linux/string_helpers.h>
#include <linux/sframe.h>
+#include <linux/syscalls.h>
#include <asm/unwind_user_sframe.h>
#include <linux/unwind_user_types.h>
+#include <uapi/linux/sframe.h>
#include "sframe.h"
#include "sframe_debug.h"
@@ -842,9 +844,11 @@ static void sframe_free_srcu(struct rcu_head *rcu)
static int __sframe_remove_section(struct mm_struct *mm,
struct sframe_section *sec)
{
- if (!mtree_erase(&mm->sframe_mt, sec->text_start)) {
- dbg_sec("mtree_erase failed: text=%lx\n", sec->text_start);
- return -EINVAL;
+ scoped_guard(mmap_read_lock, mm) {
+ if (!mtree_erase(&mm->sframe_mt, sec->text_start)) {
+ dbg_sec("mtree_erase failed: text=%lx\n", sec->text_start);
+ return -EINVAL;
+ }
}
call_srcu(&sframe_srcu, &sec->rcu, sframe_free_srcu);
@@ -936,3 +940,56 @@ void sframe_free_mm(struct mm_struct *mm)
mtree_destroy(&mm->sframe_mt);
}
+
+/**
+ * sys_sframe_register - register an address for user space stacktrace walking.
+ * @data: Structure of sframe data used to register the sframe section
+ * @size: The size of the given structure.
+ *
+ * This system call is used by dynamic library utilities to inform the kernel
+ * of meta data that it loaded that can be used by the kernel to know how
+ * to stack walk the given text locations.
+ *
+ * Return: 0 if successful, otherwise a negative error.
+ */
+SYSCALL_DEFINE2(sframe_register, __user struct sframe_setup *, data, unsigned int, size)
+{
+ struct sframe_setup sframe;
+
+ if (sizeof(sframe) != size)
+ return -EINVAL;
+
+ if (copy_from_user(&sframe, data, size))
+ return -EFAULT;
+
+ return sframe_add_section(sframe.sframe_start,
+ sframe.sframe_start + sframe.sframe_size,
+ sframe.text_start,
+ sframe.text_start + sframe.text_size);
+}
+
+/**
+ * sys_sframe_unregister - unregister an sframe address
+ * @data: Structure of sframe data used to register the sframe section
+ * @size: The size of the given structure.
+ *
+ * The data->sframe_start is the only value that is used. The rest must
+ * be zero.
+ *
+ * Return: 0 if successful, otherwise a negative error.
+ */
+SYSCALL_DEFINE2(sframe_unregister, __user struct sframe_setup *, data, unsigned int, size)
+{
+ struct sframe_setup sframe;
+
+ if (sizeof(sframe) != size)
+ return -EINVAL;
+
+ if (copy_from_user(&sframe, data, size))
+ return -EFAULT;
+
+ if (sframe.sframe_size || sframe.text_start || sframe.text_size)
+ return -EINVAL;
+
+ return sframe_remove_section(sframe.sframe_start);
+}
diff --git a/scripts/syscall.tbl b/scripts/syscall.tbl
index 7a42b32b6577..46ec22b50042 100644
--- a/scripts/syscall.tbl
+++ b/scripts/syscall.tbl
@@ -412,3 +412,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
--
2.53.0
^ permalink raw reply related
* Re: [PATCH] tracing: Replace BUG_ON with lockdep_assert_held in uprobe_buffer functions
From: Steven Rostedt @ 2026-05-21 22:16 UTC (permalink / raw)
To: Yash Suthar
Cc: mhiramat, mathieu.desnoyers, linux-kernel, linux-trace-kernel,
skhan
In-Reply-To: <20260521192846.8306-1-yashsuthar983@gmail.com>
On Fri, 22 May 2026 00:58:46 +0530
Yash Suthar <yashsuthar983@gmail.com> wrote:
> Replace BUG_ON(!mutex_is_locked(&event_mutex)) with
> lockdep_assert_held(&event_mutex) in uprobe_buffer_enable() and
> uprobe_buffer_disable().
>
> BUG_ON() will crash the kernel. mutex_is_locked() only checks
> if any task holds lock,but not the caller task. lockdep_assert_held()
> also check current task for lock and no crash on true condition.
>
> Signed-off-by: Yash Suthar <yashsuthar983@gmail.com>
This looks good to me.
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Masami, do you want to take this?
-- Steve
> ---
> kernel/trace/trace_uprobe.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> index 2cabf8a23ec5..aee0960d0cf7 100644
> --- a/kernel/trace/trace_uprobe.c
> +++ b/kernel/trace/trace_uprobe.c
> @@ -912,7 +912,7 @@ static int uprobe_buffer_enable(void)
> {
> int ret = 0;
>
> - BUG_ON(!mutex_is_locked(&event_mutex));
> + lockdep_assert_held(&event_mutex);
>
> if (uprobe_buffer_refcnt++ == 0) {
> ret = uprobe_buffer_init();
> @@ -927,7 +927,7 @@ static void uprobe_buffer_disable(void)
> {
> int cpu;
>
> - BUG_ON(!mutex_is_locked(&event_mutex));
> + lockdep_assert_held(&event_mutex);
>
> if (--uprobe_buffer_refcnt == 0) {
> for_each_possible_cpu(cpu)
^ permalink raw reply
* Re: [PATCH v6 21/43] KVM: SEV: Make 'uaddr' parameter optional for KVM_SEV_SNP_LAUNCH_UPDATE
From: Ackerley Tng @ 2026-05-21 21:27 UTC (permalink / raw)
To: Sean Christopherson, Fuad Tabba
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <ag8G7Wq5PbEdKloG@google.com>
Sean Christopherson <seanjc@google.com> writes:
> On Thu, May 21, 2026, Fuad Tabba wrote:
>> Hi,
>>
>> On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
>> <devnull+ackerleytng.google.com@kernel.org> wrote:
>> >
>> > From: Michael Roth <michael.roth@amd.com>
>> >
>> > For vm_memory_attributes=1, in-place conversion/population is not
>> > supported, so the initial contents necessarily must need to come
>> > from a separate src address, which is enforced by the current
>> > implementation. However, for vm_memory_attributes=0, it is possible for
>> > guest memory to be initialized directly from userspace by mmap()'ing the
>> > guest_memfd and writing to it while the corresponding GPA ranges are in
>> > a 'shared' state before converting them to the 'private' state expected
>> > by KVM_SEV_SNP_LAUNCH_UPDATE.
>> >
>> > Update the handling/documentation for KVM_SEV_SNP_LAUNCH_UPDATE to allow
>> > for 'uaddr' to be set to NULL when vm_memory_attributes=0, which
>> > SNP_LAUNCH_UPDATE will then use to determine when it should/shouldn't
>> > copy in data from a separate memory location. Continue to enforce
>> > non-NULL for the original vm_memory_attributes=1 case.
>> >
>> > Signed-off-by: Michael Roth <michael.roth@amd.com>
>> > [Added src_page check in error handling path when the firmware command fails]
>> > [Dropped ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES]
>> > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>>
>> I'm not very familiar with the SEV-SNP populate flows, but it looks
>> like Sashiko is on to something:
>> https://sashiko.dev/#/patchset/20260507-gmem-inplace-conversion-v6-0-91ab5a8b19a4%40google.com?part=21
>>
>> - a potential read-only page overwrite, because src_page is acquired
>> via get_user_pages_fast() without the FOLL_WRITE flag, but is then
>> overwritten via memcpy
>
> Oof, yeah, that's bad. Adding FOLL_WRITE to kvm_gmem_populate() feels wrong, and
> could break uABI, but doing gup() in SNP code would reintroduce the AB-BA issue
> with filemap_invalidate_lock().
>
> Aha! Not if we use get_user_page_fast_only(). Ugh, but then we'd have to plumb
> the userspace address into the post-populated callback.
>
> Hrm. Given that no one has yelled about overwriting their CPUID page, and given
> that the CPUID page is likely dynamically created and thus is unlikely to be a
> read-only mapping (e.g. versus the initial image), maybe this?
>
Overwriting the CPUID page is by design, I think. IIUC if the SNP
firmware doesn't like something about the CPUID page, it can update
src_page and then return an error to userspace.
Userspace should then check if it agrees with the updated CPUID contents
and then retry if it agrees.
> diff --git arch/x86/kvm/svm/sev.c arch/x86/kvm/svm/sev.c
> index 37d4cfa5d980..c73c028d72c1 100644
> --- arch/x86/kvm/svm/sev.c
> +++ arch/x86/kvm/svm/sev.c
> @@ -2456,6 +2456,7 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
> sev_populate_args.type = params.type;
>
> count = kvm_gmem_populate(kvm, params.gfn_start, src, npages,
> + params.type == KVM_SEV_SNP_PAGE_TYPE_CPUID,
I think this makes sense given that writing to src_page can only happen
when params.type == KVM_SEV_SNP_PAGE_TYPE_CPUID (this is explicitly one
of the guards in sev_gmem_post_populate()):
/*
* If the firmware command failed handle the reclaim and cleanup of that
* PFN before reporting an error.
*
* Additionally, when invalid CPUID function entries are detected,
* firmware writes the expected values into the page and leaves it
* unencrypted so it can be used for debugging and error-reporting.
*
* Copy this page back into the source buffer so userspace can use this
* information to provide information on which CPUID leaves/fields
* failed CPUID validation.
*/
if (ret && !snp_page_reclaim(kvm, pfn) &&
sev_populate_args->type == KVM_SEV_SNP_PAGE_TYPE_CPUID &&
sev_populate_args->fw_error == SEV_RET_INVALID_PARAM && src_page) {
void *src_vaddr = kmap_local_page(src_page);
void *dst_vaddr = kmap_local_pfn(pfn);
memcpy(src_vaddr, dst_vaddr, PAGE_SIZE);
kunmap_local(src_vaddr);
kunmap_local(dst_vaddr);
}
> sev_gmem_post_populate, &sev_populate_args);
> if (count < 0) {
> argp->error = sev_populate_args.fw_error;
> diff --git arch/x86/kvm/vmx/tdx.c arch/x86/kvm/vmx/tdx.c
> index f97bcf580e6d..33f35be4455b 100644
> --- arch/x86/kvm/vmx/tdx.c
> +++ arch/x86/kvm/vmx/tdx.c
> @@ -3188,7 +3188,7 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *c
> };
> gmem_ret = kvm_gmem_populate(kvm, gpa_to_gfn(region.gpa),
> u64_to_user_ptr(region.source_addr),
> - 1, tdx_gmem_post_populate, &arg);
> + 1, false, tdx_gmem_post_populate, &arg);
And TDX doesn't try to write src_page, so this is good too.
> if (gmem_ret < 0) {
> ret = gmem_ret;
> break;
> diff --git include/linux/kvm_host.h include/linux/kvm_host.h
> index 61a3430957f2..b83cda2870ba 100644
> --- include/linux/kvm_host.h
> +++ include/linux/kvm_host.h
> @@ -2596,7 +2596,8 @@ int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_ord
> typedef int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> struct page *page, void *opaque);
>
> -long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages,
> +long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src,
> + long npages, bool writable,
What do you think of need_writable_src instead of just writable for the
variable name?
> kvm_gmem_populate_cb post_populate, void *opaque);
> #endif
>
> diff --git virt/kvm/guest_memfd.c virt/kvm/guest_memfd.c
> index a35a55571a2d..6553d4e032ce 100644
> --- virt/kvm/guest_memfd.c
> +++ virt/kvm/guest_memfd.c
> @@ -858,7 +858,8 @@ static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
> return ret;
> }
>
> -long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long npages,
> +long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src,
> + long npages, bool writable,
> kvm_gmem_populate_cb post_populate, void *opaque)
> {
> struct kvm_memory_slot *slot;
> @@ -892,8 +893,9 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
>
> if (src) {
> unsigned long uaddr = (unsigned long)src + i * PAGE_SIZE;
> + unsigned int flags = writable ? FOLL_WRITE : 0;
How about using FOLL_WRITE | FOLL_NOFAULT so if it weren't writable to
start with, don't CoW, just error out?
Like you said above the CPUID page provided as src_page would have been
written to before, so it should have been mapped as writable.
>
> - ret = get_user_pages_fast(uaddr, 1, 0, &src_page);
> + ret = get_user_pages_fast(uaddr, 1, flags, &src_page);
If we stick with FOLL_WRITE, this also solves the case where a read-only
mapping or global zero page are provided as src_page, since
get_user_pages_fast() will do a copy-on-write if those were the inputs,
making it writable before the write happens (on failure) in
sev_gmem_post_populate().
> if (ret < 0)
> break;
> if (ret != 1) {
>
>> - an ordering violation with the kunmap_local() calls
>
> Yeesh, that's a new one for me. Thankfully this is 64-bit only, so it's not an
> issue.
>
>> These predate this patch series and are just being touched by the
>> 'src_page' addition, but if Sashiko's right, these should probably be
>> fixed sooner rather than later.
>
> Yeah, ditto with the offset wrapping case.
^ permalink raw reply
* [PATCH] tracing: Replace BUG_ON with lockdep_assert_held in uprobe_buffer functions
From: Yash Suthar @ 2026-05-21 19:28 UTC (permalink / raw)
To: rostedt, mhiramat
Cc: mathieu.desnoyers, linux-kernel, linux-trace-kernel, skhan,
Yash Suthar
Replace BUG_ON(!mutex_is_locked(&event_mutex)) with
lockdep_assert_held(&event_mutex) in uprobe_buffer_enable() and
uprobe_buffer_disable().
BUG_ON() will crash the kernel. mutex_is_locked() only checks
if any task holds lock,but not the caller task. lockdep_assert_held()
also check current task for lock and no crash on true condition.
Signed-off-by: Yash Suthar <yashsuthar983@gmail.com>
---
kernel/trace/trace_uprobe.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 2cabf8a23ec5..aee0960d0cf7 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -912,7 +912,7 @@ static int uprobe_buffer_enable(void)
{
int ret = 0;
- BUG_ON(!mutex_is_locked(&event_mutex));
+ lockdep_assert_held(&event_mutex);
if (uprobe_buffer_refcnt++ == 0) {
ret = uprobe_buffer_init();
@@ -927,7 +927,7 @@ static void uprobe_buffer_disable(void)
{
int cpu;
- BUG_ON(!mutex_is_locked(&event_mutex));
+ lockdep_assert_held(&event_mutex);
if (--uprobe_buffer_refcnt == 0) {
for_each_possible_cpu(cpu)
--
2.43.0
^ permalink raw reply related
* Re: [PATCH 0/3] Fix out-of-tree build of some tools
From: Ian Rogers @ 2026-05-21 17:50 UTC (permalink / raw)
To: Ben Hutchings
Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Steven Rostedt, Tomas Glozar, linux-perf-users,
linux-trace-kernel
In-Reply-To: <ag8X7gcDw6jpJsLq@decadent.org.uk>
On Thu, May 21, 2026 at 7:57 AM Ben Hutchings <benh@debian.org> wrote:
>
> perf and rtla currently don't fully support out-of-tree builds, as
> they may still create files in their source directory. This series
> fixes all the instances of this problem that I have found.
>
> Ben.
Thanks for the cleanup! I ran Sashiko locally and it noted that we
should also fix the clean steps to remove these files (I'll copy the
full text below).
Reviewed-by: Ian Rogers <irogers@google.com>
```
commit cf212be834ce6b0df5cccdbc4ad977a917ff8e48
Author: Ben Hutchings <benh@debian.org>
perf tools: Put Python egg info in output directory
Installing the Python extension currently creates an egg-info
directory in the source tree, preventing a fully out-of-tree build.
Add the necessary runes to the setup.py comamnd line to relocate the
egg-info directory in an out-of-tree build.
> diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
> index fc92d6ceac5b..b9a9bdddb89a 100644
> --- a/tools/perf/Makefile.perf
> +++ b/tools/perf/Makefile.perf
> @@ -905,7 +910,9 @@ install-bin: install-tools install-tests
> install: install-bin try-install-man
>
> install-python_ext:
> - $(PYTHON_WORD) util/setup.py --quiet install
--root='/$(DESTDIR_SQ)'
> + $(PYTHON_WORD) util/setup.py --quiet \
> + $(if $(OUTPUT),egg_info --egg-base $(OUTPUT)) \
> + install --root='/$(DESTDIR_SQ)'
This relocates the egg-info directory to the output directory, which
helps keep the source tree clean. However, does the `python-clean`
target need to be updated to clean up this directory?
Currently, `python-clean` is defined as:
python-clean := $(call QUIET_CLEAN, python) $(RM) -r
$(PYTHON_EXTBUILD) $(OUTPUT)python/perf*.so
Should we also clean `$(OUTPUT)/perf.egg-info` (or `perf.egg-info` for
in-tree builds) to ensure `make clean` removes all generated python
metadata? For example:
python-clean := $(call QUIET_CLEAN, python) \
$(RM) -r $(PYTHON_EXTBUILD) $(OUTPUT)python/perf*.so \
$(OUTPUT)perf.egg-info perf.egg-info
commit dd754ea75966ec7ade41500afeb32d944f20a8f9
Author: Ben Hutchings <benh@debian.org>
perf tools: Put Python bytecode in output directory
The PMU events are processed into C sources by Python scripts,
which
normally results in writing bytecode for each module into the
source
tree. This prevents a fully out-of-tree build.
To fix this, set $PYTHONPYCACHEPREFIX to relocate the bytecode cache
directory in an out-of-tree build.
> diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
> index fc92d6ceac5b..b9a9bdddb89a 100644
> --- a/tools/perf/Makefile.perf
> +++ b/tools/perf/Makefile.perf
> @@ -400,6 +400,11 @@ PYTHON_EXTBUILD_LIB :=
$(PYTHON_EXTBUILD)lib/
> PYTHON_EXTBUILD_TMP := $(PYTHON_EXTBUILD)tmp/
> export PYTHON_EXTBUILD_LIB PYTHON_EXTBUILD_TMP
>
> +# Put Python bytecode in output directory
> +ifdef OUTPUT
> +export PYTHONPYCACHEPREFIX := $(OUTPUT)/__pycache__
> +endif
> +
> python-clean := $(call QUIET_CLEAN, python) $(RM) -r
$(PYTHON_EXTBUILD) $(OUTPUT)python/perf*.so
Similar to the egg-info relocation, does `python-clean` need to
clean up
the relocated `__pycache__` directory under `$(OUTPUT)`?
If `PYTHONPYCACHEPREFIX` is exported to `$(OUTPUT)/__pycache__`, running
`make clean` currently leaves this directory behind. Should we add it to
the `RM` list in `python-clean`? For example:
python-clean := $(call QUIET_CLEAN, python) \
$(RM) -r $(PYTHON_EXTBUILD)
$(OUTPUT)python/perf*.so \
$(if $(OUTPUT),$(OUTPUT)/__pycache__)
```
Thanks,
Ian
> Ben Hutchings (3):
> rtla: Fix output files in source tree
> perf tools: Put Python egg info in output directory
> perf tools: Put Python bytecode in output directory
>
> tools/perf/Makefile.perf | 9 ++++++++-
> tools/tracing/rtla/Makefile | 31 ++++++++++++++++++-----------
> tools/tracing/rtla/tests/timerlat.t | 4 ++--
> 3 files changed, 29 insertions(+), 15 deletions(-)
>
^ permalink raw reply
* Re: [PATCH] Re: Re: [RFC PATCH v2 04/10] rv/da: add pre-allocated storage pool for per-object monitors
From: Wen Yang @ 2026-05-21 17:40 UTC (permalink / raw)
To: Gabriele Monaco; +Cc: linux-kernel, linux-trace-kernel, rostedt
In-Reply-To: <5183dc18d63b617ab0f19290e8a790fa6898f372.camel@redhat.com>
On 5/19/26 19:14, Gabriele Monaco wrote:
> Hi Wen,
>
> On Mon, 2026-05-18 at 01:13 +0800, Wen Yang wrote:
>>
>> Yes. The ftracetest check_requires logic calls `command -v <binary>` to
>> satisfy `requires: <name>:program` directives. Without the script's
>> directory in PATH those checks evaluate to exit_unsupported and test cases
>> are skipped rather than run. The make path avoids this only because make
>> sets OUTDIR and the runner appends it to PATH internally.
>>
>
> So you're overriding PATH so the selftest's binaries can be found from
> the test, right?
>
> Wouldn't it be simpler to just put the absolute paths in the tests and
> don't touch PATH.
>
> If the selftests are run via makefile, it ensures the required binaries
> are built and available, so there's no need to go through the `requires:
> <name>:program` infrastructure (that's more about what's installed on
> the system.
>
> Or if you don't want anything hardcoded, you could pass the $OUTDIR from
> the environment and use that in scripts, whatever looks cleaner.
>
> Does it make sense to you?
>
>>
>> -- Patch 04: pre-allocated storage pool
>>
>> > Since you're using spinlocks, isn't that going to sleep on PREEMPT_RT?
>>
>> User-mode uprobe handlers run with preempt_count == 0, fully preemptible on
>> both PREEMPT_RT and non-PREEMPT_RT. The strongest evidence is in tlob
>> itself: tlob_start_task() takes a mutex and calls kmem_cache_zalloc(...,
>> GFP_KERNEL) from the uprobe entry handler; both are illegal in atomic
>> context and would trigger lockdep splats immediately.
>>
>> On PREEMPT_RT, spinlock_t becoming a sleeping lock in the uprobe handler
>> iscfine: both call sites (da_create_or_get_pool() from the handler and
>> da_pool_return_cb() from the rcuc kthread) are in sleepable task context.
>>
>
> Yeah exactly, the uprobe is fine with anything (even the automatic
> `kmalloc_nolock`), but sure preallocation at least guarantees the slots
> are there.
>
>> > We can have a macro DA_MON_ALLOCATION_STRATEGY = {DA_ALLOC_AUTO,
>> > DA_ALLOC_POOL, DA_ALLOC_MANUAL} where DA_MON_POOL also requires
>> > DA_MON_POOL_SIZE to be defined (force that with an #error).
>> >
>> > Anyway, this way you probably wouldn't need to define a different init
>> > function and let everything handled more transparently.
>> >
>> > Also you don't need to call da_create_or_get() explicitly,
>> > da_handle_start_event() should do it for you.
>>
>> Agreed on all counts. We plan to implement this in v3 as follows.
>> The three strategies would be a compile-time selection in da_monitor.h:
>>
>> DA_ALLOC_AUTO (default) - lock-free kmalloc_nolock on the hot path;
>> unbounded capacity.
>>
>> DA_ALLOC_POOL - pre-allocated fixed-size pool;
>> DA_MON_POOL_SIZE
>> required, enforced with #error if missing.
>>
>> DA_ALLOC_MANUAL - caller pre-inserts storage via
>> da_create_empty_storage() before the first
>> da_handle_start_event(); the framework only
>> links the target field.
>>
>> da_monitor_init_prealloc() would be removed; da_monitor_init() would
>> select pool or kmalloc initialisation internally based on the strategy.
>>
>> da_handle_start_event() and da_handle_start_run_event() would both call
>> da_prepare_storage() at compile time:
>>
>> DA_ALLOC_AUTO -> da_create_storage() (kmalloc_nolock)
>> DA_ALLOC_POOL -> da_create_or_get_pool()
>> DA_ALLOC_MANUAL -> da_fill_empty_storage() (link target into pre-
>> allocated slot; no allocation on the hot path)
>>
>> No explicit da_create_or_get() call would be needed in any monitor.
>>
>> da_create_or_get_kmalloc() would be removed: as you noted, a caller that
>> uses kmalloc_nolock does so because locking is forbidden; a GFP_KERNEL
>> fallback is equally forbidden if the lockless attempt fails, so the
>> function has no viable use case.
>>
>> tlob would define:
>> #define DA_MON_ALLOCATION_STRATEGY DA_ALLOC_POOL
>> #define DA_MON_POOL_SIZE TLOB_MAX_MONITORED
>>
>> nomiss would define:
>> #define DA_MON_ALLOCATION_STRATEGY DA_ALLOC_MANUAL
>>
>> and call da_create_empty_storage() from handle_sys_enter() (the
>> sched_setscheduler syscall path), which runs in safe task context;
>> da_fill_empty_storage() would then link the sched_dl_entity target on
>> the first da_handle_start_run_event() call in handle_sched_switch().
>
> Yeah good point, there's no need to make it a special path even if we
> have the target ready, da_handle_start_run_event() can do it just fine.
>
>>
>>
>> -- Patch 05: generic uprobe infrastructure
>>
>> Carried unchanged into v3 (as part of the 08-b split described below).
>>
>>
>> -- Patch 06: rvgen __init arrow reset
>>
>> Thanks, carried unchanged into v3.
>>
>
> Well, if you don't need reset() on the __init arrow we can drop this,
> right? Also it doesn't seem fully wired with the rest and requires a
> separate event to do handle_monitor_start(), which can be only just
> another handler for tlob, nothing general.
>
>>
>> > Why don't you make it a separate event (e.g. "start_tlob") [...] then
>> > you also wouldn't need to call reset() and start_timer() manually.
>>
>> Good suggestion. We plan to use a dedicated start_tlob event instead,
>> with a self-loop in tlob.dot:
>>
>> "running" -> "running" [ label = "start;reset(clk_elapsed)" ]
>>
>> da_handle_start_run_event(task->pid, ws, start_tlob) would put the
>> monitor into running and deliver start_tlob, which resets clk_elapsed
>> and arms the budget hrtimer via the generated ha_setup_invariants() —
>> no manual reset() or start_timer() calls needed.
>>
>> One guard would be added in tlob's ha_setup_invariants() to make the
>> self-loop work correctly:
>>
>> if (next_state == curr_state && event != start_tlob)
>> return;
>>
>> Without this, the start_tlob self-loop would be treated the same as any
>> repeated switch_in (already running) and ha_setup_invariants() would
>> return early, leaving the timer unarmed. Does this look right to you?
>>
>
> If you just add a separate event rvgen should take care of everything,
> you should be able to take ha_verify_constraint() and friends as-is from
> the generated code.
>
> But yeah, that's what it would end up doing.
>
>>
>> -- Patch 08: tlob monitor
>>
>> --- Patch structure ---
>>
>> > Could you have everything that isn't strictly tlob-related in another
>> > patch.
>>
>> Agreed. With the ioctl interface deferred (see below), v3 would keep
>> patch 08 as the tlob monitor only:
>>
>> 05-b: rv: extend uprobe API with three-phase detach helpers
>> (rv_uprobe.c, rv_uprobe.h, rv_uprobe_detach refactoring)
>> — extension of patch 05, independent of tlob
>>
>> 08: rv/tlob: add the tlob monitor itself
>> (tlob.c, tlob.h, tlob_trace.h, Kconfig/Makefile, Documentation,
>> rv_trace.h include; ha_monitor.h EVENT_NONE_LBL override
>> bundled here as it is only needed by tlob)
>>
>> The chardev infrastructure (rv_chardev.c, rv.h additions) and the UAPI
>> header (include/uapi/linux/rv.h) would move to a follow-up series
>> together with the ioctl self-instrumentation feature.
>>
>> --- ioctl interface design ---
>>
>> > I'm not particularly fond of ioctls, they aren't that flexible and in
>> > this way I don't really see an added value.
>> > [...] cannot the same thing be achieved using uprobes alone, e.g. by
>> > registering a function address or the current instruction pointer?
>> > [...] wouldn't a sysfs/tracefs file achieve a similar purpose without
>> > much of the boilerplate code?
>>
>> Fair point. We plan to ship v3 with the tracefs/uprobe interface only
>> and defer the ioctl (/dev/rv chardev) to a follow-up series once there
>> is a concrete in-tree user that requires it.
>>
>> The unique value of the ioctl is that TLOB_IOCTL_TRACE_STOP returns a
>> synchronous per-call result (-EOVERFLOW or 0) to the calling thread,
>> which neither uprobes nor tracefs writes can provide. We want to keep
>> that option open for later, but agree it should not block the initial
>> tlob submission.
>>
>> Does this approach work for you?
>>
>> What is your preference?
>
> Yeah looks good to me.
> Ioctls are cumbersome to set up also for the user, perhaps another sysfs
> file in the monitor directory would keep the control entirely in tlob.c
> and give you roughly the same value with easier setup.
>
> Heck we might even think of an RV reactor that does that: e.g. creates a
> file where reads sleep until the first reaction (-EOVERFLOW) and returns
> 0 in other scenarios. I'm gonna have a thought on that, but anyway I
> don't see why a sysfs file cannot do this.
>
> Let's defer it for now.
>
>>
>> --- Handler simplification ---
>>
>> > Perhaps keep the handler simpler by moving this reporting to a helper
>> > function and use guard(rcu)() there.
>>
>> Done. The accumulation logic is extracted into three inline helpers, each
>> using scoped_guard(rcu) and returning bool (true if the task is monitored):
>>
>> tlob_acc_running(prev) - accumulate running_ns on sched-out
>> tlob_acc_waiting(next) - accumulate waiting_ns on sched-in
>> tlob_acc_sleeping(task) - accumulate sleeping_ns on wakeup
>>
>> handle_sched_switch() and handle_sched_wakeup() become one-liners:
>>
>> static void handle_sched_switch(...)
>> {
>> bool prev_preempted = (prev_state == 0);
>>
>> if (tlob_acc_running(prev))
>> da_handle_event(prev->pid, NULL,
>> prev_preempted ? preempt_tlob : sleep_tlob);
>> if (tlob_acc_waiting(next))
>> da_handle_event(next->pid, NULL, switch_in_tlob);
>> }
>
> Yeah sounds good.
>
>>
>> > You probably don't need these. da_handle_event should skip tasks without
>> > a monitor.
>>
>> Agreed; the do_prev/do_next flags are gone. The helpers return false
>> for unmonitored tasks, and da_handle_event() skips them too — both paths
>> are no-ops for tasks with no pool entry.
>>
>> --- scoped_guard(rcu) ---
>>
>> > That should be a scoped_guard(rcu), definitely use guards if you have
>> > return paths, the compiler is going to clean up (unlock) for you.
>>
>> Applied to all RCU-protected sections in tlob_start_task() and
>> tlob_stop_task(). tlob_start_task() now uses guard(mutex) for the
>> serialised duplicate-check (replacing the explicit mutex_lock/unlock),
>> and tlob_stop_task() uses scoped_guard(rcu) for the atomic CAS section:
>>
>> scoped_guard(rcu) {
>> ws = da_get_target_by_id(task->pid);
>> if (!ws)
>> return -ESRCH;
>> ...
>> if (atomic_cmpxchg_release(&ws->stopping, 0, 1) != 0)
>> return -EAGAIN;
>> }
>
> Perfect.
>
>>
>> --- tlob_stop_all removal ---
>>
>> > All this function does should be done by da_monitor_destroy. We could
>> > add a way to pass some additional deallocation for all the other cleanup
>> > you're doing on each storage. Something like a da_extra_cleanup() you
>> > can define as whatever you need and gets called in all per-obj
>> > destruction paths.
>>
>> Agreed. tlob_stop_all() (~50 lines) has been removed entirely.
>>
>> A da_extra_cleanup() hook macro is introduced in da_monitor.h: the default
>> is a no-op; a monitor may override it before including the header. tlob
>> defines:
>>
>> static inline void tlob_extra_cleanup(struct da_monitor *da_mon)
>> {
>> struct ha_monitor *ha_mon = to_ha_monitor(da_mon);
>> struct tlob_task_state *ws = da_get_target(ha_mon);
>>
>> if (!ws)
>> return;
>> if (atomic_cmpxchg_release(&ws->stopping, 0, 1) != 0)
>> return;
>
>> ha_cancel_timer_sync(ha_mon);
> After my patch making timer callbacks RCU read-side critical section,
> you won't need that, just let the usual reset asynchronously stop the
> timer and put everything that needs it stopped in your RCU callback.
>
> Of course make sure the timer was stopped before this extra cleanup, so
> put the macro accordingly.
>
> I don't think da_extra_cleanup in general should be expected to sleep
> and call_rcu should do the heavy lifting (it may run from any tracepoint).
>
> Anyway we can see it later after that's merged.
>
>> atomic_dec(&tlob_num_monitored);
>> put_task_struct(ws->task);
>> call_rcu(&ws->rcu, tlob_free_rcu);
>> }
>> #define da_extra_cleanup tlob_extra_cleanup
>>
>> da_monitor_destroy() iterates remaining entries via da_extra_cleanup +
>> hash_del_rcu + call_rcu, then waits for all callbacks via rcu_barrier().
>> tlob's disable path is now simply:
>>
>> static void __tlob_destroy_monitor(void)
>> {
>> da_monitor_destroy();
>> }
>
> Looks good, let's see the full picture.
>
>> --- EVENT_NONE_LBL ---
>>
>> > Why don't you just override EVENT_NONE_LBL (and if you prefer call it
>> > MONITOR_TIMER_EVENT_NAME) without the need for another function?
>>
>> Done. model_get_timer_event_name() has been removed from automata.h.
>> In ha_monitor.h, EVENT_NONE_LBL is now overridable:
>>
>> #ifndef EVENT_NONE_LBL
>> #define EVENT_NONE_LBL "none"
>> #endif
>>
>> tlob.c defines it before including the model header:
>>
>> #define EVENT_NONE_LBL "budget_exceeded"
>>
>> The two call sites in ha_monitor.h that previously called
>> model_get_timer_event_name() now use EVENT_NONE_LBL directly.
>>
>> --- KUnit config / tristate ---
>>
>> > Do you need to add this here? Since you have a patch adding KUnit tests
>> > to tlob, cannot you put everything kunit-related there?
>> > I couldn't build it as module.
>>
>> Agreed on moving the Kconfig entry to patch 09.
>>
>> The module build issue is fixed by exporting the symbols needed by the
>> test via EXPORT_SYMBOL_IF_KUNIT (EXPORTED_FOR_KUNIT_TESTING namespace);
>> tlob_kunit.c imports the namespace with MODULE_IMPORT_NS. We plan to
>> keep tristate rather than changing to bool. Does that work for you?
>
> Yeah it's good as long as it works as module too.
>
> I might have a look at making my patch module-ready, for now it just
> can't work but I wonder if we can do something nicer to allow it
> (like in your case a bunch of exports, a separate file and a standalone
> testcase, perhaps all wrapped in some helper).
>
>>
>> --- detail_env_tlob tracepoint ---
>>
>> > Since you are not documenting the detail_env_tlob tracepoint, is it
>> > something really required? I would at the very least document its usage.
>>
>> Fair point. detail_env_tlob emits (running_ns, waiting_ns, sleeping_ns)
>> so the user can see which phase consumed the budget: high sleeping_ns
>> indicates I/O latency, high waiting_ns indicates scheduler pressure, high
>> running_ns indicates a compute overrun. Without this breakdown the user
>> only knows the total elapsed time exceeded the threshold, not why.
>>
>
> Alright, then this can go into the docs.
>
>>
>> --- Documentation ---
>>
>> > This is standard tracepoints usage, there's nothing about tlob we should
>> > document here.
>> > Same here, standard RV [for enable/desc tracefs files].
>> > And this is duplicating what mentioned above about uprobes, isn't it?
>>
>> Agreed. The following have been removed:
>>
>> - "Violation events" section: generic trace-cmd examples and cat-trace
>> instructions (standard tracepoints usage).
>> - tracefs files: "enable (rw)" and "desc (ro)" entries (standard RV).
>> - tracefs files: "monitor (rw)" description condensed to one line with
>> a cross-reference to the uprobes section above.
>>
>> In their place, a new "Violation tracepoints" subsection documents both
>> tlob-specific tracepoints with fields and a worked example:
>>
>> error_env_tlob: id, state, event ("budget_exceeded"), env ("clk_elapsed")
>>
>> detail_env_tlob: id, threshold_us, running_ns, waiting_ns, sleeping_ns
>> Use sleeping_ns to diagnose I/O latency, waiting_ns for scheduler
>> pressure, running_ns for compute overruns.
>>
>> Example:
>> trace-cmd record -e error_env_tlob -e detail_env_tlob &
>> # ... run workload ...
>> trace-cmd report
>
> Yeah sounds good, also pointing out to enable the monitor.
> We might think of a general way to do this kind of thing in
> tools/rv, although detail_env_tlob is non-standard.
>
>> > Is kernel code going to use this API? RV monitors are meant to be
>> > enabled by userspace. What's the use-case here?
>>
>> Agreed. The uprobe interface is driven from userspace; tlob_start_task()
>> and tlob_stop_task() are the internal implementation functions it calls,
>> not a public API for external kernel modules. The hypothetical
>> kernel-module use case would be removed from the documentation; the
>> kernel-doc block is retained for code maintainers.
>>
>> > That's probably a bit too detailed for this page. If you really want
>> > this information somewhere couldn't it stay in the code?
>>
>> Agreed; moved to comments in handle_sched_switch() and
>> handle_sched_wakeup(). The "Limitations" subsection is retained.
>>
>> -- Patch 09: KUnit tests
>>
>> > What caught my eyes are tests enrolling tracepoints handlers. If you
>> > go there you're no longer doing unit testing, what's the advantage of
>> > testing the entire monitor here over doing that in selftests?
>>
>> Agreed. The three suites that register tracepoint handlers or create
>> kthreads (tlob_sched_integration, tlob_trace_output, tlob_violation_react)
>> have been removed from KUnit and will be added to selftests in v3.
>>
>> Two pure unit test suites remain in KUnit:
>>
>> tlob_task_api:
>> Tests tlob_start_task / tlob_stop_task return values (-ENODEV,
>> -EALREADY, -ESRCH, -EOVERFLOW, -ENOSPC, -ERANGE) via direct calls
>> (these functions are the internal implementation used by both the
>> uprobe and, in future, the ioctl interface).
>> No tracepoints, no scheduling.
>>
>> tlob_uprobe_format:
>> Tests the uprobe line parser (tlob_parse_uprobe_line,
>> tlob_parse_remove_line) against valid and invalid input strings.
>> Pure string parsing; no scheduling, no tracepoints.
>>
>> This also resolves the tristate-vs-bool issue: with only pure unit tests
>> there is no dependency on sched_setscheduler_nocheck, so bool is correct.
>>
>
> Yeah looks good.
>
>>
>> -- Patch 10: selftests
>>
>> --- PREEMPT_RT RCU stall ---
>>
>> > I run it on a VM and have it hanging at step 9 [...] rcu_preempt stall.
>> > Did you see that? Am I doing something wrong?
>>
>> Thanks for reporting. The patch changed ha_monitor.h from
>> HRTIMER_MODE_REL_HARD (the existing upstream value) to REL_SOFT; the
>> stall appeared on PREEMPT_RT after that change. We have not fully
>> confirmed whether REL_SOFT is the root cause — REL_SOFT defers the
>> callback to the ktimers kthread, which could starve rcu_preempt under
>> certain PREEMPT_RT configurations, but other factors may be involved.
>>
>> We plan to revert to HRTIMER_MODE_REL_HARD at both sites in ha_monitor.h
>> as the conservative choice:
>>
>> ha_setup_timer(): HRTIMER_MODE_REL_SOFT -> HRTIMER_MODE_REL_HARD
>> ha_start_timer_ns(): HRTIMER_MODE_REL_SOFT -> HRTIMER_MODE_REL_HARD
>>
>> Do you have more insight into the stall, or does REL_HARD resolve it on
>> your setup?
>
> Right, good point, any specific reason why you wanted REL_SOFT?
>
> I indeed always test under PREEMPT_RT but I still see the same splat
> also after reverting REL_HARD..
> Could you reproduce it on your setup?
>
> My config is nothing special: what vng gives you adding
> PREEMPT_RT/RCU_PREEMPT and lockdep (PROVE_LOCKING/PROVE_RCU).
>
Hi Gabriele,
No specific reason for REL_SOFT — not intentional, reverting to
REL_HARD.
Reproduced the stall on the same config (PREEMPT_RT +
PROVE_LOCKING/PROVE_RCU).
Root cause is a cleanup ordering bug in uprobe_detail_waiting.tc,
unrelated to REL_SOFT/REL_HARD:
# original cleanup — wrong order
echo "-${UPROBE_TARGET}:${start_offset}" > "$TLOB_MONITOR" # (A)
kill "$hog_pid" # (B)
(A)
triggers synchronize_srcu() in the kernel. But tlob_target is stuck
mid-uprobe_notify_resume holding an SRCU read lock, preempted by the
FIFO-99 hog -> so the reader never finishes and (B) is never reached.
rcuc/0 (a kthread on PREEMPT_RT) is also starved by the hog -> RCU stall.
Fix: kill the hog first:
kill "$hog_pid"; wait "$hog_pid"
echo "-${UPROBE_TARGET}:${start_offset}" > "$TLOB_MONITOR"
On the PREEMPT_RT it is more reliably triggered there because rcuc/0
runs as a preemptible kthread rather than in softirq, making it easier
for the hog to monopolise the CPU long enough to hit the stall.
Thank you for the thorough review and valuable suggestions. We are
working through all of them and running the full test suite.
We expect to post v3 within the next two days.
--
Best wishes,
Wen
>>
>> --- Selftest structure ---
>>
>> > This should be tested together with the other monitors (enable/disable),
>> > we could at most expand those with the check_requires.
>> > Let's focus on tlob-only features in this patch.
>>
>> Agreed. In v3 we plan to drop tracefs.tc (covered by the generic
>> rv_monitor_enable_disable.tc) and keep only the six uprobe-specific
>> test cases under test.d/tlob/
>>
>> ioctl.tc is deferred with the ioctl interface to the follow-up series.
>> The KUnit integration tests (sched_switch accounting, budget-expiry
>> tracepoint) would be moved to selftests as additional test cases.
>>
>
> Thanks,
> Gabriele
>
^ permalink raw reply
* Re: [RFC PATCH 0/3] trace: stack trace deduplication for ftrace ring buffer
From: Steven Rostedt @ 2026-05-21 15:23 UTC (permalink / raw)
To: Li Pengfei
Cc: linux-trace-kernel, mhiramat, linux-kernel, cmllamas, zhangbo56,
lipengfei28
In-Reply-To: <20260514034916.2162517-1-lipengfei28@xiaomi.com>
On Thu, 14 May 2026 11:49:13 +0800
Li Pengfei <ljdlns1987@gmail.com> wrote:
> From: Pengfei Li <lipengfei28@xiaomi.com>
>
> Hi Steven, all,
>
Hi Pengfei,
Can you address the Sashiko reviews:
https://sashiko.dev/?list=org.kernel.vger.linux-trace-kernel#/patchset/20260514034916.2162517-1-lipengfei28%40xiaomi.com
It has a way to copy the comments. Just reply to this series with a past of
Sashiko's review and reply to them to explain why the comments may not be
an issue, or submit a new version with fixes if they are issues.
Thanks,
-- Steve
^ permalink raw reply
* Re: [RFC PATCH 2/2] tracing: Record and show boot ID in last_boot_info
From: Steven Rostedt @ 2026-05-21 15:16 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: Theodore Ts'o, Jason A . Donenfeld, Mathieu Desnoyers,
linux-kernel, linux-trace-kernel
In-Reply-To: <177937543666.2596845.9748178606108139386.stgit@mhiramat.tok.corp.google.com>
On Thu, 21 May 2026 23:57:16 +0900
"Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:
> @@ -4804,6 +4806,7 @@ struct trace_mod_entry {
> struct trace_scratch {
> unsigned int clock_id;
> unsigned long text_addr;
> + u8 boot_id[UUID_SIZE];
> unsigned long nr_entries;
> struct trace_mod_entry entries[];
> };
I just don't like wasting scratch space if boot_id isn't defined. But I
can't figure out a way to optionally have it there without wasting space
anyway.
If the get_boot_id() is accepted by the random folks, then I'm fine with
this change.
-- Steve
^ permalink raw reply
* Re: [PATCH v7 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: Breno Leitao @ 2026-05-21 15:09 UTC (permalink / raw)
To: Lance Yang
Cc: linmiaohe, akpm, david, ljs, vbabka, rppt, surenb, mhocko, shuah,
nao.horiguchi, rostedt, mhiramat, mathieu.desnoyers, corbet,
skhan, liam, linux-mm, linux-kernel, linux-doc, linux-kselftest,
linux-trace-kernel, kernel-team
In-Reply-To: <f76b79d3-080a-4931-873e-99d4b3e1020f@linux.dev>
On Sat, May 16, 2026 at 12:06:14PM +0800, Lance Yang wrote:
>
>
> On 2026/5/15 21:13, Breno Leitao wrote:
> [...]
> > >
> > > Wonder if it would be simpler to just do a positive check near the top
> > > of get_any_page() instead. Something like:
> > >
> > > static bool hwpoison_unrecoverable_kernel_page(struct page *page,
> > > unsigned long flags)
> >
> > Ack. We probably want to call it something like HWPoisonKernelOwned() to
> > follow the same naming sematics of these helpers, such as HWPoisonHandlable()
> >
> > By the way, I will re-include the self test back to this patch series,
> > In case they are not useful, we do not merge it.
> >
>
> Sounds good :)
>
> Can you also test the relevant page types if possible, especially
> the ones the new helper is supposed to classify?
Ack. I will expand the test to cover different page types as well!
Thanks,
--breno
^ permalink raw reply
* Re: [RFC PATCH 0/2] random: tracing: Expose last boot ID on persistent instance
From: Mathieu Desnoyers @ 2026-05-21 14:59 UTC (permalink / raw)
To: Masami Hiramatsu (Google), Theodore Ts'o, Jason A . Donenfeld,
Steven Rostedt
Cc: linux-kernel, linux-trace-kernel
In-Reply-To: <177937541909.2596845.17729857441826694760.stgit@mhiramat.tok.corp.google.com>
On 2026-05-21 10:56, Masami Hiramatsu (Google) wrote:
> Hi,
>
> Here is an RFC series to expose the boot ID (random UUID for each
> boot) in the last_boot_info of the persistent ring buffer instance.
>
> The persistent ring buffer can hold trace data beyond reboot/crashes.
> This means the recorded data does not always come from the last boot.
> Currently we just assume that the data comes from the last boot.
>
> On the other hand, the kernel provides a random generated UUID for
> each boot time, called "boot ID". If you record the logs with the
> boot ID, it is easy to do cross-referencing it with other logs.
>
> Similarly, recording the Boot ID for persistent ring buffer
> instances would make it easier to determine which boot the read
> data came from.
>
> For example:
>
> # cat /proc/sys/kernel/random/boot_id
> df152e7a-c0a7-4d32-9f0b-7f5c39fb7b68
>
> (enable tracing on persistent instance and reboot)
>
> # cat /sys/kernel/tracing/instances/ptracingtest/last_boot_info
> # boot_id: df152e7a-c0a7-4d32-9f0b-7f5c39fb7b68
> ffffffff81000000 [kernel]
>
FWIW, we've used boot id to uniquely identify traces belonging
to a given kernel execution and allow validation that traces
can indeed be correlated across CPU and across kernel vs userspace
for years in LTTng.
Good to see this approach proposed for Ftrace as well.
Thanks,
Mathieu
>
> Thank you,
>
> ---
>
> Masami Hiramatsu (Google) (2):
> random: Expose boot ID to other subsystems
> tracing: Record and show boot ID in last_boot_info
>
>
> drivers/char/random.c | 27 +++++++++++++++++++++------
> include/linux/random.h | 9 +++++++++
> kernel/trace/trace.c | 14 ++++++++++++--
> 3 files changed, 42 insertions(+), 8 deletions(-)
>
> --
> Masami Hiramatsu (Google) <mhiramat@kernel.org>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply
* Re: [PATCH mm-unstable v17 04/14] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
From: Lorenzo Stoakes @ 2026-05-21 14:59 UTC (permalink / raw)
To: Nico Pache
Cc: David Hildenbrand (Arm), Wei Yang, Lance Yang, linux-doc,
linux-kernel, linux-mm, linux-trace-kernel, aarcange, akpm,
anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
liam, mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
pfalcato, rakie.kim, raquini, rdunlap, rientjes, rostedt, rppt,
ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <CAA1CXcCNT51jeXh6Kwg1QN9e+AJB-1hg21kmeY6fTTKr2GACug@mail.gmail.com>
On Tue, May 19, 2026 at 01:05:13PM -0600, Nico Pache wrote:
> On Mon, May 18, 2026 at 1:33 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
> >
> > On Mon, May 18, 2026 at 03:16:11PM +0200, David Hildenbrand (Arm) wrote:
> > > > For me, I would vote for fallback to 0.
> > >
> > > At this point I'll prefer to not return errors from collapse_max_ptes_none().
> > > It's just rather awkward to return an error deep down in collapse code for a
> > > configuration problem.
> > >
> > > For mthp collapse, we only support max_ptes_none==0 and
> > > max_ptes_none=="HPAGE_PMD_NR - 1" (default).
> > >
> > > If another value is specified while collapsing mTHP, print a warning and treat
> > > it as 0 (save value, no creep, no memory waste).
> > >
> > > In a sense, this is similar to how we handle max_ptes_shared + max_ptes_swap:
> > > for mTHP: we always treat them as being 0 for mTHP collapse (and don't issue a
> > > warning, because we would issue a warning with the default settings).
> > >
> > > @Lorenzo, fine with you?
> >
> > Yes 100%, this sounds sensible both in terms of the error and the default. Let's
> > keep our lives simple(-ish) please :)
>
> Ok thank you im glad we finally came to consensus on this! phew!
>
It happens sometimes ;)
Cheers, Lorenzo
^ permalink raw reply
* [RFC PATCH 2/2] tracing: Record and show boot ID in last_boot_info
From: Masami Hiramatsu (Google) @ 2026-05-21 14:57 UTC (permalink / raw)
To: Theodore Ts'o, Jason A . Donenfeld, Steven Rostedt
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel
In-Reply-To: <177937541909.2596845.17729857441826694760.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Record the boot ID (random UUID for each boot) when tracing on the
persistent ring buffer and show it in the last_boot_info file.
User can use this boot ID when cross-referencing it with other logs.
For example:
# cat /proc/sys/kernel/random/boot_id
df152e7a-c0a7-4d32-9f0b-7f5c39fb7b68
(enable tracing on persistent instance and reboot)
# cat /sys/kernel/tracing/instances/ptracingtest/last_boot_info
# boot_id: df152e7a-c0a7-4d32-9f0b-7f5c39fb7b68
ffffffff81000000 [kernel]
Thus, if user saves the other logs with this boot_id,
user can easily find the appropriate logs.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
kernel/trace/trace.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 6eb4d3097a4d..c56694bb5e0c 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -34,10 +34,12 @@
#include <linux/percpu.h>
#include <linux/splice.h>
#include <linux/kdebug.h>
+#include <linux/random.h>
#include <linux/string.h>
#include <linux/mount.h>
#include <linux/rwsem.h>
#include <linux/slab.h>
+#include <linux/uuid.h>
#include <linux/ctype.h>
#include <linux/init.h>
#include <linux/panic_notifier.h>
@@ -4804,6 +4806,7 @@ struct trace_mod_entry {
struct trace_scratch {
unsigned int clock_id;
unsigned long text_addr;
+ u8 boot_id[UUID_SIZE];
unsigned long nr_entries;
struct trace_mod_entry entries[];
};
@@ -4921,6 +4924,7 @@ static void update_last_data(struct trace_array *tr)
/* Reset the module list and reload them */
if (tr->scratch) {
struct trace_scratch *tscratch = tr->scratch;
+ const u8 *boot_id = get_boot_id();
tscratch->clock_id = tr->clock_id;
memset(tscratch->entries, 0,
@@ -4929,6 +4933,11 @@ static void update_last_data(struct trace_array *tr)
guard(mutex)(&scratch_mutex);
module_for_each_mod(save_mod, tr);
+ /* Also, update boot ID if exists */
+ if (boot_id)
+ memcpy(tscratch->boot_id, boot_id, sizeof(tscratch->boot_id));
+ else
+ memset(tscratch->boot_id, 0, sizeof(tscratch->boot_id));
}
/*
@@ -5822,9 +5831,10 @@ static void show_last_boot_header(struct seq_file *m, struct trace_array *tr)
* Otherwise it shows the KASLR address from the previous boot which
* should not be the same as the current boot.
*/
- if (tscratch && (tr->flags & TRACE_ARRAY_FL_LAST_BOOT))
+ if (tscratch && (tr->flags & TRACE_ARRAY_FL_LAST_BOOT)) {
+ seq_printf(m, "# boot_id: %pUb\n", tscratch->boot_id);
seq_printf(m, "%lx\t[kernel]\n", tscratch->text_addr);
- else
+ } else
seq_puts(m, "# Current\n");
}
^ permalink raw reply related
* [RFC PATCH 1/2] random: Expose boot ID to other subsystems
From: Masami Hiramatsu (Google) @ 2026-05-21 14:57 UTC (permalink / raw)
To: Theodore Ts'o, Jason A . Donenfeld, Steven Rostedt
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel
In-Reply-To: <177937541909.2596845.17729857441826694760.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Add get_boot_id() to expose current boot ID to other kernel subsystems.
Note that since this is only meaningful if user can access it via sysctl,
it returns NULL if CONFIG_SYSCTL=n.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
drivers/char/random.c | 27 +++++++++++++++++++++------
include/linux/random.h | 9 +++++++++
2 files changed, 30 insertions(+), 6 deletions(-)
diff --git a/drivers/char/random.c b/drivers/char/random.c
index b4da1fb976c1..96a5a165627a 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1615,6 +1615,25 @@ static int sysctl_random_write_wakeup_bits = POOL_READY_BITS;
static int sysctl_poolsize = POOL_BITS;
static u8 sysctl_bootid[UUID_SIZE];
+/**
+ * get_boot_id - return the boot ID UUID
+ *
+ * This function returns a pointer to the boot ID UUID, which is generated on
+ * demand the first time this function is called. The boot ID is a UUID that
+ * is unique to each boot of the system.
+ */
+const u8 *get_boot_id(void)
+{
+ static DEFINE_SPINLOCK(bootid_spinlock);
+
+ spin_lock(&bootid_spinlock);
+ if (!sysctl_bootid[8])
+ generate_random_uuid(sysctl_bootid);
+ spin_unlock(&bootid_spinlock);
+
+ return sysctl_bootid;
+}
+
/*
* This function is used to return both the bootid UUID, and random
* UUID. The difference is in whether table->data is NULL; if it is,
@@ -1638,12 +1657,8 @@ static int proc_do_uuid(const struct ctl_table *table, int write, void *buf,
uuid = tmp_uuid;
generate_random_uuid(uuid);
} else {
- static DEFINE_SPINLOCK(bootid_spinlock);
-
- spin_lock(&bootid_spinlock);
- if (!uuid[8])
- generate_random_uuid(uuid);
- spin_unlock(&bootid_spinlock);
+ /* Ensure that the boot ID is initialized. */
+ get_boot_id();
}
snprintf(uuid_string, sizeof(uuid_string), "%pU", uuid);
diff --git a/include/linux/random.h b/include/linux/random.h
index 8a8064dc3970..aaf630f14931 100644
--- a/include/linux/random.h
+++ b/include/linux/random.h
@@ -130,6 +130,15 @@ static inline int get_random_bytes_wait(void *buf, size_t nbytes)
return ret;
}
+#ifdef CONFIG_SYSCTL
+const u8 *get_boot_id(void);
+#else
+static inline const u8 *get_boot_id(void)
+{
+ return NULL;
+}
+#endif
+
#ifdef CONFIG_SMP
int random_prepare_cpu(unsigned int cpu);
int random_online_cpu(unsigned int cpu);
^ permalink raw reply related
* [RFC PATCH 0/2] random: tracing: Expose last boot ID on persistent instance
From: Masami Hiramatsu (Google) @ 2026-05-21 14:56 UTC (permalink / raw)
To: Theodore Ts'o, Jason A . Donenfeld, Steven Rostedt
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel
Hi,
Here is an RFC series to expose the boot ID (random UUID for each
boot) in the last_boot_info of the persistent ring buffer instance.
The persistent ring buffer can hold trace data beyond reboot/crashes.
This means the recorded data does not always come from the last boot.
Currently we just assume that the data comes from the last boot.
On the other hand, the kernel provides a random generated UUID for
each boot time, called "boot ID". If you record the logs with the
boot ID, it is easy to do cross-referencing it with other logs.
Similarly, recording the Boot ID for persistent ring buffer
instances would make it easier to determine which boot the read
data came from.
For example:
# cat /proc/sys/kernel/random/boot_id
df152e7a-c0a7-4d32-9f0b-7f5c39fb7b68
(enable tracing on persistent instance and reboot)
# cat /sys/kernel/tracing/instances/ptracingtest/last_boot_info
# boot_id: df152e7a-c0a7-4d32-9f0b-7f5c39fb7b68
ffffffff81000000 [kernel]
Thank you,
---
Masami Hiramatsu (Google) (2):
random: Expose boot ID to other subsystems
tracing: Record and show boot ID in last_boot_info
drivers/char/random.c | 27 +++++++++++++++++++++------
include/linux/random.h | 9 +++++++++
kernel/trace/trace.c | 14 ++++++++++++--
3 files changed, 42 insertions(+), 8 deletions(-)
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH 1/3] rtla: Fix output files in source tree
From: Steven Rostedt @ 2026-05-21 14:48 UTC (permalink / raw)
To: Ben Hutchings
Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Tomas Glozar, linux-perf-users, linux-trace-kernel
In-Reply-To: <ag8YLYKZTKOq0t4x@decadent.org.uk>
On Thu, 21 May 2026 16:35:25 +0200
Ben Hutchings <benh@debian.org> wrote:
> Some output files (src/timerlat.bpf.o, src/timerlat.skel.h,
> example/timerlat_bpf_action.o, tests/bpf/bpf_action_map.o) are
> currently generated in the source tree, preventing a fully out-of-tree
> build. To fix this:
>
> - Add $(OUTPUT) to their filenames in the relevant Makefile rules, and
> create subdirectories as needed
> - Add $(OUTPUT)src to the include path
> - Add ${OUTPUT} to the BPF object filename in tests/timerlat.t
>
> Fixes: e34293ddcebd ("rtla/timerlat: Add BPF skeleton to collect samples")
> Fixes: 0304a3b7ec9a ("rtla/timerlat: Add example for BPF action program")
> Fixes: 5525aebd4e0c ("rtla/tests: Test BPF action program")
> Signed-off-by: Ben Hutchings <benh@debian.org>
Hi Ben,
Can you send this as a separate patch. The rtla code is handled via a
different tree than the perf code. So these patches will not be going
together.
Thanks,
-- Steve
^ permalink raw reply
* Re: [PATCH v4 2/2] tracing: Bound histogram expression strings with seq_buf
From: Steven Rostedt @ 2026-05-21 14:44 UTC (permalink / raw)
To: Pengpeng Hou
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-trace-kernel,
linux-kernel
In-Reply-To: <20260521022817.38453-2-pengpeng@iscas.ac.cn>
On Thu, 21 May 2026 10:28:17 +0800
Pengpeng Hou <pengpeng@iscas.ac.cn> wrote:
> @@ -1778,47 +1783,56 @@ static char *expr_str(struct hist_field *field, unsigned int level)
> if (!expr)
> return ERR_PTR(-ENOMEM);
>
> + seq_buf_init(&s, expr, MAX_FILTER_STR_VAL);
> +
> if (!field->operands[0]) {
> - expr_field_str(field, expr);
> + if (!expr_field_str(field, &s))
> + return ERR_PTR(-E2BIG);
> +
> return_ptr(expr);
> }
>
> if (field->operator == FIELD_OP_UNARY_MINUS) {
> - char *subexpr;
> + char *subexpr __free(kfree) = NULL;
>
> - strcat(expr, "-(");
> + seq_buf_puts(&s, "-(");
> subexpr = expr_str(field->operands[0], ++level);
> if (IS_ERR(subexpr))
> return subexpr;
>
> - strcat(expr, subexpr);
> - strcat(expr, ")");
> + seq_buf_puts(&s, subexpr);
> + seq_buf_putc(&s, ')');
> + seq_buf_str(&s);
>
> - kfree(subexpr);
> + if (seq_buf_has_overflowed(&s))
> + return ERR_PTR(-E2BIG);
>
> return_ptr(expr);
> }
Wouldn't the above if statement be a lot nicer as:
if (field->operator == FIELD_OP_UNARY_MINUS) {
char *subexpr;
subexpr = expr_str(field->operands[0], ++level);
if (IS_ERR(subexpr))
return subexpr;
seq_buf_printf(&s, "-(%s)", subexpr);
seq_buf_str(&s);
kfree(subexpr);
if (seq_buf_has_overflowed(&s))
return ERR_PTR(-E2BIG);
return_ptr(expr);
}
In fact, the above is so simple, you don't even need to use the __free()
guard on subexpr.
BTW, because currently seq_buf_printf() does add a '\0' to the string, I
may update the API for seq_buf to state that you don't need to terminate
after calling that function. So you can leave out the sub_buf_str() after
calling seq_buf_printf().
-- Steve
^ permalink raw reply
* Re: [PATCH v6 16/43] KVM: guest_memfd: Use actual size for invalidation in kvm_gmem_release()
From: Ackerley Tng @ 2026-05-21 14:40 UTC (permalink / raw)
To: Sean Christopherson, Fuad Tabba
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <ag8BmtzxTlcuA_zy@google.com>
Sean Christopherson <seanjc@google.com> writes:
>
> [...snip...]
>
> --- virt/kvm/guest_memfd.c
> +++ virt/kvm/guest_memfd.c
> @@ -640,9 +640,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
> }
>
> int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
> - unsigned int fd, loff_t offset)
> + unsigned int fd, u64 offset)
> {
> - loff_t size = slot->npages << PAGE_SHIFT;
> + u64 size = slot->npages << PAGE_SHIFT;
> unsigned long start, end;
> struct gmem_file *f;
> struct inode *inode;
>
My mental model was:
+ offsets => loff_t
+ indices => pgoff_t
+ sizes => size_t
But looks like loff_t is more suitable for places where return values
(possibly negative) matter.
Good to go with u64!
> [...snip...]
>
^ permalink raw reply
* Re: [PATCH v6 11/43] KVM: guest_memfd: Ensure pages are not in use before conversion
From: Ackerley Tng @ 2026-05-21 14:36 UTC (permalink / raw)
To: Fuad Tabba
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <CA+EHjTyaBpTYsJRRyP09YggoHbi6s-ZgDoWoFgDRxO5k_BkoBw@mail.gmail.com>
Fuad Tabba <tabba@google.com> writes:
>
> [...snip...]
>
>> +static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
>> + size_t nr_pages, pgoff_t *err_index)
>> +{
>> + struct address_space *mapping = inode->i_mapping;
>> + const int filemap_get_folios_refcount = 1;
>> + pgoff_t last = start + nr_pages - 1;
>> + struct folio_batch fbatch;
>> + bool safe = true;
>> + int i;
>> +
>> + folio_batch_init(&fbatch);
>> + while (safe && filemap_get_folios(mapping, &start, last, &fbatch)) {
>> +
>> + for (i = 0; i < folio_batch_count(&fbatch); ++i) {
>> + struct folio *folio = fbatch.folios[i];
>> +
>> + if (folio_ref_count(folio) !=
>> + folio_nr_pages(folio) + filemap_get_folios_refcount) {
>> + safe = false;
>> + *err_index = folio->index;
>> + break;
>
> https://sashiko.dev/#/patchset/20260507-gmem-inplace-conversion-v6-0-91ab5a8b19a4%40google.com?part=11
>
Sashiko's first issue on lru is addressed in a separate patch later. :)
> Sashiko raised a few issues here, but I think this one might be
> genuine. Can you look into it please?
>
> If that's right, when huge page support lands, if start falls in the
> middle of a large folio, returning folio->index as the err_index will
> return an offset strictly less than the requested start. A naive
> userspace retry loop resuming from error_offset would step backwards
> and corrupt attributes on memory it didn't intend to convert.
> err_index should be clamped to max(start, folio->index).
>
For these ones, I was thinking to defer all the huge-page related issues
to be fixed when huge pages land, since there are probably quite a few
places to update.
On second thought, this isn't a huge change, I'll fix this in the next
revision.
> Cheers,
> /fuad
>
>> + }
>> + }
>> +
>>
>> [...snip...]
>>
^ permalink raw reply
* [PATCH 3/3] perf tools: Put Python bytecode in output directory
From: Ben Hutchings @ 2026-05-21 14:35 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Steven Rostedt, Tomas Glozar
Cc: linux-perf-users, linux-trace-kernel
In-Reply-To: <ag8X7gcDw6jpJsLq@decadent.org.uk>
[-- Attachment #1: Type: text/plain, Size: 1025 bytes --]
The PMU events are processed into C sources by Python scripts, which
normally results in writing bytecode for each module into the source
tree. This prevents a fully out-of-tree build.
To fix this, set $PYTHONPYCACHEPREFIX to relocate the bytecode cache
directory in an out-of-tree build.
Signed-off-by: Ben Hutchings <benh@debian.org>
---
tools/perf/Makefile.perf | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index 899a4249a42f..c35b65f9fdda 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -400,6 +400,11 @@ PYTHON_EXTBUILD_LIB := $(PYTHON_EXTBUILD)lib/
PYTHON_EXTBUILD_TMP := $(PYTHON_EXTBUILD)tmp/
export PYTHON_EXTBUILD_LIB PYTHON_EXTBUILD_TMP
+# Put Python bytecode in output directory
+ifdef OUTPUT
+export PYTHONPYCACHEPREFIX := $(OUTPUT)/__pycache__
+endif
+
python-clean := $(call QUIET_CLEAN, python) $(RM) -r $(PYTHON_EXTBUILD) $(OUTPUT)python/perf*.so
# Use the detected configuration
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply related
* [PATCH 2/3] perf tools: Put Python egg info in output directory
From: Ben Hutchings @ 2026-05-21 14:35 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Steven Rostedt, Tomas Glozar
Cc: linux-perf-users, linux-trace-kernel
In-Reply-To: <ag8X7gcDw6jpJsLq@decadent.org.uk>
[-- Attachment #1: Type: text/plain, Size: 990 bytes --]
Installing the Python extension currently creates an egg-info
directory in the source tree, preventing a fully out-of-tree build.
Add the necessary runes to the setup.py comamnd line to relocate the
egg-info directory in an out-of-tree build.
Signed-off-by: Ben Hutchings <benh@debian.org>
---
tools/perf/Makefile.perf | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index cee19c923c06..899a4249a42f 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -1152,7 +1152,9 @@ install-bin: install-tools install-tests
install: install-bin try-install-man
install-python_ext:
- $(PYTHON_WORD) util/setup.py --quiet install --root='/$(DESTDIR_SQ)'
+ $(PYTHON_WORD) util/setup.py --quiet \
+ $(if $(OUTPUT),egg_info --egg-base $(OUTPUT)) \
+ install --root='/$(DESTDIR_SQ)'
# 'make install-doc' should call 'make -C Documentation install'
$(INSTALL_DOC_TARGETS):
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox