* Re: [PATCH v4 2/2] tracing: Bound histogram expression strings with seq_buf
From: Steven Rostedt @ 2026-05-21 14:44 UTC (permalink / raw)
To: Pengpeng Hou
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-trace-kernel,
linux-kernel
In-Reply-To: <20260521022817.38453-2-pengpeng@iscas.ac.cn>
On Thu, 21 May 2026 10:28:17 +0800
Pengpeng Hou <pengpeng@iscas.ac.cn> wrote:
> @@ -1778,47 +1783,56 @@ static char *expr_str(struct hist_field *field, unsigned int level)
> if (!expr)
> return ERR_PTR(-ENOMEM);
>
> + seq_buf_init(&s, expr, MAX_FILTER_STR_VAL);
> +
> if (!field->operands[0]) {
> - expr_field_str(field, expr);
> + if (!expr_field_str(field, &s))
> + return ERR_PTR(-E2BIG);
> +
> return_ptr(expr);
> }
>
> if (field->operator == FIELD_OP_UNARY_MINUS) {
> - char *subexpr;
> + char *subexpr __free(kfree) = NULL;
>
> - strcat(expr, "-(");
> + seq_buf_puts(&s, "-(");
> subexpr = expr_str(field->operands[0], ++level);
> if (IS_ERR(subexpr))
> return subexpr;
>
> - strcat(expr, subexpr);
> - strcat(expr, ")");
> + seq_buf_puts(&s, subexpr);
> + seq_buf_putc(&s, ')');
> + seq_buf_str(&s);
>
> - kfree(subexpr);
> + if (seq_buf_has_overflowed(&s))
> + return ERR_PTR(-E2BIG);
>
> return_ptr(expr);
> }
Wouldn't the above if statement be a lot nicer as:
if (field->operator == FIELD_OP_UNARY_MINUS) {
char *subexpr;
subexpr = expr_str(field->operands[0], ++level);
if (IS_ERR(subexpr))
return subexpr;
seq_buf_printf(&s, "-(%s)", subexpr);
seq_buf_str(&s);
kfree(subexpr);
if (seq_buf_has_overflowed(&s))
return ERR_PTR(-E2BIG);
return_ptr(expr);
}
In fact, the above is so simple, you don't even need to use the __free()
guard on subexpr.
BTW, because currently seq_buf_printf() does add a '\0' to the string, I
may update the API for seq_buf to state that you don't need to terminate
after calling that function. So you can leave out the sub_buf_str() after
calling seq_buf_printf().
-- Steve
^ permalink raw reply
* Re: [PATCH 1/3] rtla: Fix output files in source tree
From: Steven Rostedt @ 2026-05-21 14:48 UTC (permalink / raw)
To: Ben Hutchings
Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Tomas Glozar, linux-perf-users, linux-trace-kernel
In-Reply-To: <ag8YLYKZTKOq0t4x@decadent.org.uk>
On Thu, 21 May 2026 16:35:25 +0200
Ben Hutchings <benh@debian.org> wrote:
> Some output files (src/timerlat.bpf.o, src/timerlat.skel.h,
> example/timerlat_bpf_action.o, tests/bpf/bpf_action_map.o) are
> currently generated in the source tree, preventing a fully out-of-tree
> build. To fix this:
>
> - Add $(OUTPUT) to their filenames in the relevant Makefile rules, and
> create subdirectories as needed
> - Add $(OUTPUT)src to the include path
> - Add ${OUTPUT} to the BPF object filename in tests/timerlat.t
>
> Fixes: e34293ddcebd ("rtla/timerlat: Add BPF skeleton to collect samples")
> Fixes: 0304a3b7ec9a ("rtla/timerlat: Add example for BPF action program")
> Fixes: 5525aebd4e0c ("rtla/tests: Test BPF action program")
> Signed-off-by: Ben Hutchings <benh@debian.org>
Hi Ben,
Can you send this as a separate patch. The rtla code is handled via a
different tree than the perf code. So these patches will not be going
together.
Thanks,
-- Steve
^ permalink raw reply
* [RFC PATCH 0/2] random: tracing: Expose last boot ID on persistent instance
From: Masami Hiramatsu (Google) @ 2026-05-21 14:56 UTC (permalink / raw)
To: Theodore Ts'o, Jason A . Donenfeld, Steven Rostedt
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel
Hi,
Here is an RFC series to expose the boot ID (random UUID for each
boot) in the last_boot_info of the persistent ring buffer instance.
The persistent ring buffer can hold trace data beyond reboot/crashes.
This means the recorded data does not always come from the last boot.
Currently we just assume that the data comes from the last boot.
On the other hand, the kernel provides a random generated UUID for
each boot time, called "boot ID". If you record the logs with the
boot ID, it is easy to do cross-referencing it with other logs.
Similarly, recording the Boot ID for persistent ring buffer
instances would make it easier to determine which boot the read
data came from.
For example:
# cat /proc/sys/kernel/random/boot_id
df152e7a-c0a7-4d32-9f0b-7f5c39fb7b68
(enable tracing on persistent instance and reboot)
# cat /sys/kernel/tracing/instances/ptracingtest/last_boot_info
# boot_id: df152e7a-c0a7-4d32-9f0b-7f5c39fb7b68
ffffffff81000000 [kernel]
Thank you,
---
Masami Hiramatsu (Google) (2):
random: Expose boot ID to other subsystems
tracing: Record and show boot ID in last_boot_info
drivers/char/random.c | 27 +++++++++++++++++++++------
include/linux/random.h | 9 +++++++++
kernel/trace/trace.c | 14 ++++++++++++--
3 files changed, 42 insertions(+), 8 deletions(-)
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* [RFC PATCH 1/2] random: Expose boot ID to other subsystems
From: Masami Hiramatsu (Google) @ 2026-05-21 14:57 UTC (permalink / raw)
To: Theodore Ts'o, Jason A . Donenfeld, Steven Rostedt
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel
In-Reply-To: <177937541909.2596845.17729857441826694760.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Add get_boot_id() to expose current boot ID to other kernel subsystems.
Note that since this is only meaningful if user can access it via sysctl,
it returns NULL if CONFIG_SYSCTL=n.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
drivers/char/random.c | 27 +++++++++++++++++++++------
include/linux/random.h | 9 +++++++++
2 files changed, 30 insertions(+), 6 deletions(-)
diff --git a/drivers/char/random.c b/drivers/char/random.c
index b4da1fb976c1..96a5a165627a 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1615,6 +1615,25 @@ static int sysctl_random_write_wakeup_bits = POOL_READY_BITS;
static int sysctl_poolsize = POOL_BITS;
static u8 sysctl_bootid[UUID_SIZE];
+/**
+ * get_boot_id - return the boot ID UUID
+ *
+ * This function returns a pointer to the boot ID UUID, which is generated on
+ * demand the first time this function is called. The boot ID is a UUID that
+ * is unique to each boot of the system.
+ */
+const u8 *get_boot_id(void)
+{
+ static DEFINE_SPINLOCK(bootid_spinlock);
+
+ spin_lock(&bootid_spinlock);
+ if (!sysctl_bootid[8])
+ generate_random_uuid(sysctl_bootid);
+ spin_unlock(&bootid_spinlock);
+
+ return sysctl_bootid;
+}
+
/*
* This function is used to return both the bootid UUID, and random
* UUID. The difference is in whether table->data is NULL; if it is,
@@ -1638,12 +1657,8 @@ static int proc_do_uuid(const struct ctl_table *table, int write, void *buf,
uuid = tmp_uuid;
generate_random_uuid(uuid);
} else {
- static DEFINE_SPINLOCK(bootid_spinlock);
-
- spin_lock(&bootid_spinlock);
- if (!uuid[8])
- generate_random_uuid(uuid);
- spin_unlock(&bootid_spinlock);
+ /* Ensure that the boot ID is initialized. */
+ get_boot_id();
}
snprintf(uuid_string, sizeof(uuid_string), "%pU", uuid);
diff --git a/include/linux/random.h b/include/linux/random.h
index 8a8064dc3970..aaf630f14931 100644
--- a/include/linux/random.h
+++ b/include/linux/random.h
@@ -130,6 +130,15 @@ static inline int get_random_bytes_wait(void *buf, size_t nbytes)
return ret;
}
+#ifdef CONFIG_SYSCTL
+const u8 *get_boot_id(void);
+#else
+static inline const u8 *get_boot_id(void)
+{
+ return NULL;
+}
+#endif
+
#ifdef CONFIG_SMP
int random_prepare_cpu(unsigned int cpu);
int random_online_cpu(unsigned int cpu);
^ permalink raw reply related
* [RFC PATCH 2/2] tracing: Record and show boot ID in last_boot_info
From: Masami Hiramatsu (Google) @ 2026-05-21 14:57 UTC (permalink / raw)
To: Theodore Ts'o, Jason A . Donenfeld, Steven Rostedt
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel
In-Reply-To: <177937541909.2596845.17729857441826694760.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Record the boot ID (random UUID for each boot) when tracing on the
persistent ring buffer and show it in the last_boot_info file.
User can use this boot ID when cross-referencing it with other logs.
For example:
# cat /proc/sys/kernel/random/boot_id
df152e7a-c0a7-4d32-9f0b-7f5c39fb7b68
(enable tracing on persistent instance and reboot)
# cat /sys/kernel/tracing/instances/ptracingtest/last_boot_info
# boot_id: df152e7a-c0a7-4d32-9f0b-7f5c39fb7b68
ffffffff81000000 [kernel]
Thus, if user saves the other logs with this boot_id,
user can easily find the appropriate logs.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
kernel/trace/trace.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 6eb4d3097a4d..c56694bb5e0c 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -34,10 +34,12 @@
#include <linux/percpu.h>
#include <linux/splice.h>
#include <linux/kdebug.h>
+#include <linux/random.h>
#include <linux/string.h>
#include <linux/mount.h>
#include <linux/rwsem.h>
#include <linux/slab.h>
+#include <linux/uuid.h>
#include <linux/ctype.h>
#include <linux/init.h>
#include <linux/panic_notifier.h>
@@ -4804,6 +4806,7 @@ struct trace_mod_entry {
struct trace_scratch {
unsigned int clock_id;
unsigned long text_addr;
+ u8 boot_id[UUID_SIZE];
unsigned long nr_entries;
struct trace_mod_entry entries[];
};
@@ -4921,6 +4924,7 @@ static void update_last_data(struct trace_array *tr)
/* Reset the module list and reload them */
if (tr->scratch) {
struct trace_scratch *tscratch = tr->scratch;
+ const u8 *boot_id = get_boot_id();
tscratch->clock_id = tr->clock_id;
memset(tscratch->entries, 0,
@@ -4929,6 +4933,11 @@ static void update_last_data(struct trace_array *tr)
guard(mutex)(&scratch_mutex);
module_for_each_mod(save_mod, tr);
+ /* Also, update boot ID if exists */
+ if (boot_id)
+ memcpy(tscratch->boot_id, boot_id, sizeof(tscratch->boot_id));
+ else
+ memset(tscratch->boot_id, 0, sizeof(tscratch->boot_id));
}
/*
@@ -5822,9 +5831,10 @@ static void show_last_boot_header(struct seq_file *m, struct trace_array *tr)
* Otherwise it shows the KASLR address from the previous boot which
* should not be the same as the current boot.
*/
- if (tscratch && (tr->flags & TRACE_ARRAY_FL_LAST_BOOT))
+ if (tscratch && (tr->flags & TRACE_ARRAY_FL_LAST_BOOT)) {
+ seq_printf(m, "# boot_id: %pUb\n", tscratch->boot_id);
seq_printf(m, "%lx\t[kernel]\n", tscratch->text_addr);
- else
+ } else
seq_puts(m, "# Current\n");
}
^ permalink raw reply related
* Re: [PATCH mm-unstable v17 04/14] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
From: Lorenzo Stoakes @ 2026-05-21 14:59 UTC (permalink / raw)
To: Nico Pache
Cc: David Hildenbrand (Arm), Wei Yang, Lance Yang, linux-doc,
linux-kernel, linux-mm, linux-trace-kernel, aarcange, akpm,
anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
liam, mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
pfalcato, rakie.kim, raquini, rdunlap, rientjes, rostedt, rppt,
ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <CAA1CXcCNT51jeXh6Kwg1QN9e+AJB-1hg21kmeY6fTTKr2GACug@mail.gmail.com>
On Tue, May 19, 2026 at 01:05:13PM -0600, Nico Pache wrote:
> On Mon, May 18, 2026 at 1:33 PM Lorenzo Stoakes <ljs@kernel.org> wrote:
> >
> > On Mon, May 18, 2026 at 03:16:11PM +0200, David Hildenbrand (Arm) wrote:
> > > > For me, I would vote for fallback to 0.
> > >
> > > At this point I'll prefer to not return errors from collapse_max_ptes_none().
> > > It's just rather awkward to return an error deep down in collapse code for a
> > > configuration problem.
> > >
> > > For mthp collapse, we only support max_ptes_none==0 and
> > > max_ptes_none=="HPAGE_PMD_NR - 1" (default).
> > >
> > > If another value is specified while collapsing mTHP, print a warning and treat
> > > it as 0 (save value, no creep, no memory waste).
> > >
> > > In a sense, this is similar to how we handle max_ptes_shared + max_ptes_swap:
> > > for mTHP: we always treat them as being 0 for mTHP collapse (and don't issue a
> > > warning, because we would issue a warning with the default settings).
> > >
> > > @Lorenzo, fine with you?
> >
> > Yes 100%, this sounds sensible both in terms of the error and the default. Let's
> > keep our lives simple(-ish) please :)
>
> Ok thank you im glad we finally came to consensus on this! phew!
>
It happens sometimes ;)
Cheers, Lorenzo
^ permalink raw reply
* Re: [RFC PATCH 0/2] random: tracing: Expose last boot ID on persistent instance
From: Mathieu Desnoyers @ 2026-05-21 14:59 UTC (permalink / raw)
To: Masami Hiramatsu (Google), Theodore Ts'o, Jason A . Donenfeld,
Steven Rostedt
Cc: linux-kernel, linux-trace-kernel
In-Reply-To: <177937541909.2596845.17729857441826694760.stgit@mhiramat.tok.corp.google.com>
On 2026-05-21 10:56, Masami Hiramatsu (Google) wrote:
> Hi,
>
> Here is an RFC series to expose the boot ID (random UUID for each
> boot) in the last_boot_info of the persistent ring buffer instance.
>
> The persistent ring buffer can hold trace data beyond reboot/crashes.
> This means the recorded data does not always come from the last boot.
> Currently we just assume that the data comes from the last boot.
>
> On the other hand, the kernel provides a random generated UUID for
> each boot time, called "boot ID". If you record the logs with the
> boot ID, it is easy to do cross-referencing it with other logs.
>
> Similarly, recording the Boot ID for persistent ring buffer
> instances would make it easier to determine which boot the read
> data came from.
>
> For example:
>
> # cat /proc/sys/kernel/random/boot_id
> df152e7a-c0a7-4d32-9f0b-7f5c39fb7b68
>
> (enable tracing on persistent instance and reboot)
>
> # cat /sys/kernel/tracing/instances/ptracingtest/last_boot_info
> # boot_id: df152e7a-c0a7-4d32-9f0b-7f5c39fb7b68
> ffffffff81000000 [kernel]
>
FWIW, we've used boot id to uniquely identify traces belonging
to a given kernel execution and allow validation that traces
can indeed be correlated across CPU and across kernel vs userspace
for years in LTTng.
Good to see this approach proposed for Ftrace as well.
Thanks,
Mathieu
>
> Thank you,
>
> ---
>
> Masami Hiramatsu (Google) (2):
> random: Expose boot ID to other subsystems
> tracing: Record and show boot ID in last_boot_info
>
>
> drivers/char/random.c | 27 +++++++++++++++++++++------
> include/linux/random.h | 9 +++++++++
> kernel/trace/trace.c | 14 ++++++++++++--
> 3 files changed, 42 insertions(+), 8 deletions(-)
>
> --
> Masami Hiramatsu (Google) <mhiramat@kernel.org>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply
* Re: [PATCH v7 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: Breno Leitao @ 2026-05-21 15:09 UTC (permalink / raw)
To: Lance Yang
Cc: linmiaohe, akpm, david, ljs, vbabka, rppt, surenb, mhocko, shuah,
nao.horiguchi, rostedt, mhiramat, mathieu.desnoyers, corbet,
skhan, liam, linux-mm, linux-kernel, linux-doc, linux-kselftest,
linux-trace-kernel, kernel-team
In-Reply-To: <f76b79d3-080a-4931-873e-99d4b3e1020f@linux.dev>
On Sat, May 16, 2026 at 12:06:14PM +0800, Lance Yang wrote:
>
>
> On 2026/5/15 21:13, Breno Leitao wrote:
> [...]
> > >
> > > Wonder if it would be simpler to just do a positive check near the top
> > > of get_any_page() instead. Something like:
> > >
> > > static bool hwpoison_unrecoverable_kernel_page(struct page *page,
> > > unsigned long flags)
> >
> > Ack. We probably want to call it something like HWPoisonKernelOwned() to
> > follow the same naming sematics of these helpers, such as HWPoisonHandlable()
> >
> > By the way, I will re-include the self test back to this patch series,
> > In case they are not useful, we do not merge it.
> >
>
> Sounds good :)
>
> Can you also test the relevant page types if possible, especially
> the ones the new helper is supposed to classify?
Ack. I will expand the test to cover different page types as well!
Thanks,
--breno
^ permalink raw reply
* Re: [RFC PATCH 2/2] tracing: Record and show boot ID in last_boot_info
From: Steven Rostedt @ 2026-05-21 15:16 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: Theodore Ts'o, Jason A . Donenfeld, Mathieu Desnoyers,
linux-kernel, linux-trace-kernel
In-Reply-To: <177937543666.2596845.9748178606108139386.stgit@mhiramat.tok.corp.google.com>
On Thu, 21 May 2026 23:57:16 +0900
"Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:
> @@ -4804,6 +4806,7 @@ struct trace_mod_entry {
> struct trace_scratch {
> unsigned int clock_id;
> unsigned long text_addr;
> + u8 boot_id[UUID_SIZE];
> unsigned long nr_entries;
> struct trace_mod_entry entries[];
> };
I just don't like wasting scratch space if boot_id isn't defined. But I
can't figure out a way to optionally have it there without wasting space
anyway.
If the get_boot_id() is accepted by the random folks, then I'm fine with
this change.
-- Steve
^ permalink raw reply
* Re: [RFC PATCH 0/3] trace: stack trace deduplication for ftrace ring buffer
From: Steven Rostedt @ 2026-05-21 15:23 UTC (permalink / raw)
To: Li Pengfei
Cc: linux-trace-kernel, mhiramat, linux-kernel, cmllamas, zhangbo56,
lipengfei28
In-Reply-To: <20260514034916.2162517-1-lipengfei28@xiaomi.com>
On Thu, 14 May 2026 11:49:13 +0800
Li Pengfei <ljdlns1987@gmail.com> wrote:
> From: Pengfei Li <lipengfei28@xiaomi.com>
>
> Hi Steven, all,
>
Hi Pengfei,
Can you address the Sashiko reviews:
https://sashiko.dev/?list=org.kernel.vger.linux-trace-kernel#/patchset/20260514034916.2162517-1-lipengfei28%40xiaomi.com
It has a way to copy the comments. Just reply to this series with a past of
Sashiko's review and reply to them to explain why the comments may not be
an issue, or submit a new version with fixes if they are issues.
Thanks,
-- Steve
^ permalink raw reply
* Re: [PATCH] Re: Re: [RFC PATCH v2 04/10] rv/da: add pre-allocated storage pool for per-object monitors
From: Wen Yang @ 2026-05-21 17:40 UTC (permalink / raw)
To: Gabriele Monaco; +Cc: linux-kernel, linux-trace-kernel, rostedt
In-Reply-To: <5183dc18d63b617ab0f19290e8a790fa6898f372.camel@redhat.com>
On 5/19/26 19:14, Gabriele Monaco wrote:
> Hi Wen,
>
> On Mon, 2026-05-18 at 01:13 +0800, Wen Yang wrote:
>>
>> Yes. The ftracetest check_requires logic calls `command -v <binary>` to
>> satisfy `requires: <name>:program` directives. Without the script's
>> directory in PATH those checks evaluate to exit_unsupported and test cases
>> are skipped rather than run. The make path avoids this only because make
>> sets OUTDIR and the runner appends it to PATH internally.
>>
>
> So you're overriding PATH so the selftest's binaries can be found from
> the test, right?
>
> Wouldn't it be simpler to just put the absolute paths in the tests and
> don't touch PATH.
>
> If the selftests are run via makefile, it ensures the required binaries
> are built and available, so there's no need to go through the `requires:
> <name>:program` infrastructure (that's more about what's installed on
> the system.
>
> Or if you don't want anything hardcoded, you could pass the $OUTDIR from
> the environment and use that in scripts, whatever looks cleaner.
>
> Does it make sense to you?
>
>>
>> -- Patch 04: pre-allocated storage pool
>>
>> > Since you're using spinlocks, isn't that going to sleep on PREEMPT_RT?
>>
>> User-mode uprobe handlers run with preempt_count == 0, fully preemptible on
>> both PREEMPT_RT and non-PREEMPT_RT. The strongest evidence is in tlob
>> itself: tlob_start_task() takes a mutex and calls kmem_cache_zalloc(...,
>> GFP_KERNEL) from the uprobe entry handler; both are illegal in atomic
>> context and would trigger lockdep splats immediately.
>>
>> On PREEMPT_RT, spinlock_t becoming a sleeping lock in the uprobe handler
>> iscfine: both call sites (da_create_or_get_pool() from the handler and
>> da_pool_return_cb() from the rcuc kthread) are in sleepable task context.
>>
>
> Yeah exactly, the uprobe is fine with anything (even the automatic
> `kmalloc_nolock`), but sure preallocation at least guarantees the slots
> are there.
>
>> > We can have a macro DA_MON_ALLOCATION_STRATEGY = {DA_ALLOC_AUTO,
>> > DA_ALLOC_POOL, DA_ALLOC_MANUAL} where DA_MON_POOL also requires
>> > DA_MON_POOL_SIZE to be defined (force that with an #error).
>> >
>> > Anyway, this way you probably wouldn't need to define a different init
>> > function and let everything handled more transparently.
>> >
>> > Also you don't need to call da_create_or_get() explicitly,
>> > da_handle_start_event() should do it for you.
>>
>> Agreed on all counts. We plan to implement this in v3 as follows.
>> The three strategies would be a compile-time selection in da_monitor.h:
>>
>> DA_ALLOC_AUTO (default) - lock-free kmalloc_nolock on the hot path;
>> unbounded capacity.
>>
>> DA_ALLOC_POOL - pre-allocated fixed-size pool;
>> DA_MON_POOL_SIZE
>> required, enforced with #error if missing.
>>
>> DA_ALLOC_MANUAL - caller pre-inserts storage via
>> da_create_empty_storage() before the first
>> da_handle_start_event(); the framework only
>> links the target field.
>>
>> da_monitor_init_prealloc() would be removed; da_monitor_init() would
>> select pool or kmalloc initialisation internally based on the strategy.
>>
>> da_handle_start_event() and da_handle_start_run_event() would both call
>> da_prepare_storage() at compile time:
>>
>> DA_ALLOC_AUTO -> da_create_storage() (kmalloc_nolock)
>> DA_ALLOC_POOL -> da_create_or_get_pool()
>> DA_ALLOC_MANUAL -> da_fill_empty_storage() (link target into pre-
>> allocated slot; no allocation on the hot path)
>>
>> No explicit da_create_or_get() call would be needed in any monitor.
>>
>> da_create_or_get_kmalloc() would be removed: as you noted, a caller that
>> uses kmalloc_nolock does so because locking is forbidden; a GFP_KERNEL
>> fallback is equally forbidden if the lockless attempt fails, so the
>> function has no viable use case.
>>
>> tlob would define:
>> #define DA_MON_ALLOCATION_STRATEGY DA_ALLOC_POOL
>> #define DA_MON_POOL_SIZE TLOB_MAX_MONITORED
>>
>> nomiss would define:
>> #define DA_MON_ALLOCATION_STRATEGY DA_ALLOC_MANUAL
>>
>> and call da_create_empty_storage() from handle_sys_enter() (the
>> sched_setscheduler syscall path), which runs in safe task context;
>> da_fill_empty_storage() would then link the sched_dl_entity target on
>> the first da_handle_start_run_event() call in handle_sched_switch().
>
> Yeah good point, there's no need to make it a special path even if we
> have the target ready, da_handle_start_run_event() can do it just fine.
>
>>
>>
>> -- Patch 05: generic uprobe infrastructure
>>
>> Carried unchanged into v3 (as part of the 08-b split described below).
>>
>>
>> -- Patch 06: rvgen __init arrow reset
>>
>> Thanks, carried unchanged into v3.
>>
>
> Well, if you don't need reset() on the __init arrow we can drop this,
> right? Also it doesn't seem fully wired with the rest and requires a
> separate event to do handle_monitor_start(), which can be only just
> another handler for tlob, nothing general.
>
>>
>> > Why don't you make it a separate event (e.g. "start_tlob") [...] then
>> > you also wouldn't need to call reset() and start_timer() manually.
>>
>> Good suggestion. We plan to use a dedicated start_tlob event instead,
>> with a self-loop in tlob.dot:
>>
>> "running" -> "running" [ label = "start;reset(clk_elapsed)" ]
>>
>> da_handle_start_run_event(task->pid, ws, start_tlob) would put the
>> monitor into running and deliver start_tlob, which resets clk_elapsed
>> and arms the budget hrtimer via the generated ha_setup_invariants() —
>> no manual reset() or start_timer() calls needed.
>>
>> One guard would be added in tlob's ha_setup_invariants() to make the
>> self-loop work correctly:
>>
>> if (next_state == curr_state && event != start_tlob)
>> return;
>>
>> Without this, the start_tlob self-loop would be treated the same as any
>> repeated switch_in (already running) and ha_setup_invariants() would
>> return early, leaving the timer unarmed. Does this look right to you?
>>
>
> If you just add a separate event rvgen should take care of everything,
> you should be able to take ha_verify_constraint() and friends as-is from
> the generated code.
>
> But yeah, that's what it would end up doing.
>
>>
>> -- Patch 08: tlob monitor
>>
>> --- Patch structure ---
>>
>> > Could you have everything that isn't strictly tlob-related in another
>> > patch.
>>
>> Agreed. With the ioctl interface deferred (see below), v3 would keep
>> patch 08 as the tlob monitor only:
>>
>> 05-b: rv: extend uprobe API with three-phase detach helpers
>> (rv_uprobe.c, rv_uprobe.h, rv_uprobe_detach refactoring)
>> — extension of patch 05, independent of tlob
>>
>> 08: rv/tlob: add the tlob monitor itself
>> (tlob.c, tlob.h, tlob_trace.h, Kconfig/Makefile, Documentation,
>> rv_trace.h include; ha_monitor.h EVENT_NONE_LBL override
>> bundled here as it is only needed by tlob)
>>
>> The chardev infrastructure (rv_chardev.c, rv.h additions) and the UAPI
>> header (include/uapi/linux/rv.h) would move to a follow-up series
>> together with the ioctl self-instrumentation feature.
>>
>> --- ioctl interface design ---
>>
>> > I'm not particularly fond of ioctls, they aren't that flexible and in
>> > this way I don't really see an added value.
>> > [...] cannot the same thing be achieved using uprobes alone, e.g. by
>> > registering a function address or the current instruction pointer?
>> > [...] wouldn't a sysfs/tracefs file achieve a similar purpose without
>> > much of the boilerplate code?
>>
>> Fair point. We plan to ship v3 with the tracefs/uprobe interface only
>> and defer the ioctl (/dev/rv chardev) to a follow-up series once there
>> is a concrete in-tree user that requires it.
>>
>> The unique value of the ioctl is that TLOB_IOCTL_TRACE_STOP returns a
>> synchronous per-call result (-EOVERFLOW or 0) to the calling thread,
>> which neither uprobes nor tracefs writes can provide. We want to keep
>> that option open for later, but agree it should not block the initial
>> tlob submission.
>>
>> Does this approach work for you?
>>
>> What is your preference?
>
> Yeah looks good to me.
> Ioctls are cumbersome to set up also for the user, perhaps another sysfs
> file in the monitor directory would keep the control entirely in tlob.c
> and give you roughly the same value with easier setup.
>
> Heck we might even think of an RV reactor that does that: e.g. creates a
> file where reads sleep until the first reaction (-EOVERFLOW) and returns
> 0 in other scenarios. I'm gonna have a thought on that, but anyway I
> don't see why a sysfs file cannot do this.
>
> Let's defer it for now.
>
>>
>> --- Handler simplification ---
>>
>> > Perhaps keep the handler simpler by moving this reporting to a helper
>> > function and use guard(rcu)() there.
>>
>> Done. The accumulation logic is extracted into three inline helpers, each
>> using scoped_guard(rcu) and returning bool (true if the task is monitored):
>>
>> tlob_acc_running(prev) - accumulate running_ns on sched-out
>> tlob_acc_waiting(next) - accumulate waiting_ns on sched-in
>> tlob_acc_sleeping(task) - accumulate sleeping_ns on wakeup
>>
>> handle_sched_switch() and handle_sched_wakeup() become one-liners:
>>
>> static void handle_sched_switch(...)
>> {
>> bool prev_preempted = (prev_state == 0);
>>
>> if (tlob_acc_running(prev))
>> da_handle_event(prev->pid, NULL,
>> prev_preempted ? preempt_tlob : sleep_tlob);
>> if (tlob_acc_waiting(next))
>> da_handle_event(next->pid, NULL, switch_in_tlob);
>> }
>
> Yeah sounds good.
>
>>
>> > You probably don't need these. da_handle_event should skip tasks without
>> > a monitor.
>>
>> Agreed; the do_prev/do_next flags are gone. The helpers return false
>> for unmonitored tasks, and da_handle_event() skips them too — both paths
>> are no-ops for tasks with no pool entry.
>>
>> --- scoped_guard(rcu) ---
>>
>> > That should be a scoped_guard(rcu), definitely use guards if you have
>> > return paths, the compiler is going to clean up (unlock) for you.
>>
>> Applied to all RCU-protected sections in tlob_start_task() and
>> tlob_stop_task(). tlob_start_task() now uses guard(mutex) for the
>> serialised duplicate-check (replacing the explicit mutex_lock/unlock),
>> and tlob_stop_task() uses scoped_guard(rcu) for the atomic CAS section:
>>
>> scoped_guard(rcu) {
>> ws = da_get_target_by_id(task->pid);
>> if (!ws)
>> return -ESRCH;
>> ...
>> if (atomic_cmpxchg_release(&ws->stopping, 0, 1) != 0)
>> return -EAGAIN;
>> }
>
> Perfect.
>
>>
>> --- tlob_stop_all removal ---
>>
>> > All this function does should be done by da_monitor_destroy. We could
>> > add a way to pass some additional deallocation for all the other cleanup
>> > you're doing on each storage. Something like a da_extra_cleanup() you
>> > can define as whatever you need and gets called in all per-obj
>> > destruction paths.
>>
>> Agreed. tlob_stop_all() (~50 lines) has been removed entirely.
>>
>> A da_extra_cleanup() hook macro is introduced in da_monitor.h: the default
>> is a no-op; a monitor may override it before including the header. tlob
>> defines:
>>
>> static inline void tlob_extra_cleanup(struct da_monitor *da_mon)
>> {
>> struct ha_monitor *ha_mon = to_ha_monitor(da_mon);
>> struct tlob_task_state *ws = da_get_target(ha_mon);
>>
>> if (!ws)
>> return;
>> if (atomic_cmpxchg_release(&ws->stopping, 0, 1) != 0)
>> return;
>
>> ha_cancel_timer_sync(ha_mon);
> After my patch making timer callbacks RCU read-side critical section,
> you won't need that, just let the usual reset asynchronously stop the
> timer and put everything that needs it stopped in your RCU callback.
>
> Of course make sure the timer was stopped before this extra cleanup, so
> put the macro accordingly.
>
> I don't think da_extra_cleanup in general should be expected to sleep
> and call_rcu should do the heavy lifting (it may run from any tracepoint).
>
> Anyway we can see it later after that's merged.
>
>> atomic_dec(&tlob_num_monitored);
>> put_task_struct(ws->task);
>> call_rcu(&ws->rcu, tlob_free_rcu);
>> }
>> #define da_extra_cleanup tlob_extra_cleanup
>>
>> da_monitor_destroy() iterates remaining entries via da_extra_cleanup +
>> hash_del_rcu + call_rcu, then waits for all callbacks via rcu_barrier().
>> tlob's disable path is now simply:
>>
>> static void __tlob_destroy_monitor(void)
>> {
>> da_monitor_destroy();
>> }
>
> Looks good, let's see the full picture.
>
>> --- EVENT_NONE_LBL ---
>>
>> > Why don't you just override EVENT_NONE_LBL (and if you prefer call it
>> > MONITOR_TIMER_EVENT_NAME) without the need for another function?
>>
>> Done. model_get_timer_event_name() has been removed from automata.h.
>> In ha_monitor.h, EVENT_NONE_LBL is now overridable:
>>
>> #ifndef EVENT_NONE_LBL
>> #define EVENT_NONE_LBL "none"
>> #endif
>>
>> tlob.c defines it before including the model header:
>>
>> #define EVENT_NONE_LBL "budget_exceeded"
>>
>> The two call sites in ha_monitor.h that previously called
>> model_get_timer_event_name() now use EVENT_NONE_LBL directly.
>>
>> --- KUnit config / tristate ---
>>
>> > Do you need to add this here? Since you have a patch adding KUnit tests
>> > to tlob, cannot you put everything kunit-related there?
>> > I couldn't build it as module.
>>
>> Agreed on moving the Kconfig entry to patch 09.
>>
>> The module build issue is fixed by exporting the symbols needed by the
>> test via EXPORT_SYMBOL_IF_KUNIT (EXPORTED_FOR_KUNIT_TESTING namespace);
>> tlob_kunit.c imports the namespace with MODULE_IMPORT_NS. We plan to
>> keep tristate rather than changing to bool. Does that work for you?
>
> Yeah it's good as long as it works as module too.
>
> I might have a look at making my patch module-ready, for now it just
> can't work but I wonder if we can do something nicer to allow it
> (like in your case a bunch of exports, a separate file and a standalone
> testcase, perhaps all wrapped in some helper).
>
>>
>> --- detail_env_tlob tracepoint ---
>>
>> > Since you are not documenting the detail_env_tlob tracepoint, is it
>> > something really required? I would at the very least document its usage.
>>
>> Fair point. detail_env_tlob emits (running_ns, waiting_ns, sleeping_ns)
>> so the user can see which phase consumed the budget: high sleeping_ns
>> indicates I/O latency, high waiting_ns indicates scheduler pressure, high
>> running_ns indicates a compute overrun. Without this breakdown the user
>> only knows the total elapsed time exceeded the threshold, not why.
>>
>
> Alright, then this can go into the docs.
>
>>
>> --- Documentation ---
>>
>> > This is standard tracepoints usage, there's nothing about tlob we should
>> > document here.
>> > Same here, standard RV [for enable/desc tracefs files].
>> > And this is duplicating what mentioned above about uprobes, isn't it?
>>
>> Agreed. The following have been removed:
>>
>> - "Violation events" section: generic trace-cmd examples and cat-trace
>> instructions (standard tracepoints usage).
>> - tracefs files: "enable (rw)" and "desc (ro)" entries (standard RV).
>> - tracefs files: "monitor (rw)" description condensed to one line with
>> a cross-reference to the uprobes section above.
>>
>> In their place, a new "Violation tracepoints" subsection documents both
>> tlob-specific tracepoints with fields and a worked example:
>>
>> error_env_tlob: id, state, event ("budget_exceeded"), env ("clk_elapsed")
>>
>> detail_env_tlob: id, threshold_us, running_ns, waiting_ns, sleeping_ns
>> Use sleeping_ns to diagnose I/O latency, waiting_ns for scheduler
>> pressure, running_ns for compute overruns.
>>
>> Example:
>> trace-cmd record -e error_env_tlob -e detail_env_tlob &
>> # ... run workload ...
>> trace-cmd report
>
> Yeah sounds good, also pointing out to enable the monitor.
> We might think of a general way to do this kind of thing in
> tools/rv, although detail_env_tlob is non-standard.
>
>> > Is kernel code going to use this API? RV monitors are meant to be
>> > enabled by userspace. What's the use-case here?
>>
>> Agreed. The uprobe interface is driven from userspace; tlob_start_task()
>> and tlob_stop_task() are the internal implementation functions it calls,
>> not a public API for external kernel modules. The hypothetical
>> kernel-module use case would be removed from the documentation; the
>> kernel-doc block is retained for code maintainers.
>>
>> > That's probably a bit too detailed for this page. If you really want
>> > this information somewhere couldn't it stay in the code?
>>
>> Agreed; moved to comments in handle_sched_switch() and
>> handle_sched_wakeup(). The "Limitations" subsection is retained.
>>
>> -- Patch 09: KUnit tests
>>
>> > What caught my eyes are tests enrolling tracepoints handlers. If you
>> > go there you're no longer doing unit testing, what's the advantage of
>> > testing the entire monitor here over doing that in selftests?
>>
>> Agreed. The three suites that register tracepoint handlers or create
>> kthreads (tlob_sched_integration, tlob_trace_output, tlob_violation_react)
>> have been removed from KUnit and will be added to selftests in v3.
>>
>> Two pure unit test suites remain in KUnit:
>>
>> tlob_task_api:
>> Tests tlob_start_task / tlob_stop_task return values (-ENODEV,
>> -EALREADY, -ESRCH, -EOVERFLOW, -ENOSPC, -ERANGE) via direct calls
>> (these functions are the internal implementation used by both the
>> uprobe and, in future, the ioctl interface).
>> No tracepoints, no scheduling.
>>
>> tlob_uprobe_format:
>> Tests the uprobe line parser (tlob_parse_uprobe_line,
>> tlob_parse_remove_line) against valid and invalid input strings.
>> Pure string parsing; no scheduling, no tracepoints.
>>
>> This also resolves the tristate-vs-bool issue: with only pure unit tests
>> there is no dependency on sched_setscheduler_nocheck, so bool is correct.
>>
>
> Yeah looks good.
>
>>
>> -- Patch 10: selftests
>>
>> --- PREEMPT_RT RCU stall ---
>>
>> > I run it on a VM and have it hanging at step 9 [...] rcu_preempt stall.
>> > Did you see that? Am I doing something wrong?
>>
>> Thanks for reporting. The patch changed ha_monitor.h from
>> HRTIMER_MODE_REL_HARD (the existing upstream value) to REL_SOFT; the
>> stall appeared on PREEMPT_RT after that change. We have not fully
>> confirmed whether REL_SOFT is the root cause — REL_SOFT defers the
>> callback to the ktimers kthread, which could starve rcu_preempt under
>> certain PREEMPT_RT configurations, but other factors may be involved.
>>
>> We plan to revert to HRTIMER_MODE_REL_HARD at both sites in ha_monitor.h
>> as the conservative choice:
>>
>> ha_setup_timer(): HRTIMER_MODE_REL_SOFT -> HRTIMER_MODE_REL_HARD
>> ha_start_timer_ns(): HRTIMER_MODE_REL_SOFT -> HRTIMER_MODE_REL_HARD
>>
>> Do you have more insight into the stall, or does REL_HARD resolve it on
>> your setup?
>
> Right, good point, any specific reason why you wanted REL_SOFT?
>
> I indeed always test under PREEMPT_RT but I still see the same splat
> also after reverting REL_HARD..
> Could you reproduce it on your setup?
>
> My config is nothing special: what vng gives you adding
> PREEMPT_RT/RCU_PREEMPT and lockdep (PROVE_LOCKING/PROVE_RCU).
>
Hi Gabriele,
No specific reason for REL_SOFT — not intentional, reverting to
REL_HARD.
Reproduced the stall on the same config (PREEMPT_RT +
PROVE_LOCKING/PROVE_RCU).
Root cause is a cleanup ordering bug in uprobe_detail_waiting.tc,
unrelated to REL_SOFT/REL_HARD:
# original cleanup — wrong order
echo "-${UPROBE_TARGET}:${start_offset}" > "$TLOB_MONITOR" # (A)
kill "$hog_pid" # (B)
(A)
triggers synchronize_srcu() in the kernel. But tlob_target is stuck
mid-uprobe_notify_resume holding an SRCU read lock, preempted by the
FIFO-99 hog -> so the reader never finishes and (B) is never reached.
rcuc/0 (a kthread on PREEMPT_RT) is also starved by the hog -> RCU stall.
Fix: kill the hog first:
kill "$hog_pid"; wait "$hog_pid"
echo "-${UPROBE_TARGET}:${start_offset}" > "$TLOB_MONITOR"
On the PREEMPT_RT it is more reliably triggered there because rcuc/0
runs as a preemptible kthread rather than in softirq, making it easier
for the hog to monopolise the CPU long enough to hit the stall.
Thank you for the thorough review and valuable suggestions. We are
working through all of them and running the full test suite.
We expect to post v3 within the next two days.
--
Best wishes,
Wen
>>
>> --- Selftest structure ---
>>
>> > This should be tested together with the other monitors (enable/disable),
>> > we could at most expand those with the check_requires.
>> > Let's focus on tlob-only features in this patch.
>>
>> Agreed. In v3 we plan to drop tracefs.tc (covered by the generic
>> rv_monitor_enable_disable.tc) and keep only the six uprobe-specific
>> test cases under test.d/tlob/
>>
>> ioctl.tc is deferred with the ioctl interface to the follow-up series.
>> The KUnit integration tests (sched_switch accounting, budget-expiry
>> tracepoint) would be moved to selftests as additional test cases.
>>
>
> Thanks,
> Gabriele
>
^ permalink raw reply
* Re: [PATCH 0/3] Fix out-of-tree build of some tools
From: Ian Rogers @ 2026-05-21 17:50 UTC (permalink / raw)
To: Ben Hutchings
Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Steven Rostedt, Tomas Glozar, linux-perf-users,
linux-trace-kernel
In-Reply-To: <ag8X7gcDw6jpJsLq@decadent.org.uk>
On Thu, May 21, 2026 at 7:57 AM Ben Hutchings <benh@debian.org> wrote:
>
> perf and rtla currently don't fully support out-of-tree builds, as
> they may still create files in their source directory. This series
> fixes all the instances of this problem that I have found.
>
> Ben.
Thanks for the cleanup! I ran Sashiko locally and it noted that we
should also fix the clean steps to remove these files (I'll copy the
full text below).
Reviewed-by: Ian Rogers <irogers@google.com>
```
commit cf212be834ce6b0df5cccdbc4ad977a917ff8e48
Author: Ben Hutchings <benh@debian.org>
perf tools: Put Python egg info in output directory
Installing the Python extension currently creates an egg-info
directory in the source tree, preventing a fully out-of-tree build.
Add the necessary runes to the setup.py comamnd line to relocate the
egg-info directory in an out-of-tree build.
> diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
> index fc92d6ceac5b..b9a9bdddb89a 100644
> --- a/tools/perf/Makefile.perf
> +++ b/tools/perf/Makefile.perf
> @@ -905,7 +910,9 @@ install-bin: install-tools install-tests
> install: install-bin try-install-man
>
> install-python_ext:
> - $(PYTHON_WORD) util/setup.py --quiet install
--root='/$(DESTDIR_SQ)'
> + $(PYTHON_WORD) util/setup.py --quiet \
> + $(if $(OUTPUT),egg_info --egg-base $(OUTPUT)) \
> + install --root='/$(DESTDIR_SQ)'
This relocates the egg-info directory to the output directory, which
helps keep the source tree clean. However, does the `python-clean`
target need to be updated to clean up this directory?
Currently, `python-clean` is defined as:
python-clean := $(call QUIET_CLEAN, python) $(RM) -r
$(PYTHON_EXTBUILD) $(OUTPUT)python/perf*.so
Should we also clean `$(OUTPUT)/perf.egg-info` (or `perf.egg-info` for
in-tree builds) to ensure `make clean` removes all generated python
metadata? For example:
python-clean := $(call QUIET_CLEAN, python) \
$(RM) -r $(PYTHON_EXTBUILD) $(OUTPUT)python/perf*.so \
$(OUTPUT)perf.egg-info perf.egg-info
commit dd754ea75966ec7ade41500afeb32d944f20a8f9
Author: Ben Hutchings <benh@debian.org>
perf tools: Put Python bytecode in output directory
The PMU events are processed into C sources by Python scripts,
which
normally results in writing bytecode for each module into the
source
tree. This prevents a fully out-of-tree build.
To fix this, set $PYTHONPYCACHEPREFIX to relocate the bytecode cache
directory in an out-of-tree build.
> diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
> index fc92d6ceac5b..b9a9bdddb89a 100644
> --- a/tools/perf/Makefile.perf
> +++ b/tools/perf/Makefile.perf
> @@ -400,6 +400,11 @@ PYTHON_EXTBUILD_LIB :=
$(PYTHON_EXTBUILD)lib/
> PYTHON_EXTBUILD_TMP := $(PYTHON_EXTBUILD)tmp/
> export PYTHON_EXTBUILD_LIB PYTHON_EXTBUILD_TMP
>
> +# Put Python bytecode in output directory
> +ifdef OUTPUT
> +export PYTHONPYCACHEPREFIX := $(OUTPUT)/__pycache__
> +endif
> +
> python-clean := $(call QUIET_CLEAN, python) $(RM) -r
$(PYTHON_EXTBUILD) $(OUTPUT)python/perf*.so
Similar to the egg-info relocation, does `python-clean` need to
clean up
the relocated `__pycache__` directory under `$(OUTPUT)`?
If `PYTHONPYCACHEPREFIX` is exported to `$(OUTPUT)/__pycache__`, running
`make clean` currently leaves this directory behind. Should we add it to
the `RM` list in `python-clean`? For example:
python-clean := $(call QUIET_CLEAN, python) \
$(RM) -r $(PYTHON_EXTBUILD)
$(OUTPUT)python/perf*.so \
$(if $(OUTPUT),$(OUTPUT)/__pycache__)
```
Thanks,
Ian
> Ben Hutchings (3):
> rtla: Fix output files in source tree
> perf tools: Put Python egg info in output directory
> perf tools: Put Python bytecode in output directory
>
> tools/perf/Makefile.perf | 9 ++++++++-
> tools/tracing/rtla/Makefile | 31 ++++++++++++++++++-----------
> tools/tracing/rtla/tests/timerlat.t | 4 ++--
> 3 files changed, 29 insertions(+), 15 deletions(-)
>
^ permalink raw reply
* [PATCH] tracing: Replace BUG_ON with lockdep_assert_held in uprobe_buffer functions
From: Yash Suthar @ 2026-05-21 19:28 UTC (permalink / raw)
To: rostedt, mhiramat
Cc: mathieu.desnoyers, linux-kernel, linux-trace-kernel, skhan,
Yash Suthar
Replace BUG_ON(!mutex_is_locked(&event_mutex)) with
lockdep_assert_held(&event_mutex) in uprobe_buffer_enable() and
uprobe_buffer_disable().
BUG_ON() will crash the kernel. mutex_is_locked() only checks
if any task holds lock,but not the caller task. lockdep_assert_held()
also check current task for lock and no crash on true condition.
Signed-off-by: Yash Suthar <yashsuthar983@gmail.com>
---
kernel/trace/trace_uprobe.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 2cabf8a23ec5..aee0960d0cf7 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -912,7 +912,7 @@ static int uprobe_buffer_enable(void)
{
int ret = 0;
- BUG_ON(!mutex_is_locked(&event_mutex));
+ lockdep_assert_held(&event_mutex);
if (uprobe_buffer_refcnt++ == 0) {
ret = uprobe_buffer_init();
@@ -927,7 +927,7 @@ static void uprobe_buffer_disable(void)
{
int cpu;
- BUG_ON(!mutex_is_locked(&event_mutex));
+ lockdep_assert_held(&event_mutex);
if (--uprobe_buffer_refcnt == 0) {
for_each_possible_cpu(cpu)
--
2.43.0
^ permalink raw reply related
* Re: [PATCH v6 21/43] KVM: SEV: Make 'uaddr' parameter optional for KVM_SEV_SNP_LAUNCH_UPDATE
From: Ackerley Tng @ 2026-05-21 21:27 UTC (permalink / raw)
To: Sean Christopherson, Fuad Tabba
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <ag8G7Wq5PbEdKloG@google.com>
Sean Christopherson <seanjc@google.com> writes:
> On Thu, May 21, 2026, Fuad Tabba wrote:
>> Hi,
>>
>> On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
>> <devnull+ackerleytng.google.com@kernel.org> wrote:
>> >
>> > From: Michael Roth <michael.roth@amd.com>
>> >
>> > For vm_memory_attributes=1, in-place conversion/population is not
>> > supported, so the initial contents necessarily must need to come
>> > from a separate src address, which is enforced by the current
>> > implementation. However, for vm_memory_attributes=0, it is possible for
>> > guest memory to be initialized directly from userspace by mmap()'ing the
>> > guest_memfd and writing to it while the corresponding GPA ranges are in
>> > a 'shared' state before converting them to the 'private' state expected
>> > by KVM_SEV_SNP_LAUNCH_UPDATE.
>> >
>> > Update the handling/documentation for KVM_SEV_SNP_LAUNCH_UPDATE to allow
>> > for 'uaddr' to be set to NULL when vm_memory_attributes=0, which
>> > SNP_LAUNCH_UPDATE will then use to determine when it should/shouldn't
>> > copy in data from a separate memory location. Continue to enforce
>> > non-NULL for the original vm_memory_attributes=1 case.
>> >
>> > Signed-off-by: Michael Roth <michael.roth@amd.com>
>> > [Added src_page check in error handling path when the firmware command fails]
>> > [Dropped ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES]
>> > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>>
>> I'm not very familiar with the SEV-SNP populate flows, but it looks
>> like Sashiko is on to something:
>> https://sashiko.dev/#/patchset/20260507-gmem-inplace-conversion-v6-0-91ab5a8b19a4%40google.com?part=21
>>
>> - a potential read-only page overwrite, because src_page is acquired
>> via get_user_pages_fast() without the FOLL_WRITE flag, but is then
>> overwritten via memcpy
>
> Oof, yeah, that's bad. Adding FOLL_WRITE to kvm_gmem_populate() feels wrong, and
> could break uABI, but doing gup() in SNP code would reintroduce the AB-BA issue
> with filemap_invalidate_lock().
>
> Aha! Not if we use get_user_page_fast_only(). Ugh, but then we'd have to plumb
> the userspace address into the post-populated callback.
>
> Hrm. Given that no one has yelled about overwriting their CPUID page, and given
> that the CPUID page is likely dynamically created and thus is unlikely to be a
> read-only mapping (e.g. versus the initial image), maybe this?
>
Overwriting the CPUID page is by design, I think. IIUC if the SNP
firmware doesn't like something about the CPUID page, it can update
src_page and then return an error to userspace.
Userspace should then check if it agrees with the updated CPUID contents
and then retry if it agrees.
> diff --git arch/x86/kvm/svm/sev.c arch/x86/kvm/svm/sev.c
> index 37d4cfa5d980..c73c028d72c1 100644
> --- arch/x86/kvm/svm/sev.c
> +++ arch/x86/kvm/svm/sev.c
> @@ -2456,6 +2456,7 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
> sev_populate_args.type = params.type;
>
> count = kvm_gmem_populate(kvm, params.gfn_start, src, npages,
> + params.type == KVM_SEV_SNP_PAGE_TYPE_CPUID,
I think this makes sense given that writing to src_page can only happen
when params.type == KVM_SEV_SNP_PAGE_TYPE_CPUID (this is explicitly one
of the guards in sev_gmem_post_populate()):
/*
* If the firmware command failed handle the reclaim and cleanup of that
* PFN before reporting an error.
*
* Additionally, when invalid CPUID function entries are detected,
* firmware writes the expected values into the page and leaves it
* unencrypted so it can be used for debugging and error-reporting.
*
* Copy this page back into the source buffer so userspace can use this
* information to provide information on which CPUID leaves/fields
* failed CPUID validation.
*/
if (ret && !snp_page_reclaim(kvm, pfn) &&
sev_populate_args->type == KVM_SEV_SNP_PAGE_TYPE_CPUID &&
sev_populate_args->fw_error == SEV_RET_INVALID_PARAM && src_page) {
void *src_vaddr = kmap_local_page(src_page);
void *dst_vaddr = kmap_local_pfn(pfn);
memcpy(src_vaddr, dst_vaddr, PAGE_SIZE);
kunmap_local(src_vaddr);
kunmap_local(dst_vaddr);
}
> sev_gmem_post_populate, &sev_populate_args);
> if (count < 0) {
> argp->error = sev_populate_args.fw_error;
> diff --git arch/x86/kvm/vmx/tdx.c arch/x86/kvm/vmx/tdx.c
> index f97bcf580e6d..33f35be4455b 100644
> --- arch/x86/kvm/vmx/tdx.c
> +++ arch/x86/kvm/vmx/tdx.c
> @@ -3188,7 +3188,7 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *c
> };
> gmem_ret = kvm_gmem_populate(kvm, gpa_to_gfn(region.gpa),
> u64_to_user_ptr(region.source_addr),
> - 1, tdx_gmem_post_populate, &arg);
> + 1, false, tdx_gmem_post_populate, &arg);
And TDX doesn't try to write src_page, so this is good too.
> if (gmem_ret < 0) {
> ret = gmem_ret;
> break;
> diff --git include/linux/kvm_host.h include/linux/kvm_host.h
> index 61a3430957f2..b83cda2870ba 100644
> --- include/linux/kvm_host.h
> +++ include/linux/kvm_host.h
> @@ -2596,7 +2596,8 @@ int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_ord
> typedef int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> struct page *page, void *opaque);
>
> -long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages,
> +long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src,
> + long npages, bool writable,
What do you think of need_writable_src instead of just writable for the
variable name?
> kvm_gmem_populate_cb post_populate, void *opaque);
> #endif
>
> diff --git virt/kvm/guest_memfd.c virt/kvm/guest_memfd.c
> index a35a55571a2d..6553d4e032ce 100644
> --- virt/kvm/guest_memfd.c
> +++ virt/kvm/guest_memfd.c
> @@ -858,7 +858,8 @@ static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
> return ret;
> }
>
> -long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long npages,
> +long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src,
> + long npages, bool writable,
> kvm_gmem_populate_cb post_populate, void *opaque)
> {
> struct kvm_memory_slot *slot;
> @@ -892,8 +893,9 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
>
> if (src) {
> unsigned long uaddr = (unsigned long)src + i * PAGE_SIZE;
> + unsigned int flags = writable ? FOLL_WRITE : 0;
How about using FOLL_WRITE | FOLL_NOFAULT so if it weren't writable to
start with, don't CoW, just error out?
Like you said above the CPUID page provided as src_page would have been
written to before, so it should have been mapped as writable.
>
> - ret = get_user_pages_fast(uaddr, 1, 0, &src_page);
> + ret = get_user_pages_fast(uaddr, 1, flags, &src_page);
If we stick with FOLL_WRITE, this also solves the case where a read-only
mapping or global zero page are provided as src_page, since
get_user_pages_fast() will do a copy-on-write if those were the inputs,
making it writable before the write happens (on failure) in
sev_gmem_post_populate().
> if (ret < 0)
> break;
> if (ret != 1) {
>
>> - an ordering violation with the kunmap_local() calls
>
> Yeesh, that's a new one for me. Thankfully this is 64-bit only, so it's not an
> issue.
>
>> These predate this patch series and are just being touched by the
>> 'src_page' addition, but if Sashiko's right, these should probably be
>> fixed sooner rather than later.
>
> Yeah, ditto with the offset wrapping case.
^ permalink raw reply
* Re: [PATCH] tracing: Replace BUG_ON with lockdep_assert_held in uprobe_buffer functions
From: Steven Rostedt @ 2026-05-21 22:16 UTC (permalink / raw)
To: Yash Suthar
Cc: mhiramat, mathieu.desnoyers, linux-kernel, linux-trace-kernel,
skhan
In-Reply-To: <20260521192846.8306-1-yashsuthar983@gmail.com>
On Fri, 22 May 2026 00:58:46 +0530
Yash Suthar <yashsuthar983@gmail.com> wrote:
> Replace BUG_ON(!mutex_is_locked(&event_mutex)) with
> lockdep_assert_held(&event_mutex) in uprobe_buffer_enable() and
> uprobe_buffer_disable().
>
> BUG_ON() will crash the kernel. mutex_is_locked() only checks
> if any task holds lock,but not the caller task. lockdep_assert_held()
> also check current task for lock and no crash on true condition.
>
> Signed-off-by: Yash Suthar <yashsuthar983@gmail.com>
This looks good to me.
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Masami, do you want to take this?
-- Steve
> ---
> kernel/trace/trace_uprobe.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> index 2cabf8a23ec5..aee0960d0cf7 100644
> --- a/kernel/trace/trace_uprobe.c
> +++ b/kernel/trace/trace_uprobe.c
> @@ -912,7 +912,7 @@ static int uprobe_buffer_enable(void)
> {
> int ret = 0;
>
> - BUG_ON(!mutex_is_locked(&event_mutex));
> + lockdep_assert_held(&event_mutex);
>
> if (uprobe_buffer_refcnt++ == 0) {
> ret = uprobe_buffer_init();
> @@ -927,7 +927,7 @@ static void uprobe_buffer_disable(void)
> {
> int cpu;
>
> - BUG_ON(!mutex_is_locked(&event_mutex));
> + lockdep_assert_held(&event_mutex);
>
> if (--uprobe_buffer_refcnt == 0) {
> for_each_possible_cpu(cpu)
^ permalink raw reply
* [PATCH] unwind: Add sframe_(un)register() system calls
From: Steven Rostedt @ 2026-05-21 22:35 UTC (permalink / raw)
To: LKML, Linux Trace Kernel, bpf
Cc: Masami Hiramatsu, Mathieu Desnoyers, Jens Remus, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Linus Torvalds, Andrew Morton,
Florian Weimer, Kees Cook, Carlos O'Donell, Sam James,
Dylan Hatch, Borislav Petkov, Dave Hansen, David Hildenbrand,
H. Peter Anvin, Liam R. Howlett, Lorenzo Stoakes, Michal Hocko,
Mike Rapoport, Suren Baghdasaryan, Vlastimil Babka,
Heiko Carstens, Vasily Gorbik
From: Steven Rostedt <rostedt@goodmis.org>
Add system calls to register and unregister sframes that can be used by
dynamic linkers to tell the kernel where the sframe section is in memory
for libraries it loads.
Both system calls take a pointer to a new structure:
struct sframe_setup {
unsigned long sframe_start;
unsigned long sframe_size;
unsigned long text_start;
unsigned long text_size;
};
and a size of the passed in structure. If the system call needs to be
extended, then the structure could be changed and the size of that
structure will tell the kernel that it is the new version. If the kernel
does not recognize the structure size, it will return -EINVAL.
sframe_start - The virtual address of the sframe section
sframe_size - The length of the sframe section
text_start - the text section the sframe represents
test_size - the length of the section
If other stack tracing functionality is added, it will require a new
system call.
The unregister only needs the sframe_start and requires all the rest of
the fields to be 0. In the future, if more can be done, then user space
can update the other values and check the return code to see if the kernel
supports it.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
Based on top of Jens patches here:
https://lore.kernel.org/linux-trace-kernel/20260520154004.3845823-1-jremus@linux.ibm.com/
[ Note, I tested this with the same program from the RFC patch ]
Changes from RFC: https://patch.msgid.link/20260429114355.6c712e6a@gandalf.local.home
- Remove the ioctl() like system call for a unique system call for each
functionality. Right now there's two functionalities:
1. register sframe section
2. unregister sframe sections
- Added taking a lock around the mtree logic in __sframe_remove_section()
as Sashiko mentioned that there could be races from user space
registering and unregistering sframe sections at the same time.
- Removed [RFC] from subject as I believe this is more likely the way
this system call will be done.
arch/alpha/kernel/syscalls/syscall.tbl | 2 +
arch/arm/tools/syscall.tbl | 2 +
arch/arm64/tools/syscall_32.tbl | 2 +
arch/m68k/kernel/syscalls/syscall.tbl | 2 +
arch/microblaze/kernel/syscalls/syscall.tbl | 2 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 2 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 2 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 2 +
arch/parisc/kernel/syscalls/syscall.tbl | 2 +
arch/powerpc/kernel/syscalls/syscall.tbl | 2 +
arch/s390/kernel/syscalls/syscall.tbl | 3 +
arch/sh/kernel/syscalls/syscall.tbl | 2 +
arch/sparc/kernel/syscalls/syscall.tbl | 2 +
arch/x86/entry/syscalls/syscall_32.tbl | 2 +
arch/x86/entry/syscalls/syscall_64.tbl | 2 +
arch/xtensa/kernel/syscalls/syscall.tbl | 2 +
include/linux/syscalls.h | 2 +
include/uapi/asm-generic/unistd.h | 7 ++-
include/uapi/linux/sframe.h | 12 ++++
kernel/sys_ni.c | 3 +
kernel/unwind/sframe.c | 63 ++++++++++++++++++++-
scripts/syscall.tbl | 2 +
22 files changed, 118 insertions(+), 4 deletions(-)
create mode 100644 include/uapi/linux/sframe.h
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index f31b7afffc34..f0639b831f2a 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -511,3 +511,5 @@
579 common file_setattr sys_file_setattr
580 common listns sys_listns
581 common rseq_slice_yield sys_rseq_slice_yield
+582 common sframe_register sys_sframe_register
+583 common sframe_unregister sys_sframe_unregister
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 94351e22bfcf..887b242ffb25 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -486,3 +486,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/arm64/tools/syscall_32.tbl b/arch/arm64/tools/syscall_32.tbl
index 62d93d88e0fe..c820f1ff718c 100644
--- a/arch/arm64/tools/syscall_32.tbl
+++ b/arch/arm64/tools/syscall_32.tbl
@@ -483,3 +483,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 248934257101..4c7f17f0364b 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -471,3 +471,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 223d26303627..e8dc2cc149f4 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -477,3 +477,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 7430714e2b8f..d0bae05d16af 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -410,3 +410,5 @@
469 n32 file_setattr sys_file_setattr
470 n32 listns sys_listns
471 n32 rseq_slice_yield sys_rseq_slice_yield
+472 n32 sframe_register sys_sframe_register
+473 n32 sframe_unregister sys_sframe_unregister
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 630aab9e5425..2e200de6a58c 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -386,3 +386,5 @@
469 n64 file_setattr sys_file_setattr
470 n64 listns sys_listns
471 n64 rseq_slice_yield sys_rseq_slice_yield
+472 n64 sframe_register sys_sframe_register
+473 n64 sframe_unregister sys_sframe_unregister
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 128653112284..0e3b82011ae2 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -459,3 +459,5 @@
469 o32 file_setattr sys_file_setattr
470 o32 listns sys_listns
471 o32 rseq_slice_yield sys_rseq_slice_yield
+472 o32 sframe_register sys_sframe_register
+473 o32 sframe_unregister sys_sframe_unregister
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index c6331dad9461..e0758ef8667d 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -470,3 +470,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 4fcc7c58a105..eda40c4f4f2f 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -562,3 +562,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 nospu rseq_slice_yield sys_rseq_slice_yield
+472 nospu sframe_register sys_sframe_register
+473 nospu sframe_unregister sys_sframe_unregister
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 09a7ef04d979..52519e2acdc8 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -398,3 +398,6 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common stacktrace_setup sys_stacktrace_setup
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 70b315cbe710..62ac7b1b4dd4 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -475,3 +475,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 7e71bf7fcd14..f92273ae608a 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -517,3 +517,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index f832ebd2d79b..409a50df3b21 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -477,3 +477,5 @@
469 i386 file_setattr sys_file_setattr
470 i386 listns sys_listns
471 i386 rseq_slice_yield sys_rseq_slice_yield
+472 i386 sframe_register sys_sframe_register
+473 i386 sframe_unregister sys_sframe_unregister
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 524155d655da..9b7c5a449751 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -396,6 +396,8 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index a9bca4e484de..037b8040f69d 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -442,3 +442,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register' sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index f5639d5ac331..992ccc401c5e 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -999,6 +999,8 @@ asmlinkage long sys_lsm_get_self_attr(unsigned int attr, struct lsm_ctx __user *
asmlinkage long sys_lsm_set_self_attr(unsigned int attr, struct lsm_ctx __user *ctx,
u32 size, u32 flags);
asmlinkage long sys_lsm_list_modules(u64 __user *ids, u32 __user *size, u32 flags);
+asmlinkage long sys_sframe_register(void *data, unsigned int size);
+asmlinkage long sys_sframe_unregister(void *data, unsigned int size);
/*
* Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a627acc8fb5f..17042d7e5e87 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -863,8 +863,13 @@ __SYSCALL(__NR_listns, sys_listns)
#define __NR_rseq_slice_yield 471
__SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
+#define __NR_sframe_register 472
+__SYSCALL(__NR_sframe_register, sys_sframe_register)
+#define __NR_sframe_unregister 473
+__SYSCALL(__NR_sframe_unregister, sys_sframe_unregister)
+
#undef __NR_syscalls
-#define __NR_syscalls 472
+#define __NR_syscalls 474
/*
* 32 bit systems traditionally used different
diff --git a/include/uapi/linux/sframe.h b/include/uapi/linux/sframe.h
new file mode 100644
index 000000000000..137a2ebf91f4
--- /dev/null
+++ b/include/uapi/linux/sframe.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_SFRAME_H
+#define _UAPI_LINUX_SFRAME_H
+
+struct sframe_setup {
+ unsigned long sframe_start;
+ unsigned long sframe_size;
+ unsigned long text_start;
+ unsigned long text_size;
+};
+
+#endif /* _UAPI_LINUX_SFRAME_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index add3032da16f..eca5293f5d40 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -394,3 +394,6 @@ COND_SYSCALL(rseq_slice_yield);
COND_SYSCALL(uretprobe);
COND_SYSCALL(uprobe);
+
+COND_SYSCALL(sframe_register);
+COND_SYSCALL(sframe_unregister);
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index db88d993dff1..9956f1e3aba1 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -12,8 +12,10 @@
#include <linux/mm.h>
#include <linux/string_helpers.h>
#include <linux/sframe.h>
+#include <linux/syscalls.h>
#include <asm/unwind_user_sframe.h>
#include <linux/unwind_user_types.h>
+#include <uapi/linux/sframe.h>
#include "sframe.h"
#include "sframe_debug.h"
@@ -842,9 +844,11 @@ static void sframe_free_srcu(struct rcu_head *rcu)
static int __sframe_remove_section(struct mm_struct *mm,
struct sframe_section *sec)
{
- if (!mtree_erase(&mm->sframe_mt, sec->text_start)) {
- dbg_sec("mtree_erase failed: text=%lx\n", sec->text_start);
- return -EINVAL;
+ scoped_guard(mmap_read_lock, mm) {
+ if (!mtree_erase(&mm->sframe_mt, sec->text_start)) {
+ dbg_sec("mtree_erase failed: text=%lx\n", sec->text_start);
+ return -EINVAL;
+ }
}
call_srcu(&sframe_srcu, &sec->rcu, sframe_free_srcu);
@@ -936,3 +940,56 @@ void sframe_free_mm(struct mm_struct *mm)
mtree_destroy(&mm->sframe_mt);
}
+
+/**
+ * sys_sframe_register - register an address for user space stacktrace walking.
+ * @data: Structure of sframe data used to register the sframe section
+ * @size: The size of the given structure.
+ *
+ * This system call is used by dynamic library utilities to inform the kernel
+ * of meta data that it loaded that can be used by the kernel to know how
+ * to stack walk the given text locations.
+ *
+ * Return: 0 if successful, otherwise a negative error.
+ */
+SYSCALL_DEFINE2(sframe_register, __user struct sframe_setup *, data, unsigned int, size)
+{
+ struct sframe_setup sframe;
+
+ if (sizeof(sframe) != size)
+ return -EINVAL;
+
+ if (copy_from_user(&sframe, data, size))
+ return -EFAULT;
+
+ return sframe_add_section(sframe.sframe_start,
+ sframe.sframe_start + sframe.sframe_size,
+ sframe.text_start,
+ sframe.text_start + sframe.text_size);
+}
+
+/**
+ * sys_sframe_unregister - unregister an sframe address
+ * @data: Structure of sframe data used to register the sframe section
+ * @size: The size of the given structure.
+ *
+ * The data->sframe_start is the only value that is used. The rest must
+ * be zero.
+ *
+ * Return: 0 if successful, otherwise a negative error.
+ */
+SYSCALL_DEFINE2(sframe_unregister, __user struct sframe_setup *, data, unsigned int, size)
+{
+ struct sframe_setup sframe;
+
+ if (sizeof(sframe) != size)
+ return -EINVAL;
+
+ if (copy_from_user(&sframe, data, size))
+ return -EFAULT;
+
+ if (sframe.sframe_size || sframe.text_start || sframe.text_size)
+ return -EINVAL;
+
+ return sframe_remove_section(sframe.sframe_start);
+}
diff --git a/scripts/syscall.tbl b/scripts/syscall.tbl
index 7a42b32b6577..46ec22b50042 100644
--- a/scripts/syscall.tbl
+++ b/scripts/syscall.tbl
@@ -412,3 +412,5 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common sframe_register sys_sframe_register
+473 common sframe_unregister sys_sframe_unregister
--
2.53.0
^ permalink raw reply related
* Re: [PATCH v2] tracing/osnoise: Array printk init and cleanup
From: Crystal Wood @ 2026-05-21 23:48 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-trace-kernel, John Kacur, Tomas Glozar, Costa Shulyupin,
Wander Lairson Costa, sashiko-bot, sashiko-reviews
In-Reply-To: <20260520162810.3112a9d6@gandalf.local.home>
On Wed, 2026-05-20 at 16:28 -0400, Steven Rostedt wrote:
> [ Replying to Sashiko: https://sashiko.dev/?list=org.kernel.vger.linux-trace-kernel#/patchset/20260511223035.1475676-1-crwood%40redhat.com ]
>
> > commit 859dc1eded9e36304f27e72643f28f30f327f078
> > Author: Crystal Wood <crwood@redhat.com>
> > tracing/osnoise: Array printk init and cleanup
> > This commit initializes the trace array buffer for osnoise instances on
> > creation. It also introduces the osnoise_print() helper function and switches
> > the tracer to use trace_array_printk() instead of trace_array_printk_buf().
> > > Use trace_array_printk() instead of trace_array_printk_buf(), as we're
> > > only writing to the main buffer (of a non-main instance) anyway -- and
>
> > Is this description accurate? Can the osnoise tracer also be enabled on the
> > global trace array, for example by writing to current_tracer?
> > If it is used on the global instance, trace_array_init_printk() explicitly
> > returns -EINVAL and refuses to allocate a buffer, and trace_array_printk()
> > will return 0 without printing.
>
> Hmm, I guess it would drop any of these prints if it's using the top level
> trace array. But I believe this is for use with RTLA which always creates a
> trace instance.
It doesn't *have* to be used with rtla, but you won't get the printks if
you use the main instance. The parenthetical was just meant to avoid
potential misunderstanding of the phrase "main buffer".
> > > diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
> > > index 75678053b21c5..2be188768ab42 100644
> > > --- a/kernel/trace/trace_osnoise.c
> > > +++ b/kernel/trace/trace_osnoise.c
> > > @@ -83,6 +83,22 @@ struct osnoise_instance {
> > >
> > > static struct list_head osnoise_instances;
> > >
> > > +static void osnoise_print(const char *fmt, ...)
> > > +{
> > > + struct osnoise_instance *inst;
> > > + struct trace_array *tr;
> > > + va_list ap;
> > > +
> > > + rcu_read_lock();
> > > + list_for_each_entry_rcu(inst, &osnoise_instances, list) {
> > > + tr = inst->tr;
> > > + va_start(ap, fmt);
> > > + trace_array_vprintk(tr, _RET_IP_, fmt, ap);
>
> > Does this code create a use-after-free on the trace array if an instance is
> > removed concurrently?
If so, it was already an issue with osnoise_taint(),
osnoise_stop_tracing(), osnoise_stop_exception(), etc. Wouldn't be
surprising, as this file has a number of other synchronization issues as
well.
> > When a user deletes a trace instance via rmdir, the unregister function
> > removes the instance from the list using list_del_rcu(). However, the removal
> > routine does not appear to wait for an RCU grace period before freeing the
> > trace array itself.
> > Could a concurrent execution of this loop inside the rcu_read_lock() section
> > still access the unlinked instance, read the freed inst->tr, and pass it to
> > trace_array_vprintk()? This appears to be an existing issue, but it still
> > affects the loop here.
>
> Hmm, this is interesting. osnoise keeps track of its own instances via a
> osnoise_instances list. But it only use kfree_rcu() to free the list
> descriptor but doesn't take care of the tr being freed before hand!
>
> Something like this could work [not even compiled]
>
> diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
> index 75678053b21c..bda1e0e0d2e1 100644
> --- a/kernel/trace/trace_osnoise.c
> +++ b/kernel/trace/trace_osnoise.c
> @@ -476,8 +476,11 @@ static void print_osnoise_headers(struct seq_file *s)
> \
> rcu_read_lock(); \
> list_for_each_entry_rcu(inst, &osnoise_instances, list) { \
> + if (trace_array_get(inst->tr) < 0) \
> + continue; \
> buffer = inst->tr->array_buffer.buffer; \
> trace_array_printk_buf(buffer, _THIS_IP_, msg); \
> + trace_array_put(inst->tr); \
> } \
> rcu_read_unlock(); \
> osnoise_data.tainted = true; \
OK, I'll prepare a v3.
> > > @@ -1189,10 +1198,10 @@ static __always_inline void osnoise_stop_exception(char *msg, int cpu)
> > > rcu_read_lock();
> > > list_for_each_entry_rcu(inst, &osnoise_instances, list) {
> > > tr = inst->tr;
> > > - trace_array_printk_buf(tr->array_buffer.buffer, _THIS_IP_,
> > > - "stop tracing hit on cpu %d due to exception: %s\n",
> > > - smp_processor_id(),
> > > - msg);
> > > + trace_array_printk(tr, _THIS_IP_,
> > > + "stop tracing hit on cpu %d due to exception: %s\n",
> > > + smp_processor_id(),
> > > + msg);
>
> > Does this switch to trace_array_printk() silently drop messages when tracing
> > on the global instance?
> > Looking at trace_array_printk(), it explicitly returns 0 if the
> > TRACE_ARRAY_FL_GLOBAL flag is set on the trace array:
> > kernel/trace/trace_printk.c:trace_array_printk() {
> > ...
> > if (tr->flags & TRACE_ARRAY_FL_GLOBAL)
> > return 0;
> > ...
> > }
> > This means stop and exception logs would be discarded if the tracer is
> > running on the global instance.
>
> Yep. But maybe that's not an issue?
I'd actually consider it a fix, if the policy is actually about not
allowing tracers to "spam" the main instance, rather than just avoiding
the percpu allocation. Especially for osnoise_stop_exception(), which
is called only one place, that already printed the same message with
osnoise_taint(). :-P
As I mentioned in the v1 patch, if trace_array_printk_buf() is going to
bypass the global instance check, should probably be internal to the
core trace code.
-Crystal
^ permalink raw reply
* [PATCH v6] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Steven Rostedt @ 2026-05-22 2:50 UTC (permalink / raw)
To: LKML, Linux trace kernel
Cc: Masami Hiramatsu, Mathieu Desnoyers, Mark Rutland, Peter Zijlstra,
Namhyung Kim, Takaya Saeki, Douglas Raillard, Tom Zanussi,
Andrew Morton, Thomas Gleixner, Ian Rogers, Jiri Olsa
From: Steven Rostedt <rostedt@goodmis.org>
Add syntax to the parsing of eprobes to be able to typecast a trace event
field that is a pointer to a structure.
Currently, a dereference must be a number, where the user has to figure
out manually the offset of a member of a structure that they want to
dereference.
But for event probes that records a field that happens to be a pointer to
a structure, it cannot dereference these values with BTF naming, but
must use numerical offsets.
For example, to find out what device a sk_buff is pointing to in the
net_dev_xmit trace event, one must first use gdb to find the offsets of the
members of the structures:
(gdb) p &((struct sk_buff *)0)->dev
$1 = (struct net_device **) 0x10
(gdb) p &((struct net_device *)0)->name
$2 = (char (*)[16]) 0x118
And then use the raw numbers to dereference:
# echo 'e:xmit net.net_dev_xmit +0x118(+0x10($skbaddr)):string' >> dynamic_events
If BTF is in the kernel, then instead, the skbaddr can be typecast to
sk_buff and use the normal dereference logic.
# echo 'e:xmit net.net_dev_xmit (sk_buff)skbaddr->dev->name:string' >> dynamic_events
# echo 1 > events/eprobes/xmit/enable
# cat trace
[..]
sshd-session-1022 [000] b..2. 860.249343: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.250061: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.250142: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.263553: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.283820: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.302716: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.322905: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.342828: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.362268: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.382335: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.400856: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.419893: xmit: (net.net_dev_xmit) arg1="enp7s0"
The syntax is simply: (STRUCT)(FIELD)->MEMBER[->MEMBER..]
Also add comments around the #else and #endif of #ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
to know what they are for.
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v5: https://patch.msgid.link/20260519130144.40e71a00@fedora
- Restructure parse_btf_arg() to only use typecast when TEVENT flag is set
(eprobes).
- Move found_type label to still test for (!type) in case ctx->last_struct
is NULL for some reason.
- Restructure the code to be more specific to when to use ctx->struct_btf
and use ctx->btf for places that the typecast will not be used.
- Move parse_trace_event() function outside of BTF #ifdef as it is used
outside of it as well.
- Move the code in the '(' switch statement into a helper function called
handle_typecast(). This will be a stub when BTF is not configured in.
- Have the ctx->last_struct and ctx->struct_btf only set within the
handle_typecast() function, and have that function reset it too.
- Add comments to the #else and #endif of the #ifdef
CONFIG_PROBE_EVENTS_BTF_ARGS to know what they are for.
Documentation/trace/eprobetrace.rst | 4 +
kernel/trace/trace_probe.c | 171 +++++++++++++++++++++++-----
kernel/trace/trace_probe.h | 7 +-
3 files changed, 152 insertions(+), 30 deletions(-)
diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
index 89b5157cfab8..fe3602540569 100644
--- a/Documentation/trace/eprobetrace.rst
+++ b/Documentation/trace/eprobetrace.rst
@@ -46,6 +46,10 @@ Synopsis of eprobe_events
(x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
"string", "ustring", "symbol", "symstr" and "bitfield" are
supported.
+ (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+ a pointer to STRUCT and then derference the pointer defined by
+ ->MEMBER. Note that when this is used, the FIELD name does not
+ need to be prefixed with a '$'.
Types
-----
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index e0d3a0da26af..4a205cebda60 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -332,6 +332,25 @@ static int parse_trace_event_arg(char *arg, struct fetch_insn *code,
return -ENOENT;
}
+static int parse_trace_event(char *arg, struct fetch_insn *code,
+ struct traceprobe_parse_context *ctx)
+{
+ int ret;
+
+ if (code->data)
+ return -EFAULT;
+ ret = parse_trace_event_arg(arg, code, ctx);
+ if (!ret)
+ return 0;
+ if (strcmp(arg, "comm") == 0 || strcmp(arg, "COMM") == 0) {
+ code->op = FETCH_OP_COMM;
+ return 0;
+ }
+ /* backward compatibility */
+ ctx->offset = 0;
+ return -EINVAL;
+}
+
#ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
static u32 btf_type_int(const struct btf_type *t)
@@ -376,11 +395,17 @@ static bool btf_type_is_char_array(struct btf *btf, const struct btf_type *type)
&& BTF_INT_BITS(intdata) == 8;
}
+static struct btf *ctx_btf(struct traceprobe_parse_context *ctx)
+{
+ return ctx->flags & TPARG_FL_TYPECAST ?
+ ctx->struct_btf : ctx->btf;
+}
+
static int check_prepare_btf_string_fetch(char *typename,
struct fetch_insn **pcode,
struct traceprobe_parse_context *ctx)
{
- struct btf *btf = ctx->btf;
+ struct btf *btf = ctx_btf(ctx);
if (!btf || !ctx->last_type)
return 0;
@@ -554,22 +579,29 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
struct fetch_insn *code = *pcode;
const struct btf_member *field;
u32 bitoffs, anon_offs;
+ bool is_struct = ctx->flags & TPARG_FL_TYPECAST;
+ struct btf *btf = ctx_btf(ctx);
char *next;
int is_ptr;
s32 tid;
do {
- /* Outer loop for solving arrow operator ('->') */
- if (BTF_INFO_KIND(type->info) != BTF_KIND_PTR) {
- trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
- return -EINVAL;
- }
- /* Convert a struct pointer type to a struct type */
- type = btf_type_skip_modifiers(ctx->btf, type->type, &tid);
- if (!type) {
- trace_probe_log_err(ctx->offset, BAD_BTF_TID);
- return -EINVAL;
+ if (!is_struct) {
+ /* Outer loop for solving arrow operator ('->') */
+ if (BTF_INFO_KIND(type->info) != BTF_KIND_PTR) {
+ trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
+ return -EINVAL;
+ }
+
+ /* Convert a struct pointer type to a struct type */
+ type = btf_type_skip_modifiers(btf, type->type, &tid);
+ if (!type) {
+ trace_probe_log_err(ctx->offset, BAD_BTF_TID);
+ return -EINVAL;
+ }
}
+ /* Only the first type can skip being a pointer */
+ is_struct = false;
bitoffs = 0;
do {
@@ -580,7 +612,7 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
return is_ptr;
anon_offs = 0;
- field = btf_find_struct_member(ctx->btf, type, fieldname,
+ field = btf_find_struct_member(btf, type, fieldname,
&anon_offs);
if (IS_ERR(field)) {
trace_probe_log_err(ctx->offset, BAD_BTF_TID);
@@ -602,7 +634,7 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
ctx->last_bitsize = 0;
}
- type = btf_type_skip_modifiers(ctx->btf, field->type, &tid);
+ type = btf_type_skip_modifiers(btf, field->type, &tid);
if (!type) {
trace_probe_log_err(ctx->offset, BAD_BTF_TID);
return -EINVAL;
@@ -627,6 +659,7 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
return 0;
}
+
static int __store_entry_arg(struct trace_probe *tp, int argnum);
static int parse_btf_arg(char *varname,
@@ -640,7 +673,7 @@ static int parse_btf_arg(char *varname,
int i, is_ptr, ret;
u32 tid;
- if (WARN_ON_ONCE(!ctx->funcname))
+ if (WARN_ON_ONCE(!ctx->funcname && !(ctx->flags & TPARG_FL_TEVENT)))
return -EINVAL;
is_ptr = split_next_field(varname, &field, ctx);
@@ -653,6 +686,20 @@ static int parse_btf_arg(char *varname,
return -EOPNOTSUPP;
}
+ if (ctx->flags & TPARG_FL_TEVENT) {
+ int ret;
+
+ ret = parse_trace_event(varname, code, ctx);
+ if (ret < 0)
+ return ret;
+
+ if (ctx->flags & TPARG_FL_TYPECAST) {
+ type = ctx->last_struct;
+ goto found_type;
+ }
+ return 0;
+ }
+
if (ctx->flags & TPARG_FL_RETURN && !strcmp(varname, "$retval")) {
code->op = FETCH_OP_RETVAL;
/* Check whether the function return type is not void */
@@ -709,6 +756,7 @@ static int parse_btf_arg(char *varname,
found:
type = btf_type_skip_modifiers(ctx->btf, tid, &tid);
+found_type:
if (!type) {
trace_probe_log_err(ctx->offset, BAD_BTF_TID);
return -EINVAL;
@@ -727,7 +775,7 @@ static int parse_btf_arg(char *varname,
static const struct fetch_type *find_fetch_type_from_btf_type(
struct traceprobe_parse_context *ctx)
{
- struct btf *btf = ctx->btf;
+ struct btf *btf = ctx_btf(ctx);
const char *typestr = NULL;
if (btf && ctx->last_type)
@@ -758,7 +806,70 @@ static int parse_btf_bitfield(struct fetch_insn **pcode,
return 0;
}
-#else
+static int query_btf_struct(const char *sname, struct traceprobe_parse_context *ctx)
+{
+ int id;
+
+ if (!ctx->struct_btf) {
+ struct btf *btf;
+
+ id = bpf_find_btf_id(sname, BTF_KIND_STRUCT, &btf);
+ if (id < 0)
+ return id;
+ ctx->struct_btf = btf;
+ } else {
+ id = btf_find_by_name_kind(ctx->struct_btf, sname, BTF_KIND_STRUCT);
+ if (id < 0)
+ return id;
+ }
+
+ ctx->last_struct = btf_type_by_id(ctx->struct_btf, id);
+ return 0;
+}
+
+static int handle_typecast(char *arg, struct fetch_insn **pcode,
+ struct fetch_insn *end,
+ struct traceprobe_parse_context *ctx)
+{
+ char *tmp;
+ int ret;
+
+ /* Currently this only works for eprobes */
+ if (!(ctx->flags & TPARG_FL_TEVENT)) {
+ trace_probe_log_err(ctx->offset, TYPECAST_NOT_EVENT);
+ return -EINVAL;
+ }
+
+ tmp = strchr(arg, ')');
+ if (!tmp) {
+ trace_probe_log_err(ctx->offset + strlen(arg),
+ DEREF_OPEN_BRACE);
+ return -EINVAL;
+ }
+ *tmp = '\0';
+ ret = query_btf_struct(arg + 1, ctx);
+ *tmp = ')';
+
+ if (ret < 0) {
+ trace_probe_log_err(ctx->offset + 1, NO_PTR_STRCT);
+ ret = -EINVAL;
+ goto out_put;
+ }
+
+ ctx->flags |= TPARG_FL_TYPECAST;
+ tmp++;
+
+ ctx->offset += tmp - arg;
+ ret = parse_btf_arg(tmp, pcode, end, ctx);
+ ctx->flags &= ~TPARG_FL_TYPECAST;
+ ctx->last_struct = NULL;
+out_put:
+ btf_put(ctx->struct_btf);
+ return ret;
+}
+
+#else /* !CONFIG_PROBE_EVENTS_BTF_ARGS */
+
static void clear_btf_context(struct traceprobe_parse_context *ctx)
{
ctx->btf = NULL;
@@ -794,7 +905,15 @@ static int check_prepare_btf_string_fetch(char *typename,
return 0;
}
-#endif
+static int handle_typecast(char *arg, struct fetch_insn **pcode,
+ struct fetch_insn *end,
+ struct traceprobe_parse_context *ctx)
+{
+ trace_probe_log_err(ctx->offset, NOSUP_BTFARG);
+ return -EOPNOTSUPP;
+}
+
+#endif /* CONFIG_PROBE_EVENTS_BTF_ARGS */
#ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
@@ -953,18 +1072,9 @@ static int parse_probe_vars(char *orig_arg, const struct fetch_type *t,
int len;
if (ctx->flags & TPARG_FL_TEVENT) {
- if (code->data)
- return -EFAULT;
- ret = parse_trace_event_arg(arg, code, ctx);
- if (!ret)
- return 0;
- if (strcmp(arg, "comm") == 0 || strcmp(arg, "COMM") == 0) {
- code->op = FETCH_OP_COMM;
- return 0;
- }
- /* backward compatibility */
- ctx->offset = 0;
- goto inval;
+ if (parse_trace_event(arg, code, ctx) < 0)
+ goto inval;
+ return 0;
}
if (str_has_prefix(arg, "retval")) {
@@ -1231,6 +1341,9 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
code->op = FETCH_OP_IMM;
}
break;
+ case '(':
+ ret = handle_typecast(arg, pcode, end, ctx);
+ break;
default:
if (isalpha(arg[0]) || arg[0] == '_') { /* BTF variable */
if (!tparg_is_function_entry(ctx->flags) &&
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 262d8707a3df..952e3d7582b8 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -394,6 +394,7 @@ static inline int traceprobe_get_entry_data_size(struct trace_probe *tp)
* TPARG_FL_KERNEL and TPARG_FL_USER are also mutually exclusive.
* TPARG_FL_FPROBE and TPARG_FL_TPOINT are optional but it should be with
* TPARG_FL_KERNEL.
+ * TPARG_FL_TYPECAST is set if an argument was typecast to a structure.
*/
#define TPARG_FL_RETURN BIT(0)
#define TPARG_FL_KERNEL BIT(1)
@@ -402,6 +403,7 @@ static inline int traceprobe_get_entry_data_size(struct trace_probe *tp)
#define TPARG_FL_USER BIT(4)
#define TPARG_FL_FPROBE BIT(5)
#define TPARG_FL_TPOINT BIT(6)
+#define TPARG_FL_TYPECAST BIT(7)
#define TPARG_FL_LOC_MASK GENMASK(4, 0)
static inline bool tparg_is_function_entry(unsigned int flags)
@@ -422,7 +424,9 @@ struct traceprobe_parse_context {
const struct btf_param *params; /* Parameter of the function */
s32 nr_params; /* The number of the parameters */
struct btf *btf; /* The BTF to be used */
+ struct btf *struct_btf; /* The BTF to be used for structs */
const struct btf_type *last_type; /* Saved type */
+ const struct btf_type *last_struct; /* Saved structure */
u32 last_bitoffs; /* Saved bitoffs */
u32 last_bitsize; /* Saved bitsize */
struct trace_probe *tp;
@@ -563,7 +567,8 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
C(NEED_STRING_TYPE, "$comm and immediate-string only accepts string type"),\
C(TOO_MANY_ARGS, "Too many arguments are specified"), \
C(TOO_MANY_EARGS, "Too many entry arguments specified"), \
- C(EVENT_TOO_BIG, "Event too big (too many fields?)"),
+ C(EVENT_TOO_BIG, "Event too big (too many fields?)"), \
+ C(TYPECAST_NOT_EVENT, "Typecasts are only for eprobe fields"),
#undef C
#define C(a, b) TP_ERR_##a
--
2.53.0
^ permalink raw reply related
* Re: [PATCH] Re: Re: [RFC PATCH v2 04/10] rv/da: add pre-allocated storage pool for per-object monitors
From: Gabriele Monaco @ 2026-05-22 6:45 UTC (permalink / raw)
To: Wen Yang; +Cc: linux-kernel, linux-trace-kernel, rostedt
In-Reply-To: <7917085b-fef1-4017-b289-eefe84332ff5@linux.dev>
Hi Wen,
On Fri, 2026-05-22 at 01:40 +0800, Wen Yang wrote:
> Hi Gabriele,
>
> No specific reason for REL_SOFT — not intentional, reverting to
> REL_HARD.
>
> Reproduced the stall on the same config (PREEMPT_RT +
> PROVE_LOCKING/PROVE_RCU).
>
> Root cause is a cleanup ordering bug in uprobe_detail_waiting.tc,
> unrelated to REL_SOFT/REL_HARD:
>
>
> # original cleanup — wrong order
>
> echo "-${UPROBE_TARGET}:${start_offset}" > "$TLOB_MONITOR" # (A)
>
> kill "$hog_pid" # (B)
>
>
> (A)
> triggers synchronize_srcu() in the kernel. But tlob_target is stuck
> mid-uprobe_notify_resume holding an SRCU read lock, preempted by the
> FIFO-99 hog -> so the reader never finishes and (B) is never reached.
> rcuc/0 (a kthread on PREEMPT_RT) is also starved by the hog -> RCU stall.
great you found the issue and solution. Wonder why lockdep wasn't more
informative, but probably the issue was so frequent to hog that too.
> Fix: kill the hog first:
>
>
> kill "$hog_pid"; wait "$hog_pid"
>
> echo "-${UPROBE_TARGET}:${start_offset}" > "$TLOB_MONITOR"
>
>
> On the PREEMPT_RT it is more reliably triggered there because rcuc/0
> runs as a preemptible kthread rather than in softirq, making it easier
> for the hog to monopolise the CPU long enough to hit the stall.
>
> Thank you for the thorough review and valuable suggestions. We are
> working through all of them and running the full test suite.
> We expect to post v3 within the next two days.
Alright, sounds good.
Thanks,
Gabriele
^ permalink raw reply
* Re: [PATCH] ftrace: Use flexible array for hash buckets
From: kernel test robot @ 2026-05-22 8:45 UTC (permalink / raw)
To: Rosen Penev
Cc: oe-lkp, lkp, linux-kernel, linux-trace-kernel, Steven Rostedt,
Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, oliver.sang
In-Reply-To: <20260520220030.16887-1-rosenp@gmail.com>
Hello,
kernel test robot noticed "BUG:KASAN:global-out-of-bounds_in_ftrace_find_rec_direct" on:
commit: 42ed22b1b4ae179c78e088477950eb1dd1dc9e90 ("[PATCH] ftrace: Use flexible array for hash buckets")
url: https://github.com/intel-lab-lkp/linux/commits/Rosen-Penev/ftrace-Use-flexible-array-for-hash-buckets/20260521-060312
base: https://git.kernel.org/cgit/linux/kernel/git/trace/linux-trace for-next
patch link: https://lore.kernel.org/all/20260520220030.16887-1-rosenp@gmail.com/
patch subject: [PATCH] ftrace: Use flexible array for hash buckets
in testcase: boot
config: x86_64-rhel-9.4-kunit
compiler: gcc-14
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 32G
(please refer to attached dmesg/kmsg for entire log/backtrace)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202605221023.2bcc5f10-lkp@intel.com
[ 9.849798][ T1] BUG: KASAN: global-out-of-bounds in ftrace_find_rec_direct (trace/ftrace.c:1172 (discriminator 2) trace/ftrace.c:2611 (discriminator 2))
[ 9.849798][ T1] Read of size 8 at addr ffffffff9675bd48 by task swapper/0/1
[ 9.849798][ T1]
[ 9.849798][ T1] CPU: 1 UID: 0 PID: 1 Comm: swapper/0 Not tainted 7.1.0-rc3+ #1 PREEMPT(lazy)
[ 9.849798][ T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 9.849798][ T1] Call Trace:
[ 9.849798][ T1] <TASK>
[ 9.849798][ T1] dump_stack_lvl (dump_stack.c:94 dump_stack.c:120)
[ 9.849798][ T1] print_address_description+0x70/0x300
[ 9.849798][ T1] print_report (kasan/report.c:482)
[ 9.849798][ T1] ? __virt_addr_valid (linux/mmzone.h:2198 (discriminator 1) linux/mmzone.h:2280 (discriminator 1) x86/mm/physaddr.c:54 (discriminator 1))
[ 9.849798][ T1] ? ftrace_find_rec_direct (trace/ftrace.c:1172 (discriminator 2) trace/ftrace.c:2611 (discriminator 2))
[ 9.849798][ T1] kasan_report (kasan/report.c:595)
[ 9.849798][ T1] ? ftrace_find_rec_direct (trace/ftrace.c:1172 (discriminator 2) trace/ftrace.c:2611 (discriminator 2))
[ 9.849798][ T1] ? __pfx_trace_selftest_dynamic_test_func (??:?)
[ 9.849798][ T1] ftrace_find_rec_direct (trace/ftrace.c:1172 (discriminator 2) trace/ftrace.c:2611 (discriminator 2))
[ 9.849798][ T1] register_ftrace_direct (trace/ftrace.c:6057)
[ 9.849798][ T1] ? __pfx_ftrace_stub_direct_tramp (x86/kernel/ftrace_64.S:318)
[ 9.849798][ T1] trace_selftest_startup_function_graph (trace/trace_selftest.c:1139)
[ 9.849798][ T1] ? __pfx_trace_selftest_startup_function_graph (trace/trace_selftest.c:754)
[ 9.849798][ T1] run_tracer_selftest (trace/trace.c:1335)
[ 9.849798][ T1] register_tracer (trace/trace.c:1375 trace/trace.c:1492)
[ 9.849798][ T1] ? __pfx_init_graph_trace (trace/trace_functions_graph.c:1811)
[ 9.849798][ T1] do_one_initcall (main.c:1347)
[ 9.849798][ T1] ? __pfx_do_one_initcall (trace/events/initcall.h:10)
[ 9.849798][ T1] ? asm_sysvec_apic_timer_interrupt (x86/include/asm/idtentry.h:569)
[ 9.849798][ T1] ? ret_from_fork_asm (x86/entry/entry_64.S:245)
[ 9.849798][ T1] do_initcalls (main.c:1409 (discriminator 1) main.c:1425 (discriminator 1))
[ 9.849798][ T1] kernel_init_freeable (main.c:1445 main.c:1658)
[ 9.849798][ T1] ? __pfx_kernel_init_freeable (main.c:1626)
[ 9.849798][ T1] ? __pfx_schedule_timeout (??:?)
[ 9.849798][ T1] ? __pfx__raw_spin_lock_irq (locking/spinlock.c:183)
[ 9.849798][ T1] ? __pfx_kernel_init (main.c:717)
[ 9.849798][ T1] kernel_init (main.c:1548)
[ 9.849798][ T1] ? __pfx_kernel_init (main.c:717)
[ 9.849798][ T1] ret_from_fork (x86/kernel/process.c:158)
[ 9.849798][ T1] ? __pfx_ret_from_fork (x86/include/asm/entry-common.h:54)
[ 9.849798][ T1] ? switch_fpu (linux/instrumented.h:82 asm-generic/bitops/instrumented-non-atomic.h:141 linux/thread_info.h:133 linux/sched.h:2066 x86/include/asm/fpu/sched.h:34)
[ 9.849798][ T1] ? __switch_to (x86/kernel/process_64.c:619)
[ 9.849798][ T1] ? __switch_to_asm (x86/entry/entry_64.S:206)
[ 9.849798][ T1] ? __pfx_kernel_init (main.c:717)
[ 9.849798][ T1] ret_from_fork_asm (x86/entry/entry_64.S:245)
[ 9.849798][ T1] </TASK>
[ 9.849798][ T1]
[ 9.849798][ T1] The buggy address belongs to the variable:
[ 9.849798][ T1] empty_hash+0x28/0xa0
[ 9.849798][ T1]
[ 9.849798][ T1] The buggy address belongs to the physical page:
[ 9.849798][ T1] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x77d95b
[ 9.849798][ T1] flags: 0x17ffffc0002000(reserved|node=0|zone=2|lastcpupid=0x1fffff)
[ 9.849798][ T1] raw: 0017ffffc0002000 ffffea001df656c8 ffffea001df656c8 0000000000000000
[ 9.849798][ T1] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
[ 9.849798][ T1] page dumped because: kasan: bad access detected
[ 9.849798][ T1]
[ 9.849798][ T1] Memory state around the buggy address:
[ 9.849798][ T1] ffffffff9675bc00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 9.849798][ T1] ffffffff9675bc80: 00 00 00 00 00 00 f9 f9 f9 f9 f9 f9 00 00 00 00
[ 9.849798][ T1] >ffffffff9675bd00: f9 f9 f9 f9 00 00 00 00 00 f9 f9 f9 f9 f9 f9 f9
[ 9.849798][ T1] ^
[ 9.849798][ T1] ffffffff9675bd80: 00 02 f9 f9 f9 f9 f9 f9 00 00 00 00 00 00 00 00
[ 9.849798][ T1] ffffffff9675be00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 9.849798][ T1] ==================================================================
[ 9.904835][ T1] Disabling lock debugging due to kernel taint
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20260522/202605221023.2bcc5f10-lkp@intel.com
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply
* Re: [PATCH] unwind: Add sframe_(un)register() system calls
From: Jens Remus @ 2026-05-22 9:43 UTC (permalink / raw)
To: Steven Rostedt
Cc: LKML, Linux Trace Kernel, bpf, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Linus Torvalds, Andrew Morton, Florian Weimer,
Kees Cook, Carlos O'Donell, Sam James, Dylan Hatch,
Borislav Petkov, Dave Hansen, David Hildenbrand, H. Peter Anvin,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
Suren Baghdasaryan, Vlastimil Babka, Heiko Carstens,
Vasily Gorbik
In-Reply-To: <20260521183532.7a145c8a@gandalf.local.home>
On 5/22/2026 12:35 AM, Steven Rostedt wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
>
> Add system calls to register and unregister sframes that can be used by
> dynamic linkers to tell the kernel where the sframe section is in memory
> for libraries it loads.
Why two separate system calls? Can't that be one single stacktracectl?
Could they at least be non-sframe specific, e.g. stracktrace_register
and stracktrace_unregister, so that if one would implement e.g. unwind
user dwarf/eh_frame in the future one could pass ehframe_start and
ehframe_end in addition to sframe_start and sframe_end?
>
> Both system calls take a pointer to a new structure:
>
> struct sframe_setup {
> unsigned long sframe_start;
> unsigned long sframe_size;
> unsigned long text_start;
> unsigned long text_size;
> };
>
> and a size of the passed in structure. If the system call needs to be
> extended, then the structure could be changed and the size of that
> structure will tell the kernel that it is the new version. If the kernel
> does not recognize the structure size, it will return -EINVAL.
>
> sframe_start - The virtual address of the sframe section
> sframe_size - The length of the sframe section
> text_start - the text section the sframe represents
> test_size - the length of the section
>
> If other stack tracing functionality is added, it will require a new
> system call.
>
> The unregister only needs the sframe_start and requires all the rest of
> the fields to be 0. In the future, if more can be done, then user space
> can update the other values and check the return code to see if the kernel
> supports it.
>
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> ---
>
> Based on top of Jens patches here:
>
> https://lore.kernel.org/linux-trace-kernel/20260520154004.3845823-1-jremus@linux.ibm.com/
>
> [ Note, I tested this with the same program from the RFC patch ]
>
> Changes from RFC: https://patch.msgid.link/20260429114355.6c712e6a@gandalf.local.home
>
> - Remove the ioctl() like system call for a unique system call for each
> functionality. Right now there's two functionalities:
> 1. register sframe section
> 2. unregister sframe sections
>
> - Added taking a lock around the mtree logic in __sframe_remove_section()
> as Sashiko mentioned that there could be races from user space
> registering and unregistering sframe sections at the same time.
Doesn't sframe_add_section() then also need likewise?
>
> - Removed [RFC] from subject as I believe this is more likely the way
> this system call will be done.
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> @@ -999,6 +999,8 @@ asmlinkage long sys_lsm_get_self_attr(unsigned int attr, struct lsm_ctx __user *
> asmlinkage long sys_lsm_set_self_attr(unsigned int attr, struct lsm_ctx __user *ctx,
> u32 size, u32 flags);
> asmlinkage long sys_lsm_list_modules(u64 __user *ids, u32 __user *size, u32 flags);
> +asmlinkage long sys_sframe_register(void *data, unsigned int size);
> +asmlinkage long sys_sframe_unregister(void *data, unsigned int size);
>
> /*
> * Architecture-specific system calls
> diff --git a/include/uapi/linux/sframe.h b/include/uapi/linux/sframe.h
> @@ -0,0 +1,12 @@
> +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
> +#ifndef _UAPI_LINUX_SFRAME_H
> +#define _UAPI_LINUX_SFRAME_H
> +
> +struct sframe_setup {
> + unsigned long sframe_start;
> + unsigned long sframe_size;
> + unsigned long text_start;
> + unsigned long text_size;
> +};
> +
> +#endif /* _UAPI_LINUX_SFRAME_H */
> diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
> @@ -842,9 +844,11 @@ static void sframe_free_srcu(struct rcu_head *rcu)
> static int __sframe_remove_section(struct mm_struct *mm,
> struct sframe_section *sec)
> {
> - if (!mtree_erase(&mm->sframe_mt, sec->text_start)) {
> - dbg_sec("mtree_erase failed: text=%lx\n", sec->text_start);
> - return -EINVAL;
> + scoped_guard(mmap_read_lock, mm) {
Why is a read lock sufficient? Doesn't that allow multiple readers?
How does that prevent a concurrent modification of the mm->sframe_mt?
> + if (!mtree_erase(&mm->sframe_mt, sec->text_start)) {
> + dbg_sec("mtree_erase failed: text=%lx\n", sec->text_start);
> + return -EINVAL;
> + }
Is (or why not) likewise required in sframe_add_section() for the
mtree_insert_range()?
Wasn't the reported issue that while mt_for_each() in
sframe_remove_section() there could be concurrent mtree_erase() in
__sframe_remove_section() followed by mtree_insert_range() in
sframe_add_section(), so that the mt_for_each() could get confused?
> }
>
> call_srcu(&sframe_srcu, &sec->rcu, sframe_free_srcu);
> @@ -936,3 +940,56 @@ void sframe_free_mm(struct mm_struct *mm)
>
> mtree_destroy(&mm->sframe_mt);
> }
> +
> +/**
> + * sys_sframe_register - register an address for user space stacktrace walking.
> + * @data: Structure of sframe data used to register the sframe section
> + * @size: The size of the given structure.
> + *
> + * This system call is used by dynamic library utilities to inform the kernel
> + * of meta data that it loaded that can be used by the kernel to know how
> + * to stack walk the given text locations.
> + *
> + * Return: 0 if successful, otherwise a negative error.
> + */
> +SYSCALL_DEFINE2(sframe_register, __user struct sframe_setup *, data, unsigned int, size)
> +{
> + struct sframe_setup sframe;
> +
> + if (sizeof(sframe) != size)
> + return -EINVAL;
> +
> + if (copy_from_user(&sframe, data, size))
> + return -EFAULT;
> +
> + return sframe_add_section(sframe.sframe_start,
> + sframe.sframe_start + sframe.sframe_size,
> + sframe.text_start,
> + sframe.text_start + sframe.text_size);
> +}
> +
> +/**
> + * sys_sframe_unregister - unregister an sframe address
> + * @data: Structure of sframe data used to register the sframe section
> + * @size: The size of the given structure.
> + *
> + * The data->sframe_start is the only value that is used. The rest must
> + * be zero.
> + *
> + * Return: 0 if successful, otherwise a negative error.
> + */
> +SYSCALL_DEFINE2(sframe_unregister, __user struct sframe_setup *, data, unsigned int, size)
> +{
> + struct sframe_setup sframe;
> +
> + if (sizeof(sframe) != size)
> + return -EINVAL;
> +
> + if (copy_from_user(&sframe, data, size))
> + return -EFAULT;
> +
> + if (sframe.sframe_size || sframe.text_start || sframe.text_size)
> + return -EINVAL;
> +
> + return sframe_remove_section(sframe.sframe_start);
> +}
Thanks and regards,
Jens
--
Jens Remus
Linux on Z Development (D3303)
jremus@de.ibm.com / jremus@linux.ibm.com
IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Ehningen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/
^ permalink raw reply
* Re: [PATCH v3 2/2] serial: qcom-geni: Add tracepoints for Qualcomm GENI serial driver
From: Greg Kroah-Hartman @ 2026-05-22 9:47 UTC (permalink / raw)
To: Praveen Talari
Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jiri Slaby,
Konrad Dybcio, linux-kernel, linux-trace-kernel, linux-arm-msm,
linux-serial, Mukesh Kumar Savaliya, Aniket Randive,
chandana.chiluveru, jyothi.seerapu
In-Reply-To: <20260518-add-tracepoints-for-qcom-geni-serial-v3-2-b4addb151376@oss.qualcomm.com>
On Mon, May 18, 2026 at 11:26:56PM +0530, Praveen Talari wrote:
> Add tracing to the Qualcomm GENI serial driver to improve runtime
> observability.
>
> Trace hooks are added at key points including termios and clock
> configuration, manual control get/set, interrupt handling, and data
> TX/RX paths.
>
> Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
> Signed-off-by: Praveen Talari <praveen.talari@oss.qualcomm.com>
> ---
> v2->v3:
> - Updated commit text(removed example as it was available on cover
> letter).
> ---
> drivers/tty/serial/qcom_geni_serial.c | 27 +++++++++++++++++++++++----
> 1 file changed, 23 insertions(+), 4 deletions(-)
This patch did not apply to my tree :(
^ permalink raw reply
* Re: [PATCH v3 0/2] Add trace events for Qualcomm GENI SPI drivers
From: Mark Brown @ 2026-05-19 13:13 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Praveen Talari
Cc: linux-kernel, linux-trace-kernel, linux-arm-msm, linux-spi,
mukesh.savaliya, aniket.randive, chandana.chiluveru,
jyothi.seerapu, Konrad Dybcio
In-Reply-To: <20260518-add-tracepoints-for-qcom-geni-spi-v3-0-7928f6810a79@oss.qualcomm.com>
On Mon, 18 May 2026 22:30:50 +0530, Praveen Talari wrote:
> Add trace events for Qualcomm GENI SPI drivers
>
> Add tracepoints to the Qualcomm GENI (Generic Interface) SPI driver.
> These trace events enable runtime debugging and performance analysis
> of SPI operations.
>
> The trace events capture SPI clock configuration, setup parameters,
> transfer details, interrupt status.
>
> [...]
Applied to
https://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git for-7.2
Thanks!
[1/2] spi: qcom-geni: trace: Add trace events for Qualcomm GENI SPI
https://git.kernel.org/broonie/spi/c/bf62d130b1f2
[2/2] spi: qcom-geni: Add trace events for Qualcomm GENI SPI driver
https://git.kernel.org/broonie/spi/c/b5687af4af89
All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.
You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.
If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.
Please add any relevant lists and maintainers to the CCs when replying
to this mail.
Thanks,
Mark
^ permalink raw reply
* [RFC PATCH v2 0/3] trace: stack trace deduplication for ftrace ring buffer
From: Li Pengfei @ 2026-05-22 10:40 UTC (permalink / raw)
To: linux-trace-kernel
Cc: rostedt, mhiramat, linux-kernel, cmllamas, zhangbo56, lipengfei28,
lkp
In-Reply-To: <20260514034916.2162517-1-lipengfei28@xiaomi.com>
From: Pengfei Li <lipengfei28@xiaomi.com>
Hi Steven, all,
This is v2 of the ftrace stackmap series. It addresses the Sashiko
review at [1] and incorporates the kernel test robot's toctree fix.
The series adds stack trace deduplication to ftrace. When the
stacktrace option is enabled, the ring buffer stores a 4-byte
stack_id instead of a full kernel stack trace, while the full
stacks are exported via tracefs.
Problem
=======
With stacktrace enabled, each trace event stores a full kernel
stack (typically 10-20 frames x 8 bytes = 80-160 bytes). On
production devices with 4-8 MB trace buffers, this fills the
buffer in seconds, limiting the usefulness of boot-time tracing
and always-on performance monitoring.
Design
======
The implementation is a lock-free hash map modeled after
tracing_map.c, as suggested by Steven [2]:
- lock-free insert via cmpxchg, safe in NMI/IRQ/any context
- pre-allocated element pool, so there is no allocation on the hot path
- linear probing with a 2x over-provisioned table
- bounded probe length to keep worst-case lookup/insert cost bounded
- currently implemented for the global trace instance
The ring buffer stores only stack_id. Full stacks are exported via:
/sys/kernel/debug/tracing/stack_map
/sys/kernel/debug/tracing/stack_map_stat
/sys/kernel/debug/tracing/stack_map_bin
Reset semantics
===============
Reset is treated as a control-path operation and is only supported
when tracing is stopped on the owning trace_array. Online reset is
intentionally not supported.
The reset path:
- atomically claims reset rights via cmpxchg
- rejects reset with -EBUSY if tracing is active
- blocks new get_id() callers via the resetting flag
- waits for in-flight ftrace callback paths with synchronize_rcu()
- clears the map and releases resetting with release semantics
Why not reuse tracing_map.c
===========================
This series follows the same overall lock-free approach, but uses a
purpose-built structure. tracing_map.c is designed for histogram-style
aggregation with fixed-size keys and value fields, while this use case
needs variable-length stack storage plus reference counting.
Why not reuse BPF stackmap
==========================
BPF_MAP_TYPE_STACK_TRACE addresses a similar problem, but requires a
BPF program and the BPF runtime. This series keeps the functionality
inside ftrace and available without CONFIG_BPF.
Unlike BPF stackmap, which may replace entries on collision, this
design keeps stack_id stable once assigned, which is important because
ring buffer events may reference that stack_id long after insertion.
Test results
============
Platform: ARM64 Qualcomm SM8850 (8 cores), kernel 6.12, bits=14,
tracing sched_switch + kmem_cache_alloc with stacktrace trigger,
5-second capture, default ring buffer.
Per-event payload (measured from tracing stats):
Event Full stack Stackmap Reduction
--------------------- ---------- -------- ---------
sched_switch 102 B/entry 48 B/entry -53%
kmem_cache_alloc 111 B/entry 44 B/entry -60%
In the same 5-second capture window, the smaller per-event footprint
translated to many more retained events before wraparound. For
sched_switch:
- without stackmap: 43,950 retained entries
- with stackmap: 1,710,044 retained entries
During the same runs, the stackmap observed a few thousand unique
stacks and no drops.
Boot-time activation is also supported via:
trace_options=stackmap,stacktrace
Events that occur before stackmap initialization fall back to full
stack traces; later events are deduplicated. This transition does
not itself drop events, but early boot stacks recorded before
initialization are not deduplicated.
QEMU validation
===============
The series also runs cleanly in QEMU on aarch64 (mainline,
qemu-system-aarch64, 2 vCPU, virt machine, busybox initrd).
A post-init smoke test verified:
- stack_map, stack_map_stat, stack_map_bin, and options/stackmap exist
- enabling stackmap + stacktrace produces stack_id events
- stack_map_stat shows non-zero successes and zero drops
- reset is rejected with -EBUSY while tracing is active
- reset clears the map when tracing is stopped
- stack_map_bin magic is correct
Changes since RFC v1
====================
- tightened reset semantics: reset now requires tracing to be stopped
and returns -EBUSY if tracing is active or another reset is in progress
- fixed publication/consumption ordering with smp_store_release() /
smp_load_acquire()
- bounded probe length and added pool-exhaustion fast-path handling
- moved hash_seed into struct ftrace_stackmap
- switched the element pool to a single flat vmalloc allocation
- bounded bits range to [10, 18] to limit worst-case memory usage
- fixed TRACE_ITER(STACKMAP) handling
- tightened stack_map reset input parsing
- renamed stat counters to "successes" / "success_rate" so the meaning
is unambiguous (counts events served, including first-time inserts)
- added documentation, selftest coverage, and userspace dump tooling
Known limitations
=================
- Per-instance stackmap support is not included in this series.
- The stackmap currently covers kernel stacks only.
- stack_map_bin is a best-effort snapshot, not a fully atomic export.
- trace-cmd / libtraceevent integration is left for follow-up once the
binary format settles.
Usage
=====
echo 1 > /sys/kernel/debug/tracing/options/stackmap
echo 1 > /sys/kernel/debug/tracing/options/stacktrace
[1] https://sashiko.dev/?list=org.kernel.vger.linux-trace-kernel#/patchset/20260514034916.2162517-1-lipengfei28%40xiaomi.com
[2] https://lore.kernel.org/all/20260513085145.30dd23e0@fedora/
Pengfei Li (3):
trace: add lock-free stackmap for stack trace deduplication
trace: integrate stackmap into ftrace stack recording path
trace: add documentation, selftest and tooling for stackmap
Documentation/trace/ftrace-stackmap.rst | 145 ++++
Documentation/trace/index.rst | 1 +
kernel/trace/Kconfig | 21 +
kernel/trace/Makefile | 1 +
kernel/trace/trace.c | 66 ++
kernel/trace/trace.h | 16 +
kernel/trace/trace_entries.h | 15 +
kernel/trace/trace_output.c | 23 +
kernel/trace/trace_stackmap.c | 643 ++++++++++++++++++
kernel/trace/trace_stackmap.h | 56 ++
.../ftrace/test.d/ftrace/stackmap-basic.tc | 100 +++
tools/tracing/stackmap_dump.py | 150 ++++
12 files changed, 1237 insertions(+)
create mode 100644 Documentation/trace/ftrace-stackmap.rst
create mode 100644 kernel/trace/trace_stackmap.c
create mode 100644 kernel/trace/trace_stackmap.h
create mode 100755 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
create mode 100755 tools/tracing/stackmap_dump.py
--
2.34.1
^ permalink raw reply
* [PATCH v2 1/3] trace: add lock-free stackmap for stack trace deduplication
From: Li Pengfei @ 2026-05-22 10:40 UTC (permalink / raw)
To: linux-trace-kernel
Cc: rostedt, mhiramat, linux-kernel, cmllamas, zhangbo56, lipengfei28,
lkp
In-Reply-To: <20260522104017.1668638-1-lipengfei28@xiaomi.com>
From: Pengfei Li <lipengfei28@xiaomi.com>
Add a lock-free hash map (ftrace_stackmap) that deduplicates kernel
stack traces for the ftrace ring buffer. Instead of storing full
stack traces (80-160 bytes each) in the ring buffer for every event,
ftrace can store a 4-byte stack_id when the stackmap option is enabled.
The implementation is modeled after tracing_map.c (used by hist
triggers), using the same lock-free design based on Dr. Cliff Click's
non-blocking hash table algorithm:
- Lock-free insert via cmpxchg, safe in NMI/IRQ/any context
- Pre-allocated element pool (zero allocation on hot path)
- Linear probing with 2x over-provisioned table; probe length is
bounded by FTRACE_STACKMAP_MAX_PROBE so worst-case insert/lookup
is O(1) even when the table is heavily loaded with claimed-but-
empty slots from pool exhaustion
- Single global instance (initialized for the global trace array)
The stackmap is exported via three tracefs nodes:
- stack_map: text export with symbol resolution (mode 0640)
- stack_map_stat: counters (entries, successes, drops, success_rate)
- stack_map_bin: binary export (all fields native-endian)
Counter naming:
- 'successes' counts events that were successfully assigned a
stack_id (covers both first-time inserts and dedup hits).
- 'drops' counts events that fell back to recording the full stack
(pool exhausted, probe limit reached, or reset in progress).
- 'success_rate' is successes / (successes + drops).
Reset semantics:
- Reset is a control-path operation only allowed when tracing is
stopped on the owning trace_array. Online reset (with tracing
active) is intentionally not supported to keep the proof
obligations small.
- Reset uses atomic_cmpxchg() to claim the resetting flag, then
verifies tracer_tracing_is_on() returns false. The resetting
flag itself blocks subsequent get_id() callers; userspace
re-enabling tracing after our check still cannot let new
insertions through.
- synchronize_rcu() drains in-flight get_id() callers from the
ftrace callback path, which runs preempt-disabled.
- Reset clears the resetting flag with atomic_set_release() so a
subsequent get_id() observes a fully cleared map.
- Concurrent reset returns -EBUSY; reset while tracing is active
returns -EBUSY.
Concurrency notes:
- entry->val publication uses smp_store_release() paired with
smp_load_acquire() in all dereferencing readers (lookup, seq_show,
bin_open). seq_start/seq_next only check val for NULL and use
READ_ONCE().
- elt->nr is read with READ_ONCE() and clamped to MAX_DEPTH before
use in seq_show and bin_open.
- Pool exhaustion: stackmap_get_elt() short-circuits via
atomic_read() before the contended atomic RMW, avoiding cacheline
contention once the pool is full. Slots that win cmpxchg but
cannot get an elt are left 'claimed but empty'; subsequent
lookups treat val==NULL as a miss and probe past them. The
bounded probe length keeps per-event cost O(1).
Hash key:
- Per-instance random seed stored in the stackmap struct (no
global state), seeded at create time.
- 32-bit jhash is forced to 1 if it lands on 0 (which is the
free-slot sentinel). Full memcmp confirms matches.
Memory:
- Single flat vmalloc for the element pool (no per-elt kzalloc).
- bits parameter clamped to [10, 18]: at the maximum bits=18, the
element pool is ~130 MB and a stack_map_bin snapshot may briefly
allocate another ~130 MB.
- struct stackmap_bin_snapshot uses u64 (not size_t) for its size
field so data[] is 8-byte aligned on both 32-bit and 64-bit
architectures, avoiding alignment faults when writing u64 IPs
on strict-alignment architectures.
Kernel command line parameter:
- ftrace_stackmap.bits=N: set map capacity (2^N unique stacks,
range 10-18, default 14)
Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com>
---
kernel/trace/Kconfig | 21 ++
kernel/trace/Makefile | 1 +
kernel/trace/trace_stackmap.c | 643 ++++++++++++++++++++++++++++++++++
kernel/trace/trace_stackmap.h | 56 +++
4 files changed, 721 insertions(+)
create mode 100644 kernel/trace/trace_stackmap.c
create mode 100644 kernel/trace/trace_stackmap.h
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index e130da35808f..2a63fd2c9a96 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -412,6 +412,27 @@ config STACK_TRACER
Say N if unsure.
+config FTRACE_STACKMAP
+ bool "Ftrace stack map deduplication"
+ depends on TRACING
+ depends on STACKTRACE
+ select KALLSYMS
+ help
+ This enables a global stack trace hash table for ftrace, inspired
+ by eBPF's BPF_MAP_TYPE_STACK_TRACE. When enabled, ftrace can store
+ only a stack_id in the ring buffer instead of the full stack trace,
+ significantly reducing trace buffer usage when the same call stacks
+ appear repeatedly.
+
+ The deduplicated stacks are exported via:
+ /sys/kernel/debug/tracing/stack_map
+
+ Writing to this file resets the stack map. Reading shows all unique
+ stacks with their stack_id and reference count.
+
+ Say Y if you want to reduce ftrace buffer usage for stack traces.
+ Say N if unsure.
+
config TRACE_PREEMPT_TOGGLE
bool
help
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 1decdce8cbef..f1b6175099cc 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -85,6 +85,7 @@ obj-$(CONFIG_HWLAT_TRACER) += trace_hwlat.o
obj-$(CONFIG_OSNOISE_TRACER) += trace_osnoise.o
obj-$(CONFIG_NOP_TRACER) += trace_nop.o
obj-$(CONFIG_STACK_TRACER) += trace_stack.o
+obj-$(CONFIG_FTRACE_STACKMAP) += trace_stackmap.o
obj-$(CONFIG_MMIOTRACE) += trace_mmiotrace.o
obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += trace_functions_graph.o
obj-$(CONFIG_TRACE_BRANCH_PROFILING) += trace_branch.o
diff --git a/kernel/trace/trace_stackmap.c b/kernel/trace/trace_stackmap.c
new file mode 100644
index 000000000000..b23a60e9286c
--- /dev/null
+++ b/kernel/trace/trace_stackmap.c
@@ -0,0 +1,643 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Ftrace Stack Map - Lock-free stack trace deduplication for ftrace
+ *
+ * Modeled after tracing_map.c (used by hist triggers), this provides
+ * a lock-free hash map optimized for the ftrace hot path. The design
+ * is based on Dr. Cliff Click's non-blocking hash table algorithm.
+ *
+ * Key properties:
+ * - Lock-free insert via cmpxchg, safe in NMI/IRQ/any context
+ * - Pre-allocated element pool (zero allocation on hot path)
+ * - Linear probing with 2x over-provisioned table; probe length
+ * bounded by FTRACE_STACKMAP_MAX_PROBE to keep worst-case lookup
+ * cost constant even when the table is heavily loaded
+ * - Single global instance (initialized for the global trace array)
+ *
+ * Reset is a control-path operation, only allowed when tracing is
+ * stopped on the owning trace_array. The protocol is:
+ *
+ * - atomic_cmpxchg(&resetting, 0, 1) atomically claims reset rights
+ * and blocks new get_id() callers (they observe resetting=1 and
+ * return -EINVAL).
+ * - tracer_tracing_is_on() is checked AFTER the cmpxchg, so the
+ * resetting flag itself prevents new insertions even if userspace
+ * re-enables tracing immediately after the check.
+ * - synchronize_rcu() drains in-flight get_id() callers from the
+ * ftrace callback path, which runs with preemption disabled.
+ *
+ * Online reset (with tracing active) is intentionally not supported
+ * to keep the design simple and the proof obligations small.
+ *
+ * The 32-bit jhash of the stack IPs is the hash table key. On hash
+ * collision, linear probing finds the next slot and full memcmp
+ * confirms the match.
+ *
+ * Concurrent userspace readers (cat stack_map / stack_map_bin) get
+ * a best-effort snapshot. They are coherent with the hot path
+ * (smp_load_acquire on entry->val), but they are not coherent with
+ * a concurrent reset; since reset requires tracing to be stopped,
+ * mid-iteration reset can produce truncated or partial output but
+ * never crashes.
+ */
+
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/jhash.h>
+#include <linux/seq_file.h>
+#include <linux/kallsyms.h>
+#include <linux/vmalloc.h>
+#include <linux/atomic.h>
+#include <linux/random.h>
+#include <linux/rcupdate.h>
+#include <linux/log2.h>
+
+#include "trace.h"
+#include "trace_stackmap.h"
+
+/*
+ * Bound the linear-probe scan length. With a 2x over-provisioned table,
+ * a well-distributed hash gives very short probe chains. Capping at 64
+ * keeps worst-case lookup O(1) even when the table is heavily loaded
+ * with claimed-but-empty slots from pool exhaustion.
+ */
+#define FTRACE_STACKMAP_MAX_PROBE 64
+
+/*
+ * Each pre-allocated element holds one unique stack trace.
+ * Fixed size: MAX_DEPTH entries regardless of actual depth.
+ */
+struct stackmap_elt {
+ u32 nr; /* actual number of IPs */
+ atomic_t ref_count;
+ unsigned long ips[FTRACE_STACKMAP_MAX_DEPTH];
+};
+
+/*
+ * Hash table entry: a 32-bit key (jhash of stack) + pointer to elt.
+ * key == 0 means the slot is free.
+ */
+struct stackmap_entry {
+ u32 key; /* 0 = free, non-zero = jhash */
+ struct stackmap_elt *val; /* NULL until fully published */
+};
+
+struct ftrace_stackmap {
+ struct trace_array *tr; /* owning trace_array */
+ unsigned int map_bits;
+ unsigned int map_size; /* 1 << (map_bits + 1) */
+ unsigned int max_elts; /* 1 << map_bits */
+ u32 hash_seed; /* per-instance jhash seed */
+ atomic_t next_elt; /* index into elts pool */
+ struct stackmap_entry *entries; /* hash table */
+ struct stackmap_elt *elts; /* flat element pool */
+ atomic_t resetting;
+ atomic64_t successes; /* events served (hits + new inserts) */
+ atomic64_t drops;
+};
+
+/*
+ * Cap the bits parameter to keep worst-case allocations bounded:
+ * bits=18 → 256K elts, 512K slots, ~130 MB elt pool, ~130 MB bin
+ * export.
+ * Smaller workloads should use the default (14) which gives 16K elts
+ * (~8 MB pool); bump bits via the ftrace_stackmap.bits= kernel
+ * parameter for higher unique-stack capacity.
+ */
+#define FTRACE_STACKMAP_BITS_MIN 10
+#define FTRACE_STACKMAP_BITS_MAX 18
+#define FTRACE_STACKMAP_BITS_DEFAULT 14
+
+static unsigned int stackmap_map_bits = FTRACE_STACKMAP_BITS_DEFAULT;
+static int __init stackmap_bits_setup(char *str)
+{
+ unsigned long val;
+
+ if (kstrtoul(str, 0, &val))
+ return -EINVAL;
+ val = clamp_val(val, FTRACE_STACKMAP_BITS_MIN, FTRACE_STACKMAP_BITS_MAX);
+ stackmap_map_bits = val;
+ return 0;
+}
+early_param("ftrace_stackmap.bits", stackmap_bits_setup);
+
+/* --- Element pool --- */
+
+static struct stackmap_elt *stackmap_get_elt(struct ftrace_stackmap *smap)
+{
+ int idx;
+
+ /*
+ * Fast-path early-out once the pool is fully consumed. Avoids
+ * the contended atomic RMW on next_elt for every traced event
+ * after the pool is exhausted.
+ */
+ if (atomic_read(&smap->next_elt) >= smap->max_elts)
+ return NULL;
+
+ idx = atomic_fetch_add_unless(&smap->next_elt, 1, smap->max_elts);
+ if (idx < smap->max_elts)
+ return &smap->elts[idx];
+ return NULL;
+}
+
+/* --- Create / Destroy / Reset --- */
+
+struct ftrace_stackmap *ftrace_stackmap_create(struct trace_array *tr)
+{
+ struct ftrace_stackmap *smap;
+ unsigned int bits;
+
+ smap = kzalloc(sizeof(*smap), GFP_KERNEL);
+ if (!smap)
+ return ERR_PTR(-ENOMEM);
+
+ /* Defensive clamp: reject bogus bits even if early_param is bypassed. */
+ bits = clamp_val(stackmap_map_bits,
+ FTRACE_STACKMAP_BITS_MIN,
+ FTRACE_STACKMAP_BITS_MAX);
+
+ smap->tr = tr;
+ smap->map_bits = bits;
+ smap->max_elts = 1U << bits;
+ smap->map_size = 1U << (bits + 1); /* 2x over-provision */
+ BUG_ON(!is_power_of_2(smap->map_size));
+
+ smap->entries = vzalloc(sizeof(*smap->entries) * smap->map_size);
+ if (!smap->entries) {
+ kfree(smap);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ /*
+ * Single large vmalloc of the element pool, indexed flat.
+ * At bits=16 this is 64K * sizeof(struct stackmap_elt). The
+ * struct is ~520 B (8 + 4 + 4 + 64*8), so total ~33 MB.
+ */
+ smap->elts = vzalloc(sizeof(*smap->elts) * (size_t)smap->max_elts);
+ if (!smap->elts) {
+ vfree(smap->entries);
+ kfree(smap);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ smap->hash_seed = get_random_u32();
+ atomic_set(&smap->next_elt, 0);
+ atomic_set(&smap->resetting, 0);
+ atomic64_set(&smap->successes, 0);
+ atomic64_set(&smap->drops, 0);
+
+ return smap;
+}
+
+void ftrace_stackmap_destroy(struct ftrace_stackmap *smap)
+{
+ if (!smap || IS_ERR(smap))
+ return;
+ vfree(smap->elts);
+ vfree(smap->entries);
+ kfree(smap);
+}
+
+/**
+ * ftrace_stackmap_reset - clear all entries in the stackmap
+ * @smap: the stackmap to reset
+ *
+ * Returns 0 on success, -EBUSY if another reset is already in
+ * progress, or if tracing is currently active on the owning
+ * trace_array.
+ *
+ * Online reset (with tracing active) is not supported. Caller must
+ * stop tracing first (echo 0 > tracing_on).
+ *
+ * Caller is process context (typically sysfs write handler).
+ *
+ * Protocol:
+ * 1. Atomically claim reset rights via cmpxchg on @resetting.
+ * 2. Verify tracing is stopped on @smap->tr; if not, release the
+ * claim and return -EBUSY. The resetting flag itself blocks
+ * any subsequent get_id() callers.
+ * 3. synchronize_rcu() drains in-flight get_id() callers from the
+ * ftrace callback path (which runs preempt-disabled).
+ * 4. memset entries, elts, and counters.
+ * 5. Release the resetting flag with release semantics so any new
+ * get_id() observes a fully cleared map.
+ */
+int ftrace_stackmap_reset(struct ftrace_stackmap *smap)
+{
+ if (!smap)
+ return 0;
+
+ if (atomic_cmpxchg(&smap->resetting, 0, 1) != 0)
+ return -EBUSY;
+
+ if (smap->tr && tracer_tracing_is_on(smap->tr)) {
+ atomic_set(&smap->resetting, 0);
+ return -EBUSY;
+ }
+
+ /*
+ * synchronize_rcu() itself is a full barrier; no extra smp_mb()
+ * is needed before it. It drains in-flight ftrace callbacks that
+ * may have already passed the resetting check with the old value.
+ */
+ synchronize_rcu();
+
+ memset(smap->entries, 0, sizeof(*smap->entries) * smap->map_size);
+ memset(smap->elts, 0, sizeof(*smap->elts) * (size_t)smap->max_elts);
+
+ atomic_set(&smap->next_elt, 0);
+ atomic64_set(&smap->successes, 0);
+ atomic64_set(&smap->drops, 0);
+
+ /* Release resetting=0 so new get_id() observes a cleared map. */
+ atomic_set_release(&smap->resetting, 0);
+ return 0;
+}
+
+/* --- Core: get_id (lock-free, NMI-safe) --- */
+
+int ftrace_stackmap_get_id(struct ftrace_stackmap *smap,
+ unsigned long *ips, unsigned int nr_entries)
+{
+ u32 key_hash, idx, test_key, trace_len;
+ struct stackmap_entry *entry;
+ struct stackmap_elt *val;
+ int probes = 0;
+
+ if (!smap || !nr_entries || atomic_read(&smap->resetting))
+ return -EINVAL;
+ if (nr_entries > FTRACE_STACKMAP_MAX_DEPTH)
+ nr_entries = FTRACE_STACKMAP_MAX_DEPTH;
+
+ trace_len = nr_entries * sizeof(unsigned long);
+ /*
+ * jhash2() requires the length in u32 units and the data to be
+ * u32-aligned. On 64-bit kernels sizeof(unsigned long)==8, so
+ * trace_len is always a multiple of 8 (hence of 4). Use jhash2
+ * directly; the cast to u32* is safe because ips[] is naturally
+ * aligned to sizeof(unsigned long) >= 4.
+ */
+ key_hash = jhash2((const u32 *)ips, trace_len / sizeof(u32),
+ smap->hash_seed);
+ if (key_hash == 0)
+ key_hash = 1; /* 0 means free slot */
+
+ idx = key_hash >> (32 - (smap->map_bits + 1));
+
+ while (probes < FTRACE_STACKMAP_MAX_PROBE) {
+ idx &= (smap->map_size - 1);
+ entry = &smap->entries[idx];
+ test_key = entry->key;
+
+ if (test_key == key_hash) {
+ /*
+ * smp_load_acquire pairs with smp_store_release in
+ * the publisher below; ensures we see fully-formed
+ * elt fields (nr, ips, ref_count) before dereference.
+ */
+ val = smp_load_acquire(&entry->val);
+ if (val && val->nr == nr_entries &&
+ memcmp(val->ips, ips, trace_len) == 0) {
+ atomic_inc(&val->ref_count);
+ atomic64_inc(&smap->successes);
+ return (int)idx;
+ }
+ /*
+ * val == NULL: another CPU is mid-insert, or this
+ * slot is "claimed but empty" (pool exhausted).
+ * val != NULL but mismatch: 32-bit hash collision
+ * with a different stack. In both cases, advance.
+ */
+ } else if (!test_key) {
+ /* Free slot: try to claim it */
+ if (cmpxchg(&entry->key, 0, key_hash) == 0) {
+ struct stackmap_elt *elt;
+
+ elt = stackmap_get_elt(smap);
+ if (!elt) {
+ /*
+ * Pool exhausted. We claimed this
+ * slot with cmpxchg but cannot fill
+ * it. Leave key set so the slot
+ * stays "claimed but empty" — future
+ * lookups treat val==NULL as a miss
+ * and probe past it. Cannot revert
+ * key=0 without racing other CPUs.
+ */
+ atomic64_inc(&smap->drops);
+ return -ENOSPC;
+ }
+
+ elt->nr = nr_entries;
+ atomic_set(&elt->ref_count, 1);
+ memcpy(elt->ips, ips, trace_len);
+
+ /*
+ * Publish elt with release semantics so the
+ * reader's smp_load_acquire can safely
+ * dereference val->nr / val->ips.
+ */
+ smp_store_release(&entry->val, elt);
+ atomic64_inc(&smap->successes);
+ return (int)idx;
+ }
+ /* cmpxchg failed; another CPU claimed this slot. */
+ }
+
+ idx++;
+ probes++;
+ }
+
+ atomic64_inc(&smap->drops);
+ return -ENOSPC;
+}
+
+/* --- Text export: /sys/kernel/debug/tracing/stack_map --- */
+
+struct stackmap_seq_private {
+ struct ftrace_stackmap *smap;
+};
+
+static void *stackmap_seq_start(struct seq_file *m, loff_t *pos)
+{
+ struct stackmap_seq_private *priv = m->private;
+ struct ftrace_stackmap *smap = priv->smap;
+ u32 i;
+
+ if (!smap)
+ return NULL;
+ for (i = *pos; i < smap->map_size; i++) {
+ if (smap->entries[i].key && READ_ONCE(smap->entries[i].val)) {
+ *pos = i;
+ return &smap->entries[i];
+ }
+ }
+ return NULL;
+}
+
+static void *stackmap_seq_next(struct seq_file *m, void *v, loff_t *pos)
+{
+ struct stackmap_seq_private *priv = m->private;
+ struct ftrace_stackmap *smap = priv->smap;
+ u32 i;
+
+ if (!smap)
+ return NULL;
+ for (i = *pos + 1; i < smap->map_size; i++) {
+ if (smap->entries[i].key && READ_ONCE(smap->entries[i].val)) {
+ *pos = i;
+ return &smap->entries[i];
+ }
+ }
+ return NULL;
+}
+
+static void stackmap_seq_stop(struct seq_file *m, void *v) { }
+
+static int stackmap_seq_show(struct seq_file *m, void *v)
+{
+ struct stackmap_entry *entry = v;
+ struct stackmap_elt *elt = smp_load_acquire(&entry->val);
+ struct stackmap_seq_private *priv = m->private;
+ u32 idx = entry - priv->smap->entries;
+ u32 i, nr;
+
+ if (!elt)
+ return 0;
+
+ nr = READ_ONCE(elt->nr);
+ if (nr > FTRACE_STACKMAP_MAX_DEPTH)
+ nr = FTRACE_STACKMAP_MAX_DEPTH;
+
+ seq_printf(m, "stack_id %u [ref %u, depth %u]\n",
+ idx, atomic_read(&elt->ref_count), nr);
+ for (i = 0; i < nr; i++)
+ seq_printf(m, " [%u] %pS\n", i, (void *)elt->ips[i]);
+ seq_putc(m, '\n');
+ return 0;
+}
+
+static const struct seq_operations stackmap_seq_ops = {
+ .start = stackmap_seq_start,
+ .next = stackmap_seq_next,
+ .stop = stackmap_seq_stop,
+ .show = stackmap_seq_show,
+};
+
+static int stackmap_open(struct inode *inode, struct file *file)
+{
+ struct stackmap_seq_private *priv;
+ struct seq_file *m;
+ int ret;
+
+ ret = seq_open_private(file, &stackmap_seq_ops,
+ sizeof(struct stackmap_seq_private));
+ if (ret)
+ return ret;
+ m = file->private_data;
+ priv = m->private;
+ priv->smap = inode->i_private;
+ return 0;
+}
+
+/*
+ * Accept exactly "0" or "reset" (optionally followed by a single newline).
+ */
+static bool stackmap_write_is_reset(const char *buf, size_t n)
+{
+ if (n > 0 && buf[n - 1] == '\n')
+ n--;
+ return (n == 1 && buf[0] == '0') ||
+ (n == 5 && memcmp(buf, "reset", 5) == 0);
+}
+
+static ssize_t stackmap_write(struct file *file, const char __user *ubuf,
+ size_t count, loff_t *ppos)
+{
+ struct seq_file *m = file->private_data;
+ struct stackmap_seq_private *priv = m->private;
+ char buf[8];
+ size_t n = min(count, sizeof(buf) - 1);
+ int ret;
+
+ if (n == 0)
+ return -EINVAL;
+ if (copy_from_user(buf, ubuf, n))
+ return -EFAULT;
+ buf[n] = '\0';
+
+ if (!stackmap_write_is_reset(buf, n))
+ return -EINVAL;
+
+ /*
+ * ftrace_stackmap_reset() atomically claims reset rights via
+ * cmpxchg and returns -EBUSY if another reset is in progress
+ * or if tracing is active.
+ */
+ ret = ftrace_stackmap_reset(priv->smap);
+ if (ret)
+ return ret;
+ return count;
+}
+
+const struct file_operations ftrace_stackmap_fops = {
+ .open = stackmap_open,
+ .read = seq_read,
+ .write = stackmap_write,
+ .llseek = seq_lseek,
+ .release = seq_release_private,
+};
+
+/* --- Stats --- */
+
+static int stackmap_stat_show(struct seq_file *m, void *v)
+{
+ struct ftrace_stackmap *smap = m->private;
+ u32 entries;
+ u64 successes, drops;
+
+ if (!smap) {
+ seq_puts(m, "stackmap not initialized\n");
+ return 0;
+ }
+
+ entries = atomic_read(&smap->next_elt);
+ successes = atomic64_read(&smap->successes);
+ drops = atomic64_read(&smap->drops);
+
+ seq_printf(m, "entries: %u / %u\n", entries, smap->max_elts);
+ seq_printf(m, "table_size: %u\n", smap->map_size);
+ seq_printf(m, "successes: %llu\n", successes);
+ seq_printf(m, "drops: %llu\n", drops);
+ if (successes + drops > 0)
+ seq_printf(m, "success_rate: %llu%%\n",
+ successes * 100 / (successes + drops));
+ return 0;
+}
+
+static int stackmap_stat_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, stackmap_stat_show, inode->i_private);
+}
+
+const struct file_operations ftrace_stackmap_stat_fops = {
+ .open = stackmap_stat_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+/* --- Binary export --- */
+
+struct stackmap_bin_snapshot {
+ /*
+ * Use u64 (not size_t) so data[] is 8-byte aligned on both
+ * 32-bit and 64-bit architectures. The IP array within data[]
+ * is accessed as u64*, which would alignment-fault on strict
+ * architectures (e.g. older ARM, SPARC) if data[] started at
+ * a 4-byte boundary.
+ */
+ u64 size;
+ char data[];
+};
+
+static int stackmap_bin_open(struct inode *inode, struct file *file)
+{
+ struct ftrace_stackmap *smap = inode->i_private;
+ struct stackmap_bin_snapshot *snap;
+ struct ftrace_stackmap_bin_header *hdr;
+ size_t alloc_size, off;
+ u32 nr_entries, i, nr_stacks;
+
+ if (!smap)
+ return -ENODEV;
+
+ /*
+ * Worst-case allocation size: every populated entry uses a
+ * full-depth stack. The (+1) gives one slack slot in case a
+ * concurrent insert lands between this snapshot and iteration.
+ * The loop below performs an explicit bounds check anyway.
+ *
+ * At bits=16 this caps at ~33 MB. The file is mode 0440
+ * (TRACE_MODE_READ), so only privileged users can open it.
+ */
+ nr_entries = atomic_read(&smap->next_elt);
+ alloc_size = sizeof(*hdr) + (nr_entries + 1) *
+ (sizeof(struct ftrace_stackmap_bin_entry) +
+ FTRACE_STACKMAP_MAX_DEPTH * sizeof(u64));
+
+ snap = vmalloc(sizeof(*snap) + alloc_size);
+ if (!snap)
+ return -ENOMEM;
+
+ hdr = (struct ftrace_stackmap_bin_header *)snap->data;
+ hdr->magic = FTRACE_STACKMAP_BIN_MAGIC;
+ hdr->version = FTRACE_STACKMAP_BIN_VERSION;
+ hdr->reserved = 0;
+ off = sizeof(*hdr);
+ nr_stacks = 0;
+
+ for (i = 0; i < smap->map_size; i++) {
+ struct stackmap_entry *entry = &smap->entries[i];
+ struct stackmap_elt *elt;
+ struct ftrace_stackmap_bin_entry *e;
+ u64 *ips_out;
+ u32 k, nr;
+
+ if (!entry->key)
+ continue;
+ elt = smp_load_acquire(&entry->val);
+ if (!elt)
+ continue;
+
+ nr = READ_ONCE(elt->nr);
+ if (nr > FTRACE_STACKMAP_MAX_DEPTH)
+ nr = FTRACE_STACKMAP_MAX_DEPTH;
+
+ /* Bounds check: stop if we would overflow the allocation. */
+ if (off + sizeof(*e) + nr * sizeof(u64) > alloc_size)
+ break;
+
+ e = (struct ftrace_stackmap_bin_entry *)(snap->data + off);
+ e->stack_id = i;
+ e->nr = nr;
+ e->ref_count = atomic_read(&elt->ref_count);
+ e->reserved = 0;
+ off += sizeof(*e);
+
+ ips_out = (u64 *)(snap->data + off);
+ for (k = 0; k < nr; k++)
+ ips_out[k] = (u64)elt->ips[k];
+ off += nr * sizeof(u64);
+ nr_stacks++;
+ }
+
+ hdr->nr_stacks = nr_stacks;
+ snap->size = off;
+ file->private_data = snap;
+ return 0;
+}
+
+static ssize_t stackmap_bin_read(struct file *file, char __user *ubuf,
+ size_t count, loff_t *ppos)
+{
+ struct stackmap_bin_snapshot *snap = file->private_data;
+
+ if (!snap)
+ return -EINVAL;
+ return simple_read_from_buffer(ubuf, count, ppos, snap->data, snap->size);
+}
+
+static int stackmap_bin_release(struct inode *inode, struct file *file)
+{
+ vfree(file->private_data);
+ return 0;
+}
+
+const struct file_operations ftrace_stackmap_bin_fops = {
+ .open = stackmap_bin_open,
+ .read = stackmap_bin_read,
+ .llseek = default_llseek,
+ .release = stackmap_bin_release,
+};
diff --git a/kernel/trace/trace_stackmap.h b/kernel/trace/trace_stackmap.h
new file mode 100644
index 000000000000..da51ed919e2c
--- /dev/null
+++ b/kernel/trace/trace_stackmap.h
@@ -0,0 +1,56 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _TRACE_STACKMAP_H
+#define _TRACE_STACKMAP_H
+
+#include <linux/types.h>
+#include <linux/atomic.h>
+
+#define FTRACE_STACKMAP_MAX_DEPTH 64
+
+/* Binary export format */
+#define FTRACE_STACKMAP_BIN_MAGIC 0x464D5342 /* 'FSMB' */
+#define FTRACE_STACKMAP_BIN_VERSION 2
+
+struct ftrace_stackmap_bin_header {
+ u32 magic;
+ u32 version;
+ u32 nr_stacks;
+ u32 reserved;
+};
+
+struct ftrace_stackmap_bin_entry {
+ u32 stack_id;
+ u32 nr;
+ u32 ref_count;
+ u32 reserved;
+ /* followed by u64 ips[nr] */
+};
+
+struct trace_array;
+
+#ifdef CONFIG_FTRACE_STACKMAP
+
+struct ftrace_stackmap;
+
+struct ftrace_stackmap *ftrace_stackmap_create(struct trace_array *tr);
+void ftrace_stackmap_destroy(struct ftrace_stackmap *smap);
+int ftrace_stackmap_get_id(struct ftrace_stackmap *smap,
+ unsigned long *ips, unsigned int nr_entries);
+int ftrace_stackmap_reset(struct ftrace_stackmap *smap);
+
+extern const struct file_operations ftrace_stackmap_fops;
+extern const struct file_operations ftrace_stackmap_stat_fops;
+extern const struct file_operations ftrace_stackmap_bin_fops;
+
+#else
+
+struct ftrace_stackmap;
+static inline struct ftrace_stackmap *ftrace_stackmap_create(struct trace_array *tr) { return NULL; }
+static inline void ftrace_stackmap_destroy(struct ftrace_stackmap *s) { }
+static inline int ftrace_stackmap_get_id(struct ftrace_stackmap *s,
+ unsigned long *ips, unsigned int n)
+{ return -ENOSYS; }
+static inline int ftrace_stackmap_reset(struct ftrace_stackmap *s) { return 0; }
+
+#endif
+#endif /* _TRACE_STACKMAP_H */
--
2.34.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox