linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/6] tracing: Deferred unwinding of user space stack traces
@ 2025-08-28 18:03 Steven Rostedt
  2025-08-28 18:03 ` [PATCH v6 1/6] tracing: Do not bother getting user space stacktraces for kernel threads Steven Rostedt
                   ` (5 more replies)
  0 siblings, 6 replies; 59+ messages in thread
From: Steven Rostedt @ 2025-08-28 18:03 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
	Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell


This is the tracefs (ftrace) implementation of the deferred stack tracing.

This creates two new events that get set for every trace event when the new
"userstacktrace_delay" option is enabled. The first event happens at the time
of the request (after a trace event) that saves the cookie that represents the
current user space stack trace (it stays the same while the task is in the
kernel, as the user space stack doesn't change during this time). The second
event is where the user space stacktrace is recorded along with the cookie.

Since the callback is called in faultable context, it uses this opportunity
to look at the addresses in the stacktrace and convert them to where
they would be in the executable file (if found). It also records
the inode and device major/minor numbers into the trace, so that post
processing can find the exact location where the stacks are.

To simplify the finding of the files, a new "inode_cache" event is created
that gets triggered whenever a new inode/device is added to a new rhashtable.
It will then look up the path that represents that inode/device via the vma
descriptor. To keep the recording down to a minimum, this event is only
triggered when a new inode/device is added to the rhashtable. The rhashtable
is reset when certain changes occur in the tracefs system so that new readers
of this event can get the latest changes.

Changes since v5: https://lore.kernel.org/linux-trace-kernel/20250424192456.851953422@goodmis.org/

- Removed unwind infrastructure patches as they have already been merged.

- Also add check for PF_USER_WORKER to test for kernel thread

- Have the userstacktrace_delay option not depend on the userstacktrace
  option.

- Do not expose the userstacktrace_delay option if it's not supported.

- Set inode to -1L if vma is not found for that address to let user space
  know that, and differentiate from a vdso section.

- Added "inode_cache" to dsiplay inode/device paths when added to a stack trace

Steven Rostedt (6):
      tracing: Do not bother getting user space stacktraces for kernel threads
      tracing: Rename __dynamic_array() to __dynamic_field() for ftrace events
      tracing: Implement deferred user space stacktracing
      tracing: Have deferred user space stacktrace show file offsets
      tracing: Show inode and device major:minor in deferred user space stacktrace
      tracing: Add an event to map the inodes to their file names

----
 kernel/trace/Makefile            |   3 +
 kernel/trace/inode_cache.c       | 144 ++++++++++++++++++++++++++++++++++++++
 kernel/trace/trace.c             | 146 ++++++++++++++++++++++++++++++++++++++-
 kernel/trace/trace.h             |  32 ++++++++-
 kernel/trace/trace_entries.h     |  38 ++++++++--
 kernel/trace/trace_export.c      |  25 ++++++-
 kernel/trace/trace_inode_cache.h |  42 +++++++++++
 kernel/trace/trace_output.c      |  99 ++++++++++++++++++++++++++
 8 files changed, 520 insertions(+), 9 deletions(-)
 create mode 100644 kernel/trace/inode_cache.c
 create mode 100644 kernel/trace/trace_inode_cache.h

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v6 1/6] tracing: Do not bother getting user space stacktraces for kernel threads
  2025-08-28 18:03 [PATCH v6 0/6] tracing: Deferred unwinding of user space stack traces Steven Rostedt
@ 2025-08-28 18:03 ` Steven Rostedt
  2025-08-28 18:03 ` [PATCH v6 2/6] tracing: Rename __dynamic_array() to __dynamic_field() for ftrace events Steven Rostedt
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 59+ messages in thread
From: Steven Rostedt @ 2025-08-28 18:03 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
	Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

From: Steven Rostedt <rostedt@goodmis.org>

If a user space stacktrace is requested when running a kernel thread, just
return, as there's no point trying to get the user space stacktrace as
there is no user space.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v5: https://lore.kernel.org/20250424192613.014380756@goodmis.org

- Also add check for PF_USER_WORKER to test for kernel thread

 kernel/trace/trace.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 1b7db732c0b1..2cca29c9863d 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3138,6 +3138,10 @@ ftrace_trace_userstack(struct trace_array *tr,
 	if (!(tr->trace_flags & TRACE_ITER_USERSTACKTRACE))
 		return;
 
+	/* No point doing user space stacktraces on kernel threads */
+	if (current->flags & (PF_KTHREAD | PF_USER_WORKER))
+		return;
+
 	/*
 	 * NMIs can not handle page faults, even with fix ups.
 	 * The save user stack can (and often does) fault.
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v6 2/6] tracing: Rename __dynamic_array() to __dynamic_field() for ftrace events
  2025-08-28 18:03 [PATCH v6 0/6] tracing: Deferred unwinding of user space stack traces Steven Rostedt
  2025-08-28 18:03 ` [PATCH v6 1/6] tracing: Do not bother getting user space stacktraces for kernel threads Steven Rostedt
@ 2025-08-28 18:03 ` Steven Rostedt
  2025-08-28 18:03 ` [PATCH v6 3/6] tracing: Implement deferred user space stacktracing Steven Rostedt
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 59+ messages in thread
From: Steven Rostedt @ 2025-08-28 18:03 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
	Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

From: Steven Rostedt <rostedt@goodmis.org>

The ftrace events (like function, trace_print, etc) are created somewhat
manually and not via the TRACE_EVENT() or tracepoint magic macros. It has
its own macros.

The dynamic fields used __dynamic_array() to be created, but the output is
different than the __dynamic_array() used by TRACE_EVENT().

The TRACE_EVENT() __dynamic_array() creates the field like:

	field:__data_loc u8[] v_data;   offset:120;     size:4; signed:0;

Whereas the ftrace event is created as:

	field:char buf[];       offset:12;      size:0; signed:0;

The difference is that the ftrace field is defined as the rest of the size
of the event saved in the ring buffer. TRACE_EVENT() doesn't have such a
dynamic field, and its version saves a word that holds the offset into the
event that the field is stored, as well as the size.

For consistency rename the ftrace event macro to __dynamic_field(). This
way the ftrace event can also include a __dynamic_array() later that works
the same as the TRACE_EVENT() dynamic array.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/trace.h         |  4 ++--
 kernel/trace/trace_entries.h | 10 +++++-----
 kernel/trace/trace_export.c  | 12 ++++++------
 3 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 5f4bed5842f9..0fd2559ff119 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -92,8 +92,8 @@ enum trace_type {
 #undef __array_desc
 #define __array_desc(type, container, item, size)
 
-#undef __dynamic_array
-#define __dynamic_array(type, item)	type	item[];
+#undef __dynamic_field
+#define __dynamic_field(type, item)	type	item[];
 
 #undef __rel_dynamic_array
 #define __rel_dynamic_array(type, item)	type	item[];
diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index de294ae2c5c5..5cf80f6c704a 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -63,7 +63,7 @@ FTRACE_ENTRY_REG(function, ftrace_entry,
 	F_STRUCT(
 		__field_fn(	unsigned long,		ip		)
 		__field_fn(	unsigned long,		parent_ip	)
-		__dynamic_array( unsigned long,		args		)
+		__dynamic_field( unsigned long,		args		)
 	),
 
 	F_printk(" %ps <-- %ps",
@@ -81,7 +81,7 @@ FTRACE_ENTRY(funcgraph_entry, ftrace_graph_ent_entry,
 		__field_struct(	struct ftrace_graph_ent,	graph_ent	)
 		__field_packed(	unsigned long,	graph_ent,	func		)
 		__field_packed(	unsigned int,	graph_ent,	depth		)
-		__dynamic_array(unsigned long,	args				)
+		__dynamic_field(unsigned long,	args				)
 	),
 
 	F_printk("--> %ps (%u)", (void *)__entry->func, __entry->depth)
@@ -259,7 +259,7 @@ FTRACE_ENTRY(bprint, bprint_entry,
 	F_STRUCT(
 		__field(	unsigned long,	ip	)
 		__field(	const char *,	fmt	)
-		__dynamic_array(	u32,	buf	)
+		__dynamic_field(	u32,	buf	)
 	),
 
 	F_printk("%ps: %s",
@@ -272,7 +272,7 @@ FTRACE_ENTRY_REG(print, print_entry,
 
 	F_STRUCT(
 		__field(	unsigned long,	ip	)
-		__dynamic_array(	char,	buf	)
+		__dynamic_field(	char,	buf	)
 	),
 
 	F_printk("%ps: %s",
@@ -287,7 +287,7 @@ FTRACE_ENTRY(raw_data, raw_data_entry,
 
 	F_STRUCT(
 		__field(	unsigned int,	id	)
-		__dynamic_array(	char,	buf	)
+		__dynamic_field(	char,	buf	)
 	),
 
 	F_printk("id:%04x %08x",
diff --git a/kernel/trace/trace_export.c b/kernel/trace/trace_export.c
index 1698fc22afa0..d9d41e3ba379 100644
--- a/kernel/trace/trace_export.c
+++ b/kernel/trace/trace_export.c
@@ -57,8 +57,8 @@ static int ftrace_event_register(struct trace_event_call *call,
 #undef __array_desc
 #define __array_desc(type, container, item, size)	type item[size];
 
-#undef __dynamic_array
-#define __dynamic_array(type, item)			type item[];
+#undef __dynamic_field
+#define __dynamic_field(type, item)			type item[];
 
 #undef F_STRUCT
 #define F_STRUCT(args...)				args
@@ -123,8 +123,8 @@ static void __always_unused ____ftrace_check_##name(void)		\
 #undef __array_desc
 #define __array_desc(_type, _container, _item, _len) __array(_type, _item, _len)
 
-#undef __dynamic_array
-#define __dynamic_array(_type, _item) {					\
+#undef __dynamic_field
+#define __dynamic_field(_type, _item) {					\
 	.type = #_type "[]", .name = #_item,				\
 	.size = 0, .align = __alignof__(_type),				\
 	is_signed_type(_type), .filter_type = FILTER_OTHER },
@@ -161,8 +161,8 @@ static struct trace_event_fields ftrace_event_fields_##name[] = {	\
 #undef __array_desc
 #define __array_desc(type, container, item, len)
 
-#undef __dynamic_array
-#define __dynamic_array(type, item)
+#undef __dynamic_field
+#define __dynamic_field(type, item)
 
 #undef F_printk
 #define F_printk(fmt, args...) __stringify(fmt) ", "  __stringify(args)
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v6 3/6] tracing: Implement deferred user space stacktracing
  2025-08-28 18:03 [PATCH v6 0/6] tracing: Deferred unwinding of user space stack traces Steven Rostedt
  2025-08-28 18:03 ` [PATCH v6 1/6] tracing: Do not bother getting user space stacktraces for kernel threads Steven Rostedt
  2025-08-28 18:03 ` [PATCH v6 2/6] tracing: Rename __dynamic_array() to __dynamic_field() for ftrace events Steven Rostedt
@ 2025-08-28 18:03 ` Steven Rostedt
  2025-08-28 18:03 ` [PATCH v6 4/6] tracing: Have deferred user space stacktrace show file offsets Steven Rostedt
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 59+ messages in thread
From: Steven Rostedt @ 2025-08-28 18:03 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
	Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

From: Steven Rostedt <rostedt@goodmis.org>

Use the unwind_deferred_*() interface to be able to trace deferred user
space stacks. This creates two new ftrace events:

  user_unwind_cookie
  user_unwind_stack

The user_unwind_cookie will record into the ring buffer the cookie given
from unwind_deferred_request(), and the user_unwind_stack will record into
the ring buffer the user space stack as well as the cookie associated with
it.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v5: https://lore.kernel.org/20250424192613.356969984@goodmis.org

- Have the userstacktrace_delay option not depend on the userstacktrace
  option.

- Do not expose the userstacktrace_delay option if it's not supported.

 kernel/trace/trace.c         | 91 ++++++++++++++++++++++++++++++++++--
 kernel/trace/trace.h         | 20 ++++++++
 kernel/trace/trace_entries.h | 24 ++++++++++
 kernel/trace/trace_export.c  | 23 +++++++++
 kernel/trace/trace_output.c  | 72 ++++++++++++++++++++++++++++
 5 files changed, 227 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 2cca29c9863d..e5b7db19aa53 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3128,6 +3128,66 @@ EXPORT_SYMBOL_GPL(trace_dump_stack);
 #ifdef CONFIG_USER_STACKTRACE_SUPPORT
 static DEFINE_PER_CPU(int, user_stack_count);
 
+static void trace_user_unwind_callback(struct unwind_work *unwind,
+				       struct unwind_stacktrace *trace,
+				       u64 ctx_cookie)
+{
+	struct trace_array *tr = container_of(unwind, struct trace_array, unwinder);
+	struct trace_buffer *buffer = tr->array_buffer.buffer;
+	struct userunwind_stack_entry *entry;
+	struct ring_buffer_event *event;
+	unsigned int trace_ctx;
+	unsigned long *caller;
+	unsigned int offset;
+	int len;
+	int i;
+
+	if (!(tr->trace_flags & TRACE_ITER_USERSTACKTRACE_DELAY))
+		return;
+
+	len = trace->nr * sizeof(unsigned long) + sizeof(*entry);
+
+	trace_ctx = tracing_gen_ctx();
+	event = __trace_buffer_lock_reserve(buffer, TRACE_USER_UNWIND_STACK,
+					    len, trace_ctx);
+	if (!event)
+		return;
+
+	entry	= ring_buffer_event_data(event);
+
+	entry->cookie = ctx_cookie;
+
+	offset = sizeof(*entry);
+	len = sizeof(unsigned long) * trace->nr;
+
+	entry->__data_loc_stack = offset | (len << 16);
+	caller = (void *)entry + offset;
+
+	for (i = 0; i < trace->nr; i++) {
+		caller[i] = trace->entries[i];
+	}
+
+	__buffer_unlock_commit(buffer, event);
+}
+
+static void
+ftrace_trace_userstack_delay(struct trace_array *tr,
+			     struct trace_buffer *buffer, unsigned int trace_ctx)
+{
+	struct userunwind_cookie_entry *entry;
+	struct ring_buffer_event *event;
+
+	event = __trace_buffer_lock_reserve(buffer, TRACE_USER_UNWIND_COOKIE,
+					    sizeof(*entry), trace_ctx);
+	if (!event)
+		return;
+	entry	= ring_buffer_event_data(event);
+
+	unwind_deferred_request(&tr->unwinder, &entry->cookie);
+
+	__buffer_unlock_commit(buffer, event);
+}
+
 static void
 ftrace_trace_userstack(struct trace_array *tr,
 		       struct trace_buffer *buffer, unsigned int trace_ctx)
@@ -3135,13 +3195,18 @@ ftrace_trace_userstack(struct trace_array *tr,
 	struct ring_buffer_event *event;
 	struct userstack_entry *entry;
 
-	if (!(tr->trace_flags & TRACE_ITER_USERSTACKTRACE))
-		return;
-
 	/* No point doing user space stacktraces on kernel threads */
 	if (current->flags & (PF_KTHREAD | PF_USER_WORKER))
 		return;
 
+	if (tr->trace_flags & TRACE_ITER_USERSTACKTRACE_DELAY) {
+		ftrace_trace_userstack_delay(tr, buffer, trace_ctx);
+		return;
+	}
+
+	if (!(tr->trace_flags & TRACE_ITER_USERSTACKTRACE))
+		return;
+
 	/*
 	 * NMIs can not handle page faults, even with fix ups.
 	 * The save user stack can (and often does) fault.
@@ -5215,6 +5280,17 @@ int trace_keep_overwrite(struct tracer *tracer, u32 mask, int set)
 	return 0;
 }
 
+static int update_unwind_deferred(struct trace_array *tr, int enabled)
+{
+	if (enabled) {
+		return unwind_deferred_init(&tr->unwinder,
+					    trace_user_unwind_callback);
+	} else {
+		unwind_deferred_cancel(&tr->unwinder);
+		return 0;
+	}
+}
+
 int set_tracer_flag(struct trace_array *tr, unsigned int mask, int enabled)
 {
 	if ((mask == TRACE_ITER_RECORD_TGID) ||
@@ -5251,6 +5327,12 @@ int set_tracer_flag(struct trace_array *tr, unsigned int mask, int enabled)
 		}
 	}
 
+	if (mask == TRACE_ITER_USERSTACKTRACE_DELAY) {
+		int ret = update_unwind_deferred(tr, enabled);
+		if (ret < 0)
+			return ret;
+	}
+
 	if (mask == TRACE_ITER_COPY_MARKER)
 		update_marker_trace(tr, enabled);
 
@@ -10002,6 +10084,9 @@ static int __remove_instance(struct trace_array *tr)
 	if (tr->ref > 1 || (tr->current_trace && tr->trace_ref))
 		return -EBUSY;
 
+	if ((tr->flags & TRACE_ITER_USERSTACKTRACE_DELAY))
+		unwind_deferred_cancel(&tr->unwinder);
+
 	list_del(&tr->list);
 
 	/* Disable all the flags that were enabled coming in */
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 0fd2559ff119..940107ba618a 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -8,6 +8,7 @@
 #include <linux/sched.h>
 #include <linux/clocksource.h>
 #include <linux/ring_buffer.h>
+#include <linux/unwind_deferred.h>
 #include <linux/mmiotrace.h>
 #include <linux/tracepoint.h>
 #include <linux/ftrace.h>
@@ -49,7 +50,10 @@ enum trace_type {
 	TRACE_GRAPH_ENT,
 	TRACE_GRAPH_RETADDR_ENT,
 	TRACE_USER_STACK,
+	/* trace-cmd manually adds blktrace after USER_STACK */
 	TRACE_BLK,
+	TRACE_USER_UNWIND_STACK,
+	TRACE_USER_UNWIND_COOKIE,
 	TRACE_BPUTS,
 	TRACE_HWLAT,
 	TRACE_OSNOISE,
@@ -92,6 +96,9 @@ enum trace_type {
 #undef __array_desc
 #define __array_desc(type, container, item, size)
 
+#undef __dynamic_array
+#define __dynamic_array(type, item)	u32	__data_loc_##item;
+
 #undef __dynamic_field
 #define __dynamic_field(type, item)	type	item[];
 
@@ -435,6 +442,7 @@ struct trace_array {
 	struct cond_snapshot	*cond_snapshot;
 #endif
 	struct trace_func_repeats	__percpu *last_func_repeats;
+	struct unwind_work	unwinder;
 	/*
 	 * On boot up, the ring buffer is set to the minimum size, so that
 	 * we do not waste memory on systems that are not using tracing.
@@ -526,6 +534,9 @@ extern void __ftrace_bad_type(void);
 		IF_ASSIGN(var, ent, struct ctx_switch_entry, 0);	\
 		IF_ASSIGN(var, ent, struct stack_entry, TRACE_STACK);	\
 		IF_ASSIGN(var, ent, struct userstack_entry, TRACE_USER_STACK);\
+		IF_ASSIGN(var, ent, struct userunwind_stack_entry, TRACE_USER_UNWIND_STACK);\
+		IF_ASSIGN(var, ent, struct userunwind_cookie_entry, TRACE_USER_UNWIND_COOKIE);\
+		IF_ASSIGN(var, ent, struct userstack_entry, TRACE_USER_STACK);\
 		IF_ASSIGN(var, ent, struct print_entry, TRACE_PRINT);	\
 		IF_ASSIGN(var, ent, struct bprint_entry, TRACE_BPRINT);	\
 		IF_ASSIGN(var, ent, struct bputs_entry, TRACE_BPUTS);	\
@@ -1359,6 +1370,14 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf,
 # define STACK_FLAGS
 #endif
 
+#ifdef CONFIG_UNWIND_USER
+# define USERSTACK_DELAY					\
+	C(USERSTACKTRACE_DELAY,	"userstacktrace_delay"),
+#else
+# define USERSTACK_DELAY
+# define TRACE_ITER_USERSTACKTRACE_DELAY		0
+#endif
+
 /*
  * trace_iterator_flags is an enumeration that defines bit
  * positions into trace_flags that controls the output.
@@ -1379,6 +1398,7 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf,
 		C(PRINTK,		"trace_printk"),	\
 		C(ANNOTATE,		"annotate"),		\
 		C(USERSTACKTRACE,	"userstacktrace"),	\
+		USERSTACK_DELAY					\
 		C(SYM_USEROBJ,		"sym-userobj"),		\
 		C(PRINTK_MSGONLY,	"printk-msg-only"),	\
 		C(CONTEXT_INFO,		"context-info"),   /* Print pid/cpu/time */ \
diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index 5cf80f6c704a..40dc53ead0a8 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -249,6 +249,30 @@ FTRACE_ENTRY(user_stack, userstack_entry,
 		 (void *)__entry->caller[6], (void *)__entry->caller[7])
 );
 
+FTRACE_ENTRY(user_unwind_stack, userunwind_stack_entry,
+
+	TRACE_USER_UNWIND_STACK,
+
+	F_STRUCT(
+		__field(		u64,		cookie	)
+		__dynamic_array(	unsigned long,	stack	)
+	),
+
+	F_printk("cookie=%lld\n%s", __entry->cookie,
+		 __print_dynamic_array(stack, sizeof(unsigned long)))
+);
+
+FTRACE_ENTRY(user_unwind_cookie, userunwind_cookie_entry,
+
+	TRACE_USER_UNWIND_COOKIE,
+
+	F_STRUCT(
+		__field(		u64,		cookie	)
+	),
+
+	F_printk("cookie=%lld", __entry->cookie)
+);
+
 /*
  * trace_printk entry:
  */
diff --git a/kernel/trace/trace_export.c b/kernel/trace/trace_export.c
index d9d41e3ba379..831999f84e2c 100644
--- a/kernel/trace/trace_export.c
+++ b/kernel/trace/trace_export.c
@@ -57,6 +57,9 @@ static int ftrace_event_register(struct trace_event_call *call,
 #undef __array_desc
 #define __array_desc(type, container, item, size)	type item[size];
 
+#undef __dynamic_array
+#define __dynamic_array(type, item)			u32 __data_loc_##item;
+
 #undef __dynamic_field
 #define __dynamic_field(type, item)			type item[];
 
@@ -66,6 +69,16 @@ static int ftrace_event_register(struct trace_event_call *call,
 #undef F_printk
 #define F_printk(fmt, args...) fmt, args
 
+/* Only used for ftrace event format output */
+static inline char * __print_dynamic_array(int array, size_t size)
+{
+	return NULL;
+}
+
+#undef __print_dynamic_array
+#define __print_dynamic_array(array, el_size)				\
+	__print_dynamic_array(__entry->__data_loc_##array, el_size)
+
 #undef FTRACE_ENTRY
 #define FTRACE_ENTRY(name, struct_name, id, tstruct, print)		\
 struct ____ftrace_##name {						\
@@ -74,6 +87,7 @@ struct ____ftrace_##name {						\
 static void __always_unused ____ftrace_check_##name(void)		\
 {									\
 	struct ____ftrace_##name *__entry = NULL;			\
+	struct trace_seq __maybe_unused *p = NULL;			\
 									\
 	/* force compile-time check on F_printk() */			\
 	printk(print);							\
@@ -123,6 +137,12 @@ static void __always_unused ____ftrace_check_##name(void)		\
 #undef __array_desc
 #define __array_desc(_type, _container, _item, _len) __array(_type, _item, _len)
 
+#undef __dynamic_array
+#define __dynamic_array(_type, _item) {					\
+	.type = "__data_loc " #_type "[]", .name = #_item,		\
+	.size = 4, .align = __alignof__(4),				\
+	is_signed_type(_type), .filter_type = FILTER_OTHER },
+
 #undef __dynamic_field
 #define __dynamic_field(_type, _item) {					\
 	.type = #_type "[]", .name = #_item,				\
@@ -161,6 +181,9 @@ static struct trace_event_fields ftrace_event_fields_##name[] = {	\
 #undef __array_desc
 #define __array_desc(type, container, item, len)
 
+#undef __dynamic_array
+#define __dynamic_array(type, item)
+
 #undef __dynamic_field
 #define __dynamic_field(type, item)
 
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 97db0b0ccf3e..9489537533f7 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1404,6 +1404,58 @@ static struct trace_event trace_stack_event = {
 };
 
 /* TRACE_USER_STACK */
+static enum print_line_t trace_user_unwind_stack_print(struct trace_iterator *iter,
+						int flags, struct trace_event *event)
+{
+	struct userunwind_stack_entry *field;
+	struct trace_seq *s = &iter->seq;
+	unsigned long *caller;
+	unsigned int offset;
+	unsigned int len;
+	unsigned int caller_cnt;
+	unsigned int i;
+
+	trace_assign_type(field, iter->ent);
+
+	trace_seq_puts(s, "<user stack unwind>\n");
+
+	trace_seq_printf(s, "cookie=%llx\n", field->cookie);
+
+	/* The stack field is a dynamic pointer */
+	offset = field->__data_loc_stack;
+	len = offset >> 16;
+	offset = offset & 0xffff;
+	caller_cnt = len / sizeof(*caller);
+
+	caller = (void *)iter->ent + offset;
+
+	for (i = 0; i < caller_cnt; i++) {
+		unsigned long ip = caller[i];
+
+		if (!ip || trace_seq_has_overflowed(s))
+			break;
+
+		trace_seq_puts(s, " => ");
+		seq_print_user_ip(s, NULL, ip, flags);
+		trace_seq_putc(s, '\n');
+	}
+
+	return trace_handle_return(s);
+}
+
+static enum print_line_t trace_user_unwind_cookie_print(struct trace_iterator *iter,
+						 int flags, struct trace_event *event)
+{
+	struct userunwind_cookie_entry *field;
+	struct trace_seq *s = &iter->seq;
+
+	trace_assign_type(field, iter->ent);
+
+	trace_seq_printf(s, "cookie=%llx\n", field->cookie);
+
+	return trace_handle_return(s);
+}
+
 static enum print_line_t trace_user_stack_print(struct trace_iterator *iter,
 						int flags, struct trace_event *event)
 {
@@ -1447,6 +1499,24 @@ static enum print_line_t trace_user_stack_print(struct trace_iterator *iter,
 	return trace_handle_return(s);
 }
 
+static struct trace_event_functions trace_userunwind_stack_funcs = {
+	.trace		= trace_user_unwind_stack_print,
+};
+
+static struct trace_event trace_userunwind_stack_event = {
+	.type		= TRACE_USER_UNWIND_STACK,
+	.funcs		= &trace_userunwind_stack_funcs,
+};
+
+static struct trace_event_functions trace_userunwind_cookie_funcs = {
+	.trace		= trace_user_unwind_cookie_print,
+};
+
+static struct trace_event trace_userunwind_cookie_event = {
+	.type		= TRACE_USER_UNWIND_COOKIE,
+	.funcs		= &trace_userunwind_cookie_funcs,
+};
+
 static struct trace_event_functions trace_user_stack_funcs = {
 	.trace		= trace_user_stack_print,
 };
@@ -1846,6 +1916,8 @@ static struct trace_event *events[] __initdata = {
 	&trace_ctx_event,
 	&trace_wake_event,
 	&trace_stack_event,
+	&trace_userunwind_cookie_event,
+	&trace_userunwind_stack_event,
 	&trace_user_stack_event,
 	&trace_bputs_event,
 	&trace_bprint_event,
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v6 4/6] tracing: Have deferred user space stacktrace show file offsets
  2025-08-28 18:03 [PATCH v6 0/6] tracing: Deferred unwinding of user space stack traces Steven Rostedt
                   ` (2 preceding siblings ...)
  2025-08-28 18:03 ` [PATCH v6 3/6] tracing: Implement deferred user space stacktracing Steven Rostedt
@ 2025-08-28 18:03 ` Steven Rostedt
  2025-08-28 18:03 ` [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace Steven Rostedt
  2025-08-28 18:03 ` [PATCH v6 6/6] tracing: Add an event to map the inodes to their file names Steven Rostedt
  5 siblings, 0 replies; 59+ messages in thread
From: Steven Rostedt @ 2025-08-28 18:03 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
	Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

From: Steven Rostedt <rostedt@goodmis.org>

Instead of showing the IP address of the user space stack trace, which is
where ever it was mapped by the kernel, show the offsets of where it would
be in the file.

Instead of:

       trace-cmd-1066    [007] .....    67.770256: <user stack unwind>
cookie=7000000000009
 =>  <00007fdbd0d421ca>
 =>  <00007fdbd0f3be27>
 =>  <00005635ece557e7>
 =>  <00005635ece559d3>
 =>  <00005635ece56523>
 =>  <00005635ece6479d>
 =>  <00005635ece64b01>
 =>  <00005635ece64bc0>
 =>  <00005635ece53b7e>
 =>  <00007fdbd0c6bca8>

Which is the addresses of the functions in the virtual address space of
the process. Have it record:

       trace-cmd-1090    [003] .....   180.779876: <user stack unwind>
cookie=3000000000009
 =>  <00000000001001ca>
 =>  <000000000000ae27>
 =>  <00000000000107e7>
 =>  <00000000000109d3>
 =>  <0000000000011523>
 =>  <000000000001f79d>
 =>  <000000000001fb01>
 =>  <000000000001fbc0>
 =>  <000000000000eb7e>
 =>  <0000000000029ca8>

Which is the offset from code where it was mapped at. To find this
address, the mmap_read_lock is taken and the vma is searched for the
addresses. Then what is recorded is simply:

  (addr - vma->vm_start) + (vma->vm_pgoff << PAGE_SHIFT);

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/trace.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index e5b7db19aa53..3e9ef644dd64 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3136,18 +3136,27 @@ static void trace_user_unwind_callback(struct unwind_work *unwind,
 	struct trace_buffer *buffer = tr->array_buffer.buffer;
 	struct userunwind_stack_entry *entry;
 	struct ring_buffer_event *event;
+	struct mm_struct *mm = current->mm;
 	unsigned int trace_ctx;
+	struct vm_area_struct *vma = NULL;
 	unsigned long *caller;
 	unsigned int offset;
 	int len;
 	int i;
 
+	/* This should never happen */
+	if (!mm)
+		return;
+
 	if (!(tr->trace_flags & TRACE_ITER_USERSTACKTRACE_DELAY))
 		return;
 
 	len = trace->nr * sizeof(unsigned long) + sizeof(*entry);
 
 	trace_ctx = tracing_gen_ctx();
+
+	guard(mmap_read_lock)(mm);
+
 	event = __trace_buffer_lock_reserve(buffer, TRACE_USER_UNWIND_STACK,
 					    len, trace_ctx);
 	if (!event)
@@ -3164,7 +3173,16 @@ static void trace_user_unwind_callback(struct unwind_work *unwind,
 	caller = (void *)entry + offset;
 
 	for (i = 0; i < trace->nr; i++) {
-		caller[i] = trace->entries[i];
+		unsigned long addr = trace->entries[i];
+
+		if (!vma || addr < vma->vm_start || addr >= vma->vm_end)
+			vma = vma_lookup(mm, addr);
+
+		if (!vma) {
+			caller[i] = addr;
+			continue;
+		}
+		caller[i] = (addr - vma->vm_start) + (vma->vm_pgoff << PAGE_SHIFT);
 	}
 
 	__buffer_unlock_commit(buffer, event);
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-28 18:03 [PATCH v6 0/6] tracing: Deferred unwinding of user space stack traces Steven Rostedt
                   ` (3 preceding siblings ...)
  2025-08-28 18:03 ` [PATCH v6 4/6] tracing: Have deferred user space stacktrace show file offsets Steven Rostedt
@ 2025-08-28 18:03 ` Steven Rostedt
  2025-08-28 18:39   ` Linus Torvalds
  2025-08-28 18:03 ` [PATCH v6 6/6] tracing: Add an event to map the inodes to their file names Steven Rostedt
  5 siblings, 1 reply; 59+ messages in thread
From: Steven Rostedt @ 2025-08-28 18:03 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
	Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

From: Steven Rostedt <rostedt@goodmis.org>

The deferred user space stacktrace event already does a lookup of the vma
for each address in the trace to get the file offset for those addresses,
it can also report the file itself.

Add two more arrays to the user space stacktrace event. One for the inode
number, and the other to store the device major:minor number. Now the
output looks like this:

       trace-cmd-1108    [007] .....   240.253487: <user stack unwind>
cookie=7000000000009
 =>  <00000000001001ca> : 1340007 : 254:3
 =>  <000000000000ae27> : 1308548 : 254:3
 =>  <00000000000107e7> : 1440347 : 254:3
 =>  <00000000000109d3> : 1440347 : 254:3
 =>  <0000000000011523> : 1440347 : 254:3
 =>  <000000000001f79d> : 1440347 : 254:3
 =>  <000000000001fb01> : 1440347 : 254:3
 =>  <000000000001fbc0> : 1440347 : 254:3
 =>  <000000000000eb7e> : 1440347 : 254:3
 =>  <0000000000029ca8> : 1340007 : 254:3

Use space tooling can use this information to get the actual functions
from the files.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v5: https://lore.kernel.org/20250424192613.869730948@goodmis.org

- Set inode to -1L if vma is not found for that address to let user space
  know that, and differentiate from a vdso section.

 kernel/trace/trace.c         | 26 +++++++++++++++++++++++++-
 kernel/trace/trace_entries.h |  8 ++++++--
 kernel/trace/trace_output.c  | 27 +++++++++++++++++++++++++++
 3 files changed, 58 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 3e9ef644dd64..c6e1471e4615 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3140,6 +3140,8 @@ static void trace_user_unwind_callback(struct unwind_work *unwind,
 	unsigned int trace_ctx;
 	struct vm_area_struct *vma = NULL;
 	unsigned long *caller;
+	unsigned long *inodes;
+	unsigned int *devs;
 	unsigned int offset;
 	int len;
 	int i;
@@ -3151,7 +3153,8 @@ static void trace_user_unwind_callback(struct unwind_work *unwind,
 	if (!(tr->trace_flags & TRACE_ITER_USERSTACKTRACE_DELAY))
 		return;
 
-	len = trace->nr * sizeof(unsigned long) + sizeof(*entry);
+	len = trace->nr * (sizeof(unsigned long) * 2 + sizeof(unsigned int))
+			   + sizeof(*entry);
 
 	trace_ctx = tracing_gen_ctx();
 
@@ -3172,6 +3175,15 @@ static void trace_user_unwind_callback(struct unwind_work *unwind,
 	entry->__data_loc_stack = offset | (len << 16);
 	caller = (void *)entry + offset;
 
+	offset += len;
+	entry->__data_loc_inodes = offset | (len << 16);
+	inodes = (void *)entry + offset;
+
+	offset += len;
+	len = sizeof(unsigned int) * trace->nr;
+	entry->__data_loc_dev = offset | (len << 16);
+	devs = (void *)entry + offset;
+
 	for (i = 0; i < trace->nr; i++) {
 		unsigned long addr = trace->entries[i];
 
@@ -3180,9 +3192,21 @@ static void trace_user_unwind_callback(struct unwind_work *unwind,
 
 		if (!vma) {
 			caller[i] = addr;
+			/* Use -1 to denote no vma found */
+			inodes[i] = -1L;
+			devs[i] = 0;
 			continue;
 		}
+
 		caller[i] = (addr - vma->vm_start) + (vma->vm_pgoff << PAGE_SHIFT);
+
+		if (vma->vm_file && vma->vm_file->f_inode) {
+			inodes[i] = vma->vm_file->f_inode->i_ino;
+			devs[i] = vma->vm_file->f_inode->i_sb->s_dev;
+		} else {
+			inodes[i] = 0;
+			devs[i] = 0;
+		}
 	}
 
 	__buffer_unlock_commit(buffer, event);
diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index 40dc53ead0a8..5f7b72359901 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -256,10 +256,14 @@ FTRACE_ENTRY(user_unwind_stack, userunwind_stack_entry,
 	F_STRUCT(
 		__field(		u64,		cookie	)
 		__dynamic_array(	unsigned long,	stack	)
+		__dynamic_array(	unsigned long,	inodes	)
+		__dynamic_array(	unsigned int,	dev	)
 	),
 
-	F_printk("cookie=%lld\n%s", __entry->cookie,
-		 __print_dynamic_array(stack, sizeof(unsigned long)))
+	F_printk("cookie=%lld\n%s%s%s", __entry->cookie,
+		 __print_dynamic_array(stack, sizeof(unsigned long)),
+		 __print_dynamic_array(inodes, sizeof(unsigned long)),
+		 __print_dynamic_array(dev, sizeof(unsigned long)))
 );
 
 FTRACE_ENTRY(user_unwind_cookie, userunwind_cookie_entry,
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 9489537533f7..437e5f23b73d 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1410,9 +1410,13 @@ static enum print_line_t trace_user_unwind_stack_print(struct trace_iterator *it
 	struct userunwind_stack_entry *field;
 	struct trace_seq *s = &iter->seq;
 	unsigned long *caller;
+	unsigned long *inodes;
+	unsigned int *devs;
 	unsigned int offset;
 	unsigned int len;
 	unsigned int caller_cnt;
+	unsigned int inode_cnt;
+	unsigned int dev_cnt;
 	unsigned int i;
 
 	trace_assign_type(field, iter->ent);
@@ -1429,6 +1433,21 @@ static enum print_line_t trace_user_unwind_stack_print(struct trace_iterator *it
 
 	caller = (void *)iter->ent + offset;
 
+	/* The inodes and devices are also dynamic pointers */
+	offset = field->__data_loc_inodes;
+	len = offset >> 16;
+	offset = offset & 0xffff;
+	inode_cnt = len / sizeof(*inodes);
+
+	inodes = (void *)iter->ent + offset;
+
+	offset = field->__data_loc_dev;
+	len = offset >> 16;
+	offset = offset & 0xffff;
+	dev_cnt = len / sizeof(*devs);
+
+	devs = (void *)iter->ent + offset;
+
 	for (i = 0; i < caller_cnt; i++) {
 		unsigned long ip = caller[i];
 
@@ -1437,6 +1456,14 @@ static enum print_line_t trace_user_unwind_stack_print(struct trace_iterator *it
 
 		trace_seq_puts(s, " => ");
 		seq_print_user_ip(s, NULL, ip, flags);
+
+		if (i < inode_cnt) {
+			trace_seq_printf(s, " : %ld", inodes[i]);
+			if (i < dev_cnt) {
+				trace_seq_printf(s, " : %d:%d",
+						 MAJOR(devs[i]), MINOR(devs[i]));
+			}
+		}
 		trace_seq_putc(s, '\n');
 	}
 
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v6 6/6] tracing: Add an event to map the inodes to their file names
  2025-08-28 18:03 [PATCH v6 0/6] tracing: Deferred unwinding of user space stack traces Steven Rostedt
                   ` (4 preceding siblings ...)
  2025-08-28 18:03 ` [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace Steven Rostedt
@ 2025-08-28 18:03 ` Steven Rostedt
  5 siblings, 0 replies; 59+ messages in thread
From: Steven Rostedt @ 2025-08-28 18:03 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
	Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

From: Steven Rostedt <rostedt@goodmis.org>

The userstacktrace_delay stack trace shows for each frame of the listed
stack, the address in the file of the code, the inode number of the file,
and the device number of the file. This can be used by a user space tool
to find exactly where the stack walk is in the source code. But the issue
with this is that it also requires the tool to find the application on
disk from its device number and inode. This can take a bit of time.

The output of the usestacktrace_delay looks like this:

       trace-cmd-1053    [007] .....  1290.400226: <user stack unwind>
cookie=300000008
 =>  <000000000008f687> : 1340007 : 254:3
 =>  <0000000000014560> : 1488818 : 254:3
 =>  <000000000001f94a> : 1488818 : 254:3
 =>  <000000000001fc9e> : 1488818 : 254:3
 =>  <000000000001fcfa> : 1488818 : 254:3
 =>  <000000000000ebae> : 1488818 : 254:3
 =>  <0000000000029ca8> : 1340007 : 254:3

To help out, create a "inode_cache" that maps the device/inode to the path
of the file. Use a rhashtable to store the device/inode as a key, and
every time a new inode is added, it triggers a trace event that prints the
device, inode and the path. A tool can use this trace event to find the
paths without having to look for them on the device:

 trace-cmd start -B map -e inode_cache
[..]
 trace-cmd show -B map
[..]
       trace-cmd-1053    [007] ...1.  1290.400956: inode_cache: inode=1340007 dev=[254:3] path=/usr/lib/x86_64-linux-gnu/libc.so.6
       trace-cmd-1053    [007] ...1.  1290.401175: inode_cache: inode=1488818 dev=[254:3] path=/usr/local/bin/trace-cmd
       trace-cmd-1053    [007] ...1.  1290.401249: inode_cache: inode=1308544 dev=[254:3] path=/usr/local/lib64/libtracefs.so.1.8.2
       trace-cmd-1053    [007] ...1.  1290.401288: inode_cache: inode=1319848 dev=[254:3] path=/usr/local/lib64/libtraceevent.so.1.8.4
       trace-cmd-1053    [007] ...1.  1290.401338: inode_cache: inode=1311769 dev=[254:3] path=/usr/lib/x86_64-linux-gnu/libzstd.so.1.5.7
            bash-1044    [006] ...1.  1290.402620: inode_cache: inode=1308405 dev=[254:3] path=/usr/bin/bash
            bash-1044    [006] ...1.  1293.945511: inode_cache: inode=1309170 dev=[254:3] path=/usr/lib/x86_64-linux-gnu/libtinfo.so.6.5
       trace-cmd-1054    [001] ...1.  1293.956178: inode_cache: inode=1339989 dev=[254:3] path=/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
            less-1055    [000] ...1.  1293.962161: inode_cache: inode=1309556 dev=[254:3] path=/usr/bin/less
       trace-cmd-1054    [001] ...1.  1293.963118: inode_cache: inode=1309303 dev=[254:3] path=/usr/lib/x86_64-linux-gnu/libz.so.1.3.1
  NetworkManager-592     [000] ...1.  1296.802760: inode_cache: inode=1310774 dev=[254:3] path=/usr/sbin/NetworkManager
   systemd-udevd-323     [002] ...1.  1327.342209: inode_cache: inode=1308579 dev=[254:3] path=/usr/lib/x86_64-linux-gnu/systemd/libsystemd-shared-257.so
    sshd-session-1041    [001] ...1.  1352.996159: inode_cache: inode=1570224 dev=[254:3] path=/usr/lib/openssh/sshd-session

The event is only triggered when a new inode device combo is added to the
rhashtable. To help make sure new tracing can read this event, every time
the trace starts and stops and some other changes to the tracing system
occur, the cache is cleared so that it will show the paths again.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/Makefile            |   3 +
 kernel/trace/inode_cache.c       | 144 +++++++++++++++++++++++++++++++
 kernel/trace/trace.c             |  15 +++-
 kernel/trace/trace.h             |  10 +++
 kernel/trace/trace_inode_cache.h |  42 +++++++++
 5 files changed, 212 insertions(+), 2 deletions(-)
 create mode 100644 kernel/trace/inode_cache.c
 create mode 100644 kernel/trace/trace_inode_cache.h

diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index dcb4e02afc5f..c13f8ec48dc2 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -71,6 +71,7 @@ obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += trace_functions_graph.o
 obj-$(CONFIG_TRACE_BRANCH_PROFILING) += trace_branch.o
 obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o
 obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += fgraph.o
+obj-$(CONFIG_UNWIND_USER) += inode_cache.o
 ifeq ($(CONFIG_BLOCK),y)
 obj-$(CONFIG_EVENT_TRACING) += blktrace.o
 endif
@@ -110,4 +111,6 @@ obj-$(CONFIG_FPROBE_EVENTS) += trace_fprobe.o
 obj-$(CONFIG_TRACEPOINT_BENCHMARK) += trace_benchmark.o
 obj-$(CONFIG_RV) += rv/
 
+CFLAGS_inode_cache.o := -I$(src)
+
 libftrace-y := ftrace.o
diff --git a/kernel/trace/inode_cache.c b/kernel/trace/inode_cache.c
new file mode 100644
index 000000000000..bf177f7a5dad
--- /dev/null
+++ b/kernel/trace/inode_cache.c
@@ -0,0 +1,144 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2025 Google, author: Steven Rostedt <rostedt@goodmis.org>
+ */
+#include <linux/rhashtable.h>
+#include "trace.h"
+
+#define CREATE_TRACE_POINTS
+#include "trace_inode_cache.h"
+
+struct inode_cache_key {
+	unsigned long		inode_nr;
+	unsigned long		dev_nr;
+};
+
+struct inode_cache {
+	struct rhash_head	rh;
+	struct inode_cache_key	key;
+};
+
+struct inode_cache_hash {
+	struct rhashtable	rhash;
+	struct rcu_head		rcu;
+};
+
+static const struct rhashtable_params inode_cache_params = {
+	.nelem_hint		= 32,
+	.key_len		= sizeof(struct inode_cache_key),
+	.key_offset		= offsetof(struct inode_cache, key),
+	.head_offset		= offsetof(struct inode_cache, rh),
+};
+
+static DEFINE_MUTEX(inode_cache_mutex);
+static struct inode_cache_hash	*imhash;
+
+static void free_inode_cache(void *ptr, void *arg)
+{
+	kfree(ptr);
+}
+
+static const char *get_vma_name(struct vm_area_struct *vma, char *buf, int size)
+{
+	struct anon_vma_name *anon_name = anon_vma_name(vma);
+	const struct path *path;
+
+	if (anon_name)
+		return anon_name->name;
+
+	path = file_user_path(vma->vm_file);
+
+	return d_path(path, buf, size);
+}
+
+#define PATH_BUF_SZ 128
+
+static void print_inode_vma(struct vm_area_struct *vma,
+			 unsigned long inode, unsigned int dev)
+{
+	static char buf[PATH_BUF_SZ];
+	const char *name;
+
+	lockdep_assert_held(&inode_cache_mutex);
+
+	name = get_vma_name(vma, buf, PATH_BUF_SZ);
+
+	trace_inode_cache(inode, dev, name);
+}
+
+/**
+ * trace_inode_cache_add - Add a inode/dev to the cache and trigger path trace
+ * @vma: The vma that maps to the inode/dev
+ * @inode: The inode number of the vma->vm_file
+ * @dev: The device number of the vma->vm_file
+ *
+ * This is used to trigger the inode_cache trace event when a new inode/dev
+ * is added. This only gets called when that trace event is active.
+ * Whenever a inode/dev is added to the userstacktrace, this function
+ * gets called with the associated @vma and if it wasn't added before, it
+ * triggers the trace event that will write the @inode, @dev and lookup
+ * the file it is associated with. This can be used by user space tools to
+ * map the inode/dev in the userspace stack traces to their corresponding
+ * files.
+ *
+ * This gets reset when certain events happen in the tracefs system, such as,
+ * enabling or disabling tracing, or enabling or disabling the deferred user
+ * space stack tracing. This is done to not miss events.
+ */
+void trace_inode_cache_add(struct vm_area_struct *vma,
+			 unsigned long inode, unsigned int dev)
+{
+	struct inode_cache_hash *rht = READ_ONCE(imhash);
+	struct inode_cache_key key;
+	struct inode_cache *item;
+
+	if (!vma->vm_file)
+		return;
+
+	key.inode_nr = inode;
+	key.dev_nr = dev;
+
+	/* First check if the inode, dev exist already */
+	if (rht && rhashtable_lookup_fast(&rht->rhash, &key, inode_cache_params) != NULL)
+		return;
+
+	guard(mutex)(&inode_cache_mutex);
+
+	rht = imhash;
+
+	/* Make sure it wasn't added between the lookup and taking the lock */
+	if (rht && rhashtable_lookup_fast(&rht->rhash, &key, inode_cache_params) != NULL)
+		return;
+
+	if (!rht) {
+		rht = kmalloc(sizeof(*rht), GFP_KERNEL);
+		if (!rht)
+			goto print;
+		if (rhashtable_init(&rht->rhash, &inode_cache_params) < 0) {
+			kfree(rht);
+			goto print;
+		}
+		imhash = rht;
+	}
+
+	item = kmalloc(sizeof(*item), GFP_KERNEL);
+	if (!item)
+		goto print;
+
+	item->key = key;
+
+	rhashtable_insert_fast(&rht->rhash, &item->rh, inode_cache_params);
+
+ print:
+	print_inode_vma(vma, inode, dev);
+}
+
+void trace_inode_cache_reset(void)
+{
+	guard(mutex)(&inode_cache_mutex);
+	if (!imhash)
+		return;
+	rhashtable_free_and_destroy(&imhash->rhash, free_inode_cache, NULL);
+	kfree_rcu(imhash, rcu);
+	imhash = NULL;
+}
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index c6e1471e4615..983b885fee88 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -57,6 +57,7 @@
 
 #include "trace.h"
 #include "trace_output.h"
+#include "trace_inode_cache.h"
 
 #ifdef CONFIG_FTRACE_STARTUP_TEST
 /*
@@ -3203,6 +3204,9 @@ static void trace_user_unwind_callback(struct unwind_work *unwind,
 		if (vma->vm_file && vma->vm_file->f_inode) {
 			inodes[i] = vma->vm_file->f_inode->i_ino;
 			devs[i] = vma->vm_file->f_inode->i_sb->s_dev;
+
+			if (trace_inode_cache_enabled())
+				trace_inode_cache_add(vma, inodes[i], devs[i]);
 		} else {
 			inodes[i] = 0;
 			devs[i] = 0;
@@ -4979,9 +4983,10 @@ static int tracing_open(struct inode *inode, struct file *file)
 			trace_buf = &tr->max_buffer;
 #endif
 
-		if (cpu == RING_BUFFER_ALL_CPUS)
+		if (cpu == RING_BUFFER_ALL_CPUS) {
 			tracing_reset_online_cpus(trace_buf);
-		else
+			trace_inode_cache_reset();
+		} else
 			tracing_reset_cpu(trace_buf, cpu);
 	}
 
@@ -5324,6 +5329,7 @@ int trace_keep_overwrite(struct tracer *tracer, u32 mask, int set)
 
 static int update_unwind_deferred(struct trace_array *tr, int enabled)
 {
+	trace_inode_cache_reset();
 	if (enabled) {
 		return unwind_deferred_init(&tr->unwinder,
 					    trace_user_unwind_callback);
@@ -6041,6 +6047,7 @@ tracing_set_trace_read(struct file *filp, char __user *ubuf,
 int tracer_init(struct tracer *t, struct trace_array *tr)
 {
 	tracing_reset_online_cpus(&tr->array_buffer);
+	trace_inode_cache_reset();
 	return t->init(tr);
 }
 
@@ -7518,6 +7525,7 @@ int tracing_set_clock(struct trace_array *tr, const char *clockstr)
 	 * Reset the buffer so that it doesn't have incomparable timestamps.
 	 */
 	tracing_reset_online_cpus(&tr->array_buffer);
+	trace_inode_cache_reset();
 
 #ifdef CONFIG_TRACER_MAX_TRACE
 	if (tr->max_buffer.buffer)
@@ -9478,6 +9486,9 @@ rb_simple_write(struct file *filp, const char __user *ubuf,
 	if (ret)
 		return ret;
 
+	/* Cleare the inode cache whenever tracing starts or stops */
+	trace_inode_cache_reset();
+
 	if (buffer) {
 		guard(mutex)(&trace_types_lock);
 		if (!!val == tracer_tracing_is_on(tr)) {
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 940107ba618a..d04563f088bf 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -450,6 +450,16 @@ struct trace_array {
 	bool ring_buffer_expanded;
 };
 
+#ifdef CONFIG_UNWIND_USER
+void trace_inode_cache_add(struct vm_area_struct *vma,
+			 unsigned long inode, unsigned int dev);
+void trace_inode_cache_reset(void);
+#else
+static inline void trace_inode_cache_add(struct vm_area_struct *vma,
+				       unsigned long inode, unsigned int dev) {}
+static inline void trace_inode_cache_reset(void) {}
+#endif
+
 enum {
 	TRACE_ARRAY_FL_GLOBAL		= BIT(0),
 	TRACE_ARRAY_FL_BOOT		= BIT(1),
diff --git a/kernel/trace/trace_inode_cache.h b/kernel/trace/trace_inode_cache.h
new file mode 100644
index 000000000000..3a71d0104fbb
--- /dev/null
+++ b/kernel/trace/trace_inode_cache.h
@@ -0,0 +1,42 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifdef CONFIG_UNWIND_USER
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM inode_cache
+
+#if !defined(_TRACE_inode_cache_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_inode_cache_H
+
+TRACE_EVENT(inode_cache,
+
+	TP_PROTO(unsigned long inode, unsigned int dev, const char *path),
+
+	TP_ARGS(inode, dev, path),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,	inode		)
+		__field(	unsigned int,	dev		)
+		__string(	path,		path		)
+	),
+	TP_fast_assign(
+		__entry->inode = inode;
+		__entry->dev = dev;
+		__assign_str(path);
+	),
+	TP_printk("inode=%lu dev=[%u:%u] path=%s",
+		  __entry->inode, MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __get_str(path))
+);
+
+
+#endif /* if !defined(_TRACE_inode_cache_H) || defined(TRACE_HEADER_MULTI_READ) */
+
+#undef TRACE_INCLUDE_PATH
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_PATH .
+#define TRACE_INCLUDE_FILE trace_inode_cache
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
+#else /* CONFIG_UNWIND_USER */
+static inline bool trace_inode_cache_enabled(void) { return false; }
+#endif /* !CONFIG_UNWIND_USER */
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-28 18:03 ` [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace Steven Rostedt
@ 2025-08-28 18:39   ` Linus Torvalds
  2025-08-28 18:58     ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 59+ messages in thread
From: Linus Torvalds @ 2025-08-28 18:39 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
	Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Andrew Morton, Florian Weimer,
	Sam James, Kees Cook, Carlos O'Donell

On Thu, 28 Aug 2025 at 11:05, Steven Rostedt <rostedt@kernel.org> wrote:
>
> The deferred user space stacktrace event already does a lookup of the vma
> for each address in the trace to get the file offset for those addresses,
> it can also report the file itself.

That sounds like a good idea..

But the implementation absolutely sucks:

> Add two more arrays to the user space stacktrace event. One for the inode
> number, and the other to store the device major:minor number. Now the
> output looks like this:

WTF? Why are you back in the 1960's? What's next? The index into the
paper card deck?

Stop using inode numbers and device numbers already. It's the 21st
century. No, cars still don't fly, but dammit, inode numbers were a
great idea back in the days, but they are not acceptable any more.

They *particularly* aren't acceptable when you apparently think that
they are 'unsigned long'.  Yes, that's the internal representation we
use for inode indexing, but for example on nfs the inode is actually
bigger. It's exposed to user space as a u64 through

        stat->ino = nfs_compat_user_ino64(NFS_FILEID(inode));

so the inode that user space sees in 'struct stat' (a) doesn't
actually match inode->i_ino, and (b) isn't even the full file ID that
NFS actually uses.

Let's not let that 60's thinking be any part of a new interface.

Give the damn thing an actual filename or something *useful*, not a
number that user space can't even necessarily match up to anything.

              Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-28 18:39   ` Linus Torvalds
@ 2025-08-28 18:58     ` Arnaldo Carvalho de Melo
  2025-08-28 19:02       ` Mathieu Desnoyers
  2025-08-28 19:18       ` Linus Torvalds
  0 siblings, 2 replies; 59+ messages in thread
From: Arnaldo Carvalho de Melo @ 2025-08-28 18:58 UTC (permalink / raw)
  To: Linus Torvalds, Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
	Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Andrew Morton, Florian Weimer,
	Sam James, Kees Cook, Carlos O'Donell



On August 28, 2025 3:39:35 PM GMT-03:00, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>On Thu, 28 Aug 2025 at 11:05, Steven Rostedt <rostedt@kernel.org> wrote:
>>
>> The deferred user space stacktrace event already does a lookup of the vma
>> for each address in the trace to get the file offset for those addresses,
>> it can also report the file itself.
>
>That sounds like a good idea..
>
>But the implementation absolutely sucks:
>
>> Add two more arrays to the user space stacktrace event. One for the inode
>> number, and the other to store the device major:minor number. Now the
>> output looks like this:
>
>WTF? Why are you back in the 1960's? What's next? The index into the
>paper card deck?
>
>Stop using inode numbers and device numbers already. It's the 21st
>century. No, cars still don't fly, but dammit, inode numbers were a
>great idea back in the days, but they are not acceptable any more.
>
>They *particularly* aren't acceptable when you apparently think that
>they are 'unsigned long'.  Yes, that's the internal representation we
>use for inode indexing, but for example on nfs the inode is actually
>bigger. It's exposed to user space as a u64 through
>
>        stat->ino = nfs_compat_user_ino64(NFS_FILEID(inode));
>
>so the inode that user space sees in 'struct stat' (a) doesn't
>actually match inode->i_ino, and (b) isn't even the full file ID that
>NFS actually uses.
>
>Let's not let that 60's thinking be any part of a new interface.
>
>Give the damn thing an actual filename or something *useful*, not a
>number that user space can't even necessarily match up to anything.
>

A build ID?

PERF_RECORD_MMAP went thru this, filename ->  inode -> Content based hash

- Arnaldo 



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-28 18:58     ` Arnaldo Carvalho de Melo
@ 2025-08-28 19:02       ` Mathieu Desnoyers
  2025-08-28 19:18       ` Linus Torvalds
  1 sibling, 0 replies; 59+ messages in thread
From: Mathieu Desnoyers @ 2025-08-28 19:02 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Linus Torvalds, Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

On 2025-08-28 14:58, Arnaldo Carvalho de Melo wrote:
> 
> 
> On August 28, 2025 3:39:35 PM GMT-03:00, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>> On Thu, 28 Aug 2025 at 11:05, Steven Rostedt <rostedt@kernel.org> wrote:
>>>
>>> The deferred user space stacktrace event already does a lookup of the vma
>>> for each address in the trace to get the file offset for those addresses,
>>> it can also report the file itself.
>>
>> That sounds like a good idea..
>>
>> But the implementation absolutely sucks:
>>
>>> Add two more arrays to the user space stacktrace event. One for the inode
>>> number, and the other to store the device major:minor number. Now the
>>> output looks like this:
>>
>> WTF? Why are you back in the 1960's? What's next? The index into the
>> paper card deck?
>>
>> Stop using inode numbers and device numbers already. It's the 21st
>> century. No, cars still don't fly, but dammit, inode numbers were a
>> great idea back in the days, but they are not acceptable any more.
>>
>> They *particularly* aren't acceptable when you apparently think that
>> they are 'unsigned long'.  Yes, that's the internal representation we
>> use for inode indexing, but for example on nfs the inode is actually
>> bigger. It's exposed to user space as a u64 through
>>
>>         stat->ino = nfs_compat_user_ino64(NFS_FILEID(inode));
>>
>> so the inode that user space sees in 'struct stat' (a) doesn't
>> actually match inode->i_ino, and (b) isn't even the full file ID that
>> NFS actually uses.
>>
>> Let's not let that 60's thinking be any part of a new interface.
>>
>> Give the damn thing an actual filename or something *useful*, not a
>> number that user space can't even necessarily match up to anything.
>>
> 
> A build ID?
> 
> PERF_RECORD_MMAP went thru this, filename ->  inode -> Content based hash

FWIW, we record:

- executable or shared library path name,
- build id (if available),
- debug link (if available),

in LTTng-UST when we dump the loaded executable and libraries from
userspace.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-28 18:58     ` Arnaldo Carvalho de Melo
  2025-08-28 19:02       ` Mathieu Desnoyers
@ 2025-08-28 19:18       ` Linus Torvalds
  2025-08-28 20:04         ` Arnaldo Carvalho de Melo
  2025-08-28 20:17         ` Steven Rostedt
  1 sibling, 2 replies; 59+ messages in thread
From: Linus Torvalds @ 2025-08-28 19:18 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Andrew Morton,
	Florian Weimer, Sam James, Kees Cook, Carlos O'Donell

On Thu, 28 Aug 2025 at 11:58, Arnaldo Carvalho de Melo
<arnaldo.melo@gmail.com> wrote:
> >
> >Give the damn thing an actual filename or something *useful*, not a
> >number that user space can't even necessarily match up to anything.
>
> A build ID?

I think that's a better thing than the disgusting inode number, yes.

That said, I think they are problematic too, in that I don't think
they are universally available, so if you want to trace some
executable without build ids - and there are good reasons to do that -
you might hate being limited that way.

So I think you'd be much better off with just actual pathnames.

Are there no trace events for "mmap this path"? Create a good u64 hash
from the contents of a 'struct path' (which is just two pointers: the
dentry and the mnt) when mmap'ing the file, and then you can just
associate the stack trace entry with that hash.

That should be simple and straightforward, and hashing two pointers
should be simple and straightforward.

And then matching that hash against the mmap event where the actual
path was saved off gives you an actual *pathname*. Which is *so* much
better than those horrific inode numbers.

And yes, yes, obviously filenames can go away and aren't some kind of
long-term stable thing. But inode numbers can be re-used too, so
that's no different.

With the "create a hash of 'struct path' contents" you basically have
an ID that can be associated with whatever the file name was at the
time it was mmap'ed into the thing you are tracing, which is I think
what you really want anyway.

Now, what would be even simpler is to not create a hash at all, but
simply just create the whole pathname when the stack trace entry is
created. But it would probably waste too much space, since you'd
probably want to have at least 32 bytes (as opposed to just 64 bits)
for a (truncated) pathname.

And it would be more expensive than just hashing the dentry/mnt
pointers, although '%pD' isn't actually *that* expensive. But probably
expensive enough to not really be acceptable. I'm just throwing it out
as a stupid idea that at least generates much more usable output than
the inode numbers do.

          Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-28 19:18       ` Linus Torvalds
@ 2025-08-28 20:04         ` Arnaldo Carvalho de Melo
  2025-08-28 20:37           ` Linus Torvalds
  2025-08-28 20:17         ` Steven Rostedt
  1 sibling, 1 reply; 59+ messages in thread
From: Arnaldo Carvalho de Melo @ 2025-08-28 20:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arnaldo Carvalho de Melo, Steven Rostedt, linux-kernel,
	linux-trace-kernel, bpf, x86, Masami Hiramatsu, Mathieu Desnoyers,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Andrew Morton,
	Florian Weimer, Sam James, Kees Cook, Carlos O'Donell

On Thu, Aug 28, 2025 at 12:18:39PM -0700, Linus Torvalds wrote:
> On Thu, 28 Aug 2025 at 11:58, Arnaldo Carvalho de Melo <arnaldo.melo@gmail.com> wrote:

> > >Give the damn thing an actual filename or something *useful*, not a
> > >number that user space can't even necessarily match up to anything.

> > A build ID?
 
> I think that's a better thing than the disgusting inode number, yes.

> That said, I think they are problematic too, in that I don't think
> they are universally available, so if you want to trace some
> executable without build ids - and there are good reasons to do that -
> you might hate being limited that way.

Right, but these days gdb (and other traditional tools) supports it and
downloads it (perf should do it with a one-time sticky question too,
does it already in some cases, unconditionally, that should be fixed as
well), most distros have it:

⬢ [acme@toolbx perf-tools-next]$ file /bin/bash
/bin/bash: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=707a1c670cd72f8e55ffedfbe94ea98901b7ce3a, for GNU/Linux 3.2.0, stripped
⬢ [acme@toolbx perf-tools-next]$

We have debuginfod-servers that brings ELF images with debug keyed by
that build id and finally build-ids come together with pathnames, so if
one is null, fallback to the other.

Default in fedora:

⬢ [acme@toolbx perf-tools-next]$ echo $DEBUGINFOD_
$DEBUGINFOD_IMA_CERT_PATH  $DEBUGINFOD_URLS           
⬢ [acme@toolbx perf-tools-next]$ echo $DEBUGINFOD_
$DEBUGINFOD_IMA_CERT_PATH  $DEBUGINFOD_URLS           
⬢ [acme@toolbx perf-tools-next]$ echo $DEBUGINFOD_IMA_CERT_PATH 
/etc/keys/ima:
⬢ [acme@toolbx perf-tools-next]$ echo $DEBUGINFOD_URLS 
https://debuginfod.fedoraproject.org/
⬢ [acme@toolbx perf-tools-next]$

I wasn't aware of that IMA stuff.

So even without the mandate and with sometimes not being able to get
that build-id, most of the time they are there and deterministically
allows tooling to fetch it in most cases, I guess that is as far as we
can pragmatically get.

- Arnaldo
 
> So I think you'd be much better off with just actual pathnames.
> 
> Are there no trace events for "mmap this path"? Create a good u64 hash
> from the contents of a 'struct path' (which is just two pointers: the
> dentry and the mnt) when mmap'ing the file, and then you can just
> associate the stack trace entry with that hash.
> 
> That should be simple and straightforward, and hashing two pointers
> should be simple and straightforward.
> 
> And then matching that hash against the mmap event where the actual
> path was saved off gives you an actual *pathname*. Which is *so* much
> better than those horrific inode numbers.
> 
> And yes, yes, obviously filenames can go away and aren't some kind of
> long-term stable thing. But inode numbers can be re-used too, so
> that's no different.
> 
> With the "create a hash of 'struct path' contents" you basically have
> an ID that can be associated with whatever the file name was at the
> time it was mmap'ed into the thing you are tracing, which is I think
> what you really want anyway.
> 
> Now, what would be even simpler is to not create a hash at all, but
> simply just create the whole pathname when the stack trace entry is
> created. But it would probably waste too much space, since you'd
> probably want to have at least 32 bytes (as opposed to just 64 bits)
> for a (truncated) pathname.
> 
> And it would be more expensive than just hashing the dentry/mnt
> pointers, although '%pD' isn't actually *that* expensive. But probably
> expensive enough to not really be acceptable. I'm just throwing it out
> as a stupid idea that at least generates much more usable output than
> the inode numbers do.
> 
>           Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-28 19:18       ` Linus Torvalds
  2025-08-28 20:04         ` Arnaldo Carvalho de Melo
@ 2025-08-28 20:17         ` Steven Rostedt
  2025-08-28 20:27           ` Arnaldo Carvalho de Melo
  2025-08-28 20:38           ` Linus Torvalds
  1 sibling, 2 replies; 59+ messages in thread
From: Steven Rostedt @ 2025-08-28 20:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arnaldo Carvalho de Melo, linux-kernel, linux-trace-kernel, bpf,
	x86, Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Andrew Morton,
	Florian Weimer, Sam James, Kees Cook, Carlos O'Donell

On Thu, 28 Aug 2025 12:18:39 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, 28 Aug 2025 at 11:58, Arnaldo Carvalho de Melo
> <arnaldo.melo@gmail.com> wrote:
> > >
> > >Give the damn thing an actual filename or something *useful*, not a
> > >number that user space can't even necessarily match up to anything.  
> >
> > A build ID?  
> 
> I think that's a better thing than the disgusting inode number, yes.

I don't care what it is. I picked inode/device just because it was the
only thing I saw available. I'm not sure build ID is appropriate either.

> 
> That said, I think they are problematic too, in that I don't think
> they are universally available, so if you want to trace some
> executable without build ids - and there are good reasons to do that -
> you might hate being limited that way.
> 
> So I think you'd be much better off with just actual pathnames.

As you mentioned below, the reason I avoided path names is that they
take up too much of the ring buffer, and would be duplicated all over
the place. I've run this for a while, and it only picked up a couple of
hundred paths while the trace had several thousand stack traces.

> 
> Are there no trace events for "mmap this path"? Create a good u64 hash
> from the contents of a 'struct path' (which is just two pointers: the
> dentry and the mnt) when mmap'ing the file, and then you can just
> associate the stack trace entry with that hash.

I would love to have a hash to use. The next patch does the mapping of
the inode numbers to their path name. It can easily be switched over to
do the same with a hash number.

> 
> That should be simple and straightforward, and hashing two pointers
> should be simple and straightforward.

Would a hash of these pointers have any collisions? That would be bad.

Hmm, I just tried using the pointer to vma->vm_file->f_inode, and that
gives me a unique number. Then I just need to map that back to the path name:

       trace-cmd-1016    [002] ...1.    34.675646: inode_cache: inode=ffff8881007ed428 dev=[254:3] path=/usr/lib/x86_64-linux-gnu/libc.so.6
       trace-cmd-1016    [002] ...1.    34.675893: inode_cache: inode=ffff88811970e648 dev=[254:3] path=/usr/local/lib64/libtracefs.so.1.8.2
       trace-cmd-1016    [002] ...1.    34.675933: inode_cache: inode=ffff88811970b8f8 dev=[254:3] path=/usr/local/lib64/libtraceevent.so.1.8.4
       trace-cmd-1016    [002] ...1.    34.675981: inode_cache: inode=ffff888110b78ba8 dev=[254:3] path=/usr/lib/x86_64-linux-gnu/libzstd.so.1.5.7
            bash-1007    [003] ...1.    34.677316: inode_cache: inode=ffff888103f05d38 dev=[254:3] path=/usr/bin/bash
            bash-1007    [003] ...1.    35.432951: inode_cache: inode=ffff888116be94b8 dev=[254:3] path=/usr/lib/x86_64-linux-gnu/libtinfo.so.6.5
            bash-1018    [005] ...1.    36.104543: inode_cache: inode=ffff8881007e9dc8 dev=[254:3] path=/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
            bash-1018    [005] ...1.    36.110407: inode_cache: inode=ffff888110b78298 dev=[254:3] path=/usr/lib/x86_64-linux-gnu/libz.so.1.3.1
            bash-1018    [005] ...1.    36.110536: inode_cache: inode=ffff888103d09dc8 dev=[254:3] path=/usr/local/bin/trace-cmd

I just swapped out the inode with the above (unsigned long)vma->vm_file->f_inode,
and it appears to be unique.

Thus, I could use that as the "hash" value and then the above could be turned into:

       trace-cmd-1016    [002] ...1.    34.675646: inode_cache: hash=ffff8881007ed428 path=/usr/lib/x86_64-linux-gnu/libc.so.6
       trace-cmd-1016    [002] ...1.    34.675893: inode_cache: hash=ffff88811970e648 path=/usr/local/lib64/libtracefs.so.1.8.2
       trace-cmd-1016    [002] ...1.    34.675933: inode_cache: hash=ffff88811970b8f8 path=/usr/local/lib64/libtraceevent.so.1.8.4
       trace-cmd-1016    [002] ...1.    34.675981: inode_cache: hash=ffff888110b78ba8 path=/usr/lib/x86_64-linux-gnu/libzstd.so.1.5.7
            bash-1007    [003] ...1.    34.677316: inode_cache: hash=ffff888103f05d38 path=/usr/bin/bash
            bash-1007    [003] ...1.    35.432951: inode_cache: hash=ffff888116be94b8 path=/usr/lib/x86_64-linux-gnu/libtinfo.so.6.5
            bash-1018    [005] ...1.    36.104543: inode_cache: hash=ffff8881007e9dc8 path=/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
            bash-1018    [005] ...1.    36.110407: inode_cache: hash=ffff888110b78298 path=/usr/lib/x86_64-linux-gnu/libz.so.1.3.1
            bash-1018    [005] ...1.    36.110536: inode_cache: hash=ffff888103d09dc8 path=/usr/local/bin/trace-cmd

This would mean the readers of the userstacktrace_delay need to also
have this event enabled to do the mappings. But that shouldn't be an
issue.

-- Steve


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-28 20:17         ` Steven Rostedt
@ 2025-08-28 20:27           ` Arnaldo Carvalho de Melo
  2025-08-28 20:42             ` Linus Torvalds
  2025-08-28 20:51             ` Steven Rostedt
  2025-08-28 20:38           ` Linus Torvalds
  1 sibling, 2 replies; 59+ messages in thread
From: Arnaldo Carvalho de Melo @ 2025-08-28 20:27 UTC (permalink / raw)
  To: Steven Rostedt, Linus Torvalds
  Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
	Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Andrew Morton, Florian Weimer,
	Sam James, Kees Cook, Carlos O'Donell



On August 28, 2025 5:17:18 PM GMT-03:00, Steven Rostedt <rostedt@kernel.org> wrote:
>On Thu, 28 Aug 2025 12:18:39 -0700
>Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>> On Thu, 28 Aug 2025 at 11:58, Arnaldo Carvalho de Melo
>> <arnaldo.melo@gmail.com> wrote:
>> > >
>> > >Give the damn thing an actual filename or something *useful*, not a
>> > >number that user space can't even necessarily match up to anything.  
>> >
>> > A build ID?  
>> 
>> I think that's a better thing than the disgusting inode number, yes.
>
>I don't care what it is. I picked inode/device just because it was the
>only thing I saw available. I'm not sure build ID is appropriate either.
>
>> 
>> That said, I think they are problematic too, in that I don't think
>> they are universally available, so if you want to trace some
>> executable without build ids - and there are good reasons to do that -
>> you might hate being limited that way.
>> 
>> So I think you'd be much better off with just actual pathnames.
>
>As you mentioned below, the reason I avoided path names is that they
>take up too much of the ring buffer, and would be duplicated all over
>the place. I've run this for a while, and it only picked up a couple of
>hundred paths while the trace had several thousand stack traces.
>
>> 
>> Are there no trace events for "mmap this path"? Create a good u64 hash
>> from the contents of a 'struct path' (which is just two pointers: the
>> dentry and the mnt) when mmap'ing the file, and then you can just
>> associate the stack trace entry with that hash.
>
>I would love to have a hash to use. The next patch does the mapping of
>the inode numbers to their path name. It can

The path name is a nice to have detail, but a content based hash is what we want, no?

Tracing/profiling has to be about contents of files later used for analysis, and filenames provide no guarantee about that.

- Arnaldo 

 easily be switched over to
>do the same with a hash number.
>
>> 
>> That should be simple and straightforward, and hashing two pointers
>> should be simple and straightforward.
>
>Would a hash of these pointers have any collisions? That would be bad.
>
>Hmm, I just tried using the pointer to vma->vm_file->f_inode, and that
>gives me a unique number. Then I just need to map that back to the path name:
>
>       trace-cmd-1016    [002] ...1.    34.675646: inode_cache: inode=ffff8881007ed428 dev=[254:3] path=/usr/lib/x86_64-linux-gnu/libc.so.6
>       trace-cmd-1016    [002] ...1.    34.675893: inode_cache: inode=ffff88811970e648 dev=[254:3] path=/usr/local/lib64/libtracefs.so.1.8.2
>       trace-cmd-1016    [002] ...1.    34.675933: inode_cache: inode=ffff88811970b8f8 dev=[254:3] path=/usr/local/lib64/libtraceevent.so.1.8.4
>       trace-cmd-1016    [002] ...1.    34.675981: inode_cache: inode=ffff888110b78ba8 dev=[254:3] path=/usr/lib/x86_64-linux-gnu/libzstd.so.1.5.7
>            bash-1007    [003] ...1.    34.677316: inode_cache: inode=ffff888103f05d38 dev=[254:3] path=/usr/bin/bash
>            bash-1007    [003] ...1.    35.432951: inode_cache: inode=ffff888116be94b8 dev=[254:3] path=/usr/lib/x86_64-linux-gnu/libtinfo.so.6.5
>            bash-1018    [005] ...1.    36.104543: inode_cache: inode=ffff8881007e9dc8 dev=[254:3] path=/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
>            bash-1018    [005] ...1.    36.110407: inode_cache: inode=ffff888110b78298 dev=[254:3] path=/usr/lib/x86_64-linux-gnu/libz.so.1.3.1
>            bash-1018    [005] ...1.    36.110536: inode_cache: inode=ffff888103d09dc8 dev=[254:3] path=/usr/local/bin/trace-cmd
>
>I just swapped out the inode with the above (unsigned long)vma->vm_file->f_inode,
>and it appears to be unique.
>
>Thus, I could use that as the "hash" value and then the above could be turned into:
>
>       trace-cmd-1016    [002] ...1.    34.675646: inode_cache: hash=ffff8881007ed428 path=/usr/lib/x86_64-linux-gnu/libc.so.6
>       trace-cmd-1016    [002] ...1.    34.675893: inode_cache: hash=ffff88811970e648 path=/usr/local/lib64/libtracefs.so.1.8.2
>       trace-cmd-1016    [002] ...1.    34.675933: inode_cache: hash=ffff88811970b8f8 path=/usr/local/lib64/libtraceevent.so.1.8.4
>       trace-cmd-1016    [002] ...1.    34.675981: inode_cache: hash=ffff888110b78ba8 path=/usr/lib/x86_64-linux-gnu/libzstd.so.1.5.7
>            bash-1007    [003] ...1.    34.677316: inode_cache: hash=ffff888103f05d38 path=/usr/bin/bash
>            bash-1007    [003] ...1.    35.432951: inode_cache: hash=ffff888116be94b8 path=/usr/lib/x86_64-linux-gnu/libtinfo.so.6.5
>            bash-1018    [005] ...1.    36.104543: inode_cache: hash=ffff8881007e9dc8 path=/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
>            bash-1018    [005] ...1.    36.110407: inode_cache: hash=ffff888110b78298 path=/usr/lib/x86_64-linux-gnu/libz.so.1.3.1
>            bash-1018    [005] ...1.    36.110536: inode_cache: hash=ffff888103d09dc8 path=/usr/local/bin/trace-cmd
>
>This would mean the readers of the userstacktrace_delay need to also
>have this event enabled to do the mappings. But that shouldn't be an
>issue.
>
>-- Steve
>

- Arnaldo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-28 20:04         ` Arnaldo Carvalho de Melo
@ 2025-08-28 20:37           ` Linus Torvalds
  0 siblings, 0 replies; 59+ messages in thread
From: Linus Torvalds @ 2025-08-28 20:37 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Arnaldo Carvalho de Melo, Steven Rostedt, linux-kernel,
	linux-trace-kernel, bpf, x86, Masami Hiramatsu, Mathieu Desnoyers,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Andrew Morton,
	Florian Weimer, Sam James, Kees Cook, Carlos O'Donell

On Thu, 28 Aug 2025 at 13:04, Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
>
> > That said, I think they are problematic too, in that I don't think
> > they are universally available, so if you want to trace some
> > executable without build ids - and there are good reasons to do that -
> > you might hate being limited that way.
>
> Right, but these days gdb (and other traditional tools) supports it and
> downloads it (perf should do it with a one-time sticky question too,
> does it already in some cases, unconditionally, that should be fixed as
> well), most distros have it:

So I'm literally thinking one very valid case is debugging some
third-party binary with tracing.

End result: the whole "most distros have it" is just not relevant to
that situation.

          Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-28 20:17         ` Steven Rostedt
  2025-08-28 20:27           ` Arnaldo Carvalho de Melo
@ 2025-08-28 20:38           ` Linus Torvalds
  2025-08-28 20:48             ` Steven Rostedt
  1 sibling, 1 reply; 59+ messages in thread
From: Linus Torvalds @ 2025-08-28 20:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Arnaldo Carvalho de Melo, linux-kernel, linux-trace-kernel, bpf,
	x86, Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Andrew Morton,
	Florian Weimer, Sam James, Kees Cook, Carlos O'Donell

On Thu, 28 Aug 2025 at 13:17, Steven Rostedt <rostedt@kernel.org> wrote:
>
> >
> > That should be simple and straightforward, and hashing two pointers
> > should be simple and straightforward.
>
> Would a hash of these pointers have any collisions? That would be bad.

What? Collisions in 64 bits when you have a handful of cases around?
Not an issue unless you picked your hash to be something ridiculous.

               Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-28 20:27           ` Arnaldo Carvalho de Melo
@ 2025-08-28 20:42             ` Linus Torvalds
  2025-08-28 20:51             ` Steven Rostedt
  1 sibling, 0 replies; 59+ messages in thread
From: Linus Torvalds @ 2025-08-28 20:42 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Andrew Morton,
	Florian Weimer, Sam James, Kees Cook, Carlos O'Donell

On Thu, 28 Aug 2025 at 13:27, Arnaldo Carvalho de Melo
<arnaldo.melo@gmail.com> wrote:
>
> The path name is a nice to have detail, but a content based hash is what we want, no?

No.

We want something easy and quick to compute because this is looked up
at stacktrace time. That's the primary issue.

You can do the mapping to some build-id - if it even exists - later.

               Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-28 20:38           ` Linus Torvalds
@ 2025-08-28 20:48             ` Steven Rostedt
  2025-08-28 21:06               ` Linus Torvalds
  0 siblings, 1 reply; 59+ messages in thread
From: Steven Rostedt @ 2025-08-28 20:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arnaldo Carvalho de Melo, linux-kernel, linux-trace-kernel, bpf,
	x86, Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Andrew Morton,
	Florian Weimer, Sam James, Kees Cook, Carlos O'Donell

On Thu, 28 Aug 2025 13:38:33 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, 28 Aug 2025 at 13:17, Steven Rostedt <rostedt@kernel.org> wrote:
> >  
> > >
> > > That should be simple and straightforward, and hashing two pointers
> > > should be simple and straightforward.  
> >
> > Would a hash of these pointers have any collisions? That would be bad.  
> 
> What? Collisions in 64 bits when you have a handful of cases around?
> Not an issue unless you picked your hash to be something ridiculous.
> 

Since I only need a unique identifier, and it appears that the
vma->vm_file->f_inode pointer is unique, would just using that be OK?

I could run it through the same hash algorithm that "%p" goes through so
that it's not a real memory address.

As getting to the path does require some more logic to get to. Not to
mention, this may later need to handle JIT code (and we'll need a way
to map to that too).

-- Steve

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-28 20:27           ` Arnaldo Carvalho de Melo
  2025-08-28 20:42             ` Linus Torvalds
@ 2025-08-28 20:51             ` Steven Rostedt
  2025-08-28 21:00               ` Arnaldo Carvalho de Melo
  1 sibling, 1 reply; 59+ messages in thread
From: Steven Rostedt @ 2025-08-28 20:51 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Linus Torvalds, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Andrew Morton,
	Florian Weimer, Sam James, Kees Cook, Carlos O'Donell

On Thu, 28 Aug 2025 17:27:37 -0300
Arnaldo Carvalho de Melo <arnaldo.melo@gmail.com> wrote:

> >I would love to have a hash to use. The next patch does the mapping
> >of the inode numbers to their path name. It can  
> 
> The path name is a nice to have detail, but a content based hash is
> what we want, no?
> 
> Tracing/profiling has to be about contents of files later used for
> analysis, and filenames provide no guarantee about that.

I could add the build id to the inode_cache as well (which I'll rename
to file_cache).

Thus, the user stack trace will just have the offset and a hash value
that will be match the output of the file_cache event which will have
the path name and a build id (if one exists).

Would that work?

-- Steve

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-28 20:51             ` Steven Rostedt
@ 2025-08-28 21:00               ` Arnaldo Carvalho de Melo
  2025-08-28 21:27                 ` Steven Rostedt
  2025-08-29 16:27                 ` Sam James
  0 siblings, 2 replies; 59+ messages in thread
From: Arnaldo Carvalho de Melo @ 2025-08-28 21:00 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Andrew Morton,
	Florian Weimer, Sam James, Kees Cook, Carlos O'Donell



On August 28, 2025 5:51:39 PM GMT-03:00, Steven Rostedt <rostedt@kernel.org> wrote:
>On Thu, 28 Aug 2025 17:27:37 -0300
>Arnaldo Carvalho de Melo <arnaldo.melo@gmail.com> wrote:
>
>> >I would love to have a hash to use. The next patch does the mapping
>> >of the inode numbers to their path name. It can  
>> 
>> The path name is a nice to have detail, but a content based hash is
>> what we want, no?
>> 
>> Tracing/profiling has to be about contents of files later used for
>> analysis, and filenames provide no guarantee about that.
>
>I could add the build id to the inode_cache as well (which I'll rename
>to file_cache).
>
>Thus, the user stack trace will just have the offset and a hash value
>that will be match the output of the file_cache event which will have
>the path name and a build id (if one exists).
>
>Would that work?

Probably.

This "if it is available" question is valid, but since 2016 it's is more of a "did developers disabled it explicitly?"

If my "googling" isn't wrong, GNU LD defaults to generating a build ID in ELF images since 2011 and clang's companion since 2016.

So making it even more available than what the BPF guys did long ago and perf piggybacked on at some point, by having it cached, on request?, in some 20 bytes alignment hole in task_struct that would be only used when profiling/tracing may be amenable.

- Arnaldo 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-28 20:48             ` Steven Rostedt
@ 2025-08-28 21:06               ` Linus Torvalds
  2025-08-28 21:17                 ` Steven Rostedt
  0 siblings, 1 reply; 59+ messages in thread
From: Linus Torvalds @ 2025-08-28 21:06 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Arnaldo Carvalho de Melo, linux-kernel, linux-trace-kernel, bpf,
	x86, Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Andrew Morton,
	Florian Weimer, Sam James, Kees Cook, Carlos O'Donell

On Thu, 28 Aug 2025 at 13:48, Steven Rostedt <rostedt@kernel.org> wrote:
>
> I could run it through the same hash algorithm that "%p" goes through so
> that it's not a real memory address.

For '%p', people can't easily trigger lots of different cases, and you
can't force kernel printouts from user space.

For something like tracing, user space *does* control the output, and
you shouldn't give people visibility into the hashing that '%p' does.

So you can certainly use siphash for hashing, but make sure to not use
the same secret key that the printing does.

As to the ID to hash, I actually think a 'struct file *' might be the
best thing to use - that's directly in the vma, no need to follow any
other pointers for it.

               Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-28 21:06               ` Linus Torvalds
@ 2025-08-28 21:17                 ` Steven Rostedt
  2025-08-28 22:10                   ` Linus Torvalds
  0 siblings, 1 reply; 59+ messages in thread
From: Steven Rostedt @ 2025-08-28 21:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arnaldo Carvalho de Melo, linux-kernel, linux-trace-kernel, bpf,
	x86, Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Andrew Morton,
	Florian Weimer, Sam James, Kees Cook, Carlos O'Donell

On Thu, 28 Aug 2025 14:06:39 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> So you can certainly use siphash for hashing, but make sure to not use
> the same secret key that the printing does.

Right, I just meant to use the same algorithm. The key would be different.

> 
> As to the ID to hash, I actually think a 'struct file *' might be the
> best thing to use - that's directly in the vma, no need to follow any
> other pointers for it.

But that's unique per task, right? What I liked about the f_inode
pointer, is that it appears to be shared between tasks.

I only want to add a new hash and print the path for a new file. If
several tasks are using the same file (which they are with the
libraries), then having the hash be the same between tasks would be
more efficient.

-- Steve

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-28 21:00               ` Arnaldo Carvalho de Melo
@ 2025-08-28 21:27                 ` Steven Rostedt
  2025-08-29 16:27                 ` Sam James
  1 sibling, 0 replies; 59+ messages in thread
From: Steven Rostedt @ 2025-08-28 21:27 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Linus Torvalds, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Andrew Morton,
	Florian Weimer, Sam James, Kees Cook, Carlos O'Donell

On Thu, 28 Aug 2025 18:00:22 -0300
Arnaldo Carvalho de Melo <arnaldo.melo@gmail.com> wrote:

> >Thus, the user stack trace will just have the offset and a hash value
> >that will be match the output of the file_cache event which will have
> >the path name and a build id (if one exists).
> >
> >Would that work?  
> 
> Probably.
> 
> This "if it is available" question is valid, but since 2016 it's is
> more of a "did developers disabled it explicitly?"

The "if one exists" comment is that it's not a requirement. If none
exists, it would just add a zero.

> 
> If my "googling" isn't wrong, GNU LD defaults to generating a build
> ID in ELF images since 2011 and clang's companion since 2016.
> 
> So making it even more available than what the BPF guys did long ago
> and perf piggybacked on at some point, by having it cached, on
> request?, in some 20 bytes alignment hole in task_struct that would
> be only used when profiling/tracing may be amenable.

Would perf be interested in this hash file lookup?

I know perf is reliant on user space more than ftrace is, and has a lot
of work happening in user space while getting stack traces. With
ftrace, there's on real user space requirement, thus a lot of the work
needs to be done in the kernel.

If we go with a hash to file, it's somewhat useless by itself without a
way to map the hash to file/buildid.

I originally started making this hash->file a file in tracefs. But then
I needed to figure out how to manage the allocations. Do I add a "size"
for that file and start dropping mappings when it reaches that limit.
Then I may need to add a LRU algorithm to do so. I found simply having
an event that wrote out the mappings was so much easier to implement.

But the file_cache code could be used by perf, where perf does the same
and just monitors the file_cache event. I could make the API more
global than just the kernel/trace directory.

-- Steve

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-28 21:17                 ` Steven Rostedt
@ 2025-08-28 22:10                   ` Linus Torvalds
  2025-08-28 22:44                     ` Steven Rostedt
  2025-08-29 15:06                     ` Steven Rostedt
  0 siblings, 2 replies; 59+ messages in thread
From: Linus Torvalds @ 2025-08-28 22:10 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Arnaldo Carvalho de Melo, linux-kernel, linux-trace-kernel, bpf,
	x86, Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Andrew Morton,
	Florian Weimer, Sam James, Kees Cook, Carlos O'Donell

On Thu, 28 Aug 2025 at 14:17, Steven Rostedt <rostedt@kernel.org> wrote:
>
> But that's unique per task, right? What I liked about the f_inode
> pointer, is that it appears to be shared between tasks.

I actually think the local meaning of the file pointer is an advantage.

It not only means that you see the difference in mappings of the same
file created with different open calls, it also means that when
different processes mmap the same executable, they don't see the same
hash.

And because the file pointer doesn't have any long-term meaning, it
also means that you also can't make the mistake of thinking the hash
has a long lifetime. With an inode pointer hash, you could easily have
software bugs that end up not realizing that it's a temporary hash,
and that the same inode *will* get two different hashes if the inode
has been flushed from memory and then loaded anew due to memory
pressure.

> I only want to add a new hash and print the path for a new file. If
> several tasks are using the same file (which they are with the
> libraries), then having the hash be the same between tasks would be
> more efficient.

Why? See above why I think it's a mistake to think those hashes have
lifetimes. They don't. Two different inodes can have the same hash due
to lifetime issues, and the same inode can get two different hashes at
different times for the same reason.

So you *need* to tie these things to the only lifetime that matters:
the open/close pair (and the mmap - and the stack traces - will be
part of that lifetime).

I literally think that you are not thinking about this right if you
think you can re-use the hash.

             Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-28 22:10                   ` Linus Torvalds
@ 2025-08-28 22:44                     ` Steven Rostedt
  2025-08-29 15:06                     ` Steven Rostedt
  1 sibling, 0 replies; 59+ messages in thread
From: Steven Rostedt @ 2025-08-28 22:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Arnaldo Carvalho de Melo, linux-kernel,
	linux-trace-kernel, bpf, x86, Masami Hiramatsu, Mathieu Desnoyers,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell


[ My last email of the night, as it's our anniversary, and I'm off to dinner now ;-) ]

On Thu, 28 Aug 2025 15:10:52 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, 28 Aug 2025 at 14:17, Steven Rostedt <rostedt@kernel.org> wrote:
> >
> > But that's unique per task, right? What I liked about the f_inode
> > pointer, is that it appears to be shared between tasks.  
> 
> I actually think the local meaning of the file pointer is an advantage.
> 
> It not only means that you see the difference in mappings of the same
> file created with different open calls, it also means that when
> different processes mmap the same executable, they don't see the same
> hash.
> 
> And because the file pointer doesn't have any long-term meaning, it
> also means that you also can't make the mistake of thinking the hash
> has a long lifetime. With an inode pointer hash, you could easily have
> software bugs that end up not realizing that it's a temporary hash,
> and that the same inode *will* get two different hashes if the inode
> has been flushed from memory and then loaded anew due to memory
> pressure.

This is a reasonable argument. But it is still nice to have the same value
for all tasks. This is for a "file_cache" that does get flushed regularly
(when various changes happen to the tracefs system).

It's only purpose is to map the user space stack trace hash value to a path
name (and build-id).

But yeah, I do not want another file to get flagged with the same hash.

> 
> > I only want to add a new hash and print the path for a new file. If
> > several tasks are using the same file (which they are with the
> > libraries), then having the hash be the same between tasks would be
> > more efficient.  
> 
> Why? See above why I think it's a mistake to think those hashes have
> lifetimes. They don't. Two different inodes can have the same hash due
> to lifetime issues, and the same inode can get two different hashes at
> different times for the same reason.
> 
> So you *need* to tie these things to the only lifetime that matters:
> the open/close pair (and the mmap - and the stack traces - will be
> part of that lifetime).
> 
> I literally think that you are not thinking about this right if you
> think you can re-use the hash.

I'm just worried about this causing slow downs, especially if I also track
the buildid. I did a quick update to the code to first use the f_inode and
get the build_id, and it gives:

       trace-cmd-1012    [003] ...1.    35.247318: inode_cache: hash=0xcb214087 path=/usr/lib/x86_64-linux-gnu/libc.so.6 build_id={0x10bddb6d,0xf5234181,0xc2f72e26,0x1aa4f797,0x6aa19eda}
       trace-cmd-1012    [003] ...1.    35.247333: inode_cache: hash=0x2565194a path=/usr/local/bin/trace-cmd build_id={0x3f399e26,0xf9eb2d4d,0x475fa369,0xf5bb7eeb,0x6244ae85}
       trace-cmd-1012    [003] ...1.    35.247419: inode_cache: hash=0x22dca920 path=/usr/local/lib64/libtracefs.so.1.8.2 build_id={0x6b040bdb,0x961f23d6,0xc1e1027e,0x7067c348,0xd069fa67}
       trace-cmd-1012    [003] ...1.    35.247455: inode_cache: hash=0xe87b6ea5 path=/usr/local/lib64/libtraceevent.so.1.8.4 build_id={0x8946b4eb,0xe3bf4ec5,0x11fd7d86,0xcd3105e2,0xe44a8d4d}
       trace-cmd-1012    [003] ...1.    35.247488: inode_cache: hash=0xafc34117 path=/usr/lib/x86_64-linux-gnu/libzstd.so.1.5.7 build_id={0x379dc873,0x32bbdbc4,0x91eeb6cf,0xba549730,0xe2b96c55}
            bash-1003    [001] ...1.    35.248508: inode_cache: hash=0xcf9bd2d6 path=/usr/bin/bash build_id={0xd94aa36d,0x8e1f19c7,0xa4a69446,0x7338f602,0x20d66357}
  NetworkManager-581     [004] ...1.    35.703993: inode_cache: hash=0xea1c3e22 path=/usr/sbin/NetworkManager build_id={0x278c6dbb,0x4a1cdde6,0xa1a30a2c,0xbc417464,0x9dfaa28e}
            bash-1003    [001] ...1.    35.904817: inode_cache: hash=0x133252fa path=/usr/lib/x86_64-linux-gnu/libtinfo.so.6.5 build_id={0xff2193a5,0xb2ece2f1,0x1bcbd242,0xca302a0b,0xc155fd26}
            bash-1013    [004] ...1.    37.716435: inode_cache: hash=0x53ae379b path=/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2 build_id={0x4ed9e462,0xb302cd84,0x3ccf0104,0xbd80ac72,0x91c7fd44}
            bash-1013    [004] ...1.    37.722923: inode_cache: hash=0xa55a259e path=/usr/lib/x86_64-linux-gnu/libz.so.1.3.1 build_id={0xc2d9e5b6,0xb211e958,0xdef878e4,0xe4022df,0x9552253}

Now I changed it to be the file pointer, and it does give a bit more (see the duplicates):

    sshd-session-1004    [007] ...1.    98.940058: inode_cache: hash=0x41a6191a path=/usr/lib/x86_64-linux-gnu/libc.so.6 build_id={0x10bddb6d,0xf5234181,0xc2f72e26,0x1aa4f797,0x6aa19eda}
       trace-cmd-1016    [006] ...1.    98.940089: inode_cache: hash=0xcc38a542 path=/usr/lib/x86_64-linux-gnu/libc.so.6 build_id={0x10bddb6d,0xf5234181,0xc2f72e26,0x1aa4f797,0x6aa19eda}
       trace-cmd-1016    [006] ...1.    98.940109: inode_cache: hash=0xa89cdd4b path=/usr/local/bin/trace-cmd build_id={0x3f399e26,0xf9eb2d4d,0x475fa369,0xf5bb7eeb,0x6244ae85}
       trace-cmd-1016    [006] ...1.    98.940410: inode_cache: hash=0xb3c570ca path=/usr/local/lib64/libtracefs.so.1.8.2 build_id={0x6b040bdb,0x961f23d6,0xc1e1027e,0x7067c348,0xd069fa67}
       trace-cmd-1016    [006] ...1.    98.940460: inode_cache: hash=0x4da4af85 path=/usr/local/lib64/libtraceevent.so.1.8.4 build_id={0x8946b4eb,0xe3bf4ec5,0x11fd7d86,0xcd3105e2,0xe44a8d4d}
       trace-cmd-1016    [006] ...1.    98.940513: inode_cache: hash=0xce16bd9d path=/usr/lib/x86_64-linux-gnu/libzstd.so.1.5.7 build_id={0x379dc873,0x32bbdbc4,0x91eeb6cf,0xba549730,0xe2b96c55}
            bash-1007    [004] ...1.    98.941772: inode_cache: hash=0x772df671 path=/usr/lib/x86_64-linux-gnu/libc.so.6 build_id={0x10bddb6d,0xf5234181,0xc2f72e26,0x1aa4f797,0x6aa19eda}
            bash-1007    [004] ...1.    98.941911: inode_cache: hash=0xdb764962 path=/usr/bin/bash build_id={0xd94aa36d,0x8e1f19c7,0xa4a69446,0x7338f602,0x20d66357}
            bash-1007    [004] ...1.   100.080299: inode_cache: hash=0xef3bf212 path=/usr/lib/x86_64-linux-gnu/libtinfo.so.6.5 build_id={0xff2193a5,0xb2ece2f1,0x1bcbd242,0xca302a0b,0xc155fd26}
           gmain-602     [003] ...1.   100.477235: inode_cache: hash=0xc9205658 path=/usr/lib/x86_64-linux-gnu/libc.so.6 build_id={0x10bddb6d,0xf5234181,0xc2f72e26,0x1aa4f797,0x6aa19eda}
       trace-cmd-1017    [005] ...1.   101.412116: inode_cache: hash=0x5a77751e path=/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2 build_id={0x4ed9e462,0xb302cd84,0x3ccf0104,0xbd80ac72,0x91c7fd44}
       trace-cmd-1017    [005] ...1.   101.417004: inode_cache: hash=0xf2e95689 path=/usr/lib/x86_64-linux-gnu/libc.so.6 build_id={0x10bddb6d,0xf5234181,0xc2f72e26,0x1aa4f797,0x6aa19eda}
       trace-cmd-1017    [005] ...1.   101.418528: inode_cache: hash=0x5f35d3ca path=/usr/lib/x86_64-linux-gnu/libzstd.so.1.5.7 build_id={0x379dc873,0x32bbdbc4,0x91eeb6cf,0xba549730,0xe2b96c55}
       trace-cmd-1017    [005] ...1.   101.418572: inode_cache: hash=0x57feda78 path=/usr/lib/x86_64-linux-gnu/libz.so.1.3.1 build_id={0xc2d9e5b6,0xb211e958,0xdef878e4,0xe4022df,0x9552253}
       trace-cmd-1017    [005] ...1.   101.418620: inode_cache: hash=0x22ad5d84 path=/usr/local/lib64/libtraceevent.so.1.8.4 build_id={0x8946b4eb,0xe3bf4ec5,0x11fd7d86,0xcd3105e2,0xe44a8d4d}
       trace-cmd-1017    [005] ...1.   101.418666: inode_cache: hash=0x11c240a6 path=/usr/local/lib64/libtracefs.so.1.8.2 build_id={0x6b040bdb,0x961f23d6,0xc1e1027e,0x7067c348,0xd069fa67}
       trace-cmd-1017    [005] ...1.   101.418714: inode_cache: hash=0xf4e46cf path=/usr/local/bin/trace-cmd build_id={0x3f399e26,0xf9eb2d4d,0x475fa369,0xf5bb7eeb,0x6244ae85}
  wpa_supplicant-583     [000] ...1.   102.521195: inode_cache: hash=0xd20a587b path=/usr/lib/x86_64-linux-gnu/libc.so.6 build_id={0x10bddb6d,0xf5234181,0xc2f72e26,0x1aa4f797,0x6aa19eda}
       trace-cmd-1018    [005] ...1.   102.847910: inode_cache: hash=0xee16ee8e path=/usr/lib/x86_64-linux-gnu/libc.so.6 build_id={0x10bddb6d,0xf5234181,0xc2f72e26,0x1aa4f797,0x6aa19eda}
    sshd-session-1004    [000] ...1.   102.853561: inode_cache: hash=0x3404c7ea path=/usr/lib/openssh/sshd-session build_id={0x3b119855,0x5b15323e,0xe1ec337a,0xbd49f66e,0x78bddd0f}
   systemd-udevd-323     [007] ...1.   125.800839: inode_cache: hash=0x760273d5 path=/usr/lib/x86_64-linux-gnu/libc.so.6 build_id={0x10bddb6d,0xf5234181,0xc2f72e26,0x1aa4f797,0x6aa19eda}
 systemd-journal-294     [000] ...1.   125.800932: inode_cache: hash=0x77f34056 path=/usr/lib/x86_64-linux-gnu/libc.so.6 build_id={0x10bddb6d,0xf5234181,0xc2f72e26,0x1aa4f797,0x6aa19eda}
   systemd-udevd-323     [007] ...1.   125.801135: inode_cache: hash=0xe70bd063 path=/usr/lib/x86_64-linux-gnu/systemd/libsystemd-shared-257.so build_id={0x81d9bace,0x59f9953f,0x439928d7,0xe849d513,0xf2103286}
         systemd-1       [006] ...1.   125.801781: inode_cache: hash=0x42292844 path=/usr/lib/x86_64-linux-gnu/libc.so.6 build_id={0x10bddb6d,0xf5234181,0xc2f72e26,0x1aa4f797,0x6aa19eda}
         systemd-1       [006] ...1.   125.802811: inode_cache: hash=0x2cac8b3b path=/usr/lib/x86_64-linux-gnu/systemd/libsystemd-core-257.so build_id={0x580a80c5,0x931714d2,0xec54d3be,0xd5400bc0,0x6f2530ba}
         systemd-1       [006] ...1.   125.803740: inode_cache: hash=0xb17acaa6 path=/usr/lib/x86_64-linux-gnu/systemd/libsystemd-shared-257.so build_id={0x81d9bace,0x59f9953f,0x439928d7,0xe849d513,0xf2103286}
            cron-541     [006] ...1.   138.192640: inode_cache: hash=0x9285db61 path=/usr/lib/x86_64-linux-gnu/libc.so.6 build_id={0x10bddb6d,0xf5234181,0xc2f72e26,0x1aa4f797,0x6aa19eda}
  NetworkManager-581     [005] ...1.   144.716224: inode_cache: hash=0xf3c5bbc1 path=/usr/lib/x86_64-linux-gnu/libc.so.6 build_id={0x10bddb6d,0xf5234181,0xc2f72e26,0x1aa4f797,0x6aa19eda}
  NetworkManager-581     [005] ...1.   144.716392: inode_cache: hash=0x381883bb path=/usr/sbin/NetworkManager build_id={0x278c6dbb,0x4a1cdde6,0xa1a30a2c,0xbc417464,0x9dfaa28e}
  NetworkManager-581     [005] ...1.   146.385151: inode_cache: hash=0x43451e15 path=/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.8400.0 build_id={0x9a7d3e29,0x5d8ed8f,0xe399da0,0xb5d373da,0x3ca1049b}
         chronyd-663     [001] ...1.   157.080405: inode_cache: hash=0xa0db647a path=/usr/lib/x86_64-linux-gnu/libc.so.6 build_id={0x10bddb6d,0xf5234181,0xc2f72e26,0x1aa4f797,0x6aa19eda}
         chronyd-663     [001] ...1.   158.152790: inode_cache: hash=0x1c471c4c path=/usr/sbin/chronyd build_id={0xf9588e62,0x3a8e6223,0x619fcb4f,0x12562bb,0x2ea104fb}

But maybe it's not enough to be an issue. But this will become more
predominate when sframes is built throughout. I only have a few
applications having sframes enabled so not every task is getting a full
stack trace, and hence, not all the files being touched is being displayed.

Just to clarify my concern. I want the stack traces to be quick and small.
I believe a 32 bit hash may be enough. And then have a side event that gets
updated when new files appear that can display much more information. This
side event may be slow which is why I don't want it to occur often. But I
do want it to occur for all new files.

-- Steve

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-28 22:10                   ` Linus Torvalds
  2025-08-28 22:44                     ` Steven Rostedt
@ 2025-08-29 15:06                     ` Steven Rostedt
  2025-08-29 15:47                       ` Linus Torvalds
  1 sibling, 1 reply; 59+ messages in thread
From: Steven Rostedt @ 2025-08-29 15:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Arnaldo Carvalho de Melo, linux-kernel,
	linux-trace-kernel, bpf, x86, Masami Hiramatsu, Mathieu Desnoyers,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

On Thu, 28 Aug 2025 15:10:52 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> And because the file pointer doesn't have any long-term meaning, it
> also means that you also can't make the mistake of thinking the hash
> has a long lifetime. With an inode pointer hash, you could easily have
> software bugs that end up not realizing that it's a temporary hash,
> and that the same inode *will* get two different hashes if the inode
> has been flushed from memory and then loaded anew due to memory
> pressure.

The hash value can actually last longer than the file pointer. Thus, if we
use the file pointer for the hash, then we could risk it getting freed and
then allocated again for different file. Then we get the same hash value
for two different paths.

What I'm looking at doing is using both the file pointer as well as its
path to make the hash:

struct jhash_key {
	void		*file;
	struct path	path;
};

u32 trace_file_cache_add(struct vm_area_struct *vma)
{
	[..]
	static u32 initval;
	u32 hash;

	if (!vma->vm_file)
		return 0;

	if (!initval)
		get_random_bytes(&initval, sizeof(initval));

	jkey.file = vma->vm_file;
	jkey.path = vma->vm_file->f_path;

	hash = jhash(&jkey, sizeof(jkey), initval);

	if (!trace_file_cache_enabled())
		return hash;

	[ add the hash to the rhashtable and print if unique ]

Hopefully by using both the file pointer and its path to create the hash,
it will stay unique for some time.

-- Steve

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-29 15:06                     ` Steven Rostedt
@ 2025-08-29 15:47                       ` Linus Torvalds
  2025-08-29 16:07                         ` Linus Torvalds
  2025-08-29 16:19                         ` Steven Rostedt
  0 siblings, 2 replies; 59+ messages in thread
From: Linus Torvalds @ 2025-08-29 15:47 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Steven Rostedt, Arnaldo Carvalho de Melo, linux-kernel,
	linux-trace-kernel, bpf, x86, Masami Hiramatsu, Mathieu Desnoyers,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

On Fri, 29 Aug 2025 at 08:06, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> The hash value can actually last longer than the file pointer. Thus, if we
> use the file pointer for the hash, then we could risk it getting freed and
> then allocated again for different file. Then we get the same hash value
> for two different paths.

No, no no.

That doesn't mean that the hash "lasts longer" than the file pointer.
Quite the reverse.

It is literally about the fact that YOU HAVE TO TAKE LIFETIMES INTO ACCOUNT.

So you are being confused, and that shows in your "solution".

And the thing is, the "you have to take lifetimes into account' is
true *regardless* of what you use as your index. It was true even with
inode numbers and major/minor numbers, in that file deletion and
creation would basically end up reusing the same "hash".

And this is my *point*: the advantage of the 'struct file *' is that
it is a local thing that gets reused potentially quite quickly, and
*forces* you to get the lifetime right.

So don't mess that up.

Except you do, and suggest this instead:

> What I'm looking at doing is using both the file pointer as well as its
> path to make the hash:

NO NO NO.

Now you are only saying "ok, I have a bogus lifetime, so I'll make a
hash where that isn't obvious any more because reuse is harder to
trigger".

IOW: YOU ARE MAKING THE BUG WORSE BY HIDING IT.

You're not fixing anything at all. You are literally making it obvious
that your design is bogus and you're not thinking things through.

So stop it. Really.

Instead, realize that *ANY* hash you use has a limited lifetime, and
the *ONLY* validity of that random number - whether it's a hash of the
file pointer, an inode number, or anything else - is DURING THE
MAPPING THAT IT USES.

As long as the mapping exists, you know the thing is stable, because
the mapping has a reference to the file (which has a reference to the
path, which has a reference to the inode - so it all stays consistent
and stable).

But *immediately* when the mapping goes away, it's now no longer valid
to think it has some meaning any more. Really. It might be a temporary
file that was already unlinked, and the 'munmap()' is the last thing
that releases the inode and it gets deleted from disk, and a new inode
is created with the exact same inode number, and maybe even the exact
same 'struct inode *' pointer.

And as long as you don't understand this, you will always get this
wrong, and you'll create bogus "workarounds" that just hide the REAL
bug. Bogus workarounds like making a random different hash that is
just less likely to show your mistake.

In other words, to get this right, you *have* to associate the hash
with the mmap creation that precedes it in the trace. You MUST NOT
reuse it, not unless you also have some kind of reference count model
that keeps track of how many mmap's that hash is associated with.

Put another way: the only valid way to reuse it is if you manually
track the lifetime of it. Anything else is WRONG.

Now, tracking the actual lifetime of the hash is probably doable, but
it's complex and error-prone. You can't do it by using the reference
count in the 'struct file' itself, because that would keep the
lifetime of the file artificially elevated, so you'd have to do some
entirely separate thing that tracks things. Don't do it.

Anyway, the way to fix this is to not care about lifetimes at all:
just treat the hash as the random number it is, and just accept the
fact that the number gets actively reused and has no meaning.

Instead, just make sure that when you *use* the hash in user space,
you always associate the hash with the previous trace event for the
mmap that used that hash.

You need to look up the event anyway to figure out what the hash means.

And this is where the whole "short lifetime" is so important. It's
what *forces* you to get this right, instead of doing the wrong thing
and thinking that hashes have lifetimes that they simply do not have.

The number in the stack trace - regardless of what that number is - is
*ONLY* valid if you associate it with the last mmap that created that
number.

You don't even have to care about the unmap event, because that unmap
- while it will potentially kill the lifetime of the hash if it was
the last use of that file - also means that now there won't be any new
stack traces with that hash any more. So you can ignore the lifetime
in that respect: all that matters is that yes, it can get re-used, but
you'll see a new mmap event with that hash if it is.

(And then you might still have the complexity with per-cpu trace
buffers etc where the ordering between an mmap event on one CPU might
not be entirely obvious wrt the stack trace even on another CPU with a
different thread that shares the same VM, but that's no different from
any of the other percpu trace buffer ordering issues).

                 Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-29 15:47                       ` Linus Torvalds
@ 2025-08-29 16:07                         ` Linus Torvalds
  2025-08-29 16:33                           ` Steven Rostedt
  2025-08-29 16:19                         ` Steven Rostedt
  1 sibling, 1 reply; 59+ messages in thread
From: Linus Torvalds @ 2025-08-29 16:07 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Steven Rostedt, Arnaldo Carvalho de Melo, linux-kernel,
	linux-trace-kernel, bpf, x86, Masami Hiramatsu, Mathieu Desnoyers,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

On Fri, 29 Aug 2025 at 08:47, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Anyway, the way to fix this is to not care about lifetimes at all:
> just treat the hash as the random number it is, and just accept the
> fact that the number gets actively reused and has no meaning.

Side note: the actual re-use of various pointers and/or inode numbers
is going to be very very random.

Classic old filesystems that live by inode numbers will use
'iget5_locked()' and will basically have the same 'struct inode *'
pointer too when they re-use an inode number.

And they likely also have a very simplistic inode allocation model and
a unlink followed by a file creation probably *will* re-use that same
inode number. So you can probably see 'struct inode *' get reused
quite quickly and reliably for entirely unrelated files just based on
file deletion/creation patterns.

The dentry pointer will typically stick around rather aggressively,
and will likely remain the same when you delete a file and create
another one with the same name, and the mnt pointer will stick around
too, so the contents of 'struct path' will be the exact same for two
completely different files across a delete/create event.

So hashing the path is very likely to stay the same as long as the
actual path stays the same, but would be fairly insensitive to the
underlying data changing. People might not care, particularly with
executables and libraries that simply don't get switched around much.

And, 'struct file *' will get reused randomly just based on memory
allocation issues, but I wouldn't be surprised if a close/open
sequence would get the same 'struct file *' pointer.

So these will all have various different 'value stays the same, but
the underlying data changed' patterns. I really think that you should
just treat the hash as a very random number, not assign it *any*
meaning at trace collection time, and the more random the better.

And then do all the "figure it out" work in user space when *looking*
at the traces. It might be a bit more work, and involve a bit more
data, but I _think_ it should be very straightforward to just do a
"what was the last mmap that had this hash"

               Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-29 15:47                       ` Linus Torvalds
  2025-08-29 16:07                         ` Linus Torvalds
@ 2025-08-29 16:19                         ` Steven Rostedt
  2025-08-29 16:28                           ` Linus Torvalds
  1 sibling, 1 reply; 59+ messages in thread
From: Steven Rostedt @ 2025-08-29 16:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Arnaldo Carvalho de Melo, linux-kernel,
	linux-trace-kernel, bpf, x86, Masami Hiramatsu, Mathieu Desnoyers,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

On Fri, 29 Aug 2025 08:47:44 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> You don't even have to care about the unmap event, because that unmap
> - while it will potentially kill the lifetime of the hash if it was
> the last use of that file - also means that now there won't be any new
> stack traces with that hash any more. So you can ignore the lifetime
> in that respect: all that matters is that yes, it can get re-used, but
> you'll see a new mmap event with that hash if it is.

Basically what I need is that every time I add a file/hash mapping to the
hashtable, I really need a callback to know when that file goes away. And
then I can remove it from the hash table, so that the next time that hash is
added, it will trigger another "print the file associated with this hash".

It's OK to have the same hash for multiple files as long as it is traced.

All events have timestamps associated to them, so it is trivial to map
which hash mapping belongs to which stack trace.

The reason for the file_cache is to keep from having to do the lookup of
the file at every stack trace. But if I can have a callback for when that
vma gets changed, (as I'm assuming the file will last as long as the vma is
unmodified), then the callback could remove the hash value and this would
not be a problem.

My question now is, is there a callback that can be registered by the
file_cache to know when the vma or the file change?

-- Steve

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-28 21:00               ` Arnaldo Carvalho de Melo
  2025-08-28 21:27                 ` Steven Rostedt
@ 2025-08-29 16:27                 ` Sam James
  1 sibling, 0 replies; 59+ messages in thread
From: Sam James @ 2025-08-29 16:27 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Steven Rostedt, Linus Torvalds, linux-kernel, linux-trace-kernel,
	bpf, x86, Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Andrew Morton,
	Florian Weimer, Kees Cook, Carlos O'Donell

Arnaldo Carvalho de Melo <arnaldo.melo@gmail.com> writes:

> On August 28, 2025 5:51:39 PM GMT-03:00, Steven Rostedt <rostedt@kernel.org> wrote:
>>On Thu, 28 Aug 2025 17:27:37 -0300
>>Arnaldo Carvalho de Melo <arnaldo.melo@gmail.com> wrote:
>>
>>> >I would love to have a hash to use. The next patch does the mapping
>>> >of the inode numbers to their path name. It can  
>>> 
>>> The path name is a nice to have detail, but a content based hash is
>>> what we want, no?
>>> 
>>> Tracing/profiling has to be about contents of files later used for
>>> analysis, and filenames provide no guarantee about that.
>>
>>I could add the build id to the inode_cache as well (which I'll rename
>>to file_cache).
>>
>>Thus, the user stack trace will just have the offset and a hash value
>>that will be match the output of the file_cache event which will have
>>the path name and a build id (if one exists).
>>
>>Would that work?
>
> Probably.
>
> This "if it is available" question is valid, but since 2016 it's is more of a "did developers disabled it explicitly?"
>
> If my "googling" isn't wrong, GNU LD defaults to generating a build ID in ELF images since 2011 and clang's companion since 2016.

GNU ld doesn't ever default to generating build IDs, and I don't *think*
LLVM does either (either in Clang, or LLD).

GCC, on the other hand, has a configure arg to control this, but it's
default-off. Clang generally prefers to have defaults like this done via
user/sysadmin specified configuration files rather than adding
build-time configure flags.

Now, is it a reasonable ask to say "we require build IDs for this
feature"? Yeah, it probably is, but it's not default-on right now, and
indeed we in Gentoo aren't using them yet (but I'm working on enabling
them).

>
> So making it even more available than what the BPF guys did long ago
> and perf piggybacked on at some point, by having it cached, on
> request?, in some 20 bytes alignment hole in task_struct that would be
> only used when profiling/tracing may be amenable.

thanks,
sam

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-29 16:19                         ` Steven Rostedt
@ 2025-08-29 16:28                           ` Linus Torvalds
  2025-08-29 16:49                             ` Steven Rostedt
  0 siblings, 1 reply; 59+ messages in thread
From: Linus Torvalds @ 2025-08-29 16:28 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Steven Rostedt, Arnaldo Carvalho de Melo, linux-kernel,
	linux-trace-kernel, bpf, x86, Masami Hiramatsu, Mathieu Desnoyers,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

On Fri, 29 Aug 2025 at 09:18, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Basically what I need is that every time I add a file/hash mapping to the
> hashtable, I really need a callback to know when that file goes away. And
> then I can remove it from the hash table, so that the next time that hash is
> added, it will trigger another "print the file associated with this hash".

That works, but why would you care?

Why don't you just register the hash value and NOT CARE.

Leave it all to later when the trace gets analyzed.

Leave it be. The normal situation is presumably going to be that
millions of stack traces will be generated, and nobody will even look
at them.

> My question now is, is there a callback that can be registered by the
> file_cache to know when the vma or the file change?

No. And what's the point? I just told you that unmap doesn't matter.
All that matters is mmap.

Don't try to "reuse" hashes. Just treat them as opaque numbers.

             Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-29 16:07                         ` Linus Torvalds
@ 2025-08-29 16:33                           ` Steven Rostedt
  2025-08-29 16:42                             ` Linus Torvalds
  0 siblings, 1 reply; 59+ messages in thread
From: Steven Rostedt @ 2025-08-29 16:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Arnaldo Carvalho de Melo, linux-kernel,
	linux-trace-kernel, bpf, x86, Masami Hiramatsu, Mathieu Desnoyers,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

On Fri, 29 Aug 2025 09:07:58 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> The dentry pointer will typically stick around rather aggressively,
> and will likely remain the same when you delete a file and create
> another one with the same name, and the mnt pointer will stick around
> too, so the contents of 'struct path' will be the exact same for two
> completely different files across a delete/create event.

I'm not sure how often a trace would expand the event of running code,
deleting the code, recreating it, and running it again. But that means the
stack traces of the original code will be useless regardless. But at a
minimum, the recreating of the code should trigger another print, and this
would give it a different build-id (which I'm not recording as well as the
path).

> 
> So hashing the path is very likely to stay the same as long as the
> actual path stays the same, but would be fairly insensitive to the
> underlying data changing. People might not care, particularly with
> executables and libraries that simply don't get switched around much.
> 
> And, 'struct file *' will get reused randomly just based on memory
> allocation issues, but I wouldn't be surprised if a close/open
> sequence would get the same 'struct file *' pointer.
> 
> So these will all have various different 'value stays the same, but
> the underlying data changed' patterns. I really think that you should
> just treat the hash as a very random number, not assign it *any*
> meaning at trace collection time, and the more random the better.
> 
> And then do all the "figure it out" work in user space when *looking*
> at the traces. It might be a bit more work, and involve a bit more
> data, but I _think_ it should be very straightforward to just do a
> "what was the last mmap that had this hash"

I just realized that I'm using the rhashtable as an "does this hash exist".
I could get the content of the item that matches the hash and compare it to
what was used to create the hash in the first place. If there's a reference
counter or some other identifier I could use to know that the passed in vma
is the same as what is in the hash table, I can use this to know if the
hash needs to be updated with the new information or not.

-- Steve

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-29 16:33                           ` Steven Rostedt
@ 2025-08-29 16:42                             ` Linus Torvalds
  2025-08-29 16:50                               ` Linus Torvalds
  2025-08-29 16:57                               ` Steven Rostedt
  0 siblings, 2 replies; 59+ messages in thread
From: Linus Torvalds @ 2025-08-29 16:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Steven Rostedt, Arnaldo Carvalho de Melo, linux-kernel,
	linux-trace-kernel, bpf, x86, Masami Hiramatsu, Mathieu Desnoyers,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

On Fri, 29 Aug 2025 at 09:33, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> I just realized that I'm using the rhashtable as an "does this hash exist".

The question is still *why*?

Just use the hash. Don't do anything to it. Don't mess with it.

> I could get the content of the item that matches the hash and compare it to
> what was used to create the hash in the first place. If there's a reference
> counter or some other identifier I could use to know that the passed in vma
> is the same as what is in the hash table, I can use this to know if the
> hash needs to be updated with the new information or not.

No such information exists.

Sure, we have reference counts for everything: to a very close
approximation, any memory allocation with external visibility has to
be reference counted for correctness.

But those reference counts are never going to tell you whether they
are the *same* object that they were last time you looked at it, or
just a new allocation that happens to have the same pointer.

Don't even *TRY*.

You still haven't explained why you would care. Your patch that used
inode numbers didn't care. It just used the numbers.

SO JUST USE THE NUMBERS, for chissake! Don't make them mean anything.
Don't try to think they mean something.

The *reason* I htink hashing 'struct file *' is better than the
alternative is exactly that it *cannot* mean anything. It will get
re-used quite actively, even when nobody actually changes any of the
files. So you are forced to deal with this correctly, even though you
seem to be fighting dealing with it correctly tooth and nail.

And at no point have you explained why you can't just treat it as
meaningless numbers. The patch that started this all did exactly that.
It just used the *wrong* numbers, and I pointed out why they were
wrong, and why you shouldn't use those numbers.

          Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-29 16:28                           ` Linus Torvalds
@ 2025-08-29 16:49                             ` Steven Rostedt
  2025-08-29 16:59                               ` Linus Torvalds
  0 siblings, 1 reply; 59+ messages in thread
From: Steven Rostedt @ 2025-08-29 16:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Arnaldo Carvalho de Melo, linux-kernel,
	linux-trace-kernel, bpf, x86, Masami Hiramatsu, Mathieu Desnoyers,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

On Fri, 29 Aug 2025 09:28:41 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:


> Don't try to "reuse" hashes. Just treat them as opaque numbers.

What do I use to make the hash?

One thing this is trying to do is not have to look up the path name for
every line of a stack trace.

I could just have every instance do the full look up, make a hash from the
path name and build id, and pass the hash to the caller. My worry is the
time it takes to generate that.

Perhaps I could have a hash that maps the pid with the vma->vm_start, and
if that's unique, get the path and build-id and create the hash for that
and send it back to the user. Save the hash for that mapping in the
rhashtable with the pid/vm_start as the key.

Then the code that adds the vma, will see if the pid/vma->start exists, if
it does, return the hash associated with that, if it does not, add it and
trigger the event that a new address has been created.

-- Steve

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-29 16:42                             ` Linus Torvalds
@ 2025-08-29 16:50                               ` Linus Torvalds
  2025-08-29 17:02                                 ` Steven Rostedt
  2025-08-29 16:57                               ` Steven Rostedt
  1 sibling, 1 reply; 59+ messages in thread
From: Linus Torvalds @ 2025-08-29 16:50 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Steven Rostedt, Arnaldo Carvalho de Melo, linux-kernel,
	linux-trace-kernel, bpf, x86, Masami Hiramatsu, Mathieu Desnoyers,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

On Fri, 29 Aug 2025 at 09:42, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Just use the hash. Don't do anything to it. Don't mess with it.

In fact, at actual stack trace time, don't even do the hashing. Just
save the raw pointer value (but as a *value*, not as a pointer: we
absolutely do *not* want people to think that the random value can be
used as a 'struct file' *: it needs to be a plain unsigned long, not
some kernel pointer).

Then the hashing can happen when you expose those entries to user
space (in the "print" stage). At that point you can do that

       hash = siphash_1u64(value, secret);

thing.

That will likely help I$ and D$ too, since you won't be accessing the
secret hashing data randomly, but do it only at trace output time
(presumably in a fairly tight loop at that point).

           Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-29 16:42                             ` Linus Torvalds
  2025-08-29 16:50                               ` Linus Torvalds
@ 2025-08-29 16:57                               ` Steven Rostedt
  2025-08-29 17:02                                 ` Linus Torvalds
  1 sibling, 1 reply; 59+ messages in thread
From: Steven Rostedt @ 2025-08-29 16:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Arnaldo Carvalho de Melo, linux-kernel,
	linux-trace-kernel, bpf, x86, Masami Hiramatsu, Mathieu Desnoyers,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

On Fri, 29 Aug 2025 09:42:03 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, 29 Aug 2025 at 09:33, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > I just realized that I'm using the rhashtable as an "does this hash exist".  
> 
> The question is still *why*?

The reason is to keep from triggering the event that records the pathname
for every look up.

> 
> SO JUST USE THE NUMBERS, for chissake! Don't make them mean anything.
> Don't try to think they mean something.
> 
> The *reason* I htink hashing 'struct file *' is better than the
> alternative is exactly that it *cannot* mean anything. It will get
> re-used quite actively, even when nobody actually changes any of the
> files. So you are forced to deal with this correctly, even though you
> seem to be fighting dealing with it correctly tooth and nail.
> 
> And at no point have you explained why you can't just treat it as
> meaningless numbers. The patch that started this all did exactly that.
> It just used the *wrong* numbers, and I pointed out why they were
> wrong, and why you shouldn't use those numbers.

I agree. The hash I showed last time was just using the pointers. The hash
itself is meaningless and is useless by itself. The only thing the hash is
doing is to be an identifier in the stack trace so that the path name and
buildid don't need to be generated and saved every time.

In my other email, I'm thinking of using the pid / vma->vm_start as a key
to know if the pathname needs to be printed again or not. Although, perhaps
if a task does a dlopen(), load some text and execute it, then a dlclose()
and another dlopen() and loads text, that this could break the assumption
that the vm_start is unique per file.

Just to clarify, the goal of this exercise is to avoid the work of creating
and generating the pathnames and buildids for every lookup / stacktrace.

Now maybe hashing the pathname isn't as expensive as I think it may be. And
just doing that could be "good enough".


-- Steve


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-29 16:49                             ` Steven Rostedt
@ 2025-08-29 16:59                               ` Linus Torvalds
  2025-08-29 17:17                                 ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 59+ messages in thread
From: Linus Torvalds @ 2025-08-29 16:59 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Steven Rostedt, Arnaldo Carvalho de Melo, linux-kernel,
	linux-trace-kernel, bpf, x86, Masami Hiramatsu, Mathieu Desnoyers,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

On Fri, 29 Aug 2025 at 09:49, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> What do I use to make the hash?

Literally just '(unsigned long)(vma->vm_file)'.

Nothing else.

> One thing this is trying to do is not have to look up the path name for
> every line of a stack trace.

That's the *opposite* of what I've been suggesting. I've literally
been talking about just saving off the hash of the file pointer.

(And I just suggested that what you actually save off isn't even the
hash - just the value - and that you can hash it later at a less
critical point in time)

Don't do *any* work at all at trace collection time. All you need is
to literally access three fields in the 'vma':

 - 'vm_start' and 'vm_pgoff' are needed to calculate the offset in the
file using the user space address

 - save off the value of 'vm_file' for later hashing

and I really think you're done.

Then, for the actual trace, you need two things:

 - you need the mmap trace event that has the 'file' value, and you
create a mmap event with that value hashed, and at that point you also
output the pathname and/or things like the build ID

 - for the stack trace events, you output the offset in the file, and
you hash and output the file value

now, in user space, you have all you need. All you do is match the
hashes. They are random numbers, and user space cannot know what they
are. They are just cookies as a mapping ID.

And look, now you have the pathname and the build ID - or whatever you
saved off in that mmap event. And at stack trace time, you needed to
do *nothing*.

And mmap is rare enough - and heavy enough - that doing that pathname
and build ID at *that* point is a non-issue.

See what I'm trying to say?

               Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-29 16:50                               ` Linus Torvalds
@ 2025-08-29 17:02                                 ` Steven Rostedt
  2025-08-29 17:13                                   ` Linus Torvalds
  0 siblings, 1 reply; 59+ messages in thread
From: Steven Rostedt @ 2025-08-29 17:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Arnaldo Carvalho de Melo, linux-kernel,
	linux-trace-kernel, bpf, x86, Masami Hiramatsu, Mathieu Desnoyers,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

On Fri, 29 Aug 2025 09:50:12 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, 29 Aug 2025 at 09:42, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Just use the hash. Don't do anything to it. Don't mess with it.  
> 
> In fact, at actual stack trace time, don't even do the hashing. Just
> save the raw pointer value (but as a *value*, not as a pointer: we
> absolutely do *not* want people to think that the random value can be
> used as a 'struct file' *: it needs to be a plain unsigned long, not
> some kernel pointer).
> 
> Then the hashing can happen when you expose those entries to user
> space (in the "print" stage). At that point you can do that
> 
>        hash = siphash_1u64(value, secret);
> 
> thing.
> 
> That will likely help I$ and D$ too, since you won't be accessing the
> secret hashing data randomly, but do it only at trace output time
> (presumably in a fairly tight loop at that point).

Note, the ring buffer can be mapped to user space. So anything written into
the buffer is already exposed. The "at trace output time" is done by user
space, not the kernel (except when using "trace" and "trace_pipe" files).

-- Steve

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-29 16:57                               ` Steven Rostedt
@ 2025-08-29 17:02                                 ` Linus Torvalds
  2025-08-29 17:52                                   ` Steven Rostedt
  0 siblings, 1 reply; 59+ messages in thread
From: Linus Torvalds @ 2025-08-29 17:02 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Steven Rostedt, Arnaldo Carvalho de Melo, linux-kernel,
	linux-trace-kernel, bpf, x86, Masami Hiramatsu, Mathieu Desnoyers,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

On Fri, 29 Aug 2025 at 09:57, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> The reason is to keep from triggering the event that records the pathname
> for every look up.

BUT THAT WAS NEVER THE POINT.

There is only a single 64-bit number. No lookup. No pointer following.
No nothing.

The whole point of hashing was to get an *opaque* thing very quickly.
Not a pathname. No reference counting. No verifying whether you have
seen it before.

Literally just something that you can match up in the trace file much
much later.

(And, honestly, the likely thing is that you never match it up at all
- you can delay the "match it up" until a human actually looks at a
trace, which is presumably going to be a "one in a million" case).

              Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-29 17:02                                 ` Steven Rostedt
@ 2025-08-29 17:13                                   ` Linus Torvalds
  2025-08-29 17:57                                     ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 59+ messages in thread
From: Linus Torvalds @ 2025-08-29 17:13 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Steven Rostedt, Arnaldo Carvalho de Melo, linux-kernel,
	linux-trace-kernel, bpf, x86, Masami Hiramatsu, Mathieu Desnoyers,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

On Fri, 29 Aug 2025 at 10:02, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Note, the ring buffer can be mapped to user space. So anything written into
> the buffer is already exposed.

Oh, good point. Yeah, that means that you have to do the hashing
immediately. Too bad. Because while 'vma->vm_file' is basically free
(since you have to access the vma for other reasons anyway), a good
hash isn't.

siphash is good and fast for being what it is, but it's not completely
free. It's something like 50 shift/xor pairs, and it obviously needs
to also access that secret hash value that is likely behind a cache
miss..

Still, I suspect it's the best we've got.

(If hashing is noticeable, it *might* be worth it to use
'siphash_1u32()' and only hash 32 bits of the pointers. That makes the
hashing slightly cheaper, and since the low bits of the pointer will
be zero anyway due to alignment, and the high bits don't have a lot of
information in them either, it doesn't actually remove much
information. You might get collissions if the two pointers are exactly
32 GB apart or whatever, but that sounds really really unlucky)

                Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-29 16:59                               ` Linus Torvalds
@ 2025-08-29 17:17                                 ` Arnaldo Carvalho de Melo
  2025-08-29 17:33                                   ` Linus Torvalds
  0 siblings, 1 reply; 59+ messages in thread
From: Arnaldo Carvalho de Melo @ 2025-08-29 17:17 UTC (permalink / raw)
  To: Linus Torvalds, Steven Rostedt
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Andrew Morton,
	Florian Weimer, Sam James, Kees Cook, Carlos O'Donell



On August 29, 2025 1:59:21 PM GMT-03:00, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>On Fri, 29 Aug 2025 at 09:49, Steven Rostedt <rostedt@goodmis.org> wrote:
>>
>> What do I use to make the hash?
>
>Literally just '(unsigned long)(vma->vm_file)'.
>
>Nothing else.
>
>> One thing this is trying to do is not have to look up the path name for
>> every line of a stack trace.
>
>That's the *opposite* of what I've been suggesting. I've literally
>been talking about just saving off the hash of the file pointer.
>
>(And I just suggested that what you actually save off isn't even the
>hash - just the value - and that you can hash it later at a less
>critical point in time)
>
>Don't do *any* work at all at trace collection time. All you need is
>to literally access three fields in the 'vma':
>
> - 'vm_start' and 'vm_pgoff' are needed to calculate the offset in the
>file using the user space address
>
> - save off the value of 'vm_file' for later hashing
>
>and I really think you're done.
>
>Then, for the actual trace, you need two things:
>
> - you need the mmap trace event that has the 'file' value, and you
>create a mmap event with that value hashed, and at that point you also
>output the pathname and/or things like the build ID
>
> - for the stack trace events, you output the offset in the file, and
>you hash and output the file value
>
>now, in user space, you have all you need. All you do is match the
>hashes. They are random numbers, and user space cannot know what they
>are. They are just cookies as a mapping ID.
>
>And look, now you have the pathname and the build ID - or whatever you
>saved off in that mmap event. And at stack trace time, you needed to
>do *nothing*.
>
>And mmap is rare enough - and heavy enough - that doing that pathname
>and build ID at *that* point is a non-issue.


Or using a preexisting one in the DSO used for the executable mmap.

As long as we don't lose those mmap events due to memory pressure/lost events and we have timestamps to order it all before lookups, yeah should work.

- Arnaldo 

>
>See what I'm trying to say?
>
>               Linus

- Arnaldo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-29 17:17                                 ` Arnaldo Carvalho de Melo
@ 2025-08-29 17:33                                   ` Linus Torvalds
  2025-08-29 18:11                                     ` Steven Rostedt
  0 siblings, 1 reply; 59+ messages in thread
From: Linus Torvalds @ 2025-08-29 17:33 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Steven Rostedt, Steven Rostedt, linux-kernel, linux-trace-kernel,
	bpf, x86, Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Andrew Morton,
	Florian Weimer, Sam James, Kees Cook, Carlos O'Donell

On Fri, 29 Aug 2025 at 10:18, Arnaldo Carvalho de Melo
<arnaldo.melo@gmail.com> wrote:
>
> As long as we don't lose those mmap events due to memory pressure/lost
> events and we have timestamps to order it all before lookups, yeah
> should work.

The main reason to lose mmap events that I can see is that you start
tracing in the middle of running something (for example, tracing
systemd or some other "started at boot" thing).

Then you'd not have any record of an actual mmap at all because it
happened before you started tracing, even if there is no memory
pressure or other thing going on.

That is not necessarily a show-stopper: you could have some fairly
simple count for "how many times have I seen this hash", and add a
"mmap reminder" event (which would just be the exact same thing as the
regular mmap event).

You do it for the first time you see it, and every N times afterwards
(maybe by simply using a counter array that is indexed by the low bits
of the hash, and incrementing it for every hash you see, and if it was
zero modulo N you do that "mmap reminder" thing).

Yes, at that point you'd need to do that whole "generate path and
build ID", but if 'N' is a large enough number, it's pretty rare.
Maybe using a 16-bit counter would be sufficient (ie N would naturally
be 65536 when it becomes zero again).

That might be a good thing regardless just to have some guaranteed
limit of how far back in the trace you need to go to find the mmap
information for some hash. If you have long traces, maybe you don't
want to walk back billions of events.

But I wouldn't suggest doing that as a *first* implementation. I'm
just saying that it could be added if people find that it's a problem.

            Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-29 17:02                                 ` Linus Torvalds
@ 2025-08-29 17:52                                   ` Steven Rostedt
  0 siblings, 0 replies; 59+ messages in thread
From: Steven Rostedt @ 2025-08-29 17:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Arnaldo Carvalho de Melo, linux-kernel,
	linux-trace-kernel, bpf, x86, Masami Hiramatsu, Mathieu Desnoyers,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

On Fri, 29 Aug 2025 10:02:40 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> (And, honestly, the likely thing is that you never match it up at all
> - you can delay the "match it up" until a human actually looks at a
> trace, which is presumably going to be a "one in a million" case).

Note the use case for this is for tooling that will be using these traces
for either flame graphs or for seeing where trouble areas are. That is, if
someone is enabling these stack traces, they most definitely will be looked
at. Maybe not directly by a human, but the tooling will and it will need the
mapping information.

-- Steve

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-29 17:13                                   ` Linus Torvalds
@ 2025-08-29 17:57                                     ` Arnaldo Carvalho de Melo
  2025-08-29 20:51                                       ` Linus Torvalds
  0 siblings, 1 reply; 59+ messages in thread
From: Arnaldo Carvalho de Melo @ 2025-08-29 17:57 UTC (permalink / raw)
  To: Linus Torvalds, Steven Rostedt
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Andrew Morton,
	Florian Weimer, Sam James, Kees Cook, Carlos O'Donell



On August 29, 2025 2:13:33 PM GMT-03:00, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>On Fri, 29 Aug 2025 at 10:02, Steven Rostedt <rostedt@goodmis.org> wrote:
>>
>> Note, the ring buffer can be mapped to user space. So anything written into
>> the buffer is already exposed.
>
>Oh, good point. Yeah, that means that you have to do the hashing
>immediately. Too bad. Because while 'vma->vm_file' is basically free
>(since you have to access the vma for other reasons anyway), a good
>hash isn't.


Can't we continue with that idea by using some VMA sequential number that don't expose anything critical to user space but allows us to match stack entries to mmap records?

Deferring the heavily lift to when needed is great.

- Arnaldo 

>
>siphash is good and fast for being what it is, but it's not completely
>free. It's something like 50 shift/xor pairs, and it obviously needs
>to also access that secret hash value that is likely behind a cache
>miss..
>
>Still, I suspect it's the best we've got.
>
>(If hashing is noticeable, it *might* be worth it to use
>'siphash_1u32()' and only hash 32 bits of the pointers. That makes the
>hashing slightly cheaper, and since the low bits of the pointer will
>be zero anyway due to alignment, and the high bits don't have a lot of
>information in them either, it doesn't actually remove much
>information. You might get collissions if the two pointers are exactly
>32 GB apart or whatever, but that sounds really really unlucky)
>
>                Linus

- Arnaldo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-29 17:33                                   ` Linus Torvalds
@ 2025-08-29 18:11                                     ` Steven Rostedt
  2025-08-29 20:54                                       ` Linus Torvalds
  0 siblings, 1 reply; 59+ messages in thread
From: Steven Rostedt @ 2025-08-29 18:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arnaldo Carvalho de Melo, Steven Rostedt, linux-kernel,
	linux-trace-kernel, bpf, x86, Masami Hiramatsu, Mathieu Desnoyers,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

On Fri, 29 Aug 2025 10:33:38 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, 29 Aug 2025 at 10:18, Arnaldo Carvalho de Melo
> <arnaldo.melo@gmail.com> wrote:
> >
> > As long as we don't lose those mmap events due to memory pressure/lost
> > events and we have timestamps to order it all before lookups, yeah
> > should work.  
> 
> The main reason to lose mmap events that I can see is that you start
> tracing in the middle of running something (for example, tracing
> systemd or some other "started at boot" thing).

Note, for on-demand tracing, the applications are already running before
the tracing starts. That is actually the common case. Yes, people do often
"enabled tracing, run my code, stop tracing", but most of the use cases I
deal with, it's (we are noticing something in the field, start tracing,
issue gets hit, stop tracing), where the applications we are monitoring are
already running when the tracing started. Just tracing the mmap when it
happens will not be useful for us.

Not to mention, in the future, this will also have to work with JIT. I was
thinking of using 64 bit hashes in the stack trace, where the top bits are
reserved for context (is this a file, or something dynamically created).

> 
> Then you'd not have any record of an actual mmap at all because it
> happened before you started tracing, even if there is no memory
> pressure or other thing going on.
> 
> That is not necessarily a show-stopper: you could have some fairly
> simple count for "how many times have I seen this hash", and add a
> "mmap reminder" event (which would just be the exact same thing as the
> regular mmap event).

I thought about clearing the file cache periodically, if for any other
reason, but for dropped events where the mapping is lost.

This is why I'm looking at clearing on "unmap". Yes, we don't care about
unmap, but as soon as an unmap happens if that value gets used again then
we know it's a new mapping. That is, dropped the hashes out of the file
cache when they are no longer around.

The idea is this (pseudo code):

 user_stack_trace() {
   foreach vma in each stack frame:
       key = hash(vma->vm_file);
       if (!lookup(key)) {
           trace_file_map(key, generate_path(vma), generate_buildid(vma));
           add_into_hash(key);
       }
   }
 }

On unmmaping:

 key = hash(vma->vm_file);
 remove_from_hash(key);

Now if a new mmap happens where the vma->vm_file is reused, the lookup(key)
will return false again and the file_map event will get triggered again.

We don't need to look at the mmap() calls, as those new mappings may never
end up in a user stack trace, and writing them out will just waste space in
the ring buffer.

-- Steve

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-29 17:57                                     ` Arnaldo Carvalho de Melo
@ 2025-08-29 20:51                                       ` Linus Torvalds
  0 siblings, 0 replies; 59+ messages in thread
From: Linus Torvalds @ 2025-08-29 20:51 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Steven Rostedt, Steven Rostedt, linux-kernel, linux-trace-kernel,
	bpf, x86, Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Andrew Morton,
	Florian Weimer, Sam James, Kees Cook, Carlos O'Donell

On Fri, 29 Aug 2025 at 10:58, Arnaldo Carvalho de Melo
<arnaldo.melo@gmail.com> wrote:
>
> Can't we continue with that idea by using some VMA sequential number that don't expose anything critical to user space but allows us to match stack entries to mmap records?

No such record exists.

I guess we could add an atomic ID to do_mmap() and have a 64-bit value
that would be unique and would follow vma splitting and movement
around.

But it would actually be worse than just using the 'struct file'
pointer, and the only advantage would be avoiding the hashing. And the
disadvantages would be many. In particular, it would be much better if
it was per-file, but we are *definitely* not adding some globally
unique value to each file, because we already have seen performance
issues with open/close loads just from the atomic sequence counters we
used to give to anonymous inodes.

(I say "used to give" - see get_next_ino() for what we do now, with
per-cpu counter sequences. We'd have to do similar tricks for some
kind of 'file ID', and I really don't want to do that with no reason).

And if the only reason is "hashing takes a hundred cycles" when
generating traces, that's just not a reason good enough to bloat core
kernel data structures like 'struct file' and make core ops like
open() more expensive.

          Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-29 18:11                                     ` Steven Rostedt
@ 2025-08-29 20:54                                       ` Linus Torvalds
  2025-08-29 21:18                                         ` Steven Rostedt
  0 siblings, 1 reply; 59+ messages in thread
From: Linus Torvalds @ 2025-08-29 20:54 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Arnaldo Carvalho de Melo, Steven Rostedt, linux-kernel,
	linux-trace-kernel, bpf, x86, Masami Hiramatsu, Mathieu Desnoyers,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

On Fri, 29 Aug 2025 at 11:11, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> The idea is this (pseudo code):
>
>  user_stack_trace() {
>    foreach vma in each stack frame:
>        key = hash(vma->vm_file);
>        if (!lookup(key)) {
>            trace_file_map(key, generate_path(vma), generate_buildid(vma));
>            add_into_hash(key);
>        }
>    }

I see *zero* advantage to this. It's only doing stupid things that
cost extra, and only because you don't want to do the smart thing that
I've explained extensively that has *NONE* of these overheads.

Just do the parsing at parse time. End of story.

Or don't do this at all. Justy forget the whole thing entirely. Throw
the patch that started this all away, and just DON'T DO THIS.

              Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-29 20:54                                       ` Linus Torvalds
@ 2025-08-29 21:18                                         ` Steven Rostedt
  2025-08-29 22:40                                           ` Linus Torvalds
  0 siblings, 1 reply; 59+ messages in thread
From: Steven Rostedt @ 2025-08-29 21:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arnaldo Carvalho de Melo, Steven Rostedt, linux-kernel,
	linux-trace-kernel, bpf, x86, Masami Hiramatsu, Mathieu Desnoyers,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

On Fri, 29 Aug 2025 13:54:08 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, 29 Aug 2025 at 11:11, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > The idea is this (pseudo code):
> >
> >  user_stack_trace() {
> >    foreach vma in each stack frame:
> >        key = hash(vma->vm_file);
> >        if (!lookup(key)) {
> >            trace_file_map(key, generate_path(vma), generate_buildid(vma));
> >            add_into_hash(key);
> >        }
> >    }  
> 
> I see *zero* advantage to this. It's only doing stupid things that
> cost extra, and only because you don't want to do the smart thing that
> I've explained extensively that has *NONE* of these overheads.
> 
> Just do the parsing at parse time. End of story.

What does "parsing at parse time" mean?

> 
> Or don't do this at all. Justy forget the whole thing entirely. Throw
> the patch that started this all away, and just DON'T DO THIS.

Maybe we are talking past each other.

When I get a user space stack trace, I get the virtual addresses of each of
the user space functions. This is saved into an user stack trace event in
the ring buffer that usually gets mapped right to a file for post
processing.

I still do the:

 user_stack_trace() {
   foreach addr each stack frame
      vma = vma_lookup(mm, addr);
      callchain[i++] = (addr - vma->vm_start) + (vma->vm_pgoff << PAGE_SHIFT);

Are you saying that this shouldn't be done either? And to just record the
the virtual address in the chain and the vma->vm_start and
vma->vm_pgoff in another event? Where the post processing could do the
math? This other event could also record the path and build id.

The question is, when do I record this vma event? How do I know it's new?

I can't rely too much on other events (like mmap) and such as those events
may have occurred before the tracing started. I have to have some way to
know if the vma has been saved previously, which was why I had the hash
lookup, and only add vma's on new instances.

My main question is, when do I record the vma data event?

-- Steve


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-29 21:18                                         ` Steven Rostedt
@ 2025-08-29 22:40                                           ` Linus Torvalds
  2025-08-29 23:09                                             ` Steven Rostedt
  0 siblings, 1 reply; 59+ messages in thread
From: Linus Torvalds @ 2025-08-29 22:40 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Arnaldo Carvalho de Melo, Steven Rostedt, linux-kernel,
	linux-trace-kernel, bpf, x86, Masami Hiramatsu, Mathieu Desnoyers,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

On Fri, 29 Aug 2025 at 14:18, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > Just do the parsing at parse time. End of story.
>
> What does "parsing at parse time" mean?

In user space. When parsing the trace events.

Not in kernel space, when generating the events.

Arnaldo already said it was workable.

> When I get a user space stack trace, I get the virtual addresses of each of
> the user space functions. This is saved into an user stack trace event in
> the ring buffer that usually gets mapped right to a file for post
> processing.
>
> I still do the:
>
>  user_stack_trace() {
>    foreach addr each stack frame
>       vma = vma_lookup(mm, addr);
>       callchain[i++] = (addr - vma->vm_start) + (vma->vm_pgoff << PAGE_SHIFT);
>
> Are you saying that this shouldn't be done either?

I'm saying that's ALL that should be done. And then that *ONE* single thing:

     callchain_filehash[i++] = hash(vma->vm_file);

BUT NOTHING ELSE.

None of that trace_file_map() garbage.

None of that add_into_hash() garbage.

NOTHING like that. You don't look at the hash. You don't "register"
it. You don't touch it in any way. You literally just use it as a
value, and user space will figure it out later. At event parsing time.

At most, you could have some trivial optimization to avoid hashing the
same pointer twice, ie have some single-entry cache of "it's still the
same file pointer, I'll just use the same hash I calculated last
time".

And I mean *single*-level, because siphash is fast enough that doing
anything *more* than that is going to be slower than just
re-calculating the hash.

In fact, you should probably do that optimization at the whole
vma_lookup() level, and try to not look up the same vma multiple times
when a common situation is probably that you'll have multiple stack
frames all with entries pointing to the same executable (or library)
mapping. Because "vma_lookup()" is likely about as expensive as the
hashing is.

           Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-29 22:40                                           ` Linus Torvalds
@ 2025-08-29 23:09                                             ` Steven Rostedt
  2025-08-29 23:42                                               ` Steven Rostedt
                                                                 ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: Steven Rostedt @ 2025-08-29 23:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arnaldo Carvalho de Melo, Steven Rostedt, linux-kernel,
	linux-trace-kernel, bpf, x86, Masami Hiramatsu, Mathieu Desnoyers,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

On Fri, 29 Aug 2025 15:40:07 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, 29 Aug 2025 at 14:18, Steven Rostedt <rostedt@goodmis.org> wrote:
> > >
> > > Just do the parsing at parse time. End of story.  
> >
> > What does "parsing at parse time" mean?  
> 
> In user space. When parsing the trace events.
> 
> Not in kernel space, when generating the events.
> 
> Arnaldo already said it was workable.

Perf does do things differently, as I believe it processes the events as it
reads from the kernel (Arnaldo correct me if I'm wrong).

For the tracefs code, the raw data gets saved directly into a file, and the
processing happens after the fact. If a tool is recording, it still needs a
way to know what those hash values mean, after the tracing is complete.

Same for when the user cats the "trace" file. If the vma's have already
been freed, when this happens, how do we map these hashes from the vma? Do
we need to have trace events in the unmap to trigger them? If tracing is
not recording anymore, those events will be dropped too. Also, we only want
to record the vmas that are in the stack traces. Not just any vma.

> 
> > When I get a user space stack trace, I get the virtual addresses of each of
> > the user space functions. This is saved into an user stack trace event in
> > the ring buffer that usually gets mapped right to a file for post
> > processing.
> >
> > I still do the:
> >
> >  user_stack_trace() {
> >    foreach addr each stack frame
> >       vma = vma_lookup(mm, addr);
> >       callchain[i++] = (addr - vma->vm_start) + (vma->vm_pgoff << PAGE_SHIFT);
> >
> > Are you saying that this shouldn't be done either?  
> 
> I'm saying that's ALL that should be done. And then that *ONE* single thing:
> 
>      callchain_filehash[i++] = hash(vma->vm_file);
> 
> BUT NOTHING ELSE.
> 
> None of that trace_file_map() garbage.
> 
> None of that add_into_hash() garbage.
> 
> NOTHING like that. You don't look at the hash. You don't "register"
> it. You don't touch it in any way. You literally just use it as a
> value, and user space will figure it out later. At event parsing time.

I guess this is where I'm stuck. How does user space know what those hash
values mean? Where does it get the information from?

> 
> At most, you could have some trivial optimization to avoid hashing the
> same pointer twice, ie have some single-entry cache of "it's still the
> same file pointer, I'll just use the same hash I calculated last
> time".
> 
> And I mean *single*-level, because siphash is fast enough that doing
> anything *more* than that is going to be slower than just
> re-calculating the hash.
> 
> In fact, you should probably do that optimization at the whole
> vma_lookup() level, and try to not look up the same vma multiple times
> when a common situation is probably that you'll have multiple stack
> frames all with entries pointing to the same executable (or library)
> mapping. Because "vma_lookup()" is likely about as expensive as the
> hashing is.

Yeah, we could add an optimization to store vma's in the callchain walk to
see if the next call chain belongs to a previous one. Could even just cache
the previous vma, as it's not as common to have one library calling into
another and back again.

That is, this would likely be useful:

  vma = NULL;
  foreach addr in callchain
    if (!vma || addr not in range of vma)
      vma = vma_lookup(addr);

-- Steve

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-29 23:09                                             ` Steven Rostedt
@ 2025-08-29 23:42                                               ` Steven Rostedt
  2025-08-30  0:36                                                 ` Steven Rostedt
  2025-08-30  0:44                                               ` Steven Rostedt
  2025-08-30  0:45                                               ` Linus Torvalds
  2 siblings, 1 reply; 59+ messages in thread
From: Steven Rostedt @ 2025-08-29 23:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arnaldo Carvalho de Melo, Steven Rostedt, linux-kernel,
	linux-trace-kernel, bpf, x86, Masami Hiramatsu, Mathieu Desnoyers,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

On Fri, 29 Aug 2025 19:09:35 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> > NOTHING like that. You don't look at the hash. You don't "register"
> > it. You don't touch it in any way. You literally just use it as a
> > value, and user space will figure it out later. At event parsing time.  
> 
> I guess this is where I'm stuck. How does user space know what those hash
> values mean? Where does it get the information from?

This is why I had the stored hash items in a "file_cache". It was the way
to know that a new vma is being used and needs an event to show it.

> 
> That is, this would likely be useful:
> 
>   vma = NULL;
>   foreach addr in callchain
>     if (!vma || addr not in range of vma)
>       vma = vma_lookup(addr);

You already stated that the vma_lookup() and the hash algorithm is very
expensive, but they need to be done anyway. A simple hash lookup is quick
and would be lost in the noise.


   vma = NULL;
   hash = 0;
   foreach addr in callchain
     if (!vma || addr not in range of vma) {
       vma = vma_lookup(addr);
       hash = get_hash(vma);
     }
     callchain[i] = addr - offset;
     hash[i] = hash;


I had that get_hash(vma) have something like:


  u32 get_hash(vma) {
     unsigned long ptr = (unsigned long)vma->vm_file;
     u32 hash;

     /* Remove alignment */
     ptr >>= 3;
     hash = siphash_1u32((u32)ptr, &key);

     if (lookup_hash(hash))
        return hash; // already saved

     // The above is the most common case and is quick.
     // Especially compared to vma_lookup() and the hash algorithm

     /* Slow but only happens when a new vma is discovered */
     trigger_event_that_maps_hash_to_file_data(hash, vma);

     /* Doesn't happen again for this hash value */
     save_hash(hash);


This is also where I would have a callback from munmap() to remove the
vmas from this hash table because they are no longer around. And if a new
vma came around with the same vm_file address, it would not be found in the
hash table and would trigger the print again with the hash and the new file
it represents.

This "garbage" was how I implemented the way to let user space know what
the meaning of the hash values are.

Otherwise, we need something else to expose to user space what those hashes
mean. And that's where I don't know what you are expecting.

-- Steve

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-29 23:42                                               ` Steven Rostedt
@ 2025-08-30  0:36                                                 ` Steven Rostedt
  0 siblings, 0 replies; 59+ messages in thread
From: Steven Rostedt @ 2025-08-30  0:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arnaldo Carvalho de Melo, Steven Rostedt, linux-kernel,
	linux-trace-kernel, bpf, x86, Masami Hiramatsu, Mathieu Desnoyers,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

On Fri, 29 Aug 2025 19:42:46 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

>    vma = NULL;
>    hash = 0;
>    foreach addr in callchain
>      if (!vma || addr not in range of vma) {
>        vma = vma_lookup(addr);
>        hash = get_hash(vma);
>      }
>      callchain[i] = addr - offset;
>      hash[i] = hash;
> 
> 
> I had that get_hash(vma) have something like:
> 
> 
>   u32 get_hash(vma) {
>      unsigned long ptr = (unsigned long)vma->vm_file;
>      u32 hash;
> 
>      /* Remove alignment */
>      ptr >>= 3;
>      hash = siphash_1u32((u32)ptr, &key);

Oh, this hash isn't that great, as it did appear to have collisions. But I
saw in vsprintf() it has something like:

#ifdef CONFIG_64BIT
	return (u32)(unsigned long)siphash_1u64((u64)ptr, &key);
#else
	return (u32)siphash_1u32((u32)ptr, &key);
#endif

Which for the 64 bit version, it uses all the bits to calculate the hash,
and the resulting bottom 32 is rather a good spread.

> 
>      if (lookup_hash(hash))
>         return hash; // already saved
> 
>      // The above is the most common case and is quick.
>      // Especially compared to vma_lookup() and the hash algorithm
> 
>      /* Slow but only happens when a new vma is discovered */
>      trigger_event_that_maps_hash_to_file_data(hash, vma);
> 
>      /* Doesn't happen again for this hash value */
>      save_hash(hash);

So this basically creates the output of:

       trace-cmd-1034    [003] .....   142.197674: <user stack unwind>
cookie=300000004
 =>  <000000000008f687> : 0x666220af
 =>  <0000000000014560> : 0x88512fee
 =>  <000000000001f94a> : 0x88512fee
 =>  <000000000001fc9e> : 0x88512fee
 =>  <000000000001fcfa> : 0x88512fee
 =>  <000000000000ebae> : 0x88512fee
 =>  <0000000000029ca8> : 0x666220af
       trace-cmd-1034    [003] ...1.   142.198063: file_cache: hash=0x666220af path=/usr/lib/x86_64-linux-gnu/libc.so.6 build_id={0x10bddb6d,0xf5234181,0xc2f72e26,0x1aa4f797,0x6aa19eda}
       trace-cmd-1034    [003] ...1.   142.198093: file_cache: hash=0x88512fee path=/usr/local/bin/trace-cmd build_id={0x3f399e26,0xf9eb2d4d,0x475fa369,0xf5bb7eeb,0x6244ae85}


Where the first instances of the vma with the values of 0x666220af and
0x88512fee get printed, but from then on, they are not. That is, from then
on, the lookup will return true, and no processing will take place.

And periodically, I could clear the hash cache, so that all vmas get
printed again. But this would be rate limited to not cause performance
issues.


-- Steve

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-29 23:09                                             ` Steven Rostedt
  2025-08-29 23:42                                               ` Steven Rostedt
@ 2025-08-30  0:44                                               ` Steven Rostedt
  2025-08-30  0:45                                               ` Linus Torvalds
  2 siblings, 0 replies; 59+ messages in thread
From: Steven Rostedt @ 2025-08-30  0:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arnaldo Carvalho de Melo, Steven Rostedt, linux-kernel,
	linux-trace-kernel, bpf, x86, Masami Hiramatsu, Mathieu Desnoyers,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

On Fri, 29 Aug 2025 19:09:35 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> Yeah, we could add an optimization to store vma's in the callchain walk to
> see if the next call chain belongs to a previous one. Could even just cache
> the previous vma, as it's not as common to have one library calling into
> another and back again.

Although, it does happen with libc. :-p

cookie=300000004
 =>  <000000000008f687> : 0x666220af
 =>  <0000000000014560> : 0x88512fee
 =>  <000000000001f94a> : 0x88512fee
 =>  <000000000001fc9e> : 0x88512fee
 =>  <000000000001fcfa> : 0x88512fee
 =>  <000000000000ebae> : 0x88512fee
 =>  <0000000000029ca8> : 0x666220af

The 0x666220af is libc, where the first item is (according to objdump):

000000000008f570 <__libc_alloca_cutoff@@GLIBC_PRIVATE>:

And the last one (top of the stack) is:

0000000000029c20 <__libc_init_first@@GLIBC_2.2.5>:

Of course libc starts the application, and then the application will likely
call back into libc. We could optimize for this case with:

  first_vma = NULL;
  vma = NULL;
  foreach addr in callchain
    if (!first_vma)
      vma = first_vma = vma_alloc()
    else if (addr in range of first_vma)
      vma = first_vma
    else (addr not in range of vma)
      vma = vma_lookup(addr);

-- Steve

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-29 23:09                                             ` Steven Rostedt
  2025-08-29 23:42                                               ` Steven Rostedt
  2025-08-30  0:44                                               ` Steven Rostedt
@ 2025-08-30  0:45                                               ` Linus Torvalds
  2025-08-30  1:20                                                 ` Steven Rostedt
  2025-08-30 18:31                                                 ` Steven Rostedt
  2 siblings, 2 replies; 59+ messages in thread
From: Linus Torvalds @ 2025-08-30  0:45 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Arnaldo Carvalho de Melo, Steven Rostedt, linux-kernel,
	linux-trace-kernel, bpf, x86, Masami Hiramatsu, Mathieu Desnoyers,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

On Fri, 29 Aug 2025 at 16:09, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Perf does do things differently, as I believe it processes the events as it
> reads from the kernel (Arnaldo correct me if I'm wrong).
>
> For the tracefs code, the raw data gets saved directly into a file, and the
> processing happens after the fact. If a tool is recording, it still needs a
> way to know what those hash values mean, after the tracing is complete.

But the data IS ALL THERE.

Really. That's the point.

It's there in the same file, it just needs those mmap events that
whoever pasrses it - whether it be perf, or somebody reading some
tracefs code - sees the mmap data, sees the cookies (hash values) that
implies, and then matches those cookies with the subsequent trace
entry cookies.

But what it does *NOT* need is munmap() events.

What it does *NOT* need is translating each hash value for each entry
by the kernel, when whoever treads the file can just remember and
re-create it in user space.

I'm done arguing. You're not listening, so I'll just let you know that
I'm not pulling garbage. I've had enough garbage in tracefs, I'm still
smarting from having to fix up the horrendous VFS interfaces, I'm not
going to pull anything that messes up this too.

        Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-30  0:45                                               ` Linus Torvalds
@ 2025-08-30  1:20                                                 ` Steven Rostedt
  2025-08-30  1:26                                                   ` Steven Rostedt
  2025-08-30 18:31                                                 ` Steven Rostedt
  1 sibling, 1 reply; 59+ messages in thread
From: Steven Rostedt @ 2025-08-30  1:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arnaldo Carvalho de Melo, Steven Rostedt, linux-kernel,
	linux-trace-kernel, bpf, x86, Masami Hiramatsu, Mathieu Desnoyers,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

On Fri, 29 Aug 2025 17:45:39 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, 29 Aug 2025 at 16:09, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > Perf does do things differently, as I believe it processes the events as it
> > reads from the kernel (Arnaldo correct me if I'm wrong).
> >
> > For the tracefs code, the raw data gets saved directly into a file, and the
> > processing happens after the fact. If a tool is recording, it still needs a
> > way to know what those hash values mean, after the tracing is complete.  
> 
> But the data IS ALL THERE.

But only in the kernel. How do I expose it?

> 
> Really. That's the point.
> 
> It's there in the same file, it just needs those mmap events that
> whoever pasrses it - whether it be perf, or somebody reading some

What mmap events are you talking about? Nothing happens to be tracing mmap
events. An interrupt triggered, we want a user space stack trace for that
interrupt, it records the kernel stack trace and a cookie that gets matched
to the user stack trace. It is then deferred until it goes back to user
space and the deferred infrastructure does a callback to the tracer with
the list of addresses that represent the user space call stack.

We do a vma_lookup() to get the vma of each of those addresses. Now we make
some hash that represents that vma for each address. But there has been no
event that maps to this vma to what the file is. And the vma's in these
stack traces are a subset of all the vma's. When the user finally gets
around to reading them, the vmas could be long gone. How is user space
supposed to find out what files they belong to?

Do we need to record most events to grab all the vma's and the files they
belong to? Note, one of the constraints to tracing is the buffer size. We
don't want to be recording information that we don't care about.

> tracefs code - sees the mmap data, sees the cookies (hash values) that
> implies, and then matches those cookies with the subsequent trace
> entry cookies.

That was basically what I was doing with the vma hash table. To print out
the vmas as soon as a new one is referenced. It created the event needed,
and only for the vmas we care about.

> 
> But what it does *NOT* need is munmap() events.

This wouldn't be recording munmap events. It would use the unmap event to
callback and remove the vma from the hash when they happened, so that if
they get reused the new ones would be printed. It's no different if we use
munmap or mmap. I could hook into the mmap event instead and check if it is
in the vma hash and if so, either reprint it, or remove it so if the vma is
in a call stack it would get reprinted.

Writing the file for every mmap seems to be a waste of ring buffer space if
the majority of them is not going to be in a stack trace.

> 
> What it does *NOT* need is translating each hash value for each entry
> by the kernel, when whoever treads the file can just remember and
> re-create it in user space.

What's reading the files? The applications that are being traced?

> 
> I'm done arguing. You're not listening, so I'll just let you know that

I am listening. I'm just not understanding you.

> I'm not pulling garbage. I've had enough garbage in tracefs, I'm still
> smarting from having to fix up the horrendous VFS interfaces, I'm not
> going to pull anything that messes up this too.

I know you keep bringing up the tracefs eventfs issue. Hey, I asked for
help with that when I first started it. I was basically told by some of the
VFS folks (I'm not going to name names) that "don't worry, if it works it's
fine". I was very worried that I wasn't doing it right. And it wasn't until
you got involved where you were the first one to tell me that using dentry
outside of VFS was a bad idea. Most of our arguing then was because I
didn't understand that. That also lead to the "garbage" code you had to fix
up.

So keep bringing that up. It just shows how much of tribal knowledge is
needed to work in the kernel. Heck, the VFS folks are still arguing about
how to handle things like kernfs. Which is similar to the eventfs issue.
And that boils down to things like kernefs, eventfs and procfs have a
fundamental difference to all other file systems. And that is it's a file
interface to the kernel itself, and not some external source. I realized
this during our arguments over eventfs. You do a write or read from a file,
and unlike other file systems, those actions trigger kernel functions
outside of vfs. But this is another topic altogether, and I only brought it
up because you did.

-- Steve

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-30  1:20                                                 ` Steven Rostedt
@ 2025-08-30  1:26                                                   ` Steven Rostedt
  0 siblings, 0 replies; 59+ messages in thread
From: Steven Rostedt @ 2025-08-30  1:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arnaldo Carvalho de Melo, Steven Rostedt, linux-kernel,
	linux-trace-kernel, bpf, x86, Masami Hiramatsu, Mathieu Desnoyers,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

On Fri, 29 Aug 2025 21:20:23 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> > I'm done arguing. You're not listening, so I'll just let you know that  
> 
> I am listening. I'm just not understanding you.

BTW, I'm not arguing with you. I'm really trying hard to figure out what it
is that you want me to do. I'm looking for that "don't use dentry outside
of VFS" moment.

I get we have in the call stack the offsets of the file and a magical hash
value that represents that vma.

What I don't get is what is user space suppose to match that magical hash
value to?

Do you want me to trace all mmaps and trigger an event for them that show
the hash value and the path names?

If that's the case, what do I do about the major use case of tracing an
application after it has mapped all it's memory to files?

What about wasted ring buffer space for recording every mmap when the
majority of them will not be used. It could risk dropping events of the
mmaps we care about.

Again, I'm not arguing with you. I'm trying to figure out what you are
suggesting.

-- Steve

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-30  0:45                                               ` Linus Torvalds
  2025-08-30  1:20                                                 ` Steven Rostedt
@ 2025-08-30 18:31                                                 ` Steven Rostedt
  2025-08-30 19:03                                                   ` Arnaldo Carvalho de Melo
  2025-08-30 19:03                                                   ` Linus Torvalds
  1 sibling, 2 replies; 59+ messages in thread
From: Steven Rostedt @ 2025-08-30 18:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arnaldo Carvalho de Melo, Steven Rostedt, linux-kernel,
	linux-trace-kernel, bpf, x86, Masami Hiramatsu, Mathieu Desnoyers,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

On Fri, 29 Aug 2025 17:45:39 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> But what it does *NOT* need is munmap() events.
> 
> What it does *NOT* need is translating each hash value for each entry
> by the kernel, when whoever treads the file can just remember and
> re-create it in user space.

If we are going to rely on mmap, then we might as well get rid of the
vma_lookup() altogether. The mmap event will have the mapping of the
file to the actual virtual address.

If we add a tracepoint at mmap that records the path and the address as
well as the permissions of the mapping, then the tracer could then
trace only those addresses that are executable.

To handle missed events, on start of tracing, trigger the mmap event
for every currently running tasks for their executable sections, and
that will allow the tracer to see where the files are mapped.

After that, the stack traces can go back to just showing the virtual
addresses of the user space stack without doing anything else. Let the
trace map the tasks memory to all the mmaps that happened and translate
it that way.

The downside is that there may be a lot of information to record. But
the tracer could choose which task maps to trace via filters and if it's
tracing all tasks, it just needs to make sure its buffer is big enough.

-- Steve

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-30 18:31                                                 ` Steven Rostedt
@ 2025-08-30 19:03                                                   ` Arnaldo Carvalho de Melo
  2025-08-30 19:03                                                   ` Linus Torvalds
  1 sibling, 0 replies; 59+ messages in thread
From: Arnaldo Carvalho de Melo @ 2025-08-30 19:03 UTC (permalink / raw)
  To: Steven Rostedt, Linus Torvalds
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Andrew Morton,
	Florian Weimer, Sam James, Kees Cook, Carlos O'Donell



On August 30, 2025 3:31:14 PM GMT-03:00, Steven Rostedt <rostedt@goodmis.org> wrote:
>On Fri, 29 Aug 2025 17:45:39 -0700
>Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>> But what it does *NOT* need is munmap() events.
>> 
>> What it does *NOT* need is translating each hash value for each entry
>> by the kernel, when whoever treads the file can just remember and
>> re-create it in user space.
>
>If we are going to rely on mmap, then we might as well get rid of the
>vma_lookup() altogether. The mmap event will have the mapping of the
>file to the actual virtual address.
>
>If we add a tracepoint at mmap that records the path and the address as
>well as the permissions of the mapping, then the tracer could then
>trace only those addresses that are executable.
>

PERF_RECORD_MMAP2 (MMAP had just the filename);

<https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/perf_event.h#n1057>

>To handle missed events, on start of tracing, trigger the mmap event
>for every currently running tasks for their executable sections, and
>that will allow the tracer to see where the files are mapped.

Perf does synthesize the needed mmap events by traversing procfs, if needed.

Jiri at some point toyed with BPF iterators to do as you suggest: from the kernel iterate task structs and generate the PERF_RECORD_MMAP2 for preexisting processes.

>After that, the stack traces can go back to just showing the virtual
>addresses of the user space stack without doing anything else. Let the
>trace map the tasks memory to all the mmaps that happened and translate
>it that way.
>
>The downside is that there may be a lot of information to record.

It is, but for system wide cases, etc. Want to see it all? There's a cost...

- Arnaldo 

But
>the tracer could choose which task maps to trace via filters and if it's
>tracing all tasks, it just needs to make sure its buffer is big enough.
>
>-- Steve

- Arnaldo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-08-30 18:31                                                 ` Steven Rostedt
  2025-08-30 19:03                                                   ` Arnaldo Carvalho de Melo
@ 2025-08-30 19:03                                                   ` Linus Torvalds
  1 sibling, 0 replies; 59+ messages in thread
From: Linus Torvalds @ 2025-08-30 19:03 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Arnaldo Carvalho de Melo, Steven Rostedt, linux-kernel,
	linux-trace-kernel, bpf, x86, Masami Hiramatsu, Mathieu Desnoyers,
	Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

On Sat, 30 Aug 2025 at 11:31, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> If we are going to rely on mmap, then we might as well get rid of the
> vma_lookup() altogether. The mmap event will have the mapping of the
> file to the actual virtual address.

It actually won't - not unless you also track every mremap etc.

Which is certainly doable, but I'd argue that it's a lot of complexity.

All you really want is an ID for the file mapping, and yes, I agree
that it's very very annoying that we don't have anything that can then
be correlated to user space any other way than also having a stage
that tracks mmap.

I've slept on it and tried to come up with something, and I can't. As
mentioned, the inode->i_ino isn't actually exposed to user space as
such at all for some common filesystems, so while it's very
traditional, it really doesn't actually work. It's also almost
impossible to turn into a path, which is what you often would want for
many cases.

That said, having slept on it, I'm starting to come around to the
inode number model, not because I think it's a good model - it really
isn't - but because it's a very historical mistake.

And in particular, it's the same mistake we made in /proc/<xyz>/maps.

So I think it's very very wrong, but it does have the advantage that
it's a number that we already do export.

But the inode we expose that way isn't actually the
'vma->vm_file->f_inode' as you'd think, it's actually

        inode = file_user_inode(vma->vm_file);

which is subtly different for the backing inode case (ie overlayfs).

Oh, how I dislike that thing, but using the same thing as
/proc/<xyz>/maps does avoid some problems.

                Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2025-08-30 19:04 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-28 18:03 [PATCH v6 0/6] tracing: Deferred unwinding of user space stack traces Steven Rostedt
2025-08-28 18:03 ` [PATCH v6 1/6] tracing: Do not bother getting user space stacktraces for kernel threads Steven Rostedt
2025-08-28 18:03 ` [PATCH v6 2/6] tracing: Rename __dynamic_array() to __dynamic_field() for ftrace events Steven Rostedt
2025-08-28 18:03 ` [PATCH v6 3/6] tracing: Implement deferred user space stacktracing Steven Rostedt
2025-08-28 18:03 ` [PATCH v6 4/6] tracing: Have deferred user space stacktrace show file offsets Steven Rostedt
2025-08-28 18:03 ` [PATCH v6 5/6] tracing: Show inode and device major:minor in deferred user space stacktrace Steven Rostedt
2025-08-28 18:39   ` Linus Torvalds
2025-08-28 18:58     ` Arnaldo Carvalho de Melo
2025-08-28 19:02       ` Mathieu Desnoyers
2025-08-28 19:18       ` Linus Torvalds
2025-08-28 20:04         ` Arnaldo Carvalho de Melo
2025-08-28 20:37           ` Linus Torvalds
2025-08-28 20:17         ` Steven Rostedt
2025-08-28 20:27           ` Arnaldo Carvalho de Melo
2025-08-28 20:42             ` Linus Torvalds
2025-08-28 20:51             ` Steven Rostedt
2025-08-28 21:00               ` Arnaldo Carvalho de Melo
2025-08-28 21:27                 ` Steven Rostedt
2025-08-29 16:27                 ` Sam James
2025-08-28 20:38           ` Linus Torvalds
2025-08-28 20:48             ` Steven Rostedt
2025-08-28 21:06               ` Linus Torvalds
2025-08-28 21:17                 ` Steven Rostedt
2025-08-28 22:10                   ` Linus Torvalds
2025-08-28 22:44                     ` Steven Rostedt
2025-08-29 15:06                     ` Steven Rostedt
2025-08-29 15:47                       ` Linus Torvalds
2025-08-29 16:07                         ` Linus Torvalds
2025-08-29 16:33                           ` Steven Rostedt
2025-08-29 16:42                             ` Linus Torvalds
2025-08-29 16:50                               ` Linus Torvalds
2025-08-29 17:02                                 ` Steven Rostedt
2025-08-29 17:13                                   ` Linus Torvalds
2025-08-29 17:57                                     ` Arnaldo Carvalho de Melo
2025-08-29 20:51                                       ` Linus Torvalds
2025-08-29 16:57                               ` Steven Rostedt
2025-08-29 17:02                                 ` Linus Torvalds
2025-08-29 17:52                                   ` Steven Rostedt
2025-08-29 16:19                         ` Steven Rostedt
2025-08-29 16:28                           ` Linus Torvalds
2025-08-29 16:49                             ` Steven Rostedt
2025-08-29 16:59                               ` Linus Torvalds
2025-08-29 17:17                                 ` Arnaldo Carvalho de Melo
2025-08-29 17:33                                   ` Linus Torvalds
2025-08-29 18:11                                     ` Steven Rostedt
2025-08-29 20:54                                       ` Linus Torvalds
2025-08-29 21:18                                         ` Steven Rostedt
2025-08-29 22:40                                           ` Linus Torvalds
2025-08-29 23:09                                             ` Steven Rostedt
2025-08-29 23:42                                               ` Steven Rostedt
2025-08-30  0:36                                                 ` Steven Rostedt
2025-08-30  0:44                                               ` Steven Rostedt
2025-08-30  0:45                                               ` Linus Torvalds
2025-08-30  1:20                                                 ` Steven Rostedt
2025-08-30  1:26                                                   ` Steven Rostedt
2025-08-30 18:31                                                 ` Steven Rostedt
2025-08-30 19:03                                                   ` Arnaldo Carvalho de Melo
2025-08-30 19:03                                                   ` Linus Torvalds
2025-08-28 18:03 ` [PATCH v6 6/6] tracing: Add an event to map the inodes to their file names Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).