[PATCH 0/7] tracing: Show contents of syscall trace event user space fields

linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/7] tracing: Show contents of syscall trace event user space fields
@ 2025-08-05 19:26 Steven Rostedt
  2025-08-05 19:26 ` [PATCH 1/7] tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE() Steven Rostedt
                   ` (6 more replies)
  0 siblings, 7 replies; 22+ messages in thread
From: Steven Rostedt @ 2025-08-05 19:26 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Peter Zijlstra, Namhyung Kim, Takaya Saeki, Tom Zanussi,
	Thomas Gleixner, Ian Rogers, aahringo, Douglas Raillard

As of commit 654ced4a1377 ("tracing: Introduce tracepoint_is_faultable()")
system call trace events allow faulting in user space memory. Have some of
the system call trace events take advantage of this.

Introduce a way to read from user space addresses from the syscall trace
event. The way this is accomplished is by creating a per CPU temporary
buffer that is used to read unsafe user memory.

When a syscall trace event needs to read user memory, it reads a per CPU
counter that gets updated every time a user space task is scheduled on the
CPU. It then enables preemption, copies the user space memory into this
buffer, then disables preemption again. If the counter is less than two
from its original value the buffer is valid. Otherwise it needs to try
again.

The reason to check for less than two and not equal to the previous value
is because scheduling kernel tasks is fine. Only user space tasks will
write to this buffer. If the task schedules out and only kernel tasks run
and the tasks schedules back in, the counter will be incremented by one.

A new file is created in the tracefs directory (and also per instance) that
allows the user to shorten the amount copied from user space. It can be
completely disabled if set to zero (it will only display "" or (, ...)
but no copying from user space will be performed). The max size to copy is
hard coded to 128, which should be enough for this purpose.

This allows the output to look like this:

 sys_access(filename: 0x7f8c55368470 "/etc/ld.so.preload", mode: 4)
 sys_execve(filename: 0x564ebcf5a6b8 "/usr/bin/emacs", argv: 0x7fff357c0300, envp: 0x564ebc4a4820)
 sys_write(fd: 0xa, buf: 0x5646978d13c0 (01:00:05:00:00:00:00:00:01:87:55:89:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00), count: 0x20)
 sys_sethostname(name: 0x5584310eb2a0 "debian", len: 6)
 sys_renameat2(olddfd: 0xffffff9c, oldname: 0x7ffe02facdff "/tmp/x", newdfd: 0xffffff9c, newname: 0x7ffe02face06 "/tmp/y", flags: 1)

Steven Rostedt (7):
      tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE()
      tracing: Have syscall trace events show "0x" for values greater than 10
      tracing: Have syscall trace events read user space string
      tracing: Have system call events record user array data
      tracing: Display some syscall arrays as strings
      tracing: Allow syscall trace events to read more than one user parameter
      tracing: Add syscall_user_buf_size to limit amount written

----
 Documentation/trace/ftrace.rst |   7 +
 include/trace/syscall.h        |   8 +-
 kernel/trace/Kconfig           |  13 +
 kernel/trace/trace.c           |  52 +++
 kernel/trace/trace.h           |   3 +
 kernel/trace/trace_syscalls.c  | 697 +++++++++++++++++++++++++++++++++++++++--
 6 files changed, 750 insertions(+), 30 deletions(-)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/7] tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE()
  2025-08-05 19:26 [PATCH 0/7] tracing: Show contents of syscall trace event user space fields Steven Rostedt
@ 2025-08-05 19:26 ` Steven Rostedt
  2025-08-06 18:39   ` Paul E. McKenney
  2025-08-06 22:05   ` kernel test robot
  2025-08-05 19:26 ` [PATCH 2/7] tracing: Have syscall trace events show "0x" for values greater than 10 Steven Rostedt
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 22+ messages in thread
From: Steven Rostedt @ 2025-08-05 19:26 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Peter Zijlstra, Namhyung Kim, Takaya Saeki, Tom Zanussi,
	Thomas Gleixner, Ian Rogers, aahringo, Douglas Raillard,
	Paul E. McKenney

From: Steven Rostedt <rostedt@goodmis.org>

The syscall events are pseudo events that hook to the raw syscalls. The
ftrace_syscall_enter/exit() callback is called by the raw_syscall
enter/exit tracepoints respectively whenever any of the syscall events are
enabled.

The trace_array has an array of syscall "files" that correspond to the
system calls based on their __NR_SYSCALL number. The array is read and if
there's a pointer to a trace_event_file then it is considered enabled and
if it is NULL that syscall event is considered disabled.

Currently it uses an rcu_dereference_sched() to get this pointer and a
rcu_assign_ptr() or RCU_INIT_POINTER() to write to it. This is unnecessary
as the file pointer will not go away outside the synchronization of the
tracepoint logic itself. And this code adds no extra RCU synchronization
that uses this.

Replace these functions with a simple READ_ONCE() and WRITE_ONCE() which
is all they need. This will also allow this code to not depend on
preemption being disabled as system call tracepoints are now allowed to
fault.

Cc: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/trace_syscalls.c | 14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 46aab0ab9350..3a0b65f89130 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -310,8 +310,7 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
 	if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
 		return;
 
-	/* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE) */
-	trace_file = rcu_dereference_sched(tr->enter_syscall_files[syscall_nr]);
+	trace_file = READ_ONCE(tr->enter_syscall_files[syscall_nr]);
 	if (!trace_file)
 		return;
 
@@ -356,8 +355,7 @@ static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
 	if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
 		return;
 
-	/* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE()) */
-	trace_file = rcu_dereference_sched(tr->exit_syscall_files[syscall_nr]);
+	trace_file = READ_ONCE(tr->exit_syscall_files[syscall_nr]);
 	if (!trace_file)
 		return;
 
@@ -393,7 +391,7 @@ static int reg_event_syscall_enter(struct trace_event_file *file,
 	if (!tr->sys_refcount_enter)
 		ret = register_trace_sys_enter(ftrace_syscall_enter, tr);
 	if (!ret) {
-		rcu_assign_pointer(tr->enter_syscall_files[num], file);
+		WRITE_ONCE(tr->enter_syscall_files[num], file);
 		tr->sys_refcount_enter++;
 	}
 	mutex_unlock(&syscall_trace_lock);
@@ -411,7 +409,7 @@ static void unreg_event_syscall_enter(struct trace_event_file *file,
 		return;
 	mutex_lock(&syscall_trace_lock);
 	tr->sys_refcount_enter--;
-	RCU_INIT_POINTER(tr->enter_syscall_files[num], NULL);
+	WRITE_ONCE(tr->enter_syscall_files[num], NULL);
 	if (!tr->sys_refcount_enter)
 		unregister_trace_sys_enter(ftrace_syscall_enter, tr);
 	mutex_unlock(&syscall_trace_lock);
@@ -431,7 +429,7 @@ static int reg_event_syscall_exit(struct trace_event_file *file,
 	if (!tr->sys_refcount_exit)
 		ret = register_trace_sys_exit(ftrace_syscall_exit, tr);
 	if (!ret) {
-		rcu_assign_pointer(tr->exit_syscall_files[num], file);
+		WRITE_ONCE(tr->exit_syscall_files[num], file);
 		tr->sys_refcount_exit++;
 	}
 	mutex_unlock(&syscall_trace_lock);
@@ -449,7 +447,7 @@ static void unreg_event_syscall_exit(struct trace_event_file *file,
 		return;
 	mutex_lock(&syscall_trace_lock);
 	tr->sys_refcount_exit--;
-	RCU_INIT_POINTER(tr->exit_syscall_files[num], NULL);
+	WRITE_ONCE(tr->exit_syscall_files[num], NULL);
 	if (!tr->sys_refcount_exit)
 		unregister_trace_sys_exit(ftrace_syscall_exit, tr);
 	mutex_unlock(&syscall_trace_lock);
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/7] tracing: Have syscall trace events show "0x" for values greater than 10
  2025-08-05 19:26 [PATCH 0/7] tracing: Show contents of syscall trace event user space fields Steven Rostedt
  2025-08-05 19:26 ` [PATCH 1/7] tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE() Steven Rostedt
@ 2025-08-05 19:26 ` Steven Rostedt
  2025-08-06 10:24   ` Douglas Raillard
  2025-08-05 19:26 ` [PATCH 3/7] tracing: Have syscall trace events read user space string Steven Rostedt
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2025-08-05 19:26 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Peter Zijlstra, Namhyung Kim, Takaya Saeki, Tom Zanussi,
	Thomas Gleixner, Ian Rogers, aahringo, Douglas Raillard

From: Steven Rostedt <rostedt@goodmis.org>

Currently the syscall trace events show each value as hexadecimal, but
without adding "0x" it can be confusing:

   sys_write(fd: 4, buf: 0x55c4a1fa9270, count: 44)

Looks like the above write wrote 44 bytes, when in reality it wrote 68
bytes.

Add a "0x" for all values greater or equal to 10 to remove the ambiguity.
For values less than 10, leave off the "0x" as that just adds noise to the
output.

Also change the iterator to check if "i" is nonzero and print the ", "
delimiter at the start, then adding the logic to the trace_seq_printf() at
the end.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/trace_syscalls.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 3a0b65f89130..0f932b22f9ec 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -153,14 +153,20 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
 		if (trace_seq_has_overflowed(s))
 			goto end;
 
+		if (i)
+			trace_seq_puts(s, ", ");
+
 		/* parameter types */
 		if (tr && tr->trace_flags & TRACE_ITER_VERBOSE)
 			trace_seq_printf(s, "%s ", entry->types[i]);
 
 		/* parameter values */
-		trace_seq_printf(s, "%s: %lx%s", entry->args[i],
-				 trace->args[i],
-				 i == entry->nb_args - 1 ? "" : ", ");
+		if (trace->args[i] < 10)
+			trace_seq_printf(s, "%s: %lu", entry->args[i],
+					 trace->args[i]);
+		else
+			trace_seq_printf(s, "%s: 0x%lx", entry->args[i],
+					 trace->args[i]);
 	}
 
 	trace_seq_putc(s, ')');
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/7] tracing: Have syscall trace events read user space string
  2025-08-05 19:26 [PATCH 0/7] tracing: Show contents of syscall trace event user space fields Steven Rostedt
  2025-08-05 19:26 ` [PATCH 1/7] tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE() Steven Rostedt
  2025-08-05 19:26 ` [PATCH 2/7] tracing: Have syscall trace events show "0x" for values greater than 10 Steven Rostedt
@ 2025-08-05 19:26 ` Steven Rostedt
  2025-08-06 14:39   ` kernel test robot
  2025-08-11  5:19   ` kernel test robot
  2025-08-05 19:26 ` [PATCH 4/7] tracing: Have system call events record user array data Steven Rostedt
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 22+ messages in thread
From: Steven Rostedt @ 2025-08-05 19:26 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Peter Zijlstra, Namhyung Kim, Takaya Saeki, Tom Zanussi,
	Thomas Gleixner, Ian Rogers, aahringo, Douglas Raillard

From: Steven Rostedt <rostedt@goodmis.org>

As of commit 654ced4a1377 ("tracing: Introduce tracepoint_is_faultable()")
system call trace events allow faulting in user space memory. Have some of
the system call trace events take advantage of this.

Introduce a way to read strings that are nul terminated into the trace
event. The way this is accomplished is by creating a per CPU temporary
buffer that is used to read unsafe user memory.

When a syscall trace event needs to read user memory, it reads a per CPU
counter that gets updated every time a user space task is scheduled on the
CPU. It then enables preemption, copies the user space memory into this
buffer, then disables preemption again. If the counter is less than two
from its original value the buffer is valid. Otherwise it needs to try
again.

The reason to check for less than two and not equal to the previous value
is because scheduling kernel tasks is fine. Only user space tasks will
write to this buffer. If the task schedules out and only kernel tasks run
and the tasks schedules back in, the counter will be incremented by one.

The syscall event has its nb_args shorten from an int to a short (where
even u8 is plenty big enough) and the freed two bytes are used for
"user_mask".  The new "user_mask" field is used to store the index of the
"args" field array that has the address to read from user space. This
value is set to 0 if the system call event does not need to read user
space for a field. This mask can be used to know if the event may fault or
not. Only one bit set in user_mask is supported at this time.

This allows the output to look like this:

 sys_access(filename: 0x7f8c55368470 "/etc/ld.so.preload", mode: 4)
 sys_execve(filename: 0x564ebcf5a6b8 "/usr/bin/emacs", argv: 0x7fff357c0300, envp: 0x564ebc4a4820)

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/trace/syscall.h       |   4 +-
 kernel/trace/trace_syscalls.c | 496 ++++++++++++++++++++++++++++++++--
 2 files changed, 480 insertions(+), 20 deletions(-)

diff --git a/include/trace/syscall.h b/include/trace/syscall.h
index 8e193f3a33b3..85f21ca15a41 100644
--- a/include/trace/syscall.h
+++ b/include/trace/syscall.h
@@ -16,6 +16,7 @@
  * @name: name of the syscall
  * @syscall_nr: number of the syscall
  * @nb_args: number of parameters it takes
+ * @user_mask: mask of @args that will read user space
  * @types: list of types as strings
  * @args: list of args as strings (args[i] matches types[i])
  * @enter_fields: list of fields for syscall_enter trace event
@@ -25,7 +26,8 @@
 struct syscall_metadata {
 	const char	*name;
 	int		syscall_nr;
-	int		nb_args;
+	short		nb_args;
+	short		user_mask;
 	const char	**types;
 	const char	**args;
 	struct list_head enter_fields;
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 0f932b22f9ec..3233319ce266 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -10,6 +10,8 @@
 #include <linux/xarray.h>
 #include <asm/syscall.h>
 
+#include <trace/events/sched.h>
+
 #include "trace_output.h"
 #include "trace.h"
 
@@ -123,6 +125,9 @@ const char *get_syscall_name(int syscall)
 	return entry->name;
 }
 
+/* Added to user strings when max limit is reached */
+#define EXTRA "..."
+
 static enum print_line_t
 print_syscall_enter(struct trace_iterator *iter, int flags,
 		    struct trace_event *event)
@@ -132,7 +137,8 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
 	struct trace_entry *ent = iter->ent;
 	struct syscall_trace_enter *trace;
 	struct syscall_metadata *entry;
-	int i, syscall;
+	int i, syscall, val;
+	unsigned char *ptr;
 
 	trace = (typeof(trace))ent;
 	syscall = trace->nr;
@@ -167,6 +173,17 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
 		else
 			trace_seq_printf(s, "%s: 0x%lx", entry->args[i],
 					 trace->args[i]);
+
+		if (!(BIT(i) & entry->user_mask))
+			continue;
+
+		/* This arg points to a user space string */
+		ptr = (void *)trace->args + sizeof(long) * entry->nb_args;
+		val = *(int *)ptr;
+
+		ptr = (void *)ent + (val & 0xffff);
+
+		trace_seq_printf(s, " \"%s\"", ptr);
 	}
 
 	trace_seq_putc(s, ')');
@@ -223,15 +240,27 @@ __set_enter_print_fmt(struct syscall_metadata *entry, char *buf, int len)
 
 	pos += snprintf(buf + pos, LEN_OR_ZERO, "\"");
 	for (i = 0; i < entry->nb_args; i++) {
-		pos += snprintf(buf + pos, LEN_OR_ZERO, "%s: 0x%%0%zulx%s",
-				entry->args[i], sizeof(unsigned long),
-				i == entry->nb_args - 1 ? "" : ", ");
+		if (i)
+			pos += snprintf(buf + pos, LEN_OR_ZERO, ", ");
+		pos += snprintf(buf + pos, LEN_OR_ZERO, "%s: 0x%%0%zulx",
+				entry->args[i], sizeof(unsigned long));
+
+		if (!(BIT(i) & entry->user_mask))
+			continue;
+
+		/* Add the format for the user space string */
+		pos += snprintf(buf + pos, LEN_OR_ZERO, " \\\"%%s\\\"");
 	}
 	pos += snprintf(buf + pos, LEN_OR_ZERO, "\"");
 
 	for (i = 0; i < entry->nb_args; i++) {
 		pos += snprintf(buf + pos, LEN_OR_ZERO,
 				", ((unsigned long)(REC->%s))", entry->args[i]);
+		if (!(BIT(i) & entry->user_mask))
+			continue;
+		/* The user space string for arg has name __<arg>_val */
+		pos += snprintf(buf + pos, LEN_OR_ZERO, ", __get_str(__%s_val)",
+				entry->args[i]);
 	}
 
 #undef LEN_OR_ZERO
@@ -277,8 +306,12 @@ static int __init syscall_enter_define_fields(struct trace_event_call *call)
 {
 	struct syscall_trace_enter trace;
 	struct syscall_metadata *meta = call->data;
+	unsigned long mask;
+	char *arg;
 	int offset = offsetof(typeof(trace), args);
+	int idx;
 	int ret = 0;
+	int len;
 	int i;
 
 	for (i = 0; i < meta->nb_args; i++) {
@@ -291,9 +324,252 @@ static int __init syscall_enter_define_fields(struct trace_event_call *call)
 		offset += sizeof(unsigned long);
 	}
 
+	if (ret || !meta->user_mask)
+		return ret;
+
+	mask = meta->user_mask;
+	idx = ffs(mask) - 1;
+
+	/*
+	 * User space strings are faulted into a temporary buffer and then
+	 * added as a dynamic string to the end of the event.
+	 * The user space string name for the arg pointer is "__<arg>_val".
+	 */
+	len = strlen(meta->args[idx]) + sizeof("___val");
+	arg = kmalloc(len, GFP_KERNEL);
+	if (WARN_ON_ONCE(!arg)) {
+		meta->user_mask = 0;
+		return -ENOMEM;
+	}
+
+	snprintf(arg, len, "__%s_val", meta->args[idx]);
+
+	ret = trace_define_field(call, "__data_loc char[]",
+				 arg, offset, sizeof(int), 0,
+				 FILTER_OTHER);
+	if (ret)
+		kfree(arg);
 	return ret;
 }
 
+struct syscall_buf {
+	char		*buf;
+};
+
+struct syscall_buf_info {
+	struct rcu_head			rcu;
+	struct syscall_buf __percpu	*sbuf;
+};
+
+/* Create a per CPU temporary buffer to copy user space pointers into */
+#define SYSCALL_FAULT_BUF_SZ 512
+static struct syscall_buf_info *syscall_buffer;
+static DEFINE_PER_CPU(unsigned long, sched_switch_cnt);
+
+static int syscall_fault_buffer_cnt;
+
+static void syscall_fault_buffer_free(struct syscall_buf_info *sinfo)
+{
+	char *buf;
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		buf = per_cpu_ptr(sinfo->sbuf, cpu)->buf;
+		kfree(buf);
+	}
+	kfree(sinfo);
+}
+
+static void rcu_free_syscall_buffer(struct rcu_head *rcu)
+{
+	struct syscall_buf_info *sinfo = container_of(rcu, struct syscall_buf_info, rcu);
+
+	syscall_fault_buffer_free(sinfo);
+}
+
+/*
+ * The per CPU buffer syscall_fault_buffer is written to optimstically.
+ * The counter sched_switch_cnt is taken, preemption is enabled,
+ * the copying of the user space memory is placed into the syscall_fault_buffer,
+ * Preeption is re-enabled and the count is read again. If the count is greater
+ * than one from its previous reading, it means that another user space
+ * task scheduled in and the buffer is unreliable for use.
+ */
+static void
+probe_sched_switch(void *ignore, bool preempt,
+		   struct task_struct *prev, struct task_struct *next,
+		   unsigned int prev_state)
+{
+	/*
+	 * The buffer can only be corrupted by another user space task.
+	 * Ignore kernel tasks that may be scheduled in order to process
+	 * the faulting memory.
+	 */
+	if (next->flags & (PF_KTHREAD | PF_USER_WORKER))
+		return;
+
+	this_cpu_inc(sched_switch_cnt);
+}
+
+static int syscall_fault_buffer_enable(void)
+{
+	struct syscall_buf_info *sinfo;
+	char *buf;
+	int cpu;
+	int ret;
+
+	lockdep_assert_held(&syscall_trace_lock);
+
+	if (syscall_fault_buffer_cnt++)
+		return 0;
+
+	sinfo = kmalloc(sizeof(sinfo), GFP_KERNEL);
+	if (!sinfo)
+		return -ENOMEM;
+
+	sinfo->sbuf = alloc_percpu(struct syscall_buf);
+	if (!sinfo->sbuf) {
+		kfree(sinfo);
+		return -ENOMEM;
+	}
+
+	/* Clear each buffer in case of error */
+	for_each_possible_cpu(cpu) {
+		per_cpu_ptr(sinfo->sbuf, cpu)->buf = NULL;
+	}
+
+	for_each_possible_cpu(cpu) {
+		buf = kmalloc_node(SYSCALL_FAULT_BUF_SZ, GFP_KERNEL,
+				   cpu_to_node(cpu));
+		if (!buf) {
+			syscall_fault_buffer_free(sinfo);
+			return -ENOMEM;
+		}
+		per_cpu_ptr(sinfo->sbuf, cpu)->buf = buf;
+	}
+
+	ret = register_trace_sched_switch(probe_sched_switch, NULL);
+	if (ret < 0) {
+		syscall_fault_buffer_free(sinfo);
+		return ret;
+	}
+	WRITE_ONCE(syscall_buffer, sinfo);
+	return 0;
+}
+
+static void syscall_fault_buffer_disable(void)
+{
+	struct syscall_buf_info *sinfo = syscall_buffer;
+
+	lockdep_assert_held(&syscall_trace_lock);
+
+	if (--syscall_fault_buffer_cnt)
+		return;
+
+	WRITE_ONCE(syscall_buffer, NULL);
+
+	unregister_trace_sched_switch(probe_sched_switch, NULL);
+	call_rcu_tasks_trace(&sinfo->rcu, rcu_free_syscall_buffer);
+}
+
+static char *sys_fault_user(struct syscall_metadata *sys_data, struct syscall_buf_info *sinfo,
+			    unsigned long *args, unsigned int *data_size)
+{
+	char *buf = per_cpu_ptr(sinfo->sbuf, smp_processor_id())->buf;
+	unsigned long size = SYSCALL_FAULT_BUF_SZ - 1;
+	unsigned long mask = sys_data->user_mask;
+	unsigned int cnt;
+	int idx = ffs(mask) - 1;
+	char *ptr;
+	int trys = 0;
+	int ret;
+
+	/* Get the pointer to user space memory to read */
+	ptr = (char *)args[idx];
+	*data_size = 0;
+
+ again:
+	/*
+	 * If this task is preempted by another user space task, it
+	 * will cause this task to try again. But just in case something
+	 * changes where the copying from user space causes another task
+	 * to run, prevent this from going into an infinite loop.
+	 * 10 tries should be plenty.
+	 */
+	if (trys++ > 10) {
+		static bool once;
+		/*
+		 * Only print a message instead of a WARN_ON() as this could
+		 * theoretically trigger under real load.
+		 */
+		if (!once)
+			pr_warn("Error: Too many tries to read syscall %s\n", sys_data->name);
+		once = true;
+		return buf;
+	}
+
+	/* Read the current sched switch count */
+	cnt = this_cpu_read(sched_switch_cnt);
+
+	/*
+	 * Preemption is going to be enabled, but this task must
+	 * remain on this CPU.
+	 */
+	migrate_disable();
+
+	/*
+	 * Now preemption is being enabed and another task can come in
+	 * and use the same buffer and corrupt our data.
+	 */
+	preempt_enable_notrace();
+
+	ret = strncpy_from_user(buf, ptr, size);
+
+	preempt_disable_notrace();
+	migrate_enable();
+
+	/* If it faulted, no use to try again */
+	if (ret < 0)
+		return buf;
+
+	/*
+	 * Preemption is disabled again, now check the sched_switch_cnt.
+	 * If it increased by two or more, then another user space process
+	 * may have schedule in and corrupted our buffer. In that case
+	 * the copying must be retried.
+	 *
+	 * Note, if this task was scheduled out and only kernel threads
+	 * were scheduled in (maybe to process the fault), then the
+	 * counter would increment again when this task scheduled in.
+	 * If this task scheduled out and another user task scheduled
+	 * in, this task would still need to be scheduled back in and
+	 * the counter would increment by at least two.
+	 */
+	if (this_cpu_read(sched_switch_cnt) > cnt + 1)
+		goto again;
+
+	/* Replace any non-printable characters with '.' */
+	for (int i = 0; i < ret; i++) {
+		if (!isprint(buf[i]))
+			buf[i] = '.';
+	}
+
+	/*
+	 * If the text was truncated due to our max limit, add "..." to
+	 * the string.
+	 */
+	if (ret > SYSCALL_FAULT_BUF_SZ - sizeof(EXTRA)) {
+		strscpy(buf + SYSCALL_FAULT_BUF_SZ - sizeof(EXTRA),
+			EXTRA, sizeof(EXTRA));
+		ret = SYSCALL_FAULT_BUF_SZ;
+	} else {
+		buf[ret++] = '\0';
+	}
+
+	*data_size = ret;
+	return buf;
+}
+
 static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
 {
 	struct trace_array *tr = data;
@@ -302,15 +578,17 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
 	struct syscall_metadata *sys_data;
 	struct trace_event_buffer fbuffer;
 	unsigned long args[6];
+	char *user_ptr;
+	int user_size = 0;
 	int syscall_nr;
-	int size;
+	int size = 0;
+	bool mayfault;
 
 	/*
 	 * Syscall probe called with preemption enabled, but the ring
 	 * buffer and per-cpu data require preemption to be disabled.
 	 */
 	might_fault();
-	guard(preempt_notrace)();
 
 	syscall_nr = trace_get_syscall_nr(current, regs);
 	if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
@@ -327,7 +605,32 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
 	if (!sys_data)
 		return;
 
-	size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
+	/* Check if this syscall event faults in user space memory */
+	mayfault = sys_data->user_mask != 0;
+
+	guard(preempt_notrace)();
+
+	syscall_get_arguments(current, regs, args);
+
+	if (mayfault) {
+		struct syscall_buf_info *sinfo;
+
+		/* If the syscall_buffer is NULL, tracing is being shutdown */
+		sinfo = READ_ONCE(syscall_buffer);
+		if (!sinfo)
+			return;
+
+		user_ptr = sys_fault_user(sys_data, sinfo, args, &user_size);
+		/*
+		 * user_size is the amount of data to append.
+		 * Need to add 4 for the meta field that points to
+		 * the user memory at the end of the event and also
+		 * stores its size.
+		 */
+		size = 4 + user_size;
+	}
+
+	size += sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
 
 	entry = trace_event_buffer_reserve(&fbuffer, trace_file, size);
 	if (!entry)
@@ -335,9 +638,36 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
 
 	entry = ring_buffer_event_data(fbuffer.event);
 	entry->nr = syscall_nr;
-	syscall_get_arguments(current, regs, args);
+
 	memcpy(entry->args, args, sizeof(unsigned long) * sys_data->nb_args);
 
+	if (mayfault) {
+		void *ptr;
+		int val;
+
+		/*
+		 * Set the pointer to point to the meta data of the event
+		 * that has information about the stored user space memory.
+		 */
+		ptr = (void *)entry->args + sizeof(unsigned long) * sys_data->nb_args;
+
+		/*
+		 * The meta data will store the offset of the user data from
+		 * the beginning of the event.
+		 */
+		val  = (ptr - (void *)entry) + 4;
+
+		/* Store the offset and the size into the meta data */
+		*(int *)ptr = val | (user_size << 16);
+
+		/* Nothing to do if the user space was empty or faulted */
+		if (user_size) {
+			/* Now store the user space data into the event */
+			ptr += 4;
+			memcpy(ptr, user_ptr, user_size);
+		}
+	}
+
 	trace_event_buffer_commit(&fbuffer);
 }
 
@@ -386,39 +716,50 @@ static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
 static int reg_event_syscall_enter(struct trace_event_file *file,
 				   struct trace_event_call *call)
 {
+	struct syscall_metadata *sys_data = call->data;
 	struct trace_array *tr = file->tr;
 	int ret = 0;
 	int num;
 
-	num = ((struct syscall_metadata *)call->data)->syscall_nr;
+	num = sys_data->syscall_nr;
 	if (WARN_ON_ONCE(num < 0 || num >= NR_syscalls))
 		return -ENOSYS;
-	mutex_lock(&syscall_trace_lock);
-	if (!tr->sys_refcount_enter)
+	guard(mutex)(&syscall_trace_lock);
+	if (sys_data->user_mask) {
+		ret = syscall_fault_buffer_enable();
+		if (ret)
+			return ret;
+	}
+	if (!tr->sys_refcount_enter) {
 		ret = register_trace_sys_enter(ftrace_syscall_enter, tr);
-	if (!ret) {
-		WRITE_ONCE(tr->enter_syscall_files[num], file);
-		tr->sys_refcount_enter++;
+		if (ret < 0) {
+			if (sys_data->user_mask)
+				syscall_fault_buffer_disable();
+			return ret;
+		}
 	}
-	mutex_unlock(&syscall_trace_lock);
-	return ret;
+	WRITE_ONCE(tr->enter_syscall_files[num], file);
+	tr->sys_refcount_enter++;
+	return 0;
 }
 
 static void unreg_event_syscall_enter(struct trace_event_file *file,
 				      struct trace_event_call *call)
 {
+	struct syscall_metadata *sys_data = call->data;
 	struct trace_array *tr = file->tr;
 	int num;
 
-	num = ((struct syscall_metadata *)call->data)->syscall_nr;
+	num = sys_data->syscall_nr;
 	if (WARN_ON_ONCE(num < 0 || num >= NR_syscalls))
 		return;
-	mutex_lock(&syscall_trace_lock);
+	guard(mutex)(&syscall_trace_lock);
 	tr->sys_refcount_enter--;
 	WRITE_ONCE(tr->enter_syscall_files[num], NULL);
 	if (!tr->sys_refcount_enter)
 		unregister_trace_sys_enter(ftrace_syscall_enter, tr);
-	mutex_unlock(&syscall_trace_lock);
+	if (sys_data->user_mask)
+		syscall_fault_buffer_disable();
 }
 
 static int reg_event_syscall_exit(struct trace_event_file *file,
@@ -459,6 +800,121 @@ static void unreg_event_syscall_exit(struct trace_event_file *file,
 	mutex_unlock(&syscall_trace_lock);
 }
 
+/*
+ * For system calls that reference user space memory that can
+ * be recorded into the event, set the system call meta data's user_mask
+ * to the "args" index that points to the user space memory to retrieve.
+ */
+static void check_faultable_syscall(struct trace_event_call *call, int nr)
+{
+	struct syscall_metadata *sys_data = call->data;
+
+	/* Only work on entry */
+	if (sys_data->enter_event != call)
+		return;
+
+	switch (nr) {
+	/* user arg at position 0 */
+	case __NR_access:
+	case __NR_acct:
+	case __NR_add_key: /* Just _type. TODO add _description */
+	case __NR_chdir:
+	case __NR_chown:
+	case __NR_chmod:
+	case __NR_chroot:
+	case __NR_creat:
+	case __NR_delete_module:
+	case __NR_execve:
+	case __NR_fsopen:
+	case __NR_getxattr: /* Just pathname, TODO add name */
+	case __NR_lchown:
+	case __NR_lgetxattr: /* Just pathname, TODO add name */
+	case __NR_lremovexattr: /* Just pathname, TODO add name */
+	case __NR_link: /* Just oldname. TODO add newname */
+	case __NR_listxattr: /* Just pathname, TODO add list */
+	case __NR_llistxattr: /* Just pathname, TODO add list */
+	case __NR_lsetxattr: /* Just pathname, TODO add list */
+	case __NR_open:
+	case __NR_memfd_create:
+	case __NR_mount: /* Just dev_name, TODO add dir_name and type */
+	case __NR_mkdir:
+	case __NR_mknod:
+	case __NR_mq_open:
+	case __NR_mq_unlink:
+	case __NR_pivot_root: /* Just new_root, TODO add old_root */
+	case __NR_readlink:
+	case __NR_removexattr: /* Just pathname, TODO add name */
+	case __NR_rename: /* Just oldname. TODO add newname */
+	case __NR_request_key: /* Just _type. TODO add _description */
+	case __NR_rmdir:
+	case __NR_setxattr: /* Just pathname, TODO add list */
+	case __NR_shmdt:
+	case __NR_statfs:
+	case __NR_swapon:
+	case __NR_swapoff:
+	case __NR_symlink: /* Just oldname. TODO add newname */
+	case __NR_truncate:
+	case __NR_unlink:
+	case __NR_umount2:
+	case __NR_utime:
+	case __NR_utimes:
+		sys_data->user_mask = BIT(0);
+		break;
+	/* user arg at position 1 */
+	case __NR_execveat:
+	case __NR_faccessat:
+	case __NR_faccessat2:
+	case __NR_finit_module:
+	case __NR_fchmodat:
+	case __NR_fchmodat2:
+	case __NR_fchownat:
+	case __NR_fgetxattr:
+	case __NR_flistxattr:
+	case __NR_fsetxattr:
+	case __NR_fspick:
+	case __NR_fremovexattr:
+	case __NR_futimesat:
+	case __NR_getxattrat: /* Just pathname, TODO add name */
+	case __NR_inotify_add_watch:
+	case __NR_linkat: /* Just oldname. TODO add newname */
+	case __NR_listxattrat: /* Just pathname, TODO add list */
+	case __NR_mkdirat:
+	case __NR_mknodat:
+	case __NR_mount_setattr:
+	case __NR_move_mount: /* Just from_pathname, TODO add to_pathname */
+	case __NR_name_to_handle_at:
+	case __NR_newfstatat:
+	case __NR_openat:
+	case __NR_openat2:
+	case __NR_open_tree:
+	case __NR_open_tree_attr:
+	case __NR_readlinkat:
+	case __NR_renameat: /* Just oldname. TODO add newname */
+	case __NR_renameat2: /* Just oldname. TODO add newname */
+	case __NR_removexattrat: /* Just pathname, TODO add name */
+	case __NR_quotactl:
+	case __NR_setxattrat: /* Just pathname, TODO add list */
+	case __NR_syslog:
+	case __NR_symlinkat: /* Just oldname. TODO add newname */
+	case __NR_statx:
+	case __NR_unlinkat:
+	case __NR_utimensat:
+		sys_data->user_mask = BIT(1);
+		break;
+	/* user arg at position 2 */
+	case __NR_init_module:
+	case __NR_fsconfig:
+		sys_data->user_mask = BIT(2);
+		break;
+	/* user arg at position 4 */
+	case __NR_fanotify_mark:
+		sys_data->user_mask = BIT(4);
+		break;
+	default:
+		sys_data->user_mask = 0;
+	}
+}
+
 static int __init init_syscall_trace(struct trace_event_call *call)
 {
 	int id;
@@ -471,6 +927,8 @@ static int __init init_syscall_trace(struct trace_event_call *call)
 		return -ENOSYS;
 	}
 
+	check_faultable_syscall(call, num);
+
 	if (set_syscall_print_fmt(call) < 0)
 		return -ENOMEM;
 
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/7] tracing: Have system call events record user array data
  2025-08-05 19:26 [PATCH 0/7] tracing: Show contents of syscall trace event user space fields Steven Rostedt
                   ` (2 preceding siblings ...)
  2025-08-05 19:26 ` [PATCH 3/7] tracing: Have syscall trace events read user space string Steven Rostedt
@ 2025-08-05 19:26 ` Steven Rostedt
  2025-08-05 19:26 ` [PATCH 5/7] tracing: Display some syscall arrays as strings Steven Rostedt
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: Steven Rostedt @ 2025-08-05 19:26 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Peter Zijlstra, Namhyung Kim, Takaya Saeki, Tom Zanussi,
	Thomas Gleixner, Ian Rogers, aahringo, Douglas Raillard

From: Steven Rostedt <rostedt@goodmis.org>

For system call events that have a length field, add a "user_arg_size"
parameter to the system call meta data that denotes the index of the args
array that holds the size of arg that the user_mask field has a bit set
for.

The "user_mask" has a bit set that denotes the arg that points to an array
in the user space address space and if a system call event has the
user_mask field set and the user_arg_size set, it will then record the
content of that address into the trace event, up to the size defined by
SYSCALL_FAULT_BUF_SZ - 1.

This allows the output to look like:

  sys_write(fd: 0xa, buf: 0x5646978d13c0 (01:00:05:00:00:00:00:00:01:87:55:89:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00), count: 0x20)

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/trace/syscall.h       |   4 +-
 kernel/trace/trace_syscalls.c | 112 ++++++++++++++++++++++++++--------
 2 files changed, 88 insertions(+), 28 deletions(-)

diff --git a/include/trace/syscall.h b/include/trace/syscall.h
index 85f21ca15a41..9413c139da66 100644
--- a/include/trace/syscall.h
+++ b/include/trace/syscall.h
@@ -16,6 +16,7 @@
  * @name: name of the syscall
  * @syscall_nr: number of the syscall
  * @nb_args: number of parameters it takes
+ * @user_arg_size: holds @arg that has size of the user space to read
  * @user_mask: mask of @args that will read user space
  * @types: list of types as strings
  * @args: list of args as strings (args[i] matches types[i])
@@ -26,7 +27,8 @@
 struct syscall_metadata {
 	const char	*name;
 	int		syscall_nr;
-	short		nb_args;
+	u8		nb_args;
+	s8		user_arg_size;
 	short		user_mask;
 	const char	**types;
 	const char	**args;
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 3233319ce266..b0a587f2e4b5 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -125,7 +125,7 @@ const char *get_syscall_name(int syscall)
 	return entry->name;
 }
 
-/* Added to user strings when max limit is reached */
+/* Added to user strings or arrays when max limit is reached */
 #define EXTRA "..."
 
 static enum print_line_t
@@ -137,7 +137,7 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
 	struct trace_entry *ent = iter->ent;
 	struct syscall_trace_enter *trace;
 	struct syscall_metadata *entry;
-	int i, syscall, val;
+	int i, syscall, val, len;
 	unsigned char *ptr;
 
 	trace = (typeof(trace))ent;
@@ -183,7 +183,25 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
 
 		ptr = (void *)ent + (val & 0xffff);
 
-		trace_seq_printf(s, " \"%s\"", ptr);
+		if (entry->user_arg_size < 0) {
+			trace_seq_printf(s, " \"%s\"", ptr);
+			continue;
+		}
+
+		len = val >> 16;
+
+		val = trace->args[entry->user_arg_size];
+
+		trace_seq_puts(s, " (");
+		for (int x = 0; x < len; x++, ptr++) {
+			if (x)
+				trace_seq_putc(s, ':');
+			trace_seq_printf(s, "%02x", *ptr);
+		}
+		if (len < val)
+			trace_seq_printf(s, ", %s", EXTRA);
+
+		trace_seq_putc(s, ')');
 	}
 
 	trace_seq_putc(s, ')');
@@ -248,8 +266,11 @@ __set_enter_print_fmt(struct syscall_metadata *entry, char *buf, int len)
 		if (!(BIT(i) & entry->user_mask))
 			continue;
 
-		/* Add the format for the user space string */
-		pos += snprintf(buf + pos, LEN_OR_ZERO, " \\\"%%s\\\"");
+		/* Add the format for the user space string or array */
+		if (entry->user_arg_size < 0)
+			pos += snprintf(buf + pos, LEN_OR_ZERO, " \\\"%%s\\\"");
+		else
+			pos += snprintf(buf + pos, LEN_OR_ZERO, " (%%s)");
 	}
 	pos += snprintf(buf + pos, LEN_OR_ZERO, "\"");
 
@@ -258,9 +279,14 @@ __set_enter_print_fmt(struct syscall_metadata *entry, char *buf, int len)
 				", ((unsigned long)(REC->%s))", entry->args[i]);
 		if (!(BIT(i) & entry->user_mask))
 			continue;
-		/* The user space string for arg has name __<arg>_val */
-		pos += snprintf(buf + pos, LEN_OR_ZERO, ", __get_str(__%s_val)",
-				entry->args[i]);
+		/* The user space data for arg has name __<arg>_val */
+		if (entry->user_arg_size < 0) {
+			pos += snprintf(buf + pos, LEN_OR_ZERO, ", __get_str(__%s_val)",
+					entry->args[i]);
+		} else {
+			pos += snprintf(buf + pos, LEN_OR_ZERO, ", __print_dynamic_array(__%s_val, 1)",
+					entry->args[i]);
+		}
 	}
 
 #undef LEN_OR_ZERO
@@ -331,9 +357,9 @@ static int __init syscall_enter_define_fields(struct trace_event_call *call)
 	idx = ffs(mask) - 1;
 
 	/*
-	 * User space strings are faulted into a temporary buffer and then
-	 * added as a dynamic string to the end of the event.
-	 * The user space string name for the arg pointer is "__<arg>_val".
+	 * User space data is faulted into a temporary buffer and then
+	 * added as a dynamic string or array to the end of the event.
+	 * The user space data name for the arg pointer is "__<arg>_val".
 	 */
 	len = strlen(meta->args[idx]) + sizeof("___val");
 	arg = kmalloc(len, GFP_KERNEL);
@@ -480,6 +506,7 @@ static char *sys_fault_user(struct syscall_metadata *sys_data, struct syscall_bu
 	unsigned long mask = sys_data->user_mask;
 	unsigned int cnt;
 	int idx = ffs(mask) - 1;
+	bool array = false;
 	char *ptr;
 	int trys = 0;
 	int ret;
@@ -511,6 +538,18 @@ static char *sys_fault_user(struct syscall_metadata *sys_data, struct syscall_bu
 	/* Read the current sched switch count */
 	cnt = this_cpu_read(sched_switch_cnt);
 
+	/*
+	 * If this system call event has a size argument, use
+	 * it to define how much of user space memory to read,
+	 * and read it as an array and not a string.
+	 */
+	if (sys_data->user_arg_size >= 0) {
+		array = true;
+		size = args[sys_data->user_arg_size];
+		if (size > SYSCALL_FAULT_BUF_SZ - 1)
+			size = SYSCALL_FAULT_BUF_SZ - 1;
+	}
+
 	/*
 	 * Preemption is going to be enabled, but this task must
 	 * remain on this CPU.
@@ -523,7 +562,12 @@ static char *sys_fault_user(struct syscall_metadata *sys_data, struct syscall_bu
 	 */
 	preempt_enable_notrace();
 
-	ret = strncpy_from_user(buf, ptr, size);
+	if (array) {
+		ret = __copy_from_user(buf, ptr, size);
+		ret = ret ? -1 : size;
+	} else {
+		ret = strncpy_from_user(buf, ptr, size);
+	}
 
 	preempt_disable_notrace();
 	migrate_enable();
@@ -548,22 +592,24 @@ static char *sys_fault_user(struct syscall_metadata *sys_data, struct syscall_bu
 	if (this_cpu_read(sched_switch_cnt) > cnt + 1)
 		goto again;
 
-	/* Replace any non-printable characters with '.' */
-	for (int i = 0; i < ret; i++) {
-		if (!isprint(buf[i]))
-			buf[i] = '.';
-	}
+	/* For strings, replace any non-printable characters with '.' */
+	if (!array) {
+		for (int i = 0; i < ret; i++) {
+			if (!isprint(buf[i]))
+				buf[i] = '.';
+		}
 
-	/*
-	 * If the text was truncated due to our max limit, add "..." to
-	 * the string.
-	 */
-	if (ret > SYSCALL_FAULT_BUF_SZ - sizeof(EXTRA)) {
-		strscpy(buf + SYSCALL_FAULT_BUF_SZ - sizeof(EXTRA),
-			EXTRA, sizeof(EXTRA));
-		ret = SYSCALL_FAULT_BUF_SZ;
-	} else {
-		buf[ret++] = '\0';
+		/*
+		 * If the text was truncated due to our max limit, add "..." to
+		 * the string.
+		 */
+		if (ret > SYSCALL_FAULT_BUF_SZ - sizeof(EXTRA)) {
+			strscpy(buf + SYSCALL_FAULT_BUF_SZ - sizeof(EXTRA),
+				EXTRA, sizeof(EXTRA));
+			ret = SYSCALL_FAULT_BUF_SZ;
+		} else {
+			buf[ret++] = '\0';
+		}
 	}
 
 	*data_size = ret;
@@ -660,6 +706,9 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
 		/* Store the offset and the size into the meta data */
 		*(int *)ptr = val | (user_size << 16);
 
+		if (WARN_ON_ONCE((ptr - (void *)entry + user_size) > size))
+			user_size = 0;
+
 		/* Nothing to do if the user space was empty or faulted */
 		if (user_size) {
 			/* Now store the user space data into the event */
@@ -813,7 +862,16 @@ static void check_faultable_syscall(struct trace_event_call *call, int nr)
 	if (sys_data->enter_event != call)
 		return;
 
+	sys_data->user_arg_size = -1;
+
 	switch (nr) {
+	/* user arg 1 with size arg at 2 */
+	case __NR_write:
+	case __NR_mq_timedsend:
+	case __NR_pwrite64:
+		sys_data->user_mask = BIT(1);
+		sys_data->user_arg_size = 2;
+		break;
 	/* user arg at position 0 */
 	case __NR_access:
 	case __NR_acct:
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/7] tracing: Display some syscall arrays as strings
  2025-08-05 19:26 [PATCH 0/7] tracing: Show contents of syscall trace event user space fields Steven Rostedt
                   ` (3 preceding siblings ...)
  2025-08-05 19:26 ` [PATCH 4/7] tracing: Have system call events record user array data Steven Rostedt
@ 2025-08-05 19:26 ` Steven Rostedt
  2025-08-06 15:12   ` kernel test robot
  2025-08-05 19:26 ` [PATCH 6/7] tracing: Allow syscall trace events to read more than one user parameter Steven Rostedt
  2025-08-05 19:26 ` [PATCH 7/7] tracing: Add syscall_user_buf_size to limit amount written Steven Rostedt
  6 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2025-08-05 19:26 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Peter Zijlstra, Namhyung Kim, Takaya Saeki, Tom Zanussi,
	Thomas Gleixner, Ian Rogers, aahringo, Douglas Raillard

From: Steven Rostedt <rostedt@goodmis.org>

Some of the system calls that read a fixed length of memory from the user
space address are not arrays but strings. Take a bit away from the nb_args
field in the syscall meta data to use as a flag to denote that the system
call's user_arg_size is being used as a string. The nb_args should never
be more than 6, so 7 bits is plenty to hold that number. When the
user_arg_is_str flag that, when set, will display the data array from the
user space address as a string and not an array.

This will allow the output to look like this:

  sys_sethostname(name: 0x5584310eb2a0 "debian", len: 6)

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/trace/syscall.h       |  4 +++-
 kernel/trace/trace_syscalls.c | 25 +++++++++++++++++++------
 2 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/include/trace/syscall.h b/include/trace/syscall.h
index 9413c139da66..0dd7f2b33431 100644
--- a/include/trace/syscall.h
+++ b/include/trace/syscall.h
@@ -16,6 +16,7 @@
  * @name: name of the syscall
  * @syscall_nr: number of the syscall
  * @nb_args: number of parameters it takes
+ * @user_arg_is_str: set if the arg for @user_arg_size is a string
  * @user_arg_size: holds @arg that has size of the user space to read
  * @user_mask: mask of @args that will read user space
  * @types: list of types as strings
@@ -27,7 +28,8 @@
 struct syscall_metadata {
 	const char	*name;
 	int		syscall_nr;
-	u8		nb_args;
+	u8		nb_args:7;
+	u8		user_arg_is_str:1;
 	s8		user_arg_size;
 	short		user_mask;
 	const char	**types;
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index b0a587f2e4b5..8c0142eea898 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -182,14 +182,13 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
 		val = *(int *)ptr;
 
 		ptr = (void *)ent + (val & 0xffff);
+		len = val >> 16;
 
-		if (entry->user_arg_size < 0) {
-			trace_seq_printf(s, " \"%s\"", ptr);
+		if (entry->user_arg_size < 0 || entry->user_arg_is_str) {
+			trace_seq_printf(s, " \"%.*s\"", len, ptr);
 			continue;
 		}
 
-		len = val >> 16;
-
 		val = trace->args[entry->user_arg_size];
 
 		trace_seq_puts(s, " (");
@@ -250,6 +249,7 @@ print_syscall_exit(struct trace_iterator *iter, int flags,
 static int __init
 __set_enter_print_fmt(struct syscall_metadata *entry, char *buf, int len)
 {
+	bool is_string = entry->user_arg_is_str;
 	int i;
 	int pos = 0;
 
@@ -267,7 +267,7 @@ __set_enter_print_fmt(struct syscall_metadata *entry, char *buf, int len)
 			continue;
 
 		/* Add the format for the user space string or array */
-		if (entry->user_arg_size < 0)
+		if (entry->user_arg_size < 0 || is_string)
 			pos += snprintf(buf + pos, LEN_OR_ZERO, " \\\"%%s\\\"");
 		else
 			pos += snprintf(buf + pos, LEN_OR_ZERO, " (%%s)");
@@ -280,7 +280,7 @@ __set_enter_print_fmt(struct syscall_metadata *entry, char *buf, int len)
 		if (!(BIT(i) & entry->user_mask))
 			continue;
 		/* The user space data for arg has name __<arg>_val */
-		if (entry->user_arg_size < 0) {
+		if (entry->user_arg_size < 0 || is_string) {
 			pos += snprintf(buf + pos, LEN_OR_ZERO, ", __get_str(__%s_val)",
 					entry->args[i]);
 		} else {
@@ -872,6 +872,19 @@ static void check_faultable_syscall(struct trace_event_call *call, int nr)
 		sys_data->user_mask = BIT(1);
 		sys_data->user_arg_size = 2;
 		break;
+	/* user arg 0 with size arg at 1 as string */
+	case __NR_setdomainname:
+	case __NR_sethostname:
+		sys_data->user_mask = BIT(0);
+		sys_data->user_arg_size = 1;
+		sys_data->user_arg_is_str = 1;
+		break;
+	/* user arg 4 with size arg at 3 as string */
+	case __NR_kexec_file_load:
+		sys_data->user_mask = BIT(4);
+		sys_data->user_arg_size = 3;
+		sys_data->user_arg_is_str = 1;
+		break;
 	/* user arg at position 0 */
 	case __NR_access:
 	case __NR_acct:
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 6/7] tracing: Allow syscall trace events to read more than one user parameter
  2025-08-05 19:26 [PATCH 0/7] tracing: Show contents of syscall trace event user space fields Steven Rostedt
                   ` (4 preceding siblings ...)
  2025-08-05 19:26 ` [PATCH 5/7] tracing: Display some syscall arrays as strings Steven Rostedt
@ 2025-08-05 19:26 ` Steven Rostedt
  2025-08-06 23:52   ` kernel test robot
  2025-08-05 19:26 ` [PATCH 7/7] tracing: Add syscall_user_buf_size to limit amount written Steven Rostedt
  6 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2025-08-05 19:26 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Peter Zijlstra, Namhyung Kim, Takaya Saeki, Tom Zanussi,
	Thomas Gleixner, Ian Rogers, aahringo, Douglas Raillard

From: Steven Rostedt <rostedt@goodmis.org>

Allow more than one field of a syscall trace event to read user space.
Build on top of the user_mask by allowing more than one bit to be set that
corresponds to the @args array of the syscall metadata. For each argument
in the @args array that is to be read, it will have a dynamic array/string
field associated to it.

Note that multiple fields to be read from user space is not supported if
the user_arg_size field is set in the syscall metada. That field can only
be used if only one field is being read from user space as that field is a
number representing the size field of the syscall event that holds the
size of the data to read from user space. It becomes ambiguous if the
system call reads more than one field. Currently this is not an issue.

If a syscall event happens to enable two events to read user space and
sets the user_arg_size field, it will trigger a warning at boot and the
user_arg_size field will be cleared.

The per CPU buffer that is used to read the user space addresses is now
broken up into 3 sections, each of 168 bytes. The reason for 168 is that
it is the biggest portion of 512 bytes divided by 3 that is 8 byte aligned.

The max amount copied into the ring buffer from user space is now only 128
bytes, which is plenty. When reading user space, it still reads 167
(168-1) bytes and uses the remaining to know if it should append the extra
"..." to the end or not.

This will allow the event to look like this:

  sys_renameat2(olddfd: 0xffffff9c, oldname: 0x7ffe02facdff "/tmp/x", newdfd: 0xffffff9c, newname: 0x7ffe02face06 "/tmp/y", flags: 1)

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/trace_syscalls.c | 312 ++++++++++++++++++++++------------
 1 file changed, 207 insertions(+), 105 deletions(-)

diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 8c0142eea898..b39fa9dd1067 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -139,6 +139,7 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
 	struct syscall_metadata *entry;
 	int i, syscall, val, len;
 	unsigned char *ptr;
+	int offset = 0;
 
 	trace = (typeof(trace))ent;
 	syscall = trace->nr;
@@ -178,11 +179,12 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
 			continue;
 
 		/* This arg points to a user space string */
-		ptr = (void *)trace->args + sizeof(long) * entry->nb_args;
+		ptr = (void *)trace->args + sizeof(long) * entry->nb_args + offset;
 		val = *(int *)ptr;
 
 		ptr = (void *)ent + (val & 0xffff);
 		len = val >> 16;
+		offset += 4;
 
 		if (entry->user_arg_size < 0 || entry->user_arg_is_str) {
 			trace_seq_printf(s, " \"%.*s\"", len, ptr);
@@ -335,7 +337,6 @@ static int __init syscall_enter_define_fields(struct trace_event_call *call)
 	unsigned long mask;
 	char *arg;
 	int offset = offsetof(typeof(trace), args);
-	int idx;
 	int ret = 0;
 	int len;
 	int i;
@@ -354,27 +355,35 @@ static int __init syscall_enter_define_fields(struct trace_event_call *call)
 		return ret;
 
 	mask = meta->user_mask;
-	idx = ffs(mask) - 1;
 
-	/*
-	 * User space data is faulted into a temporary buffer and then
-	 * added as a dynamic string or array to the end of the event.
-	 * The user space data name for the arg pointer is "__<arg>_val".
-	 */
-	len = strlen(meta->args[idx]) + sizeof("___val");
-	arg = kmalloc(len, GFP_KERNEL);
-	if (WARN_ON_ONCE(!arg)) {
-		meta->user_mask = 0;
-		return -ENOMEM;
-	}
+	while (mask) {
+		int idx = ffs(mask) - 1;
+		mask &= ~BIT(idx);
+
+		/*
+		 * User space data is faulted into a temporary buffer and then
+		 * added as a dynamic string or array to the end of the event.
+		 * The user space data name for the arg pointer is
+		 * "__<arg>_val".
+		 */
+		len = strlen(meta->args[idx]) + sizeof("___val");
+		arg = kmalloc(len, GFP_KERNEL);
+		if (WARN_ON_ONCE(!arg)) {
+			meta->user_mask = 0;
+			return -ENOMEM;
+		}
 
-	snprintf(arg, len, "__%s_val", meta->args[idx]);
+		snprintf(arg, len, "__%s_val", meta->args[idx]);
 
-	ret = trace_define_field(call, "__data_loc char[]",
-				 arg, offset, sizeof(int), 0,
-				 FILTER_OTHER);
-	if (ret)
-		kfree(arg);
+		ret = trace_define_field(call, "__data_loc char[]",
+					 arg, offset, sizeof(int), 0,
+					 FILTER_OTHER);
+		if (ret) {
+			kfree(arg);
+			break;
+		}
+		offset += 4;
+	}
 	return ret;
 }
 
@@ -387,8 +396,25 @@ struct syscall_buf_info {
 	struct syscall_buf __percpu	*sbuf;
 };
 
-/* Create a per CPU temporary buffer to copy user space pointers into */
+/*
+ * Create a per CPU temporary buffer to copy user space pointers into.
+ *
+ * SYSCALL_FAULT_BUF_SZ holds the size of the per CPU buffer to use
+ * to copy memory from user space addresses into.
+ *
+ * SYSCALL_FAULT_ARG_SZ is the amount to copy from user space.
+ *
+ * SYSCALL_FAULT_USER_MAX is the amount to copy into the ring buffer.
+ *  It's slightly smaller than SYSCALL_FAULT_ARG_SZ to know if it
+ *  needs to append the EXTRA or not.
+ *
+ * This only allows up to 3 args from system calls.
+ */
 #define SYSCALL_FAULT_BUF_SZ 512
+#define SYSCALL_FAULT_ARG_SZ 168
+#define SYSCALL_FAULT_USER_MAX 128
+#define SYSCALL_FAULT_MAX_CNT 3
+
 static struct syscall_buf_info *syscall_buffer;
 static DEFINE_PER_CPU(unsigned long, sched_switch_cnt);
 
@@ -498,22 +524,57 @@ static void syscall_fault_buffer_disable(void)
 	call_rcu_tasks_trace(&sinfo->rcu, rcu_free_syscall_buffer);
 }
 
-static char *sys_fault_user(struct syscall_metadata *sys_data, struct syscall_buf_info *sinfo,
-			    unsigned long *args, unsigned int *data_size)
+static char *sys_fault_user(struct syscall_metadata *sys_data,
+			    struct syscall_buf_info *sinfo,
+			    unsigned long *args,
+			    unsigned int data_size[SYSCALL_FAULT_MAX_CNT])
 {
-	char *buf = per_cpu_ptr(sinfo->sbuf, smp_processor_id())->buf;
-	unsigned long size = SYSCALL_FAULT_BUF_SZ - 1;
+	char *buffer = per_cpu_ptr(sinfo->sbuf, smp_processor_id())->buf;
 	unsigned long mask = sys_data->user_mask;
+	unsigned long size = SYSCALL_FAULT_ARG_SZ - 1;
 	unsigned int cnt;
-	int idx = ffs(mask) - 1;
 	bool array = false;
-	char *ptr;
+	char *ptr_array[SYSCALL_FAULT_MAX_CNT];
+	char *buf;
+	int read[SYSCALL_FAULT_MAX_CNT];
 	int trys = 0;
+	int uargs;
 	int ret;
+	int i = 0;
+
+	/* The extra is appended to the user data in the buffer */
+	BUILD_BUG_ON(SYSCALL_FAULT_USER_MAX + sizeof(EXTRA) >=
+		     SYSCALL_FAULT_ARG_SZ);
+
+	/*
+	 * If this system call event has a size argument, use
+	 * it to define how much of user space memory to read,
+	 * and read it as an array and not a string.
+	 */
+	if (sys_data->user_arg_size >= 0) {
+		array = true;
+		size = args[sys_data->user_arg_size];
+		if (size > SYSCALL_FAULT_ARG_SZ - 1)
+			size = SYSCALL_FAULT_ARG_SZ - 1;
+	}
+
+	while (mask) {
+		int idx = ffs(mask) - 1;
+		mask &= ~BIT(idx);
+
+		if (WARN_ON_ONCE(i == SYSCALL_FAULT_MAX_CNT))
+			break;
+
+		/* Get the pointer to user space memory to read */
+		ptr_array[i++] = (char *)args[idx];
+	}
 
-	/* Get the pointer to user space memory to read */
-	ptr = (char *)args[idx];
-	*data_size = 0;
+	uargs = i;
+
+	/* Clear the values that are not used */
+	for (; i < SYSCALL_FAULT_MAX_CNT; i++) {
+		data_size[i] = -1; /* Denotes no pointer */
+	}
 
  again:
 	/*
@@ -532,24 +593,12 @@ static char *sys_fault_user(struct syscall_metadata *sys_data, struct syscall_bu
 		if (!once)
 			pr_warn("Error: Too many tries to read syscall %s\n", sys_data->name);
 		once = true;
-		return buf;
+		return buffer;
 	}
 
 	/* Read the current sched switch count */
 	cnt = this_cpu_read(sched_switch_cnt);
 
-	/*
-	 * If this system call event has a size argument, use
-	 * it to define how much of user space memory to read,
-	 * and read it as an array and not a string.
-	 */
-	if (sys_data->user_arg_size >= 0) {
-		array = true;
-		size = args[sys_data->user_arg_size];
-		if (size > SYSCALL_FAULT_BUF_SZ - 1)
-			size = SYSCALL_FAULT_BUF_SZ - 1;
-	}
-
 	/*
 	 * Preemption is going to be enabled, but this task must
 	 * remain on this CPU.
@@ -562,20 +611,23 @@ static char *sys_fault_user(struct syscall_metadata *sys_data, struct syscall_bu
 	 */
 	preempt_enable_notrace();
 
-	if (array) {
-		ret = __copy_from_user(buf, ptr, size);
-		ret = ret ? -1 : size;
-	} else {
-		ret = strncpy_from_user(buf, ptr, size);
+	buf = buffer;
+
+	for (i = 0; i < uargs; i++, buf += SYSCALL_FAULT_ARG_SZ) {
+		char *ptr = ptr_array[i];
+
+		if (array) {
+			ret = __copy_from_user(buf, ptr, size);
+			ret = ret ? -1 : size;
+		} else {
+			ret = strncpy_from_user(buf, ptr, size);
+		}
+		read[i] = ret;
 	}
 
 	preempt_disable_notrace();
 	migrate_enable();
 
-	/* If it faulted, no use to try again */
-	if (ret < 0)
-		return buf;
-
 	/*
 	 * Preemption is disabled again, now check the sched_switch_cnt.
 	 * If it increased by two or more, then another user space process
@@ -592,28 +644,39 @@ static char *sys_fault_user(struct syscall_metadata *sys_data, struct syscall_bu
 	if (this_cpu_read(sched_switch_cnt) > cnt + 1)
 		goto again;
 
-	/* For strings, replace any non-printable characters with '.' */
-	if (!array) {
-		for (int i = 0; i < ret; i++) {
-			if (!isprint(buf[i]))
-				buf[i] = '.';
-		}
+	buf = buffer;
+	for (i = 0; i < uargs; i++, buf += SYSCALL_FAULT_ARG_SZ) {
 
-		/*
-		 * If the text was truncated due to our max limit, add "..." to
-		 * the string.
-		 */
-		if (ret > SYSCALL_FAULT_BUF_SZ - sizeof(EXTRA)) {
-			strscpy(buf + SYSCALL_FAULT_BUF_SZ - sizeof(EXTRA),
-				EXTRA, sizeof(EXTRA));
-			ret = SYSCALL_FAULT_BUF_SZ;
+		ret = read[i];
+		if (ret < 0)
+			continue;
+		buf[ret] = '\0';
+
+		/* For strings, replace any non-printable characters with '.' */
+		if (!array) {
+			for (int x = 0; x < ret; x++) {
+				if (!isprint(buf[x]))
+					buf[x] = '.';
+			}
+
+			/*
+			 * If the text was truncated due to our max limit,
+			 * add "..." to the string.
+			 */
+			if (ret > SYSCALL_FAULT_USER_MAX) {
+				strscpy(buf + SYSCALL_FAULT_USER_MAX, EXTRA,
+					sizeof(EXTRA));
+				ret = SYSCALL_FAULT_USER_MAX + sizeof(EXTRA);
+			} else {
+				buf[ret++] = '\0';
+			}
 		} else {
-			buf[ret++] = '\0';
+			ret = min(ret, SYSCALL_FAULT_USER_MAX);
 		}
+		data_size[i] = ret;
 	}
 
-	*data_size = ret;
-	return buf;
+	return buffer;
 }
 
 static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
@@ -625,9 +688,10 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
 	struct trace_event_buffer fbuffer;
 	unsigned long args[6];
 	char *user_ptr;
-	int user_size = 0;
+	int user_sizes[SYSCALL_FAULT_MAX_CNT] = {};
 	int syscall_nr;
 	int size = 0;
+	int uargs = 0;
 	bool mayfault;
 
 	/*
@@ -660,20 +724,27 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
 
 	if (mayfault) {
 		struct syscall_buf_info *sinfo;
+		int i;
 
 		/* If the syscall_buffer is NULL, tracing is being shutdown */
 		sinfo = READ_ONCE(syscall_buffer);
 		if (!sinfo)
 			return;
 
-		user_ptr = sys_fault_user(sys_data, sinfo, args, &user_size);
+		user_ptr = sys_fault_user(sys_data, sinfo, args, user_sizes);
 		/*
 		 * user_size is the amount of data to append.
 		 * Need to add 4 for the meta field that points to
 		 * the user memory at the end of the event and also
 		 * stores its size.
 		 */
-		size = 4 + user_size;
+		for (i = 0; i < SYSCALL_FAULT_MAX_CNT; i++) {
+			if (user_sizes[i] < 0)
+				break;
+			size += user_sizes[i] + 4;
+		}
+		/* Save the number of user read arguments of this syscall */
+		uargs = i;
 	}
 
 	size += sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
@@ -688,6 +759,7 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
 	memcpy(entry->args, args, sizeof(unsigned long) * sys_data->nb_args);
 
 	if (mayfault) {
+		char *buf = user_ptr;
 		void *ptr;
 		int val;
 
@@ -699,21 +771,30 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
 
 		/*
 		 * The meta data will store the offset of the user data from
-		 * the beginning of the event.
+		 * the beginning of the event. That is after the static arguments
+		 * and the meta data fields.
 		 */
-		val  = (ptr - (void *)entry) + 4;
+		val = (ptr - (void *)entry) + 4 * uargs;
+
+		for (int i = 0; i < uargs; i++) {
 
-		/* Store the offset and the size into the meta data */
-		*(int *)ptr = val | (user_size << 16);
+			if (i)
+				val += user_sizes[i - 1];
 
-		if (WARN_ON_ONCE((ptr - (void *)entry + user_size) > size))
-			user_size = 0;
+			/* Store the offset and the size into the meta data */
+			*(int *)ptr = val | (user_sizes[i] << 16);
 
-		/* Nothing to do if the user space was empty or faulted */
-		if (user_size) {
-			/* Now store the user space data into the event */
+			/* Skip the meta data */
 			ptr += 4;
-			memcpy(ptr, user_ptr, user_size);
+		}
+
+		for (int i = 0; i < uargs; i++, buf += SYSCALL_FAULT_ARG_SZ) {
+			/* Nothing to do if the user space was empty or faulted */
+			if (!user_sizes[i])
+				continue;
+
+			memcpy(ptr, buf, user_sizes[i]);
+			ptr += user_sizes[i];
 		}
 	}
 
@@ -857,6 +938,7 @@ static void unreg_event_syscall_exit(struct trace_event_file *file,
 static void check_faultable_syscall(struct trace_event_call *call, int nr)
 {
 	struct syscall_metadata *sys_data = call->data;
+	unsigned long mask;
 
 	/* Only work on entry */
 	if (sys_data->enter_event != call)
@@ -888,7 +970,6 @@ static void check_faultable_syscall(struct trace_event_call *call, int nr)
 	/* user arg at position 0 */
 	case __NR_access:
 	case __NR_acct:
-	case __NR_add_key: /* Just _type. TODO add _description */
 	case __NR_chdir:
 	case __NR_chown:
 	case __NR_chmod:
@@ -897,28 +978,15 @@ static void check_faultable_syscall(struct trace_event_call *call, int nr)
 	case __NR_delete_module:
 	case __NR_execve:
 	case __NR_fsopen:
-	case __NR_getxattr: /* Just pathname, TODO add name */
 	case __NR_lchown:
-	case __NR_lgetxattr: /* Just pathname, TODO add name */
-	case __NR_lremovexattr: /* Just pathname, TODO add name */
-	case __NR_link: /* Just oldname. TODO add newname */
-	case __NR_listxattr: /* Just pathname, TODO add list */
-	case __NR_llistxattr: /* Just pathname, TODO add list */
-	case __NR_lsetxattr: /* Just pathname, TODO add list */
 	case __NR_open:
 	case __NR_memfd_create:
-	case __NR_mount: /* Just dev_name, TODO add dir_name and type */
 	case __NR_mkdir:
 	case __NR_mknod:
 	case __NR_mq_open:
 	case __NR_mq_unlink:
-	case __NR_pivot_root: /* Just new_root, TODO add old_root */
 	case __NR_readlink:
-	case __NR_removexattr: /* Just pathname, TODO add name */
-	case __NR_rename: /* Just oldname. TODO add newname */
-	case __NR_request_key: /* Just _type. TODO add _description */
 	case __NR_rmdir:
-	case __NR_setxattr: /* Just pathname, TODO add list */
 	case __NR_shmdt:
 	case __NR_statfs:
 	case __NR_swapon:
@@ -945,14 +1013,10 @@ static void check_faultable_syscall(struct trace_event_call *call, int nr)
 	case __NR_fspick:
 	case __NR_fremovexattr:
 	case __NR_futimesat:
-	case __NR_getxattrat: /* Just pathname, TODO add name */
 	case __NR_inotify_add_watch:
-	case __NR_linkat: /* Just oldname. TODO add newname */
-	case __NR_listxattrat: /* Just pathname, TODO add list */
 	case __NR_mkdirat:
 	case __NR_mknodat:
 	case __NR_mount_setattr:
-	case __NR_move_mount: /* Just from_pathname, TODO add to_pathname */
 	case __NR_name_to_handle_at:
 	case __NR_newfstatat:
 	case __NR_openat:
@@ -960,13 +1024,8 @@ static void check_faultable_syscall(struct trace_event_call *call, int nr)
 	case __NR_open_tree:
 	case __NR_open_tree_attr:
 	case __NR_readlinkat:
-	case __NR_renameat: /* Just oldname. TODO add newname */
-	case __NR_renameat2: /* Just oldname. TODO add newname */
-	case __NR_removexattrat: /* Just pathname, TODO add name */
 	case __NR_quotactl:
-	case __NR_setxattrat: /* Just pathname, TODO add list */
 	case __NR_syslog:
-	case __NR_symlinkat: /* Just oldname. TODO add newname */
 	case __NR_statx:
 	case __NR_unlinkat:
 	case __NR_utimensat:
@@ -981,9 +1040,52 @@ static void check_faultable_syscall(struct trace_event_call *call, int nr)
 	case __NR_fanotify_mark:
 		sys_data->user_mask = BIT(4);
 		break;
+	/* 2 user args, 0 and 1 */
+	case __NR_add_key:
+	case __NR_getxattr:
+	case __NR_lgetxattr:
+	case __NR_lremovexattr:
+	case __NR_link:
+	case __NR_listxattr:
+	case __NR_llistxattr:
+	case __NR_lsetxattr:
+	case __NR_pivot_root:
+	case __NR_removexattr:
+	case __NR_rename:
+	case __NR_request_key:
+	case __NR_setxattr:
+	case __NR_symlinkat:
+		sys_data->user_mask = BIT(0) | BIT(1);
+		break;
+	/* 2 user args, 1 and 3 */
+	case __NR_getxattrat:
+	case __NR_linkat:
+	case __NR_listxattrat:
+	case __NR_move_mount:
+	case __NR_renameat:
+	case __NR_renameat2:
+	case __NR_removexattrat:
+	case __NR_setxattrat:
+		sys_data->user_mask = BIT(1) | BIT(3);
+		break;
+	case __NR_mount: /* Just dev_name and dir_name, TODO add type */
+		sys_data->user_mask = BIT(0) | BIT(1) | BIT(2);
+		break;
 	default:
 		sys_data->user_mask = 0;
+		return;
 	}
+
+	if (sys_data->user_arg_size < 0)
+		return;
+
+	/*
+	 * The user_arg_size can only be used when the system call
+	 * is reading only a single address from user space.
+	 */
+	mask = sys_data->user_mask;
+	if (WARN_ON(mask & (mask - 1)))
+		sys_data->user_arg_size = -1;
 }
 
 static int __init init_syscall_trace(struct trace_event_call *call)
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 7/7] tracing: Add syscall_user_buf_size to limit amount written
  2025-08-05 19:26 [PATCH 0/7] tracing: Show contents of syscall trace event user space fields Steven Rostedt
                   ` (5 preceding siblings ...)
  2025-08-05 19:26 ` [PATCH 6/7] tracing: Allow syscall trace events to read more than one user parameter Steven Rostedt
@ 2025-08-05 19:26 ` Steven Rostedt
  2025-08-06 10:50   ` Douglas Raillard
  2025-08-06 14:50   ` kernel test robot
  6 siblings, 2 replies; 22+ messages in thread
From: Steven Rostedt @ 2025-08-05 19:26 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Peter Zijlstra, Namhyung Kim, Takaya Saeki, Tom Zanussi,
	Thomas Gleixner, Ian Rogers, aahringo, Douglas Raillard

From: Steven Rostedt <rostedt@goodmis.org>

When a system call that reads user space addresses copy it to the ring
buffer, it can copy up to 511 bytes of data. This can waste precious ring
buffer space if the user isn't interested in the output. Add a new file
"syscall_user_buf_size" that gets initialized to a new config
CONFIG_SYSCALL_BUF_SIZE_DEFAULT that defaults to 128.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 Documentation/trace/ftrace.rst |  7 +++++
 kernel/trace/Kconfig           | 13 +++++++++
 kernel/trace/trace.c           | 52 ++++++++++++++++++++++++++++++++++
 kernel/trace/trace.h           |  3 ++
 kernel/trace/trace_syscalls.c  | 22 ++++++++------
 5 files changed, 88 insertions(+), 9 deletions(-)

diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
index af66a05e18cc..4712bbfcfd08 100644
--- a/Documentation/trace/ftrace.rst
+++ b/Documentation/trace/ftrace.rst
@@ -366,6 +366,13 @@ of ftrace. Here is a list of some of the key files:
 	for each function. The displayed address is the patch-site address
 	and can differ from /proc/kallsyms address.
 
+  syscall_user_buf_size:
+
+	Some system call trace events will record the data from a user
+	space address that one of the parameters point to. The amount of
+	data per event is limited. This file holds the max number of bytes
+	that will be recorded into the ring buffer to hold this data.
+
   dyn_ftrace_total_info:
 
 	This file is for debugging purposes. The number of functions that
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index f80298e6aa16..aa28d7ca3e31 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -574,6 +574,19 @@ config FTRACE_SYSCALLS
 	help
 	  Basic tracer to catch the syscall entry and exit events.
 
+config TRACE_SYSCALL_BUF_SIZE_DEFAULT
+	int "System call user read max size"
+	range 0 128
+	default 128
+	depends on FTRACE_SYSCALLS
+	help
+	 Some system call trace events will record the data from a user
+	 space address that one of the parameters point to. The amount of
+	 data per event is limited. It may be further limited by this
+	 config and later changed by writing an ASCII number into:
+
+	   /sys/kernel/tracing/syscall_user_buf_size
+
 config TRACER_SNAPSHOT
 	bool "Create a snapshot trace buffer"
 	select TRACER_MAX_TRACE
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index d0b1964648c1..1db708ed0625 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -6913,6 +6913,43 @@ static ssize_t tracing_splice_read_pipe(struct file *filp,
 	goto out;
 }
 
+static ssize_t
+tracing_syscall_buf_read(struct file *filp, char __user *ubuf,
+			 size_t cnt, loff_t *ppos)
+{
+	struct inode *inode = file_inode(filp);
+	struct trace_array *tr = inode->i_private;
+	char buf[64];
+	int r;
+
+	r = snprintf(buf, 64, "%d\n", tr->syscall_buf_sz);
+
+	return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
+}
+
+static ssize_t
+tracing_syscall_buf_write(struct file *filp, const char __user *ubuf,
+			  size_t cnt, loff_t *ppos)
+{
+	struct inode *inode = file_inode(filp);
+	struct trace_array *tr = inode->i_private;
+	unsigned long val;
+	int ret;
+
+	ret = kstrtoul_from_user(ubuf, cnt, 10, &val);
+	if (ret)
+		return ret;
+
+	if (val > SYSCALL_FAULT_USER_MAX)
+		val = SYSCALL_FAULT_USER_MAX;
+
+	tr->syscall_buf_sz = val;
+
+	*ppos += cnt;
+
+	return cnt;
+}
+
 static ssize_t
 tracing_entries_read(struct file *filp, char __user *ubuf,
 		     size_t cnt, loff_t *ppos)
@@ -7737,6 +7774,14 @@ static const struct file_operations tracing_entries_fops = {
 	.release	= tracing_release_generic_tr,
 };
 
+static const struct file_operations tracing_syscall_buf_fops = {
+	.open		= tracing_open_generic_tr,
+	.read		= tracing_syscall_buf_read,
+	.write		= tracing_syscall_buf_write,
+	.llseek		= generic_file_llseek,
+	.release	= tracing_release_generic_tr,
+};
+
 static const struct file_operations tracing_buffer_meta_fops = {
 	.open		= tracing_buffer_meta_open,
 	.read		= seq_read,
@@ -9839,6 +9884,8 @@ trace_array_create_systems(const char *name, const char *systems,
 
 	raw_spin_lock_init(&tr->start_lock);
 
+	tr->syscall_buf_sz = global_trace.syscall_buf_sz;
+
 	tr->max_lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;
 #ifdef CONFIG_TRACER_MAX_TRACE
 	spin_lock_init(&tr->snapshot_trigger_lock);
@@ -10155,6 +10202,9 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer)
 	trace_create_file("buffer_subbuf_size_kb", TRACE_MODE_WRITE, d_tracer,
 			  tr, &buffer_subbuf_size_fops);
 
+	trace_create_file("syscall_user_buf_size", TRACE_MODE_WRITE, d_tracer,
+			 tr, &tracing_syscall_buf_fops);
+
 	create_trace_options_dir(tr);
 
 #ifdef CONFIG_TRACER_MAX_TRACE
@@ -11081,6 +11131,8 @@ __init static int tracer_alloc_buffers(void)
 
 	global_trace.flags = TRACE_ARRAY_FL_GLOBAL;
 
+	global_trace.syscall_buf_sz = CONFIG_TRACE_SYSCALL_BUF_SIZE_DEFAULT;
+
 	INIT_LIST_HEAD(&global_trace.systems);
 	INIT_LIST_HEAD(&global_trace.events);
 	INIT_LIST_HEAD(&global_trace.hist_vars);
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 1dbf1d3cf2f1..1b3e464619f0 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -131,6 +131,8 @@ enum trace_type {
 #define HIST_STACKTRACE_SIZE	(HIST_STACKTRACE_DEPTH * sizeof(unsigned long))
 #define HIST_STACKTRACE_SKIP	5
 
+#define SYSCALL_FAULT_USER_MAX 128
+
 /*
  * syscalls are special, and need special handling, this is why
  * they are not included in trace_entries.h
@@ -430,6 +432,7 @@ struct trace_array {
 	int			function_enabled;
 #endif
 	int			no_filter_buffering_ref;
+	unsigned int		syscall_buf_sz;
 	struct list_head	hist_vars;
 #ifdef CONFIG_TRACER_SNAPSHOT
 	struct cond_snapshot	*cond_snapshot;
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index b39fa9dd1067..e9162165c4d2 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -407,17 +407,16 @@ struct syscall_buf_info {
  * SYSCALL_FAULT_USER_MAX is the amount to copy into the ring buffer.
  *  It's slightly smaller than SYSCALL_FAULT_ARG_SZ to know if it
  *  needs to append the EXTRA or not.
+ *  (defined in kernel/trace/trace.h)
  *
  * This only allows up to 3 args from system calls.
  */
 #define SYSCALL_FAULT_BUF_SZ 512
 #define SYSCALL_FAULT_ARG_SZ 168
-#define SYSCALL_FAULT_USER_MAX 128
 #define SYSCALL_FAULT_MAX_CNT 3
 
 static struct syscall_buf_info *syscall_buffer;
 static DEFINE_PER_CPU(unsigned long, sched_switch_cnt);
-
 static int syscall_fault_buffer_cnt;
 
 static void syscall_fault_buffer_free(struct syscall_buf_info *sinfo)
@@ -524,7 +523,7 @@ static void syscall_fault_buffer_disable(void)
 	call_rcu_tasks_trace(&sinfo->rcu, rcu_free_syscall_buffer);
 }
 
-static char *sys_fault_user(struct syscall_metadata *sys_data,
+static char *sys_fault_user(struct trace_array *tr, struct syscall_metadata *sys_data,
 			    struct syscall_buf_info *sinfo,
 			    unsigned long *args,
 			    unsigned int data_size[SYSCALL_FAULT_MAX_CNT])
@@ -576,6 +575,10 @@ static char *sys_fault_user(struct syscall_metadata *sys_data,
 		data_size[i] = -1; /* Denotes no pointer */
 	}
 
+	/* A zero size means do not even try */
+	if (!tr->syscall_buf_sz)
+		return buffer;
+
  again:
 	/*
 	 * If this task is preempted by another user space task, it
@@ -659,19 +662,20 @@ static char *sys_fault_user(struct syscall_metadata *sys_data,
 					buf[x] = '.';
 			}
 
+			size = min(tr->syscall_buf_sz, SYSCALL_FAULT_USER_MAX);
+
 			/*
 			 * If the text was truncated due to our max limit,
 			 * add "..." to the string.
 			 */
-			if (ret > SYSCALL_FAULT_USER_MAX) {
-				strscpy(buf + SYSCALL_FAULT_USER_MAX, EXTRA,
-					sizeof(EXTRA));
-				ret = SYSCALL_FAULT_USER_MAX + sizeof(EXTRA);
+			if (ret > size) {
+				strscpy(buf + size, EXTRA, sizeof(EXTRA));
+				ret = size + sizeof(EXTRA);
 			} else {
 				buf[ret++] = '\0';
 			}
 		} else {
-			ret = min(ret, SYSCALL_FAULT_USER_MAX);
+			ret = min((unsigned int)ret, tr->syscall_buf_sz);
 		}
 		data_size[i] = ret;
 	}
@@ -731,7 +735,7 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
 		if (!sinfo)
 			return;
 
-		user_ptr = sys_fault_user(sys_data, sinfo, args, user_sizes);
+		user_ptr = sys_fault_user(tr, sys_data, sinfo, args, user_sizes);
 		/*
 		 * user_size is the amount of data to append.
 		 * Need to add 4 for the meta field that points to
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/7] tracing: Have syscall trace events show "0x" for values greater than 10
  2025-08-05 19:26 ` [PATCH 2/7] tracing: Have syscall trace events show "0x" for values greater than 10 Steven Rostedt
@ 2025-08-06 10:24   ` Douglas Raillard
  2025-08-06 12:39     ` Steven Rostedt
  0 siblings, 1 reply; 22+ messages in thread
From: Douglas Raillard @ 2025-08-06 10:24 UTC (permalink / raw)
  To: Steven Rostedt, linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Peter Zijlstra, Namhyung Kim, Takaya Saeki, Tom Zanussi,
	Thomas Gleixner, Ian Rogers, aahringo

On 05-08-2025 20:26, Steven Rostedt wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
> 
> Currently the syscall trace events show each value as hexadecimal, but
> without adding "0x" it can be confusing:
> 
>     sys_write(fd: 4, buf: 0x55c4a1fa9270, count: 44)
> 
> Looks like the above write wrote 44 bytes, when in reality it wrote 68
> bytes.
> 
> Add a "0x" for all values greater or equal to 10 to remove the ambiguity.
> For values less than 10, leave off the "0x" as that just adds noise to the
> output.

I'm on the fence for the value-dependent format. This looks like
it could easily make life harder for quick&dirty scripts, but then both
awk's strtonum() and Python's int(x, base=16) seem to handle
the presence/absence of the 0x prefix so maybe it's a non-issue.

OTH, a hand-crafted regex designed after a small set of input may start
to randomly fail if one field unexpectedly goes beyond 10 ...

Just using explicit hex may be the best here, as the actual proper fix
(type-level display hints) is harder. It could probably be implemented
using btf_decl_tag() and __builtin_btf_type_id() to retrieve the BTF info.

> 
> Also change the iterator to check if "i" is nonzero and print the ", "
> delimiter at the start, then adding the logic to the trace_seq_printf() at
> the end.
> 
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
>   kernel/trace/trace_syscalls.c | 12 +++++++++---
>   1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
> index 3a0b65f89130..0f932b22f9ec 100644
> --- a/kernel/trace/trace_syscalls.c
> +++ b/kernel/trace/trace_syscalls.c
> @@ -153,14 +153,20 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
>   		if (trace_seq_has_overflowed(s))
>   			goto end;
>   
> +		if (i)
> +			trace_seq_puts(s, ", ");
> +
>   		/* parameter types */
>   		if (tr && tr->trace_flags & TRACE_ITER_VERBOSE)
>   			trace_seq_printf(s, "%s ", entry->types[i]);
>   
>   		/* parameter values */
> -		trace_seq_printf(s, "%s: %lx%s", entry->args[i],
> -				 trace->args[i],
> -				 i == entry->nb_args - 1 ? "" : ", ");
> +		if (trace->args[i] < 10)
> +			trace_seq_printf(s, "%s: %lu", entry->args[i],
> +					 trace->args[i]);
> +		else
> +			trace_seq_printf(s, "%s: 0x%lx", entry->args[i],
> +					 trace->args[i]);
>   	}
>   
>   	trace_seq_putc(s, ')');

-- 
Douglas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 7/7] tracing: Add syscall_user_buf_size to limit amount written
  2025-08-05 19:26 ` [PATCH 7/7] tracing: Add syscall_user_buf_size to limit amount written Steven Rostedt
@ 2025-08-06 10:50   ` Douglas Raillard
  2025-08-06 12:43     ` Steven Rostedt
  2025-08-06 14:50   ` kernel test robot
  1 sibling, 1 reply; 22+ messages in thread
From: Douglas Raillard @ 2025-08-06 10:50 UTC (permalink / raw)
  To: Steven Rostedt, linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Peter Zijlstra, Namhyung Kim, Takaya Saeki, Tom Zanussi,
	Thomas Gleixner, Ian Rogers, aahringo

On 05-08-2025 20:26, Steven Rostedt wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
> 
> When a system call that reads user space addresses copy it to the ring
> buffer, it can copy up to 511 bytes of data. This can waste precious ring
> buffer space if the user isn't interested in the output. Add a new file
> "syscall_user_buf_size" that gets initialized to a new config
> CONFIG_SYSCALL_BUF_SIZE_DEFAULT that defaults to 128.

Have you considered dynamically removing some event fields ? We routinely hit
the same problem with some of our events that have rarely-used large fields.

If we could have a "fields" file in /sys/kernel/tracing/events/*/*/fields
that allowed selecting what field is needed that would be amazing. I had plans
to build something like that in our kernel module based on the synthetic events API,
but did not proceed as that API is not exported in a useful way.

> 
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
>   Documentation/trace/ftrace.rst |  7 +++++
>   kernel/trace/Kconfig           | 13 +++++++++
>   kernel/trace/trace.c           | 52 ++++++++++++++++++++++++++++++++++
>   kernel/trace/trace.h           |  3 ++
>   kernel/trace/trace_syscalls.c  | 22 ++++++++------
>   5 files changed, 88 insertions(+), 9 deletions(-)
> 
> diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
> index af66a05e18cc..4712bbfcfd08 100644
> --- a/Documentation/trace/ftrace.rst
> +++ b/Documentation/trace/ftrace.rst
> @@ -366,6 +366,13 @@ of ftrace. Here is a list of some of the key files:
>   	for each function. The displayed address is the patch-site address
>   	and can differ from /proc/kallsyms address.
>   
> +  syscall_user_buf_size:
> +
> +	Some system call trace events will record the data from a user
> +	space address that one of the parameters point to. The amount of
> +	data per event is limited. This file holds the max number of bytes
> +	that will be recorded into the ring buffer to hold this data.
> +
>     dyn_ftrace_total_info:
>   
>   	This file is for debugging purposes. The number of functions that
> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> index f80298e6aa16..aa28d7ca3e31 100644
> --- a/kernel/trace/Kconfig
> +++ b/kernel/trace/Kconfig
> @@ -574,6 +574,19 @@ config FTRACE_SYSCALLS
>   	help
>   	  Basic tracer to catch the syscall entry and exit events.
>   
> +config TRACE_SYSCALL_BUF_SIZE_DEFAULT
> +	int "System call user read max size"
> +	range 0 128
> +	default 128
> +	depends on FTRACE_SYSCALLS
> +	help
> +	 Some system call trace events will record the data from a user
> +	 space address that one of the parameters point to. The amount of
> +	 data per event is limited. It may be further limited by this
> +	 config and later changed by writing an ASCII number into:
> +
> +	   /sys/kernel/tracing/syscall_user_buf_size
> +
>   config TRACER_SNAPSHOT
>   	bool "Create a snapshot trace buffer"
>   	select TRACER_MAX_TRACE
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index d0b1964648c1..1db708ed0625 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -6913,6 +6913,43 @@ static ssize_t tracing_splice_read_pipe(struct file *filp,
>   	goto out;
>   }
>   
> +static ssize_t
> +tracing_syscall_buf_read(struct file *filp, char __user *ubuf,
> +			 size_t cnt, loff_t *ppos)
> +{
> +	struct inode *inode = file_inode(filp);
> +	struct trace_array *tr = inode->i_private;
> +	char buf[64];
> +	int r;
> +
> +	r = snprintf(buf, 64, "%d\n", tr->syscall_buf_sz);
> +
> +	return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
> +}
> +
> +static ssize_t
> +tracing_syscall_buf_write(struct file *filp, const char __user *ubuf,
> +			  size_t cnt, loff_t *ppos)
> +{
> +	struct inode *inode = file_inode(filp);
> +	struct trace_array *tr = inode->i_private;
> +	unsigned long val;
> +	int ret;
> +
> +	ret = kstrtoul_from_user(ubuf, cnt, 10, &val);
> +	if (ret)
> +		return ret;
> +
> +	if (val > SYSCALL_FAULT_USER_MAX)
> +		val = SYSCALL_FAULT_USER_MAX;
> +
> +	tr->syscall_buf_sz = val;
> +
> +	*ppos += cnt;
> +
> +	return cnt;
> +}
> +
>   static ssize_t
>   tracing_entries_read(struct file *filp, char __user *ubuf,
>   		     size_t cnt, loff_t *ppos)
> @@ -7737,6 +7774,14 @@ static const struct file_operations tracing_entries_fops = {
>   	.release	= tracing_release_generic_tr,
>   };
>   
> +static const struct file_operations tracing_syscall_buf_fops = {
> +	.open		= tracing_open_generic_tr,
> +	.read		= tracing_syscall_buf_read,
> +	.write		= tracing_syscall_buf_write,
> +	.llseek		= generic_file_llseek,
> +	.release	= tracing_release_generic_tr,
> +};
> +
>   static const struct file_operations tracing_buffer_meta_fops = {
>   	.open		= tracing_buffer_meta_open,
>   	.read		= seq_read,
> @@ -9839,6 +9884,8 @@ trace_array_create_systems(const char *name, const char *systems,
>   
>   	raw_spin_lock_init(&tr->start_lock);
>   
> +	tr->syscall_buf_sz = global_trace.syscall_buf_sz;
> +
>   	tr->max_lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;
>   #ifdef CONFIG_TRACER_MAX_TRACE
>   	spin_lock_init(&tr->snapshot_trigger_lock);
> @@ -10155,6 +10202,9 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer)
>   	trace_create_file("buffer_subbuf_size_kb", TRACE_MODE_WRITE, d_tracer,
>   			  tr, &buffer_subbuf_size_fops);
>   
> +	trace_create_file("syscall_user_buf_size", TRACE_MODE_WRITE, d_tracer,
> +			 tr, &tracing_syscall_buf_fops);
> +
>   	create_trace_options_dir(tr);
>   
>   #ifdef CONFIG_TRACER_MAX_TRACE
> @@ -11081,6 +11131,8 @@ __init static int tracer_alloc_buffers(void)
>   
>   	global_trace.flags = TRACE_ARRAY_FL_GLOBAL;
>   
> +	global_trace.syscall_buf_sz = CONFIG_TRACE_SYSCALL_BUF_SIZE_DEFAULT;
> +
>   	INIT_LIST_HEAD(&global_trace.systems);
>   	INIT_LIST_HEAD(&global_trace.events);
>   	INIT_LIST_HEAD(&global_trace.hist_vars);
> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> index 1dbf1d3cf2f1..1b3e464619f0 100644
> --- a/kernel/trace/trace.h
> +++ b/kernel/trace/trace.h
> @@ -131,6 +131,8 @@ enum trace_type {
>   #define HIST_STACKTRACE_SIZE	(HIST_STACKTRACE_DEPTH * sizeof(unsigned long))
>   #define HIST_STACKTRACE_SKIP	5
>   
> +#define SYSCALL_FAULT_USER_MAX 128
> +
>   /*
>    * syscalls are special, and need special handling, this is why
>    * they are not included in trace_entries.h
> @@ -430,6 +432,7 @@ struct trace_array {
>   	int			function_enabled;
>   #endif
>   	int			no_filter_buffering_ref;
> +	unsigned int		syscall_buf_sz;
>   	struct list_head	hist_vars;
>   #ifdef CONFIG_TRACER_SNAPSHOT
>   	struct cond_snapshot	*cond_snapshot;
> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
> index b39fa9dd1067..e9162165c4d2 100644
> --- a/kernel/trace/trace_syscalls.c
> +++ b/kernel/trace/trace_syscalls.c
> @@ -407,17 +407,16 @@ struct syscall_buf_info {
>    * SYSCALL_FAULT_USER_MAX is the amount to copy into the ring buffer.
>    *  It's slightly smaller than SYSCALL_FAULT_ARG_SZ to know if it
>    *  needs to append the EXTRA or not.
> + *  (defined in kernel/trace/trace.h)
>    *
>    * This only allows up to 3 args from system calls.
>    */
>   #define SYSCALL_FAULT_BUF_SZ 512
>   #define SYSCALL_FAULT_ARG_SZ 168
> -#define SYSCALL_FAULT_USER_MAX 128
>   #define SYSCALL_FAULT_MAX_CNT 3
>   
>   static struct syscall_buf_info *syscall_buffer;
>   static DEFINE_PER_CPU(unsigned long, sched_switch_cnt);
> -
>   static int syscall_fault_buffer_cnt;
>   
>   static void syscall_fault_buffer_free(struct syscall_buf_info *sinfo)
> @@ -524,7 +523,7 @@ static void syscall_fault_buffer_disable(void)
>   	call_rcu_tasks_trace(&sinfo->rcu, rcu_free_syscall_buffer);
>   }
>   
> -static char *sys_fault_user(struct syscall_metadata *sys_data,
> +static char *sys_fault_user(struct trace_array *tr, struct syscall_metadata *sys_data,
>   			    struct syscall_buf_info *sinfo,
>   			    unsigned long *args,
>   			    unsigned int data_size[SYSCALL_FAULT_MAX_CNT])
> @@ -576,6 +575,10 @@ static char *sys_fault_user(struct syscall_metadata *sys_data,
>   		data_size[i] = -1; /* Denotes no pointer */
>   	}
>   
> +	/* A zero size means do not even try */
> +	if (!tr->syscall_buf_sz)
> +		return buffer;
> +
>    again:
>   	/*
>   	 * If this task is preempted by another user space task, it
> @@ -659,19 +662,20 @@ static char *sys_fault_user(struct syscall_metadata *sys_data,
>   					buf[x] = '.';
>   			}
>   
> +			size = min(tr->syscall_buf_sz, SYSCALL_FAULT_USER_MAX);
> +
>   			/*
>   			 * If the text was truncated due to our max limit,
>   			 * add "..." to the string.
>   			 */
> -			if (ret > SYSCALL_FAULT_USER_MAX) {
> -				strscpy(buf + SYSCALL_FAULT_USER_MAX, EXTRA,
> -					sizeof(EXTRA));
> -				ret = SYSCALL_FAULT_USER_MAX + sizeof(EXTRA);
> +			if (ret > size) {
> +				strscpy(buf + size, EXTRA, sizeof(EXTRA));
> +				ret = size + sizeof(EXTRA);
>   			} else {
>   				buf[ret++] = '\0';
>   			}
>   		} else {
> -			ret = min(ret, SYSCALL_FAULT_USER_MAX);
> +			ret = min((unsigned int)ret, tr->syscall_buf_sz);
>   		}
>   		data_size[i] = ret;
>   	}
> @@ -731,7 +735,7 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
>   		if (!sinfo)
>   			return;
>   
> -		user_ptr = sys_fault_user(sys_data, sinfo, args, user_sizes);
> +		user_ptr = sys_fault_user(tr, sys_data, sinfo, args, user_sizes);
>   		/*
>   		 * user_size is the amount of data to append.
>   		 * Need to add 4 for the meta field that points to

--
Douglas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/7] tracing: Have syscall trace events show "0x" for values greater than 10
  2025-08-06 10:24   ` Douglas Raillard
@ 2025-08-06 12:39     ` Steven Rostedt
  2025-08-06 16:42       ` Douglas Raillard
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2025-08-06 12:39 UTC (permalink / raw)
  To: Douglas Raillard
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
	Mathieu Desnoyers, Andrew Morton, Peter Zijlstra, Namhyung Kim,
	Takaya Saeki, Tom Zanussi, Thomas Gleixner, Ian Rogers, aahringo

On Wed, 6 Aug 2025 11:24:33 +0100
Douglas Raillard <douglas.raillard@arm.com> wrote:

> On 05-08-2025 20:26, Steven Rostedt wrote:
> > From: Steven Rostedt <rostedt@goodmis.org>
> > 
> > Currently the syscall trace events show each value as hexadecimal, but
> > without adding "0x" it can be confusing:
> > 
> >     sys_write(fd: 4, buf: 0x55c4a1fa9270, count: 44)
> > 
> > Looks like the above write wrote 44 bytes, when in reality it wrote 68
> > bytes.
> > 
> > Add a "0x" for all values greater or equal to 10 to remove the ambiguity.
> > For values less than 10, leave off the "0x" as that just adds noise to the
> > output.  
> 
> I'm on the fence for the value-dependent format. This looks like
> it could easily make life harder for quick&dirty scripts, but then both
> awk's strtonum() and Python's int(x, base=16) seem to handle
> the presence/absence of the 0x prefix so maybe it's a non-issue.

Yes, but the trace file is more for humans than scripts, and the 0x1 is
just noise.

> 
> OTH, a hand-crafted regex designed after a small set of input may start
> to randomly fail if one field unexpectedly goes beyond 10 ...

If you are doing hand crafted scripts, I'd suggest to use tracing
filters or other tooling that can handle this easily ;-)

> 
> Just using explicit hex may be the best here, as the actual proper fix
> (type-level display hints) is harder. It could probably be implemented
> using btf_decl_tag() and __builtin_btf_type_id() to retrieve the BTF info.

That would add a dependency on BTF, which I would like to avoid.

Thanks!

-- Steve


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 7/7] tracing: Add syscall_user_buf_size to limit amount written
  2025-08-06 10:50   ` Douglas Raillard
@ 2025-08-06 12:43     ` Steven Rostedt
  2025-08-06 16:21       ` Douglas Raillard
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2025-08-06 12:43 UTC (permalink / raw)
  To: Douglas Raillard
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
	Mathieu Desnoyers, Andrew Morton, Peter Zijlstra, Namhyung Kim,
	Takaya Saeki, Tom Zanussi, Thomas Gleixner, Ian Rogers, aahringo

On Wed, 6 Aug 2025 11:50:06 +0100
Douglas Raillard <douglas.raillard@arm.com> wrote:

> On 05-08-2025 20:26, Steven Rostedt wrote:
> > From: Steven Rostedt <rostedt@goodmis.org>
> > 
> > When a system call that reads user space addresses copy it to the ring
> > buffer, it can copy up to 511 bytes of data. This can waste precious ring
> > buffer space if the user isn't interested in the output. Add a new file
> > "syscall_user_buf_size" that gets initialized to a new config
> > CONFIG_SYSCALL_BUF_SIZE_DEFAULT that defaults to 128.  
> 
> Have you considered dynamically removing some event fields ? We routinely hit
> the same problem with some of our events that have rarely-used large fields.

We do that already with eprobes. Note, syscall events are pseudo events
hooked on the raw_syscall events. Thus modifying what is displayed is
trivial as it's done manually anyway. For normal events, it's all in
the TRACE_EVENT() macro which defines the fields at boot. Trying to
modify it later is very difficult.

> 
> If we could have a "fields" file in /sys/kernel/tracing/events/*/*/fields
> that allowed selecting what field is needed that would be amazing. I had plans
> to build something like that in our kernel module based on the synthetic events API,
> but did not proceed as that API is not exported in a useful way.

Take a look at eprobes. You can make a new event based from an existing
event (including other dynamic events and syscalls).

I finally got around to adding documentation about it:

  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/trace/eprobetrace.rst

-- Steve

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/7] tracing: Have syscall trace events read user space string
  2025-08-05 19:26 ` [PATCH 3/7] tracing: Have syscall trace events read user space string Steven Rostedt
@ 2025-08-06 14:39   ` kernel test robot
  2025-08-11  5:19   ` kernel test robot
  1 sibling, 0 replies; 22+ messages in thread
From: kernel test robot @ 2025-08-06 14:39 UTC (permalink / raw)
  To: Steven Rostedt, linux-kernel, linux-trace-kernel
  Cc: oe-kbuild-all, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers,
	Andrew Morton, Linux Memory Management List, Peter Zijlstra,
	Namhyung Kim, Takaya Saeki, Tom Zanussi, Thomas Gleixner,
	Ian Rogers, aahringo, Douglas Raillard

Hi Steven,

kernel test robot noticed the following build errors:

[auto build test ERROR on trace/for-next]
[also build test ERROR on linus/master v6.16 next-20250806]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Steven-Rostedt/tracing-Replace-syscall-RCU-pointer-assignment-with-READ-WRITE_ONCE/20250806-122312
base:   https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace for-next
patch link:    https://lore.kernel.org/r/20250805193235.080757106%40kernel.org
patch subject: [PATCH 3/7] tracing: Have syscall trace events read user space string
config: parisc-randconfig-r071-20250806 (https://download.01.org/0day-ci/archive/20250806/202508062230.puMRaDdE-lkp@intel.com/config)
compiler: hppa-linux-gcc (GCC) 11.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250806/202508062230.puMRaDdE-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202508062230.puMRaDdE-lkp@intel.com/

All errors (new ones prefixed by >>):

   kernel/trace/trace_syscalls.c: In function 'check_faultable_syscall':
>> kernel/trace/trace_syscalls.c:886:14: error: '__NR_newfstatat' undeclared (first use in this function); did you mean 'sys_newfstatat'?
     886 |         case __NR_newfstatat:
         |              ^~~~~~~~~~~~~~~
         |              sys_newfstatat
   kernel/trace/trace_syscalls.c:886:14: note: each undeclared identifier is reported only once for each function it appears in


vim +886 kernel/trace/trace_syscalls.c

   802	
   803	/*
   804	 * For system calls that reference user space memory that can
   805	 * be recorded into the event, set the system call meta data's user_mask
   806	 * to the "args" index that points to the user space memory to retrieve.
   807	 */
   808	static void check_faultable_syscall(struct trace_event_call *call, int nr)
   809	{
   810		struct syscall_metadata *sys_data = call->data;
   811	
   812		/* Only work on entry */
   813		if (sys_data->enter_event != call)
   814			return;
   815	
   816		switch (nr) {
   817		/* user arg at position 0 */
   818		case __NR_access:
   819		case __NR_acct:
   820		case __NR_add_key: /* Just _type. TODO add _description */
   821		case __NR_chdir:
   822		case __NR_chown:
   823		case __NR_chmod:
   824		case __NR_chroot:
   825		case __NR_creat:
   826		case __NR_delete_module:
   827		case __NR_execve:
   828		case __NR_fsopen:
   829		case __NR_getxattr: /* Just pathname, TODO add name */
   830		case __NR_lchown:
   831		case __NR_lgetxattr: /* Just pathname, TODO add name */
   832		case __NR_lremovexattr: /* Just pathname, TODO add name */
   833		case __NR_link: /* Just oldname. TODO add newname */
   834		case __NR_listxattr: /* Just pathname, TODO add list */
   835		case __NR_llistxattr: /* Just pathname, TODO add list */
   836		case __NR_lsetxattr: /* Just pathname, TODO add list */
   837		case __NR_open:
   838		case __NR_memfd_create:
   839		case __NR_mount: /* Just dev_name, TODO add dir_name and type */
   840		case __NR_mkdir:
   841		case __NR_mknod:
   842		case __NR_mq_open:
   843		case __NR_mq_unlink:
   844		case __NR_pivot_root: /* Just new_root, TODO add old_root */
   845		case __NR_readlink:
   846		case __NR_removexattr: /* Just pathname, TODO add name */
   847		case __NR_rename: /* Just oldname. TODO add newname */
   848		case __NR_request_key: /* Just _type. TODO add _description */
   849		case __NR_rmdir:
   850		case __NR_setxattr: /* Just pathname, TODO add list */
   851		case __NR_shmdt:
   852		case __NR_statfs:
   853		case __NR_swapon:
   854		case __NR_swapoff:
   855		case __NR_symlink: /* Just oldname. TODO add newname */
   856		case __NR_truncate:
   857		case __NR_unlink:
   858		case __NR_umount2:
   859		case __NR_utime:
   860		case __NR_utimes:
   861			sys_data->user_mask = BIT(0);
   862			break;
   863		/* user arg at position 1 */
   864		case __NR_execveat:
   865		case __NR_faccessat:
   866		case __NR_faccessat2:
   867		case __NR_finit_module:
   868		case __NR_fchmodat:
   869		case __NR_fchmodat2:
   870		case __NR_fchownat:
   871		case __NR_fgetxattr:
   872		case __NR_flistxattr:
   873		case __NR_fsetxattr:
   874		case __NR_fspick:
   875		case __NR_fremovexattr:
   876		case __NR_futimesat:
   877		case __NR_getxattrat: /* Just pathname, TODO add name */
   878		case __NR_inotify_add_watch:
   879		case __NR_linkat: /* Just oldname. TODO add newname */
   880		case __NR_listxattrat: /* Just pathname, TODO add list */
   881		case __NR_mkdirat:
   882		case __NR_mknodat:
   883		case __NR_mount_setattr:
   884		case __NR_move_mount: /* Just from_pathname, TODO add to_pathname */
   885		case __NR_name_to_handle_at:
 > 886		case __NR_newfstatat:
   887		case __NR_openat:
   888		case __NR_openat2:
   889		case __NR_open_tree:
   890		case __NR_open_tree_attr:
   891		case __NR_readlinkat:
   892		case __NR_renameat: /* Just oldname. TODO add newname */
   893		case __NR_renameat2: /* Just oldname. TODO add newname */
   894		case __NR_removexattrat: /* Just pathname, TODO add name */
   895		case __NR_quotactl:
   896		case __NR_setxattrat: /* Just pathname, TODO add list */
   897		case __NR_syslog:
   898		case __NR_symlinkat: /* Just oldname. TODO add newname */
   899		case __NR_statx:
   900		case __NR_unlinkat:
   901		case __NR_utimensat:
   902			sys_data->user_mask = BIT(1);
   903			break;
   904		/* user arg at position 2 */
   905		case __NR_init_module:
   906		case __NR_fsconfig:
   907			sys_data->user_mask = BIT(2);
   908			break;
   909		/* user arg at position 4 */
   910		case __NR_fanotify_mark:
   911			sys_data->user_mask = BIT(4);
   912			break;
   913		default:
   914			sys_data->user_mask = 0;
   915		}
   916	}
   917	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 7/7] tracing: Add syscall_user_buf_size to limit amount written
  2025-08-05 19:26 ` [PATCH 7/7] tracing: Add syscall_user_buf_size to limit amount written Steven Rostedt
  2025-08-06 10:50   ` Douglas Raillard
@ 2025-08-06 14:50   ` kernel test robot
  1 sibling, 0 replies; 22+ messages in thread
From: kernel test robot @ 2025-08-06 14:50 UTC (permalink / raw)
  To: Steven Rostedt, linux-kernel, linux-trace-kernel
  Cc: llvm, oe-kbuild-all, Masami Hiramatsu, Mark Rutland,
	Mathieu Desnoyers, Andrew Morton, Linux Memory Management List,
	Peter Zijlstra, Namhyung Kim, Takaya Saeki, Tom Zanussi,
	Thomas Gleixner, Ian Rogers, aahringo, Douglas Raillard

Hi Steven,

kernel test robot noticed the following build errors:

[auto build test ERROR on trace/for-next]
[also build test ERROR on linus/master v6.16 next-20250806]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Steven-Rostedt/tracing-Replace-syscall-RCU-pointer-assignment-with-READ-WRITE_ONCE/20250806-122312
base:   https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace for-next
patch link:    https://lore.kernel.org/r/20250805193235.747004484%40kernel.org
patch subject: [PATCH 7/7] tracing: Add syscall_user_buf_size to limit amount written
config: hexagon-randconfig-002-20250806 (https://download.01.org/0day-ci/archive/20250806/202508062211.cwYqtLu0-lkp@intel.com/config)
compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project 7b8dea265e72c3037b6b1e54d5ab51b7e14f328b)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250806/202508062211.cwYqtLu0-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202508062211.cwYqtLu0-lkp@intel.com/

All errors (new ones prefixed by >>):

>> kernel/trace/trace.c:11128:32: error: use of undeclared identifier 'CONFIG_TRACE_SYSCALL_BUF_SIZE_DEFAULT'
    11128 |         global_trace.syscall_buf_sz = CONFIG_TRACE_SYSCALL_BUF_SIZE_DEFAULT;
          |                                       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   1 error generated.


vim +/CONFIG_TRACE_SYSCALL_BUF_SIZE_DEFAULT +11128 kernel/trace/trace.c

 11110	
 11111		init_trace_flags_index(&global_trace);
 11112	
 11113		register_tracer(&nop_trace);
 11114	
 11115		/* Function tracing may start here (via kernel command line) */
 11116		init_function_trace();
 11117	
 11118		/* All seems OK, enable tracing */
 11119		tracing_disabled = 0;
 11120	
 11121		atomic_notifier_chain_register(&panic_notifier_list,
 11122					       &trace_panic_notifier);
 11123	
 11124		register_die_notifier(&trace_die_notifier);
 11125	
 11126		global_trace.flags = TRACE_ARRAY_FL_GLOBAL;
 11127	
 11128		global_trace.syscall_buf_sz = CONFIG_TRACE_SYSCALL_BUF_SIZE_DEFAULT;
 11129	
 11130		INIT_LIST_HEAD(&global_trace.systems);
 11131		INIT_LIST_HEAD(&global_trace.events);
 11132		INIT_LIST_HEAD(&global_trace.hist_vars);
 11133		INIT_LIST_HEAD(&global_trace.err_log);
 11134		list_add(&global_trace.marker_list, &marker_copies);
 11135		list_add(&global_trace.list, &ftrace_trace_arrays);
 11136	
 11137		apply_trace_boot_options();
 11138	
 11139		register_snapshot_cmd();
 11140	
 11141		return 0;
 11142	
 11143	out_free_pipe_cpumask:
 11144		free_cpumask_var(global_trace.pipe_cpumask);
 11145	out_free_savedcmd:
 11146		trace_free_saved_cmdlines_buffer();
 11147	out_free_temp_buffer:
 11148		ring_buffer_free(temp_buffer);
 11149	out_rm_hp_state:
 11150		cpuhp_remove_multi_state(CPUHP_TRACE_RB_PREPARE);
 11151	out_free_cpumask:
 11152		free_cpumask_var(global_trace.tracing_cpumask);
 11153	out_free_buffer_mask:
 11154		free_cpumask_var(tracing_buffer_mask);
 11155		return ret;
 11156	}
 11157	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/7] tracing: Display some syscall arrays as strings
  2025-08-05 19:26 ` [PATCH 5/7] tracing: Display some syscall arrays as strings Steven Rostedt
@ 2025-08-06 15:12   ` kernel test robot
  0 siblings, 0 replies; 22+ messages in thread
From: kernel test robot @ 2025-08-06 15:12 UTC (permalink / raw)
  To: Steven Rostedt, linux-kernel, linux-trace-kernel
  Cc: oe-kbuild-all, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers,
	Andrew Morton, Linux Memory Management List, Peter Zijlstra,
	Namhyung Kim, Takaya Saeki, Tom Zanussi, Thomas Gleixner,
	Ian Rogers, aahringo, Douglas Raillard

Hi Steven,

kernel test robot noticed the following build errors:

[auto build test ERROR on trace/for-next]
[also build test ERROR on linus/master v6.16 next-20250806]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Steven-Rostedt/tracing-Replace-syscall-RCU-pointer-assignment-with-READ-WRITE_ONCE/20250806-122312
base:   https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace for-next
patch link:    https://lore.kernel.org/r/20250805193235.416382557%40kernel.org
patch subject: [PATCH 5/7] tracing: Display some syscall arrays as strings
config: i386-randconfig-141-20250806 (https://download.01.org/0day-ci/archive/20250806/202508062215.aRvDXrBA-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14+deb12u1) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250806/202508062215.aRvDXrBA-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202508062215.aRvDXrBA-lkp@intel.com/

All errors (new ones prefixed by >>):

   kernel/trace/trace_syscalls.c: In function 'check_faultable_syscall':
>> kernel/trace/trace_syscalls.c:883:14: error: '__NR_kexec_file_load' undeclared (first use in this function)
     883 |         case __NR_kexec_file_load:
         |              ^~~~~~~~~~~~~~~~~~~~
   kernel/trace/trace_syscalls.c:883:14: note: each undeclared identifier is reported only once for each function it appears in
   kernel/trace/trace_syscalls.c:957:14: error: '__NR_newfstatat' undeclared (first use in this function)
     957 |         case __NR_newfstatat:
         |              ^~~~~~~~~~~~~~~


vim +/__NR_kexec_file_load +883 kernel/trace/trace_syscalls.c

   851	
   852	/*
   853	 * For system calls that reference user space memory that can
   854	 * be recorded into the event, set the system call meta data's user_mask
   855	 * to the "args" index that points to the user space memory to retrieve.
   856	 */
   857	static void check_faultable_syscall(struct trace_event_call *call, int nr)
   858	{
   859		struct syscall_metadata *sys_data = call->data;
   860	
   861		/* Only work on entry */
   862		if (sys_data->enter_event != call)
   863			return;
   864	
   865		sys_data->user_arg_size = -1;
   866	
   867		switch (nr) {
   868		/* user arg 1 with size arg at 2 */
   869		case __NR_write:
   870		case __NR_mq_timedsend:
   871		case __NR_pwrite64:
   872			sys_data->user_mask = BIT(1);
   873			sys_data->user_arg_size = 2;
   874			break;
   875		/* user arg 0 with size arg at 1 as string */
   876		case __NR_setdomainname:
   877		case __NR_sethostname:
   878			sys_data->user_mask = BIT(0);
   879			sys_data->user_arg_size = 1;
   880			sys_data->user_arg_is_str = 1;
   881			break;
   882		/* user arg 4 with size arg at 3 as string */
 > 883		case __NR_kexec_file_load:
   884			sys_data->user_mask = BIT(4);
   885			sys_data->user_arg_size = 3;
   886			sys_data->user_arg_is_str = 1;
   887			break;
   888		/* user arg at position 0 */
   889		case __NR_access:
   890		case __NR_acct:
   891		case __NR_add_key: /* Just _type. TODO add _description */
   892		case __NR_chdir:
   893		case __NR_chown:
   894		case __NR_chmod:
   895		case __NR_chroot:
   896		case __NR_creat:
   897		case __NR_delete_module:
   898		case __NR_execve:
   899		case __NR_fsopen:
   900		case __NR_getxattr: /* Just pathname, TODO add name */
   901		case __NR_lchown:
   902		case __NR_lgetxattr: /* Just pathname, TODO add name */
   903		case __NR_lremovexattr: /* Just pathname, TODO add name */
   904		case __NR_link: /* Just oldname. TODO add newname */
   905		case __NR_listxattr: /* Just pathname, TODO add list */
   906		case __NR_llistxattr: /* Just pathname, TODO add list */
   907		case __NR_lsetxattr: /* Just pathname, TODO add list */
   908		case __NR_open:
   909		case __NR_memfd_create:
   910		case __NR_mount: /* Just dev_name, TODO add dir_name and type */
   911		case __NR_mkdir:
   912		case __NR_mknod:
   913		case __NR_mq_open:
   914		case __NR_mq_unlink:
   915		case __NR_pivot_root: /* Just new_root, TODO add old_root */
   916		case __NR_readlink:
   917		case __NR_removexattr: /* Just pathname, TODO add name */
   918		case __NR_rename: /* Just oldname. TODO add newname */
   919		case __NR_request_key: /* Just _type. TODO add _description */
   920		case __NR_rmdir:
   921		case __NR_setxattr: /* Just pathname, TODO add list */
   922		case __NR_shmdt:
   923		case __NR_statfs:
   924		case __NR_swapon:
   925		case __NR_swapoff:
   926		case __NR_symlink: /* Just oldname. TODO add newname */
   927		case __NR_truncate:
   928		case __NR_unlink:
   929		case __NR_umount2:
   930		case __NR_utime:
   931		case __NR_utimes:
   932			sys_data->user_mask = BIT(0);
   933			break;
   934		/* user arg at position 1 */
   935		case __NR_execveat:
   936		case __NR_faccessat:
   937		case __NR_faccessat2:
   938		case __NR_finit_module:
   939		case __NR_fchmodat:
   940		case __NR_fchmodat2:
   941		case __NR_fchownat:
   942		case __NR_fgetxattr:
   943		case __NR_flistxattr:
   944		case __NR_fsetxattr:
   945		case __NR_fspick:
   946		case __NR_fremovexattr:
   947		case __NR_futimesat:
   948		case __NR_getxattrat: /* Just pathname, TODO add name */
   949		case __NR_inotify_add_watch:
   950		case __NR_linkat: /* Just oldname. TODO add newname */
   951		case __NR_listxattrat: /* Just pathname, TODO add list */
   952		case __NR_mkdirat:
   953		case __NR_mknodat:
   954		case __NR_mount_setattr:
   955		case __NR_move_mount: /* Just from_pathname, TODO add to_pathname */
   956		case __NR_name_to_handle_at:
   957		case __NR_newfstatat:
   958		case __NR_openat:
   959		case __NR_openat2:
   960		case __NR_open_tree:
   961		case __NR_open_tree_attr:
   962		case __NR_readlinkat:
   963		case __NR_renameat: /* Just oldname. TODO add newname */
   964		case __NR_renameat2: /* Just oldname. TODO add newname */
   965		case __NR_removexattrat: /* Just pathname, TODO add name */
   966		case __NR_quotactl:
   967		case __NR_setxattrat: /* Just pathname, TODO add list */
   968		case __NR_syslog:
   969		case __NR_symlinkat: /* Just oldname. TODO add newname */
   970		case __NR_statx:
   971		case __NR_unlinkat:
   972		case __NR_utimensat:
   973			sys_data->user_mask = BIT(1);
   974			break;
   975		/* user arg at position 2 */
   976		case __NR_init_module:
   977		case __NR_fsconfig:
   978			sys_data->user_mask = BIT(2);
   979			break;
   980		/* user arg at position 4 */
   981		case __NR_fanotify_mark:
   982			sys_data->user_mask = BIT(4);
   983			break;
   984		default:
   985			sys_data->user_mask = 0;
   986		}
   987	}
   988	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 7/7] tracing: Add syscall_user_buf_size to limit amount written
  2025-08-06 12:43     ` Steven Rostedt
@ 2025-08-06 16:21       ` Douglas Raillard
  0 siblings, 0 replies; 22+ messages in thread
From: Douglas Raillard @ 2025-08-06 16:21 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
	Mathieu Desnoyers, Andrew Morton, Peter Zijlstra, Namhyung Kim,
	Takaya Saeki, Tom Zanussi, Thomas Gleixner, Ian Rogers, aahringo

On 06-08-2025 13:43, Steven Rostedt wrote:
> On Wed, 6 Aug 2025 11:50:06 +0100
> Douglas Raillard <douglas.raillard@arm.com> wrote:
> 
>> On 05-08-2025 20:26, Steven Rostedt wrote:
>>> From: Steven Rostedt <rostedt@goodmis.org>
>>>
>>> When a system call that reads user space addresses copy it to the ring
>>> buffer, it can copy up to 511 bytes of data. This can waste precious ring
>>> buffer space if the user isn't interested in the output. Add a new file
>>> "syscall_user_buf_size" that gets initialized to a new config
>>> CONFIG_SYSCALL_BUF_SIZE_DEFAULT that defaults to 128.
>>
>> Have you considered dynamically removing some event fields ? We routinely hit
>> the same problem with some of our events that have rarely-used large fields.
> 
> We do that already with eprobes. Note, syscall events are pseudo events
> hooked on the raw_syscall events. Thus modifying what is displayed is
> trivial as it's done manually anyway. For normal events, it's all in
> the TRACE_EVENT() macro which defines the fields at boot. Trying to
> modify it later is very difficult.

I was thinking at a filtering step between assigning to an event struct
with TP_fast_assign and actually writing it to the buffer. An array of (offset, size)
would allow selecting which field is to be copied to the buffer, the rest would
be left out (a bit like in some parts of the synthetic event API). The format
file would be impacted to remove some fields, but hopefully not too many other
corners of ftrace.

The advantage of that over eprobe would be:
1. full support of all field types
2. probably lower overhead than the fetch_op interpreter, but maybe not by much.
3. less moving pieces for the user (e.g. no need to have BTF for by-name field access,
    no new event name to come up with etc.)

> 
>>
>> If we could have a "fields" file in /sys/kernel/tracing/events/*/*/fields
>> that allowed selecting what field is needed that would be amazing. I had plans
>> to build something like that in our kernel module based on the synthetic events API,
>> but did not proceed as that API is not exported in a useful way.
> 
> Take a look at eprobes. You can make a new event based from an existing
> event (including other dynamic events and syscalls).
> I finally got around to adding documentation about it:
> 
>    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/trace/eprobetrace.rst
> 

That's very interesting, I did not realize that you could access the actual event fields
and not just the tracepoint args. With your recent BTF patch, there is now little limits
on how deep you can drill down in the structs which is great (and actually more powerful
than the original event itself).

Before userspace tooling could make use of that as a field filtering system, a few friction
points would need to be addressed:

1. Getting the field list programmatically is currently not really possible as dealing with
    the format file is very tricky. We could just pass on the user-requested field
    to the kernel but that would prevent userspace validation with usable error reporting
    (the 6.15 kernel I tried it on gave me EINVAL and not even a dmesg error when trying to use
    a field that does not exist)

2. The type of the field is not inferred, e.g. an explicit ":string" is needed here:

      e:my/sched_switch sched.sched_switch prev_comm=$prev_comm:string

    The only place a tool can get this info from is the format file, which means you have to
    parse it and apply some conversions (e.g. "__data_loc char[]" becomes "string").

3. Only a restricted subset of field types is supported, e.g. no cpumask, buffers other
    than strings etc. In practice, this means the userspace tooling will have to either:
      * pass on the restriction to the users (can easily lead to a terrible UX by misleading
        the user to think filtering is generally available when in fact it's not).
      * or only treat that as a hint and use the unfiltered original event if the user asks
        for a field with an unsupported type.

On the bright side, creating a new event like "e:my/sched_switch" gives the event name "sched_switch" but
trace-cmd start -e my/sched_switch will only enable the new event which is exactly what we need.
This way, the trace can look like a normal one except less fields, so downstream data processing
is not impacted and only the data-gathering step needs to know about it.

Depending on whether we want/can deal with those friction point, it could either become a high-level
layer usable like the base event system with extra low-level abilities, or stay as a tool only suitable for
hand-crafted use cases where the user has deeper knowledge of layout on all involved kernels.

On a related note, if we wanted to make something that allowed reducing the amount of stored data and
that could deeply integrate with the userspace tooling in charge of collecting the data to run a user-defined query,
the best bet is to target SQL-like systems. That family is very established and virtually all trace-processing system
will use it as first stage (e.g. Perfetto with sqlite, or LISA with Polars dataframes).
In those systems, some important information can typically be extracted from the user query [1]:

1. Projection: which tables and columns the query needs. In ftrace, that's the list of events and what fields
    are needed. Other events/fields can be discarded as they won't be read by the query.

2. Row limit: how many rows the query will read (not always available obviously). In ftrace, that would allow
    automatically stopping the tracing when the event count reaches a limit, or set the buffer size based on
    the event size for a flight-recorder approach. Additional event occurrences would be discarded by the query
    anyway.

3. Predicate filtering: If the query contains a filter to only select rows with a column equal to a specific
    value. Other rows don't need to be collected as the query will discard them anyway.

Currently:
1. is partially implemented as you can select specific events, but not what field you want.
2. is partially implemented (buffer size, but AFAIK there is no way of telling ftrace to stop tracing after N events).
3. is fully implemented with /sys/kernel/debug/tracing/events/*/*/filter

If all those are implemented, ftrace would be able to make use of the most important implicit info available
in the user query to limit the collected data size, without the user having to tune anything manually
and without turning the kernel into a full-blown SQL interpreter.

[1] In the Polars dataframe library, data sources such as a parquet file served over HTTP are called "scans".
     When Polars executes an expression, it will get the data from the scans the expression refers to,
     and will pass the 3 pieces of info to the scan implementation so that processed data size can be minimized
     as early as possible in the pipeline. This is referred to as "projection pushdown", "slice pushdown" and "predicate pushdown":
     https://docs.pola.rs/user-guide/lazy/optimizations/
     If some filtering condition is too complex to express in the limited scan predicate language, filtering will happen
     later in the pipeline. If the scan does not have a smart way to apply the filter (e.g. projection pushdown for a row-oriented file format
     will probably not bring massive speed improvements) then more data than necessary will be fetched and filtering will happen
     later in the pipeline.

> -- Steve

--
Douglas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/7] tracing: Have syscall trace events show "0x" for values greater than 10
  2025-08-06 12:39     ` Steven Rostedt
@ 2025-08-06 16:42       ` Douglas Raillard
  2025-08-12 19:22         ` Steven Rostedt
  0 siblings, 1 reply; 22+ messages in thread
From: Douglas Raillard @ 2025-08-06 16:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
	Mathieu Desnoyers, Andrew Morton, Peter Zijlstra, Namhyung Kim,
	Takaya Saeki, Tom Zanussi, Thomas Gleixner, Ian Rogers, aahringo

On 06-08-2025 13:39, Steven Rostedt wrote:
> On Wed, 6 Aug 2025 11:24:33 +0100
> Douglas Raillard <douglas.raillard@arm.com> wrote:
> 
>> On 05-08-2025 20:26, Steven Rostedt wrote:
>>> From: Steven Rostedt <rostedt@goodmis.org>
>>>
>>> Currently the syscall trace events show each value as hexadecimal, but
>>> without adding "0x" it can be confusing:
>>>
>>>      sys_write(fd: 4, buf: 0x55c4a1fa9270, count: 44)
>>>
>>> Looks like the above write wrote 44 bytes, when in reality it wrote 68
>>> bytes.
>>>
>>> Add a "0x" for all values greater or equal to 10 to remove the ambiguity.
>>> For values less than 10, leave off the "0x" as that just adds noise to the
>>> output.
>>
>> I'm on the fence for the value-dependent format. This looks like
>> it could easily make life harder for quick&dirty scripts, but then both
>> awk's strtonum() and Python's int(x, base=16) seem to handle
>> the presence/absence of the 0x prefix so maybe it's a non-issue.
> 
> Yes, but the trace file is more for humans than scripts, and the 0x1 is
> just noise.
> 
>>
>> OTH, a hand-crafted regex designed after a small set of input may start
>> to randomly fail if one field unexpectedly goes beyond 10 ...
> 
> If you are doing hand crafted scripts, I'd suggest to use tracing
> filters or other tooling that can handle this easily ;-)

Well, we both know I don't do this sort of things (or do I ? :) ) but for
the first 5+ years of its existence, LISA was actually parsing the text
output of trace-cmd, and I'd be surprised if it was the only place this was done.

AFAIK to this day, there is no tool providing a simple script/SQL interface
to a trace.dat file beyond basic filtering like trace-cmd report. I've had this
PR [1] opened for a while in LISA that does exactly that but got distracted with
other things:

   # Show all the CPUs that emitted a cpu_frequency event
   lisa-trace sql trace.dat --query 'SELECT DISTINCT cpu_id FROM cpu_frequency'

   shape: (8, 1)
   ┌────────┐
   │ cpu_id │
   │ ---    │
   │ u32    │
   ╞════════╡
   │ 0      │
   │ 1      │
   │ 2      │
   │ 3      │
   │ 7      │
   │ 4      │
   │ 5      │
   │ 6      │
   └────────┘


[1] https://gitlab.arm.com/tooling/lisa/-/merge_requests/2444

> 
>>
>> Just using explicit hex may be the best here, as the actual proper fix
>> (type-level display hints) is harder. It could probably be implemented
>> using btf_decl_tag() and __builtin_btf_type_id() to retrieve the BTF info.
> 
> That would add a dependency on BTF, which I would like to avoid.
> 
> Thanks!
> 
> -- Steve
> 

--
Douglas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/7] tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE()
  2025-08-05 19:26 ` [PATCH 1/7] tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE() Steven Rostedt
@ 2025-08-06 18:39   ` Paul E. McKenney
  2025-08-06 22:05   ` kernel test robot
  1 sibling, 0 replies; 22+ messages in thread
From: Paul E. McKenney @ 2025-08-06 18:39 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
	Mathieu Desnoyers, Andrew Morton, Peter Zijlstra, Namhyung Kim,
	Takaya Saeki, Tom Zanussi, Thomas Gleixner, Ian Rogers, aahringo,
	Douglas Raillard

On Tue, Aug 05, 2025 at 03:26:47PM -0400, Steven Rostedt wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
> 
> The syscall events are pseudo events that hook to the raw syscalls. The
> ftrace_syscall_enter/exit() callback is called by the raw_syscall
> enter/exit tracepoints respectively whenever any of the syscall events are
> enabled.
> 
> The trace_array has an array of syscall "files" that correspond to the
> system calls based on their __NR_SYSCALL number. The array is read and if
> there's a pointer to a trace_event_file then it is considered enabled and
> if it is NULL that syscall event is considered disabled.
> 
> Currently it uses an rcu_dereference_sched() to get this pointer and a
> rcu_assign_ptr() or RCU_INIT_POINTER() to write to it. This is unnecessary
> as the file pointer will not go away outside the synchronization of the
> tracepoint logic itself. And this code adds no extra RCU synchronization
> that uses this.
> 
> Replace these functions with a simple READ_ONCE() and WRITE_ONCE() which
> is all they need. This will also allow this code to not depend on
> preemption being disabled as system call tracepoints are now allowed to
> fault.
> 
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

From an RCU-removal viewpoint:

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>

But is it possible to give some sort of warning just in case some creative
future developer figures out how to make the file pointer go away outside
of the synchronization of the tracepoint logic itself?

							Thanx, Paul

> ---
>  kernel/trace/trace_syscalls.c | 14 ++++++--------
>  1 file changed, 6 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
> index 46aab0ab9350..3a0b65f89130 100644
> --- a/kernel/trace/trace_syscalls.c
> +++ b/kernel/trace/trace_syscalls.c
> @@ -310,8 +310,7 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
>  	if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
>  		return;
>  
> -	/* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE) */
> -	trace_file = rcu_dereference_sched(tr->enter_syscall_files[syscall_nr]);
> +	trace_file = READ_ONCE(tr->enter_syscall_files[syscall_nr]);
>  	if (!trace_file)
>  		return;
>  
> @@ -356,8 +355,7 @@ static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
>  	if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
>  		return;
>  
> -	/* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE()) */
> -	trace_file = rcu_dereference_sched(tr->exit_syscall_files[syscall_nr]);
> +	trace_file = READ_ONCE(tr->exit_syscall_files[syscall_nr]);
>  	if (!trace_file)
>  		return;
>  
> @@ -393,7 +391,7 @@ static int reg_event_syscall_enter(struct trace_event_file *file,
>  	if (!tr->sys_refcount_enter)
>  		ret = register_trace_sys_enter(ftrace_syscall_enter, tr);
>  	if (!ret) {
> -		rcu_assign_pointer(tr->enter_syscall_files[num], file);
> +		WRITE_ONCE(tr->enter_syscall_files[num], file);
>  		tr->sys_refcount_enter++;
>  	}
>  	mutex_unlock(&syscall_trace_lock);
> @@ -411,7 +409,7 @@ static void unreg_event_syscall_enter(struct trace_event_file *file,
>  		return;
>  	mutex_lock(&syscall_trace_lock);
>  	tr->sys_refcount_enter--;
> -	RCU_INIT_POINTER(tr->enter_syscall_files[num], NULL);
> +	WRITE_ONCE(tr->enter_syscall_files[num], NULL);
>  	if (!tr->sys_refcount_enter)
>  		unregister_trace_sys_enter(ftrace_syscall_enter, tr);
>  	mutex_unlock(&syscall_trace_lock);
> @@ -431,7 +429,7 @@ static int reg_event_syscall_exit(struct trace_event_file *file,
>  	if (!tr->sys_refcount_exit)
>  		ret = register_trace_sys_exit(ftrace_syscall_exit, tr);
>  	if (!ret) {
> -		rcu_assign_pointer(tr->exit_syscall_files[num], file);
> +		WRITE_ONCE(tr->exit_syscall_files[num], file);
>  		tr->sys_refcount_exit++;
>  	}
>  	mutex_unlock(&syscall_trace_lock);
> @@ -449,7 +447,7 @@ static void unreg_event_syscall_exit(struct trace_event_file *file,
>  		return;
>  	mutex_lock(&syscall_trace_lock);
>  	tr->sys_refcount_exit--;
> -	RCU_INIT_POINTER(tr->exit_syscall_files[num], NULL);
> +	WRITE_ONCE(tr->exit_syscall_files[num], NULL);
>  	if (!tr->sys_refcount_exit)
>  		unregister_trace_sys_exit(ftrace_syscall_exit, tr);
>  	mutex_unlock(&syscall_trace_lock);
> -- 
> 2.47.2
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/7] tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE()
  2025-08-05 19:26 ` [PATCH 1/7] tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE() Steven Rostedt
  2025-08-06 18:39   ` Paul E. McKenney
@ 2025-08-06 22:05   ` kernel test robot
  1 sibling, 0 replies; 22+ messages in thread
From: kernel test robot @ 2025-08-06 22:05 UTC (permalink / raw)
  To: Steven Rostedt, linux-kernel, linux-trace-kernel
  Cc: oe-kbuild-all, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers,
	Andrew Morton, Linux Memory Management List, Peter Zijlstra,
	Namhyung Kim, Takaya Saeki, Tom Zanussi, Thomas Gleixner,
	Ian Rogers, aahringo, Douglas Raillard, Paul E. McKenney

Hi Steven,

kernel test robot noticed the following build warnings:

[auto build test WARNING on trace/for-next]
[also build test WARNING on linus/master v6.16 next-20250806]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Steven-Rostedt/tracing-Replace-syscall-RCU-pointer-assignment-with-READ-WRITE_ONCE/20250806-122312
base:   https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace for-next
patch link:    https://lore.kernel.org/r/20250805193234.745705874%40kernel.org
patch subject: [PATCH 1/7] tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE()
config: x86_64-randconfig-123-20250806 (https://download.01.org/0day-ci/archive/20250807/202508070546.KmEoxWkg-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14+deb12u1) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250807/202508070546.KmEoxWkg-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202508070546.KmEoxWkg-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> kernel/trace/trace_syscalls.c:313:20: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct trace_event_file *trace_file @@     got struct trace_event_file [noderef] __rcu * @@
   kernel/trace/trace_syscalls.c:313:20: sparse:     expected struct trace_event_file *trace_file
   kernel/trace/trace_syscalls.c:313:20: sparse:     got struct trace_event_file [noderef] __rcu *
   kernel/trace/trace_syscalls.c:358:20: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct trace_event_file *trace_file @@     got struct trace_event_file [noderef] __rcu * @@
   kernel/trace/trace_syscalls.c:358:20: sparse:     expected struct trace_event_file *trace_file
   kernel/trace/trace_syscalls.c:358:20: sparse:     got struct trace_event_file [noderef] __rcu *
>> kernel/trace/trace_syscalls.c:394:17: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct trace_event_file [noderef] __rcu *volatile @@     got struct trace_event_file *file @@
   kernel/trace/trace_syscalls.c:394:17: sparse:     expected struct trace_event_file [noderef] __rcu *volatile
   kernel/trace/trace_syscalls.c:394:17: sparse:     got struct trace_event_file *file
   kernel/trace/trace_syscalls.c:432:17: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct trace_event_file [noderef] __rcu *volatile @@     got struct trace_event_file *file @@
   kernel/trace/trace_syscalls.c:432:17: sparse:     expected struct trace_event_file [noderef] __rcu *volatile
   kernel/trace/trace_syscalls.c:432:17: sparse:     got struct trace_event_file *file

vim +313 kernel/trace/trace_syscalls.c

   290	
   291	static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
   292	{
   293		struct trace_array *tr = data;
   294		struct trace_event_file *trace_file;
   295		struct syscall_trace_enter *entry;
   296		struct syscall_metadata *sys_data;
   297		struct trace_event_buffer fbuffer;
   298		unsigned long args[6];
   299		int syscall_nr;
   300		int size;
   301	
   302		/*
   303		 * Syscall probe called with preemption enabled, but the ring
   304		 * buffer and per-cpu data require preemption to be disabled.
   305		 */
   306		might_fault();
   307		guard(preempt_notrace)();
   308	
   309		syscall_nr = trace_get_syscall_nr(current, regs);
   310		if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
   311			return;
   312	
 > 313		trace_file = READ_ONCE(tr->enter_syscall_files[syscall_nr]);
   314		if (!trace_file)
   315			return;
   316	
   317		if (trace_trigger_soft_disabled(trace_file))
   318			return;
   319	
   320		sys_data = syscall_nr_to_meta(syscall_nr);
   321		if (!sys_data)
   322			return;
   323	
   324		size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
   325	
   326		entry = trace_event_buffer_reserve(&fbuffer, trace_file, size);
   327		if (!entry)
   328			return;
   329	
   330		entry = ring_buffer_event_data(fbuffer.event);
   331		entry->nr = syscall_nr;
   332		syscall_get_arguments(current, regs, args);
   333		memcpy(entry->args, args, sizeof(unsigned long) * sys_data->nb_args);
   334	
   335		trace_event_buffer_commit(&fbuffer);
   336	}
   337	
   338	static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
   339	{
   340		struct trace_array *tr = data;
   341		struct trace_event_file *trace_file;
   342		struct syscall_trace_exit *entry;
   343		struct syscall_metadata *sys_data;
   344		struct trace_event_buffer fbuffer;
   345		int syscall_nr;
   346	
   347		/*
   348		 * Syscall probe called with preemption enabled, but the ring
   349		 * buffer and per-cpu data require preemption to be disabled.
   350		 */
   351		might_fault();
   352		guard(preempt_notrace)();
   353	
   354		syscall_nr = trace_get_syscall_nr(current, regs);
   355		if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
   356			return;
   357	
   358		trace_file = READ_ONCE(tr->exit_syscall_files[syscall_nr]);
   359		if (!trace_file)
   360			return;
   361	
   362		if (trace_trigger_soft_disabled(trace_file))
   363			return;
   364	
   365		sys_data = syscall_nr_to_meta(syscall_nr);
   366		if (!sys_data)
   367			return;
   368	
   369		entry = trace_event_buffer_reserve(&fbuffer, trace_file, sizeof(*entry));
   370		if (!entry)
   371			return;
   372	
   373		entry = ring_buffer_event_data(fbuffer.event);
   374		entry->nr = syscall_nr;
   375		entry->ret = syscall_get_return_value(current, regs);
   376	
   377		trace_event_buffer_commit(&fbuffer);
   378	}
   379	
   380	static int reg_event_syscall_enter(struct trace_event_file *file,
   381					   struct trace_event_call *call)
   382	{
   383		struct trace_array *tr = file->tr;
   384		int ret = 0;
   385		int num;
   386	
   387		num = ((struct syscall_metadata *)call->data)->syscall_nr;
   388		if (WARN_ON_ONCE(num < 0 || num >= NR_syscalls))
   389			return -ENOSYS;
   390		mutex_lock(&syscall_trace_lock);
   391		if (!tr->sys_refcount_enter)
   392			ret = register_trace_sys_enter(ftrace_syscall_enter, tr);
   393		if (!ret) {
 > 394			WRITE_ONCE(tr->enter_syscall_files[num], file);
   395			tr->sys_refcount_enter++;
   396		}
   397		mutex_unlock(&syscall_trace_lock);
   398		return ret;
   399	}
   400	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/7] tracing: Allow syscall trace events to read more than one user parameter
  2025-08-05 19:26 ` [PATCH 6/7] tracing: Allow syscall trace events to read more than one user parameter Steven Rostedt
@ 2025-08-06 23:52   ` kernel test robot
  0 siblings, 0 replies; 22+ messages in thread
From: kernel test robot @ 2025-08-06 23:52 UTC (permalink / raw)
  To: Steven Rostedt, linux-kernel, linux-trace-kernel
  Cc: oe-kbuild-all, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers,
	Andrew Morton, Linux Memory Management List, Peter Zijlstra,
	Namhyung Kim, Takaya Saeki, Tom Zanussi, Thomas Gleixner,
	Ian Rogers, aahringo, Douglas Raillard

Hi Steven,

kernel test robot noticed the following build warnings:

[auto build test WARNING on trace/for-next]
[also build test WARNING on linus/master v6.16 next-20250806]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Steven-Rostedt/tracing-Replace-syscall-RCU-pointer-assignment-with-READ-WRITE_ONCE/20250806-122312
base:   https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace for-next
patch link:    https://lore.kernel.org/r/20250805193235.582013098%40kernel.org
patch subject: [PATCH 6/7] tracing: Allow syscall trace events to read more than one user parameter
config: x86_64-randconfig-123-20250806 (https://download.01.org/0day-ci/archive/20250807/202508070706.TiTQY0Ne-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14+deb12u1) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250807/202508070706.TiTQY0Ne-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202508070706.TiTQY0Ne-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> kernel/trace/trace_syscalls.c:620:53: sparse: sparse: incorrect type in argument 2 (different address spaces) @@     expected void const [noderef] __user *from @@     got char *ptr @@
   kernel/trace/trace_syscalls.c:620:53: sparse:     expected void const [noderef] __user *from
   kernel/trace/trace_syscalls.c:620:53: sparse:     got char *ptr
>> kernel/trace/trace_syscalls.c:623:54: sparse: sparse: incorrect type in argument 2 (different address spaces) @@     expected char const [noderef] __user *src @@     got char *ptr @@
   kernel/trace/trace_syscalls.c:623:54: sparse:     expected char const [noderef] __user *src
   kernel/trace/trace_syscalls.c:623:54: sparse:     got char *ptr
   kernel/trace/trace_syscalls.c:707:20: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct trace_event_file *trace_file @@     got struct trace_event_file [noderef] __rcu * @@
   kernel/trace/trace_syscalls.c:707:20: sparse:     expected struct trace_event_file *trace_file
   kernel/trace/trace_syscalls.c:707:20: sparse:     got struct trace_event_file [noderef] __rcu *
   kernel/trace/trace_syscalls.c:824:20: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct trace_event_file *trace_file @@     got struct trace_event_file [noderef] __rcu * @@
   kernel/trace/trace_syscalls.c:824:20: sparse:     expected struct trace_event_file *trace_file
   kernel/trace/trace_syscalls.c:824:20: sparse:     got struct trace_event_file [noderef] __rcu *
   kernel/trace/trace_syscalls.c:871:9: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct trace_event_file [noderef] __rcu *volatile @@     got struct trace_event_file *file @@
   kernel/trace/trace_syscalls.c:871:9: sparse:     expected struct trace_event_file [noderef] __rcu *volatile
   kernel/trace/trace_syscalls.c:871:9: sparse:     got struct trace_event_file *file
   kernel/trace/trace_syscalls.c:909:17: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct trace_event_file [noderef] __rcu *volatile @@     got struct trace_event_file *file @@
   kernel/trace/trace_syscalls.c:909:17: sparse:     expected struct trace_event_file [noderef] __rcu *volatile
   kernel/trace/trace_syscalls.c:909:17: sparse:     got struct trace_event_file *file

vim +620 kernel/trace/trace_syscalls.c

6bc850d6f8f730 Steven Rostedt 2025-08-05  526  
623bd9e046f95c Steven Rostedt 2025-08-05  527  static char *sys_fault_user(struct syscall_metadata *sys_data,
623bd9e046f95c Steven Rostedt 2025-08-05  528  			    struct syscall_buf_info *sinfo,
623bd9e046f95c Steven Rostedt 2025-08-05  529  			    unsigned long *args,
623bd9e046f95c Steven Rostedt 2025-08-05  530  			    unsigned int data_size[SYSCALL_FAULT_MAX_CNT])
6bc850d6f8f730 Steven Rostedt 2025-08-05  531  {
623bd9e046f95c Steven Rostedt 2025-08-05  532  	char *buffer = per_cpu_ptr(sinfo->sbuf, smp_processor_id())->buf;
6bc850d6f8f730 Steven Rostedt 2025-08-05  533  	unsigned long mask = sys_data->user_mask;
623bd9e046f95c Steven Rostedt 2025-08-05  534  	unsigned long size = SYSCALL_FAULT_ARG_SZ - 1;
6bc850d6f8f730 Steven Rostedt 2025-08-05  535  	unsigned int cnt;
b979d33ec48bbd Steven Rostedt 2025-08-05  536  	bool array = false;
623bd9e046f95c Steven Rostedt 2025-08-05  537  	char *ptr_array[SYSCALL_FAULT_MAX_CNT];
623bd9e046f95c Steven Rostedt 2025-08-05  538  	char *buf;
623bd9e046f95c Steven Rostedt 2025-08-05  539  	int read[SYSCALL_FAULT_MAX_CNT];
6bc850d6f8f730 Steven Rostedt 2025-08-05  540  	int trys = 0;
623bd9e046f95c Steven Rostedt 2025-08-05  541  	int uargs;
6bc850d6f8f730 Steven Rostedt 2025-08-05  542  	int ret;
623bd9e046f95c Steven Rostedt 2025-08-05  543  	int i = 0;
623bd9e046f95c Steven Rostedt 2025-08-05  544  
623bd9e046f95c Steven Rostedt 2025-08-05  545  	/* The extra is appended to the user data in the buffer */
623bd9e046f95c Steven Rostedt 2025-08-05  546  	BUILD_BUG_ON(SYSCALL_FAULT_USER_MAX + sizeof(EXTRA) >=
623bd9e046f95c Steven Rostedt 2025-08-05  547  		     SYSCALL_FAULT_ARG_SZ);
623bd9e046f95c Steven Rostedt 2025-08-05  548  
623bd9e046f95c Steven Rostedt 2025-08-05  549  	/*
623bd9e046f95c Steven Rostedt 2025-08-05  550  	 * If this system call event has a size argument, use
623bd9e046f95c Steven Rostedt 2025-08-05  551  	 * it to define how much of user space memory to read,
623bd9e046f95c Steven Rostedt 2025-08-05  552  	 * and read it as an array and not a string.
623bd9e046f95c Steven Rostedt 2025-08-05  553  	 */
623bd9e046f95c Steven Rostedt 2025-08-05  554  	if (sys_data->user_arg_size >= 0) {
623bd9e046f95c Steven Rostedt 2025-08-05  555  		array = true;
623bd9e046f95c Steven Rostedt 2025-08-05  556  		size = args[sys_data->user_arg_size];
623bd9e046f95c Steven Rostedt 2025-08-05  557  		if (size > SYSCALL_FAULT_ARG_SZ - 1)
623bd9e046f95c Steven Rostedt 2025-08-05  558  			size = SYSCALL_FAULT_ARG_SZ - 1;
623bd9e046f95c Steven Rostedt 2025-08-05  559  	}
623bd9e046f95c Steven Rostedt 2025-08-05  560  
623bd9e046f95c Steven Rostedt 2025-08-05  561  	while (mask) {
623bd9e046f95c Steven Rostedt 2025-08-05  562  		int idx = ffs(mask) - 1;
623bd9e046f95c Steven Rostedt 2025-08-05  563  		mask &= ~BIT(idx);
623bd9e046f95c Steven Rostedt 2025-08-05  564  
623bd9e046f95c Steven Rostedt 2025-08-05  565  		if (WARN_ON_ONCE(i == SYSCALL_FAULT_MAX_CNT))
623bd9e046f95c Steven Rostedt 2025-08-05  566  			break;
6bc850d6f8f730 Steven Rostedt 2025-08-05  567  
6bc850d6f8f730 Steven Rostedt 2025-08-05  568  		/* Get the pointer to user space memory to read */
623bd9e046f95c Steven Rostedt 2025-08-05  569  		ptr_array[i++] = (char *)args[idx];
623bd9e046f95c Steven Rostedt 2025-08-05  570  	}
623bd9e046f95c Steven Rostedt 2025-08-05  571  
623bd9e046f95c Steven Rostedt 2025-08-05  572  	uargs = i;
623bd9e046f95c Steven Rostedt 2025-08-05  573  
623bd9e046f95c Steven Rostedt 2025-08-05  574  	/* Clear the values that are not used */
623bd9e046f95c Steven Rostedt 2025-08-05  575  	for (; i < SYSCALL_FAULT_MAX_CNT; i++) {
623bd9e046f95c Steven Rostedt 2025-08-05  576  		data_size[i] = -1; /* Denotes no pointer */
623bd9e046f95c Steven Rostedt 2025-08-05  577  	}
6bc850d6f8f730 Steven Rostedt 2025-08-05  578  
6bc850d6f8f730 Steven Rostedt 2025-08-05  579   again:
6bc850d6f8f730 Steven Rostedt 2025-08-05  580  	/*
6bc850d6f8f730 Steven Rostedt 2025-08-05  581  	 * If this task is preempted by another user space task, it
6bc850d6f8f730 Steven Rostedt 2025-08-05  582  	 * will cause this task to try again. But just in case something
6bc850d6f8f730 Steven Rostedt 2025-08-05  583  	 * changes where the copying from user space causes another task
6bc850d6f8f730 Steven Rostedt 2025-08-05  584  	 * to run, prevent this from going into an infinite loop.
6bc850d6f8f730 Steven Rostedt 2025-08-05  585  	 * 10 tries should be plenty.
6bc850d6f8f730 Steven Rostedt 2025-08-05  586  	 */
6bc850d6f8f730 Steven Rostedt 2025-08-05  587  	if (trys++ > 10) {
6bc850d6f8f730 Steven Rostedt 2025-08-05  588  		static bool once;
6bc850d6f8f730 Steven Rostedt 2025-08-05  589  		/*
6bc850d6f8f730 Steven Rostedt 2025-08-05  590  		 * Only print a message instead of a WARN_ON() as this could
6bc850d6f8f730 Steven Rostedt 2025-08-05  591  		 * theoretically trigger under real load.
6bc850d6f8f730 Steven Rostedt 2025-08-05  592  		 */
6bc850d6f8f730 Steven Rostedt 2025-08-05  593  		if (!once)
6bc850d6f8f730 Steven Rostedt 2025-08-05  594  			pr_warn("Error: Too many tries to read syscall %s\n", sys_data->name);
6bc850d6f8f730 Steven Rostedt 2025-08-05  595  		once = true;
623bd9e046f95c Steven Rostedt 2025-08-05  596  		return buffer;
6bc850d6f8f730 Steven Rostedt 2025-08-05  597  	}
6bc850d6f8f730 Steven Rostedt 2025-08-05  598  
6bc850d6f8f730 Steven Rostedt 2025-08-05  599  	/* Read the current sched switch count */
6bc850d6f8f730 Steven Rostedt 2025-08-05  600  	cnt = this_cpu_read(sched_switch_cnt);
6bc850d6f8f730 Steven Rostedt 2025-08-05  601  
6bc850d6f8f730 Steven Rostedt 2025-08-05  602  	/*
6bc850d6f8f730 Steven Rostedt 2025-08-05  603  	 * Preemption is going to be enabled, but this task must
6bc850d6f8f730 Steven Rostedt 2025-08-05  604  	 * remain on this CPU.
6bc850d6f8f730 Steven Rostedt 2025-08-05  605  	 */
6bc850d6f8f730 Steven Rostedt 2025-08-05  606  	migrate_disable();
6bc850d6f8f730 Steven Rostedt 2025-08-05  607  
6bc850d6f8f730 Steven Rostedt 2025-08-05  608  	/*
6bc850d6f8f730 Steven Rostedt 2025-08-05  609  	 * Now preemption is being enabed and another task can come in
6bc850d6f8f730 Steven Rostedt 2025-08-05  610  	 * and use the same buffer and corrupt our data.
6bc850d6f8f730 Steven Rostedt 2025-08-05  611  	 */
6bc850d6f8f730 Steven Rostedt 2025-08-05  612  	preempt_enable_notrace();
6bc850d6f8f730 Steven Rostedt 2025-08-05  613  
623bd9e046f95c Steven Rostedt 2025-08-05  614  	buf = buffer;
623bd9e046f95c Steven Rostedt 2025-08-05  615  
623bd9e046f95c Steven Rostedt 2025-08-05  616  	for (i = 0; i < uargs; i++, buf += SYSCALL_FAULT_ARG_SZ) {
623bd9e046f95c Steven Rostedt 2025-08-05  617  		char *ptr = ptr_array[i];
623bd9e046f95c Steven Rostedt 2025-08-05  618  
b979d33ec48bbd Steven Rostedt 2025-08-05  619  		if (array) {
b979d33ec48bbd Steven Rostedt 2025-08-05 @620  			ret = __copy_from_user(buf, ptr, size);
b979d33ec48bbd Steven Rostedt 2025-08-05  621  			ret = ret ? -1 : size;
b979d33ec48bbd Steven Rostedt 2025-08-05  622  		} else {
6bc850d6f8f730 Steven Rostedt 2025-08-05 @623  			ret = strncpy_from_user(buf, ptr, size);
b979d33ec48bbd Steven Rostedt 2025-08-05  624  		}
623bd9e046f95c Steven Rostedt 2025-08-05  625  		read[i] = ret;
623bd9e046f95c Steven Rostedt 2025-08-05  626  	}
6bc850d6f8f730 Steven Rostedt 2025-08-05  627  
6bc850d6f8f730 Steven Rostedt 2025-08-05  628  	preempt_disable_notrace();
6bc850d6f8f730 Steven Rostedt 2025-08-05  629  	migrate_enable();
6bc850d6f8f730 Steven Rostedt 2025-08-05  630  
6bc850d6f8f730 Steven Rostedt 2025-08-05  631  	/*
6bc850d6f8f730 Steven Rostedt 2025-08-05  632  	 * Preemption is disabled again, now check the sched_switch_cnt.
6bc850d6f8f730 Steven Rostedt 2025-08-05  633  	 * If it increased by two or more, then another user space process
6bc850d6f8f730 Steven Rostedt 2025-08-05  634  	 * may have schedule in and corrupted our buffer. In that case
6bc850d6f8f730 Steven Rostedt 2025-08-05  635  	 * the copying must be retried.
6bc850d6f8f730 Steven Rostedt 2025-08-05  636  	 *
6bc850d6f8f730 Steven Rostedt 2025-08-05  637  	 * Note, if this task was scheduled out and only kernel threads
6bc850d6f8f730 Steven Rostedt 2025-08-05  638  	 * were scheduled in (maybe to process the fault), then the
6bc850d6f8f730 Steven Rostedt 2025-08-05  639  	 * counter would increment again when this task scheduled in.
6bc850d6f8f730 Steven Rostedt 2025-08-05  640  	 * If this task scheduled out and another user task scheduled
6bc850d6f8f730 Steven Rostedt 2025-08-05  641  	 * in, this task would still need to be scheduled back in and
6bc850d6f8f730 Steven Rostedt 2025-08-05  642  	 * the counter would increment by at least two.
6bc850d6f8f730 Steven Rostedt 2025-08-05  643  	 */
6bc850d6f8f730 Steven Rostedt 2025-08-05  644  	if (this_cpu_read(sched_switch_cnt) > cnt + 1)
6bc850d6f8f730 Steven Rostedt 2025-08-05  645  		goto again;
6bc850d6f8f730 Steven Rostedt 2025-08-05  646  
623bd9e046f95c Steven Rostedt 2025-08-05  647  	buf = buffer;
623bd9e046f95c Steven Rostedt 2025-08-05  648  	for (i = 0; i < uargs; i++, buf += SYSCALL_FAULT_ARG_SZ) {
623bd9e046f95c Steven Rostedt 2025-08-05  649  
623bd9e046f95c Steven Rostedt 2025-08-05  650  		ret = read[i];
623bd9e046f95c Steven Rostedt 2025-08-05  651  		if (ret < 0)
623bd9e046f95c Steven Rostedt 2025-08-05  652  			continue;
623bd9e046f95c Steven Rostedt 2025-08-05  653  		buf[ret] = '\0';
623bd9e046f95c Steven Rostedt 2025-08-05  654  
b979d33ec48bbd Steven Rostedt 2025-08-05  655  		/* For strings, replace any non-printable characters with '.' */
b979d33ec48bbd Steven Rostedt 2025-08-05  656  		if (!array) {
623bd9e046f95c Steven Rostedt 2025-08-05  657  			for (int x = 0; x < ret; x++) {
623bd9e046f95c Steven Rostedt 2025-08-05  658  				if (!isprint(buf[x]))
623bd9e046f95c Steven Rostedt 2025-08-05  659  					buf[x] = '.';
6bc850d6f8f730 Steven Rostedt 2025-08-05  660  			}
6bc850d6f8f730 Steven Rostedt 2025-08-05  661  
6bc850d6f8f730 Steven Rostedt 2025-08-05  662  			/*
623bd9e046f95c Steven Rostedt 2025-08-05  663  			 * If the text was truncated due to our max limit,
623bd9e046f95c Steven Rostedt 2025-08-05  664  			 * add "..." to the string.
6bc850d6f8f730 Steven Rostedt 2025-08-05  665  			 */
623bd9e046f95c Steven Rostedt 2025-08-05  666  			if (ret > SYSCALL_FAULT_USER_MAX) {
623bd9e046f95c Steven Rostedt 2025-08-05  667  				strscpy(buf + SYSCALL_FAULT_USER_MAX, EXTRA,
623bd9e046f95c Steven Rostedt 2025-08-05  668  					sizeof(EXTRA));
623bd9e046f95c Steven Rostedt 2025-08-05  669  				ret = SYSCALL_FAULT_USER_MAX + sizeof(EXTRA);
6bc850d6f8f730 Steven Rostedt 2025-08-05  670  			} else {
6bc850d6f8f730 Steven Rostedt 2025-08-05  671  				buf[ret++] = '\0';
6bc850d6f8f730 Steven Rostedt 2025-08-05  672  			}
623bd9e046f95c Steven Rostedt 2025-08-05  673  		} else {
623bd9e046f95c Steven Rostedt 2025-08-05  674  			ret = min(ret, SYSCALL_FAULT_USER_MAX);
623bd9e046f95c Steven Rostedt 2025-08-05  675  		}
623bd9e046f95c Steven Rostedt 2025-08-05  676  		data_size[i] = ret;
b979d33ec48bbd Steven Rostedt 2025-08-05  677  	}
6bc850d6f8f730 Steven Rostedt 2025-08-05  678  
623bd9e046f95c Steven Rostedt 2025-08-05  679  	return buffer;
6bc850d6f8f730 Steven Rostedt 2025-08-05  680  }
6bc850d6f8f730 Steven Rostedt 2025-08-05  681  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/7] tracing: Have syscall trace events read user space string
  2025-08-05 19:26 ` [PATCH 3/7] tracing: Have syscall trace events read user space string Steven Rostedt
  2025-08-06 14:39   ` kernel test robot
@ 2025-08-11  5:19   ` kernel test robot
  1 sibling, 0 replies; 22+ messages in thread
From: kernel test robot @ 2025-08-11  5:19 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: oe-lkp, lkp, linux-kernel, linux-trace-kernel, Masami Hiramatsu,
	Mark Rutland, Mathieu Desnoyers, Andrew Morton, Peter Zijlstra,
	Namhyung Kim, Takaya Saeki, Tom Zanussi, Thomas Gleixner,
	Ian Rogers, aahringo, Douglas Raillard, oliver.sang



Hello,

kernel test robot noticed "BUG:KASAN:slab-out-of-bounds_in_syscall_fault_buffer_enable" on:

commit: 6bc850d6f8f7308a184edfd60ee1acdd89ced128 ("[PATCH 3/7] tracing: Have syscall trace events read user space string")
url: https://github.com/intel-lab-lkp/linux/commits/Steven-Rostedt/tracing-Replace-syscall-RCU-pointer-assignment-with-READ-WRITE_ONCE/20250806-122312
base: https://git.kernel.org/cgit/linux/kernel/git/trace/linux-trace for-next
patch link: https://lore.kernel.org/all/20250805193235.080757106@kernel.org/
patch subject: [PATCH 3/7] tracing: Have syscall trace events read user space string

in testcase: boot

config: x86_64-rhel-9.4-kunit
compiler: gcc-12
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)


+---------------------------------------------------------------+------------+------------+
|                                                               | 63f89ba6a0 | 6bc850d6f8 |
+---------------------------------------------------------------+------------+------------+
| BUG:KASAN:slab-out-of-bounds_in_syscall_fault_buffer_enable   | 0          | 24         |
+---------------------------------------------------------------+------------+------------+


If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202508110616.33657b6c-lkp@intel.com


[ 47.226292][ T1] BUG: KASAN: slab-out-of-bounds in syscall_fault_buffer_enable (kernel/trace/trace_syscalls.c:430) 
[   47.227603][    T1] Write of size 8 at addr ffff8881baea5f10 by task swapper/0/1
[   47.228735][    T1]
[   47.229107][    T1] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.16.0-rc7-00138-g6bc850d6f8f7 #1 PREEMPT(voluntary)
[   47.229114][    T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[   47.229117][    T1] Call Trace:
[   47.229121][    T1]  <TASK>
[ 47.229124][ T1] dump_stack_lvl (lib/dump_stack.c:123 (discriminator 1)) 
[ 47.229134][ T1] print_address_description+0x2c/0x380 
[ 47.229142][ T1] ? syscall_fault_buffer_enable (kernel/trace/trace_syscalls.c:430) 
[ 47.229146][ T1] print_report (mm/kasan/report.c:481) 
[ 47.229150][ T1] ? syscall_fault_buffer_enable (kernel/trace/trace_syscalls.c:430) 
[ 47.229153][ T1] ? kasan_addr_to_slab (mm/kasan/common.c:37) 
[ 47.229156][ T1] ? syscall_fault_buffer_enable (kernel/trace/trace_syscalls.c:430) 
[ 47.229159][ T1] kasan_report (mm/kasan/report.c:595) 
[ 47.229164][ T1] ? syscall_fault_buffer_enable (kernel/trace/trace_syscalls.c:430) 
[ 47.229167][ T1] syscall_fault_buffer_enable (kernel/trace/trace_syscalls.c:430) 
[ 47.229171][ T1] ? mutex_unlock (arch/x86/include/asm/atomic64_64.h:101 include/linux/atomic/atomic-arch-fallback.h:4329 include/linux/atomic/atomic-long.h:1506 include/linux/atomic/atomic-instrumented.h:4481 kernel/locking/mutex.c:167 kernel/locking/mutex.c:537) 
[ 47.229177][ T1] syscall_enter_register (kernel/trace/trace_syscalls.c:729 kernel/trace/trace_syscalls.c:1259) 
[ 47.229181][ T1] __ftrace_event_enable_disable (kernel/trace/trace_events.c:860) 
[ 47.229186][ T1] ? __pfx__printk (kernel/printk/printk.c:2470) 
[ 47.229192][ T1] __ftrace_set_clr_event_nolock (kernel/trace/trace_events.c:890 kernel/trace/trace_events.c:1353) 
[ 47.229197][ T1] event_trace_self_tests (kernel/trace/trace_events.c:1384 (discriminator 1) kernel/trace/trace_events.c:4779 (discriminator 1)) 
[ 47.229203][ T1] ? __pfx_event_trace_self_tests_init (kernel/trace/trace_events.c:4892) 
[ 47.229208][ T1] event_trace_self_tests_init (include/linux/list.h:373 kernel/trace/trace.h:487 kernel/trace/trace_events.c:4871 kernel/trace/trace_events.c:4894) 
[ 47.229212][ T1] do_one_initcall (init/main.c:1274) 
[ 47.229216][ T1] ? __pfx_do_one_initcall (init/main.c:1265) 
[ 47.229219][ T1] ? __pfx_parse_args (kernel/params.c:168) 
[ 47.229223][ T1] ? __kasan_kmalloc (include/linux/kfence.h:58 mm/kasan/common.c:390) 
[ 47.229227][ T1] ? do_initcalls (include/linux/slab.h:909 include/linux/slab.h:1039 init/main.c:1345) 
[ 47.229232][ T1] do_initcalls (init/main.c:1335 init/main.c:1352) 
[ 47.229236][ T1] kernel_init_freeable (init/main.c:1586) 
[ 47.229241][ T1] ? __pfx_kernel_init (init/main.c:1466) 
[ 47.229247][ T1] kernel_init (init/main.c:1476) 
[ 47.229251][ T1] ? calculate_sigpending (kernel/signal.c:194) 
[ 47.229256][ T1] ? __pfx_kernel_init (init/main.c:1466) 
[ 47.229259][ T1] ret_from_fork (arch/x86/kernel/process.c:154) 
[ 47.229265][ T1] ? __pfx_kernel_init (init/main.c:1466) 
[ 47.229269][ T1] ret_from_fork_asm (arch/x86/entry/entry_64.S:258) 
[   47.229275][    T1]  </TASK>
[   47.229276][    T1]
[   47.262431][    T1] Allocated by task 1:
[ 47.263075][ T1] kasan_save_stack (mm/kasan/common.c:48) 
[ 47.263810][ T1] kasan_save_track (arch/x86/include/asm/current.h:25 mm/kasan/common.c:60 mm/kasan/common.c:69) 
[ 47.264530][ T1] __kasan_kmalloc (mm/kasan/common.c:377 mm/kasan/common.c:394) 
[ 47.265210][ T1] syscall_fault_buffer_enable (include/linux/slab.h:905 kernel/trace/trace_syscalls.c:426) 
[ 47.266034][ T1] syscall_enter_register (kernel/trace/trace_syscalls.c:729 kernel/trace/trace_syscalls.c:1259) 
[ 47.266825][ T1] __ftrace_event_enable_disable (kernel/trace/trace_events.c:860) 
[ 47.267743][ T1] __ftrace_set_clr_event_nolock (kernel/trace/trace_events.c:890 kernel/trace/trace_events.c:1353) 
[ 47.268609][ T1] event_trace_self_tests (kernel/trace/trace_events.c:1384 (discriminator 1) kernel/trace/trace_events.c:4779 (discriminator 1)) 
[ 47.269392][ T1] event_trace_self_tests_init (include/linux/list.h:373 kernel/trace/trace.h:487 kernel/trace/trace_events.c:4871 kernel/trace/trace_events.c:4894) 
[ 47.270229][ T1] do_one_initcall (init/main.c:1274) 
[ 47.270971][ T1] do_initcalls (init/main.c:1335 init/main.c:1352) 
[ 47.271669][ T1] kernel_init_freeable (init/main.c:1586) 
[ 47.272429][ T1] kernel_init (init/main.c:1476) 
[ 47.273127][ T1] ret_from_fork (arch/x86/kernel/process.c:154) 
[ 47.273817][ T1] ret_from_fork_asm (arch/x86/entry/entry_64.S:258) 
[   47.274547][    T1]
[   47.274977][    T1] The buggy address belongs to the object at ffff8881baea5f00
[   47.274977][    T1]  which belongs to the cache kmalloc-8 of size 8
[   47.277002][    T1] The buggy address is located 8 bytes to the right of
[   47.277002][    T1]  allocated 8-byte region [ffff8881baea5f00, ffff8881baea5f08)
[   47.279110][    T1]
[   47.279551][    T1] The buggy address belongs to the physical page:
[   47.280447][    T1] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1baea5
[   47.281773][    T1] flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff)
[   47.282819][    T1] page_type: f5(slab)
[   47.283449][    T1] raw: 0017ffffc0000000 ffff888100041500 dead000000000122 0000000000000000
[   47.284709][    T1] raw: 0000000000000000 0000000080800080 00000000f5000000 0000000000000000
[   47.286003][    T1] page dumped because: kasan: bad access detected
[   47.286854][    T1] page_owner tracks the page as allocated
[   47.287710][    T1] page last allocated via order 0, migratetype Unmovable, gfp_mask 0x52cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP), pid 1, tgid 1 (swapper/0), ts 35802055637, free_ts 0
[ 47.290101][ T1] post_alloc_hook (include/linux/page_owner.h:32 mm/page_alloc.c:1704) 
[ 47.290799][ T1] get_page_from_freelist (mm/page_alloc.c:1714 mm/page_alloc.c:3669) 
[ 47.291594][ T1] __alloc_frozen_pages_noprof (mm/page_alloc.c:4959) 
[ 47.292391][ T1] alloc_pages_mpol (mm/mempolicy.c:2421) 
[ 47.293117][ T1] allocate_slab (mm/slub.c:2451 mm/slub.c:2619) 
[ 47.293768][ T1] ___slab_alloc (mm/slub.c:3859 (discriminator 3)) 
[ 47.294447][ T1] __kmalloc_node_track_caller_noprof (mm/slub.c:3949 mm/slub.c:4024 mm/slub.c:4185 mm/slub.c:4327 mm/slub.c:4347) 
[ 47.295322][ T1] kstrdup (mm/util.c:63 mm/util.c:83) 
[ 47.295898][ T1] __kernfs_new_node (fs/kernfs/dir.c:634) 
[ 47.296626][ T1] kernfs_new_node (fs/kernfs/dir.c:713) 
[ 47.297380][ T1] kernfs_create_dir_ns (fs/kernfs/dir.c:1085) 
[ 47.298159][ T1] sysfs_create_dir_ns (fs/sysfs/dir.c:61) 
[ 47.298933][ T1] kobject_add_internal (lib/kobject.c:73 lib/kobject.c:240) 
[ 47.299712][ T1] kobject_init_and_add (lib/kobject.c:374 lib/kobject.c:457) 
[ 47.300448][ T1] net_rx_queue_update_kobjects (net/core/net-sysfs.c:1239 net/core/net-sysfs.c:1301) 
[ 47.301292][ T1] netdev_register_kobject (net/core/net-sysfs.c:2093 net/core/net-sysfs.c:2340) 
[   47.302015][    T1] page_owner free stack trace missing
[   47.302748][    T1]
[   47.303126][    T1] Memory state around the buggy address:
[   47.303914][    T1]  ffff8881baea5e00: 06 fc fc fc 06 fc fc fc 04 fc fc fc 06 fc fc fc
[   47.305085][    T1]  ffff8881baea5e80: 05 fc fc fc 05 fc fc fc 06 fc fc fc fc fc fc fc
[   47.306239][    T1] >ffff8881baea5f00: 00 fc fc fc 07 fc fc fc 00 fc fc fc fa fc fc fc
[   47.307378][    T1]                          ^
[   47.308014][    T1]  ffff8881baea5f80: fa fc fc fc fc fc fc fc 06 fc fc fc 06 fc fc fc
[   47.309220][    T1]  ffff8881baea6000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[   47.310929][    T1] ==================================================================
[   47.312186][    T1] Disabling lock debugging due to kernel taint
[   47.318329][    T1] OK
[   47.318896][    T1] Testing event system hyperv: OK


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20250811/202508110616.33657b6c-lkp@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/7] tracing: Have syscall trace events show "0x" for values greater than 10
  2025-08-06 16:42       ` Douglas Raillard
@ 2025-08-12 19:22         ` Steven Rostedt
  0 siblings, 0 replies; 22+ messages in thread
From: Steven Rostedt @ 2025-08-12 19:22 UTC (permalink / raw)
  To: Douglas Raillard
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel,
	Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Peter Zijlstra, Namhyung Kim, Takaya Saeki, Tom Zanussi,
	Thomas Gleixner, Ian Rogers, aahringo

On Wed, 6 Aug 2025 17:42:38 +0100
Douglas Raillard <douglas.raillard@arm.com> wrote:

> 
> AFAIK to this day, there is no tool providing a simple script/SQL interface
> to a trace.dat file beyond basic filtering like trace-cmd report. I've had this


Well, I did have this patch that I never applied:

https://lore.kernel.org/all/20200116104804.5d2f71e2@gandalf.local.home/

-- Steve


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-08-12 19:22 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-05 19:26 [PATCH 0/7] tracing: Show contents of syscall trace event user space fields Steven Rostedt
2025-08-05 19:26 ` [PATCH 1/7] tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE() Steven Rostedt
2025-08-06 18:39   ` Paul E. McKenney
2025-08-06 22:05   ` kernel test robot
2025-08-05 19:26 ` [PATCH 2/7] tracing: Have syscall trace events show "0x" for values greater than 10 Steven Rostedt
2025-08-06 10:24   ` Douglas Raillard
2025-08-06 12:39     ` Steven Rostedt
2025-08-06 16:42       ` Douglas Raillard
2025-08-12 19:22         ` Steven Rostedt
2025-08-05 19:26 ` [PATCH 3/7] tracing: Have syscall trace events read user space string Steven Rostedt
2025-08-06 14:39   ` kernel test robot
2025-08-11  5:19   ` kernel test robot
2025-08-05 19:26 ` [PATCH 4/7] tracing: Have system call events record user array data Steven Rostedt
2025-08-05 19:26 ` [PATCH 5/7] tracing: Display some syscall arrays as strings Steven Rostedt
2025-08-06 15:12   ` kernel test robot
2025-08-05 19:26 ` [PATCH 6/7] tracing: Allow syscall trace events to read more than one user parameter Steven Rostedt
2025-08-06 23:52   ` kernel test robot
2025-08-05 19:26 ` [PATCH 7/7] tracing: Add syscall_user_buf_size to limit amount written Steven Rostedt
2025-08-06 10:50   ` Douglas Raillard
2025-08-06 12:43     ` Steven Rostedt
2025-08-06 16:21       ` Douglas Raillard
2025-08-06 14:50   ` kernel test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).