* [PATCH 0/7] tracing: Show contents of syscall trace event user space fields
@ 2025-08-05 19:26 Steven Rostedt
2025-08-05 19:26 ` [PATCH 1/7] tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE() Steven Rostedt
` (6 more replies)
0 siblings, 7 replies; 22+ messages in thread
From: Steven Rostedt @ 2025-08-05 19:26 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Peter Zijlstra, Namhyung Kim, Takaya Saeki, Tom Zanussi,
Thomas Gleixner, Ian Rogers, aahringo, Douglas Raillard
As of commit 654ced4a1377 ("tracing: Introduce tracepoint_is_faultable()")
system call trace events allow faulting in user space memory. Have some of
the system call trace events take advantage of this.
Introduce a way to read from user space addresses from the syscall trace
event. The way this is accomplished is by creating a per CPU temporary
buffer that is used to read unsafe user memory.
When a syscall trace event needs to read user memory, it reads a per CPU
counter that gets updated every time a user space task is scheduled on the
CPU. It then enables preemption, copies the user space memory into this
buffer, then disables preemption again. If the counter is less than two
from its original value the buffer is valid. Otherwise it needs to try
again.
The reason to check for less than two and not equal to the previous value
is because scheduling kernel tasks is fine. Only user space tasks will
write to this buffer. If the task schedules out and only kernel tasks run
and the tasks schedules back in, the counter will be incremented by one.
A new file is created in the tracefs directory (and also per instance) that
allows the user to shorten the amount copied from user space. It can be
completely disabled if set to zero (it will only display "" or (, ...)
but no copying from user space will be performed). The max size to copy is
hard coded to 128, which should be enough for this purpose.
This allows the output to look like this:
sys_access(filename: 0x7f8c55368470 "/etc/ld.so.preload", mode: 4)
sys_execve(filename: 0x564ebcf5a6b8 "/usr/bin/emacs", argv: 0x7fff357c0300, envp: 0x564ebc4a4820)
sys_write(fd: 0xa, buf: 0x5646978d13c0 (01:00:05:00:00:00:00:00:01:87:55:89:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00), count: 0x20)
sys_sethostname(name: 0x5584310eb2a0 "debian", len: 6)
sys_renameat2(olddfd: 0xffffff9c, oldname: 0x7ffe02facdff "/tmp/x", newdfd: 0xffffff9c, newname: 0x7ffe02face06 "/tmp/y", flags: 1)
Steven Rostedt (7):
tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE()
tracing: Have syscall trace events show "0x" for values greater than 10
tracing: Have syscall trace events read user space string
tracing: Have system call events record user array data
tracing: Display some syscall arrays as strings
tracing: Allow syscall trace events to read more than one user parameter
tracing: Add syscall_user_buf_size to limit amount written
----
Documentation/trace/ftrace.rst | 7 +
include/trace/syscall.h | 8 +-
kernel/trace/Kconfig | 13 +
kernel/trace/trace.c | 52 +++
kernel/trace/trace.h | 3 +
kernel/trace/trace_syscalls.c | 697 +++++++++++++++++++++++++++++++++++++++--
6 files changed, 750 insertions(+), 30 deletions(-)
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 1/7] tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE()
2025-08-05 19:26 [PATCH 0/7] tracing: Show contents of syscall trace event user space fields Steven Rostedt
@ 2025-08-05 19:26 ` Steven Rostedt
2025-08-06 18:39 ` Paul E. McKenney
2025-08-06 22:05 ` kernel test robot
2025-08-05 19:26 ` [PATCH 2/7] tracing: Have syscall trace events show "0x" for values greater than 10 Steven Rostedt
` (5 subsequent siblings)
6 siblings, 2 replies; 22+ messages in thread
From: Steven Rostedt @ 2025-08-05 19:26 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Peter Zijlstra, Namhyung Kim, Takaya Saeki, Tom Zanussi,
Thomas Gleixner, Ian Rogers, aahringo, Douglas Raillard,
Paul E. McKenney
From: Steven Rostedt <rostedt@goodmis.org>
The syscall events are pseudo events that hook to the raw syscalls. The
ftrace_syscall_enter/exit() callback is called by the raw_syscall
enter/exit tracepoints respectively whenever any of the syscall events are
enabled.
The trace_array has an array of syscall "files" that correspond to the
system calls based on their __NR_SYSCALL number. The array is read and if
there's a pointer to a trace_event_file then it is considered enabled and
if it is NULL that syscall event is considered disabled.
Currently it uses an rcu_dereference_sched() to get this pointer and a
rcu_assign_ptr() or RCU_INIT_POINTER() to write to it. This is unnecessary
as the file pointer will not go away outside the synchronization of the
tracepoint logic itself. And this code adds no extra RCU synchronization
that uses this.
Replace these functions with a simple READ_ONCE() and WRITE_ONCE() which
is all they need. This will also allow this code to not depend on
preemption being disabled as system call tracepoints are now allowed to
fault.
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/trace/trace_syscalls.c | 14 ++++++--------
1 file changed, 6 insertions(+), 8 deletions(-)
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 46aab0ab9350..3a0b65f89130 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -310,8 +310,7 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
return;
- /* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE) */
- trace_file = rcu_dereference_sched(tr->enter_syscall_files[syscall_nr]);
+ trace_file = READ_ONCE(tr->enter_syscall_files[syscall_nr]);
if (!trace_file)
return;
@@ -356,8 +355,7 @@ static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
return;
- /* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE()) */
- trace_file = rcu_dereference_sched(tr->exit_syscall_files[syscall_nr]);
+ trace_file = READ_ONCE(tr->exit_syscall_files[syscall_nr]);
if (!trace_file)
return;
@@ -393,7 +391,7 @@ static int reg_event_syscall_enter(struct trace_event_file *file,
if (!tr->sys_refcount_enter)
ret = register_trace_sys_enter(ftrace_syscall_enter, tr);
if (!ret) {
- rcu_assign_pointer(tr->enter_syscall_files[num], file);
+ WRITE_ONCE(tr->enter_syscall_files[num], file);
tr->sys_refcount_enter++;
}
mutex_unlock(&syscall_trace_lock);
@@ -411,7 +409,7 @@ static void unreg_event_syscall_enter(struct trace_event_file *file,
return;
mutex_lock(&syscall_trace_lock);
tr->sys_refcount_enter--;
- RCU_INIT_POINTER(tr->enter_syscall_files[num], NULL);
+ WRITE_ONCE(tr->enter_syscall_files[num], NULL);
if (!tr->sys_refcount_enter)
unregister_trace_sys_enter(ftrace_syscall_enter, tr);
mutex_unlock(&syscall_trace_lock);
@@ -431,7 +429,7 @@ static int reg_event_syscall_exit(struct trace_event_file *file,
if (!tr->sys_refcount_exit)
ret = register_trace_sys_exit(ftrace_syscall_exit, tr);
if (!ret) {
- rcu_assign_pointer(tr->exit_syscall_files[num], file);
+ WRITE_ONCE(tr->exit_syscall_files[num], file);
tr->sys_refcount_exit++;
}
mutex_unlock(&syscall_trace_lock);
@@ -449,7 +447,7 @@ static void unreg_event_syscall_exit(struct trace_event_file *file,
return;
mutex_lock(&syscall_trace_lock);
tr->sys_refcount_exit--;
- RCU_INIT_POINTER(tr->exit_syscall_files[num], NULL);
+ WRITE_ONCE(tr->exit_syscall_files[num], NULL);
if (!tr->sys_refcount_exit)
unregister_trace_sys_exit(ftrace_syscall_exit, tr);
mutex_unlock(&syscall_trace_lock);
--
2.47.2
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 2/7] tracing: Have syscall trace events show "0x" for values greater than 10
2025-08-05 19:26 [PATCH 0/7] tracing: Show contents of syscall trace event user space fields Steven Rostedt
2025-08-05 19:26 ` [PATCH 1/7] tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE() Steven Rostedt
@ 2025-08-05 19:26 ` Steven Rostedt
2025-08-06 10:24 ` Douglas Raillard
2025-08-05 19:26 ` [PATCH 3/7] tracing: Have syscall trace events read user space string Steven Rostedt
` (4 subsequent siblings)
6 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2025-08-05 19:26 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Peter Zijlstra, Namhyung Kim, Takaya Saeki, Tom Zanussi,
Thomas Gleixner, Ian Rogers, aahringo, Douglas Raillard
From: Steven Rostedt <rostedt@goodmis.org>
Currently the syscall trace events show each value as hexadecimal, but
without adding "0x" it can be confusing:
sys_write(fd: 4, buf: 0x55c4a1fa9270, count: 44)
Looks like the above write wrote 44 bytes, when in reality it wrote 68
bytes.
Add a "0x" for all values greater or equal to 10 to remove the ambiguity.
For values less than 10, leave off the "0x" as that just adds noise to the
output.
Also change the iterator to check if "i" is nonzero and print the ", "
delimiter at the start, then adding the logic to the trace_seq_printf() at
the end.
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/trace/trace_syscalls.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 3a0b65f89130..0f932b22f9ec 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -153,14 +153,20 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
if (trace_seq_has_overflowed(s))
goto end;
+ if (i)
+ trace_seq_puts(s, ", ");
+
/* parameter types */
if (tr && tr->trace_flags & TRACE_ITER_VERBOSE)
trace_seq_printf(s, "%s ", entry->types[i]);
/* parameter values */
- trace_seq_printf(s, "%s: %lx%s", entry->args[i],
- trace->args[i],
- i == entry->nb_args - 1 ? "" : ", ");
+ if (trace->args[i] < 10)
+ trace_seq_printf(s, "%s: %lu", entry->args[i],
+ trace->args[i]);
+ else
+ trace_seq_printf(s, "%s: 0x%lx", entry->args[i],
+ trace->args[i]);
}
trace_seq_putc(s, ')');
--
2.47.2
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 3/7] tracing: Have syscall trace events read user space string
2025-08-05 19:26 [PATCH 0/7] tracing: Show contents of syscall trace event user space fields Steven Rostedt
2025-08-05 19:26 ` [PATCH 1/7] tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE() Steven Rostedt
2025-08-05 19:26 ` [PATCH 2/7] tracing: Have syscall trace events show "0x" for values greater than 10 Steven Rostedt
@ 2025-08-05 19:26 ` Steven Rostedt
2025-08-06 14:39 ` kernel test robot
2025-08-11 5:19 ` kernel test robot
2025-08-05 19:26 ` [PATCH 4/7] tracing: Have system call events record user array data Steven Rostedt
` (3 subsequent siblings)
6 siblings, 2 replies; 22+ messages in thread
From: Steven Rostedt @ 2025-08-05 19:26 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Peter Zijlstra, Namhyung Kim, Takaya Saeki, Tom Zanussi,
Thomas Gleixner, Ian Rogers, aahringo, Douglas Raillard
From: Steven Rostedt <rostedt@goodmis.org>
As of commit 654ced4a1377 ("tracing: Introduce tracepoint_is_faultable()")
system call trace events allow faulting in user space memory. Have some of
the system call trace events take advantage of this.
Introduce a way to read strings that are nul terminated into the trace
event. The way this is accomplished is by creating a per CPU temporary
buffer that is used to read unsafe user memory.
When a syscall trace event needs to read user memory, it reads a per CPU
counter that gets updated every time a user space task is scheduled on the
CPU. It then enables preemption, copies the user space memory into this
buffer, then disables preemption again. If the counter is less than two
from its original value the buffer is valid. Otherwise it needs to try
again.
The reason to check for less than two and not equal to the previous value
is because scheduling kernel tasks is fine. Only user space tasks will
write to this buffer. If the task schedules out and only kernel tasks run
and the tasks schedules back in, the counter will be incremented by one.
The syscall event has its nb_args shorten from an int to a short (where
even u8 is plenty big enough) and the freed two bytes are used for
"user_mask". The new "user_mask" field is used to store the index of the
"args" field array that has the address to read from user space. This
value is set to 0 if the system call event does not need to read user
space for a field. This mask can be used to know if the event may fault or
not. Only one bit set in user_mask is supported at this time.
This allows the output to look like this:
sys_access(filename: 0x7f8c55368470 "/etc/ld.so.preload", mode: 4)
sys_execve(filename: 0x564ebcf5a6b8 "/usr/bin/emacs", argv: 0x7fff357c0300, envp: 0x564ebc4a4820)
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
include/trace/syscall.h | 4 +-
kernel/trace/trace_syscalls.c | 496 ++++++++++++++++++++++++++++++++--
2 files changed, 480 insertions(+), 20 deletions(-)
diff --git a/include/trace/syscall.h b/include/trace/syscall.h
index 8e193f3a33b3..85f21ca15a41 100644
--- a/include/trace/syscall.h
+++ b/include/trace/syscall.h
@@ -16,6 +16,7 @@
* @name: name of the syscall
* @syscall_nr: number of the syscall
* @nb_args: number of parameters it takes
+ * @user_mask: mask of @args that will read user space
* @types: list of types as strings
* @args: list of args as strings (args[i] matches types[i])
* @enter_fields: list of fields for syscall_enter trace event
@@ -25,7 +26,8 @@
struct syscall_metadata {
const char *name;
int syscall_nr;
- int nb_args;
+ short nb_args;
+ short user_mask;
const char **types;
const char **args;
struct list_head enter_fields;
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 0f932b22f9ec..3233319ce266 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -10,6 +10,8 @@
#include <linux/xarray.h>
#include <asm/syscall.h>
+#include <trace/events/sched.h>
+
#include "trace_output.h"
#include "trace.h"
@@ -123,6 +125,9 @@ const char *get_syscall_name(int syscall)
return entry->name;
}
+/* Added to user strings when max limit is reached */
+#define EXTRA "..."
+
static enum print_line_t
print_syscall_enter(struct trace_iterator *iter, int flags,
struct trace_event *event)
@@ -132,7 +137,8 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
struct trace_entry *ent = iter->ent;
struct syscall_trace_enter *trace;
struct syscall_metadata *entry;
- int i, syscall;
+ int i, syscall, val;
+ unsigned char *ptr;
trace = (typeof(trace))ent;
syscall = trace->nr;
@@ -167,6 +173,17 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
else
trace_seq_printf(s, "%s: 0x%lx", entry->args[i],
trace->args[i]);
+
+ if (!(BIT(i) & entry->user_mask))
+ continue;
+
+ /* This arg points to a user space string */
+ ptr = (void *)trace->args + sizeof(long) * entry->nb_args;
+ val = *(int *)ptr;
+
+ ptr = (void *)ent + (val & 0xffff);
+
+ trace_seq_printf(s, " \"%s\"", ptr);
}
trace_seq_putc(s, ')');
@@ -223,15 +240,27 @@ __set_enter_print_fmt(struct syscall_metadata *entry, char *buf, int len)
pos += snprintf(buf + pos, LEN_OR_ZERO, "\"");
for (i = 0; i < entry->nb_args; i++) {
- pos += snprintf(buf + pos, LEN_OR_ZERO, "%s: 0x%%0%zulx%s",
- entry->args[i], sizeof(unsigned long),
- i == entry->nb_args - 1 ? "" : ", ");
+ if (i)
+ pos += snprintf(buf + pos, LEN_OR_ZERO, ", ");
+ pos += snprintf(buf + pos, LEN_OR_ZERO, "%s: 0x%%0%zulx",
+ entry->args[i], sizeof(unsigned long));
+
+ if (!(BIT(i) & entry->user_mask))
+ continue;
+
+ /* Add the format for the user space string */
+ pos += snprintf(buf + pos, LEN_OR_ZERO, " \\\"%%s\\\"");
}
pos += snprintf(buf + pos, LEN_OR_ZERO, "\"");
for (i = 0; i < entry->nb_args; i++) {
pos += snprintf(buf + pos, LEN_OR_ZERO,
", ((unsigned long)(REC->%s))", entry->args[i]);
+ if (!(BIT(i) & entry->user_mask))
+ continue;
+ /* The user space string for arg has name __<arg>_val */
+ pos += snprintf(buf + pos, LEN_OR_ZERO, ", __get_str(__%s_val)",
+ entry->args[i]);
}
#undef LEN_OR_ZERO
@@ -277,8 +306,12 @@ static int __init syscall_enter_define_fields(struct trace_event_call *call)
{
struct syscall_trace_enter trace;
struct syscall_metadata *meta = call->data;
+ unsigned long mask;
+ char *arg;
int offset = offsetof(typeof(trace), args);
+ int idx;
int ret = 0;
+ int len;
int i;
for (i = 0; i < meta->nb_args; i++) {
@@ -291,9 +324,252 @@ static int __init syscall_enter_define_fields(struct trace_event_call *call)
offset += sizeof(unsigned long);
}
+ if (ret || !meta->user_mask)
+ return ret;
+
+ mask = meta->user_mask;
+ idx = ffs(mask) - 1;
+
+ /*
+ * User space strings are faulted into a temporary buffer and then
+ * added as a dynamic string to the end of the event.
+ * The user space string name for the arg pointer is "__<arg>_val".
+ */
+ len = strlen(meta->args[idx]) + sizeof("___val");
+ arg = kmalloc(len, GFP_KERNEL);
+ if (WARN_ON_ONCE(!arg)) {
+ meta->user_mask = 0;
+ return -ENOMEM;
+ }
+
+ snprintf(arg, len, "__%s_val", meta->args[idx]);
+
+ ret = trace_define_field(call, "__data_loc char[]",
+ arg, offset, sizeof(int), 0,
+ FILTER_OTHER);
+ if (ret)
+ kfree(arg);
return ret;
}
+struct syscall_buf {
+ char *buf;
+};
+
+struct syscall_buf_info {
+ struct rcu_head rcu;
+ struct syscall_buf __percpu *sbuf;
+};
+
+/* Create a per CPU temporary buffer to copy user space pointers into */
+#define SYSCALL_FAULT_BUF_SZ 512
+static struct syscall_buf_info *syscall_buffer;
+static DEFINE_PER_CPU(unsigned long, sched_switch_cnt);
+
+static int syscall_fault_buffer_cnt;
+
+static void syscall_fault_buffer_free(struct syscall_buf_info *sinfo)
+{
+ char *buf;
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ buf = per_cpu_ptr(sinfo->sbuf, cpu)->buf;
+ kfree(buf);
+ }
+ kfree(sinfo);
+}
+
+static void rcu_free_syscall_buffer(struct rcu_head *rcu)
+{
+ struct syscall_buf_info *sinfo = container_of(rcu, struct syscall_buf_info, rcu);
+
+ syscall_fault_buffer_free(sinfo);
+}
+
+/*
+ * The per CPU buffer syscall_fault_buffer is written to optimstically.
+ * The counter sched_switch_cnt is taken, preemption is enabled,
+ * the copying of the user space memory is placed into the syscall_fault_buffer,
+ * Preeption is re-enabled and the count is read again. If the count is greater
+ * than one from its previous reading, it means that another user space
+ * task scheduled in and the buffer is unreliable for use.
+ */
+static void
+probe_sched_switch(void *ignore, bool preempt,
+ struct task_struct *prev, struct task_struct *next,
+ unsigned int prev_state)
+{
+ /*
+ * The buffer can only be corrupted by another user space task.
+ * Ignore kernel tasks that may be scheduled in order to process
+ * the faulting memory.
+ */
+ if (next->flags & (PF_KTHREAD | PF_USER_WORKER))
+ return;
+
+ this_cpu_inc(sched_switch_cnt);
+}
+
+static int syscall_fault_buffer_enable(void)
+{
+ struct syscall_buf_info *sinfo;
+ char *buf;
+ int cpu;
+ int ret;
+
+ lockdep_assert_held(&syscall_trace_lock);
+
+ if (syscall_fault_buffer_cnt++)
+ return 0;
+
+ sinfo = kmalloc(sizeof(sinfo), GFP_KERNEL);
+ if (!sinfo)
+ return -ENOMEM;
+
+ sinfo->sbuf = alloc_percpu(struct syscall_buf);
+ if (!sinfo->sbuf) {
+ kfree(sinfo);
+ return -ENOMEM;
+ }
+
+ /* Clear each buffer in case of error */
+ for_each_possible_cpu(cpu) {
+ per_cpu_ptr(sinfo->sbuf, cpu)->buf = NULL;
+ }
+
+ for_each_possible_cpu(cpu) {
+ buf = kmalloc_node(SYSCALL_FAULT_BUF_SZ, GFP_KERNEL,
+ cpu_to_node(cpu));
+ if (!buf) {
+ syscall_fault_buffer_free(sinfo);
+ return -ENOMEM;
+ }
+ per_cpu_ptr(sinfo->sbuf, cpu)->buf = buf;
+ }
+
+ ret = register_trace_sched_switch(probe_sched_switch, NULL);
+ if (ret < 0) {
+ syscall_fault_buffer_free(sinfo);
+ return ret;
+ }
+ WRITE_ONCE(syscall_buffer, sinfo);
+ return 0;
+}
+
+static void syscall_fault_buffer_disable(void)
+{
+ struct syscall_buf_info *sinfo = syscall_buffer;
+
+ lockdep_assert_held(&syscall_trace_lock);
+
+ if (--syscall_fault_buffer_cnt)
+ return;
+
+ WRITE_ONCE(syscall_buffer, NULL);
+
+ unregister_trace_sched_switch(probe_sched_switch, NULL);
+ call_rcu_tasks_trace(&sinfo->rcu, rcu_free_syscall_buffer);
+}
+
+static char *sys_fault_user(struct syscall_metadata *sys_data, struct syscall_buf_info *sinfo,
+ unsigned long *args, unsigned int *data_size)
+{
+ char *buf = per_cpu_ptr(sinfo->sbuf, smp_processor_id())->buf;
+ unsigned long size = SYSCALL_FAULT_BUF_SZ - 1;
+ unsigned long mask = sys_data->user_mask;
+ unsigned int cnt;
+ int idx = ffs(mask) - 1;
+ char *ptr;
+ int trys = 0;
+ int ret;
+
+ /* Get the pointer to user space memory to read */
+ ptr = (char *)args[idx];
+ *data_size = 0;
+
+ again:
+ /*
+ * If this task is preempted by another user space task, it
+ * will cause this task to try again. But just in case something
+ * changes where the copying from user space causes another task
+ * to run, prevent this from going into an infinite loop.
+ * 10 tries should be plenty.
+ */
+ if (trys++ > 10) {
+ static bool once;
+ /*
+ * Only print a message instead of a WARN_ON() as this could
+ * theoretically trigger under real load.
+ */
+ if (!once)
+ pr_warn("Error: Too many tries to read syscall %s\n", sys_data->name);
+ once = true;
+ return buf;
+ }
+
+ /* Read the current sched switch count */
+ cnt = this_cpu_read(sched_switch_cnt);
+
+ /*
+ * Preemption is going to be enabled, but this task must
+ * remain on this CPU.
+ */
+ migrate_disable();
+
+ /*
+ * Now preemption is being enabed and another task can come in
+ * and use the same buffer and corrupt our data.
+ */
+ preempt_enable_notrace();
+
+ ret = strncpy_from_user(buf, ptr, size);
+
+ preempt_disable_notrace();
+ migrate_enable();
+
+ /* If it faulted, no use to try again */
+ if (ret < 0)
+ return buf;
+
+ /*
+ * Preemption is disabled again, now check the sched_switch_cnt.
+ * If it increased by two or more, then another user space process
+ * may have schedule in and corrupted our buffer. In that case
+ * the copying must be retried.
+ *
+ * Note, if this task was scheduled out and only kernel threads
+ * were scheduled in (maybe to process the fault), then the
+ * counter would increment again when this task scheduled in.
+ * If this task scheduled out and another user task scheduled
+ * in, this task would still need to be scheduled back in and
+ * the counter would increment by at least two.
+ */
+ if (this_cpu_read(sched_switch_cnt) > cnt + 1)
+ goto again;
+
+ /* Replace any non-printable characters with '.' */
+ for (int i = 0; i < ret; i++) {
+ if (!isprint(buf[i]))
+ buf[i] = '.';
+ }
+
+ /*
+ * If the text was truncated due to our max limit, add "..." to
+ * the string.
+ */
+ if (ret > SYSCALL_FAULT_BUF_SZ - sizeof(EXTRA)) {
+ strscpy(buf + SYSCALL_FAULT_BUF_SZ - sizeof(EXTRA),
+ EXTRA, sizeof(EXTRA));
+ ret = SYSCALL_FAULT_BUF_SZ;
+ } else {
+ buf[ret++] = '\0';
+ }
+
+ *data_size = ret;
+ return buf;
+}
+
static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
{
struct trace_array *tr = data;
@@ -302,15 +578,17 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
struct syscall_metadata *sys_data;
struct trace_event_buffer fbuffer;
unsigned long args[6];
+ char *user_ptr;
+ int user_size = 0;
int syscall_nr;
- int size;
+ int size = 0;
+ bool mayfault;
/*
* Syscall probe called with preemption enabled, but the ring
* buffer and per-cpu data require preemption to be disabled.
*/
might_fault();
- guard(preempt_notrace)();
syscall_nr = trace_get_syscall_nr(current, regs);
if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
@@ -327,7 +605,32 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
if (!sys_data)
return;
- size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
+ /* Check if this syscall event faults in user space memory */
+ mayfault = sys_data->user_mask != 0;
+
+ guard(preempt_notrace)();
+
+ syscall_get_arguments(current, regs, args);
+
+ if (mayfault) {
+ struct syscall_buf_info *sinfo;
+
+ /* If the syscall_buffer is NULL, tracing is being shutdown */
+ sinfo = READ_ONCE(syscall_buffer);
+ if (!sinfo)
+ return;
+
+ user_ptr = sys_fault_user(sys_data, sinfo, args, &user_size);
+ /*
+ * user_size is the amount of data to append.
+ * Need to add 4 for the meta field that points to
+ * the user memory at the end of the event and also
+ * stores its size.
+ */
+ size = 4 + user_size;
+ }
+
+ size += sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
entry = trace_event_buffer_reserve(&fbuffer, trace_file, size);
if (!entry)
@@ -335,9 +638,36 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
entry = ring_buffer_event_data(fbuffer.event);
entry->nr = syscall_nr;
- syscall_get_arguments(current, regs, args);
+
memcpy(entry->args, args, sizeof(unsigned long) * sys_data->nb_args);
+ if (mayfault) {
+ void *ptr;
+ int val;
+
+ /*
+ * Set the pointer to point to the meta data of the event
+ * that has information about the stored user space memory.
+ */
+ ptr = (void *)entry->args + sizeof(unsigned long) * sys_data->nb_args;
+
+ /*
+ * The meta data will store the offset of the user data from
+ * the beginning of the event.
+ */
+ val = (ptr - (void *)entry) + 4;
+
+ /* Store the offset and the size into the meta data */
+ *(int *)ptr = val | (user_size << 16);
+
+ /* Nothing to do if the user space was empty or faulted */
+ if (user_size) {
+ /* Now store the user space data into the event */
+ ptr += 4;
+ memcpy(ptr, user_ptr, user_size);
+ }
+ }
+
trace_event_buffer_commit(&fbuffer);
}
@@ -386,39 +716,50 @@ static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
static int reg_event_syscall_enter(struct trace_event_file *file,
struct trace_event_call *call)
{
+ struct syscall_metadata *sys_data = call->data;
struct trace_array *tr = file->tr;
int ret = 0;
int num;
- num = ((struct syscall_metadata *)call->data)->syscall_nr;
+ num = sys_data->syscall_nr;
if (WARN_ON_ONCE(num < 0 || num >= NR_syscalls))
return -ENOSYS;
- mutex_lock(&syscall_trace_lock);
- if (!tr->sys_refcount_enter)
+ guard(mutex)(&syscall_trace_lock);
+ if (sys_data->user_mask) {
+ ret = syscall_fault_buffer_enable();
+ if (ret)
+ return ret;
+ }
+ if (!tr->sys_refcount_enter) {
ret = register_trace_sys_enter(ftrace_syscall_enter, tr);
- if (!ret) {
- WRITE_ONCE(tr->enter_syscall_files[num], file);
- tr->sys_refcount_enter++;
+ if (ret < 0) {
+ if (sys_data->user_mask)
+ syscall_fault_buffer_disable();
+ return ret;
+ }
}
- mutex_unlock(&syscall_trace_lock);
- return ret;
+ WRITE_ONCE(tr->enter_syscall_files[num], file);
+ tr->sys_refcount_enter++;
+ return 0;
}
static void unreg_event_syscall_enter(struct trace_event_file *file,
struct trace_event_call *call)
{
+ struct syscall_metadata *sys_data = call->data;
struct trace_array *tr = file->tr;
int num;
- num = ((struct syscall_metadata *)call->data)->syscall_nr;
+ num = sys_data->syscall_nr;
if (WARN_ON_ONCE(num < 0 || num >= NR_syscalls))
return;
- mutex_lock(&syscall_trace_lock);
+ guard(mutex)(&syscall_trace_lock);
tr->sys_refcount_enter--;
WRITE_ONCE(tr->enter_syscall_files[num], NULL);
if (!tr->sys_refcount_enter)
unregister_trace_sys_enter(ftrace_syscall_enter, tr);
- mutex_unlock(&syscall_trace_lock);
+ if (sys_data->user_mask)
+ syscall_fault_buffer_disable();
}
static int reg_event_syscall_exit(struct trace_event_file *file,
@@ -459,6 +800,121 @@ static void unreg_event_syscall_exit(struct trace_event_file *file,
mutex_unlock(&syscall_trace_lock);
}
+/*
+ * For system calls that reference user space memory that can
+ * be recorded into the event, set the system call meta data's user_mask
+ * to the "args" index that points to the user space memory to retrieve.
+ */
+static void check_faultable_syscall(struct trace_event_call *call, int nr)
+{
+ struct syscall_metadata *sys_data = call->data;
+
+ /* Only work on entry */
+ if (sys_data->enter_event != call)
+ return;
+
+ switch (nr) {
+ /* user arg at position 0 */
+ case __NR_access:
+ case __NR_acct:
+ case __NR_add_key: /* Just _type. TODO add _description */
+ case __NR_chdir:
+ case __NR_chown:
+ case __NR_chmod:
+ case __NR_chroot:
+ case __NR_creat:
+ case __NR_delete_module:
+ case __NR_execve:
+ case __NR_fsopen:
+ case __NR_getxattr: /* Just pathname, TODO add name */
+ case __NR_lchown:
+ case __NR_lgetxattr: /* Just pathname, TODO add name */
+ case __NR_lremovexattr: /* Just pathname, TODO add name */
+ case __NR_link: /* Just oldname. TODO add newname */
+ case __NR_listxattr: /* Just pathname, TODO add list */
+ case __NR_llistxattr: /* Just pathname, TODO add list */
+ case __NR_lsetxattr: /* Just pathname, TODO add list */
+ case __NR_open:
+ case __NR_memfd_create:
+ case __NR_mount: /* Just dev_name, TODO add dir_name and type */
+ case __NR_mkdir:
+ case __NR_mknod:
+ case __NR_mq_open:
+ case __NR_mq_unlink:
+ case __NR_pivot_root: /* Just new_root, TODO add old_root */
+ case __NR_readlink:
+ case __NR_removexattr: /* Just pathname, TODO add name */
+ case __NR_rename: /* Just oldname. TODO add newname */
+ case __NR_request_key: /* Just _type. TODO add _description */
+ case __NR_rmdir:
+ case __NR_setxattr: /* Just pathname, TODO add list */
+ case __NR_shmdt:
+ case __NR_statfs:
+ case __NR_swapon:
+ case __NR_swapoff:
+ case __NR_symlink: /* Just oldname. TODO add newname */
+ case __NR_truncate:
+ case __NR_unlink:
+ case __NR_umount2:
+ case __NR_utime:
+ case __NR_utimes:
+ sys_data->user_mask = BIT(0);
+ break;
+ /* user arg at position 1 */
+ case __NR_execveat:
+ case __NR_faccessat:
+ case __NR_faccessat2:
+ case __NR_finit_module:
+ case __NR_fchmodat:
+ case __NR_fchmodat2:
+ case __NR_fchownat:
+ case __NR_fgetxattr:
+ case __NR_flistxattr:
+ case __NR_fsetxattr:
+ case __NR_fspick:
+ case __NR_fremovexattr:
+ case __NR_futimesat:
+ case __NR_getxattrat: /* Just pathname, TODO add name */
+ case __NR_inotify_add_watch:
+ case __NR_linkat: /* Just oldname. TODO add newname */
+ case __NR_listxattrat: /* Just pathname, TODO add list */
+ case __NR_mkdirat:
+ case __NR_mknodat:
+ case __NR_mount_setattr:
+ case __NR_move_mount: /* Just from_pathname, TODO add to_pathname */
+ case __NR_name_to_handle_at:
+ case __NR_newfstatat:
+ case __NR_openat:
+ case __NR_openat2:
+ case __NR_open_tree:
+ case __NR_open_tree_attr:
+ case __NR_readlinkat:
+ case __NR_renameat: /* Just oldname. TODO add newname */
+ case __NR_renameat2: /* Just oldname. TODO add newname */
+ case __NR_removexattrat: /* Just pathname, TODO add name */
+ case __NR_quotactl:
+ case __NR_setxattrat: /* Just pathname, TODO add list */
+ case __NR_syslog:
+ case __NR_symlinkat: /* Just oldname. TODO add newname */
+ case __NR_statx:
+ case __NR_unlinkat:
+ case __NR_utimensat:
+ sys_data->user_mask = BIT(1);
+ break;
+ /* user arg at position 2 */
+ case __NR_init_module:
+ case __NR_fsconfig:
+ sys_data->user_mask = BIT(2);
+ break;
+ /* user arg at position 4 */
+ case __NR_fanotify_mark:
+ sys_data->user_mask = BIT(4);
+ break;
+ default:
+ sys_data->user_mask = 0;
+ }
+}
+
static int __init init_syscall_trace(struct trace_event_call *call)
{
int id;
@@ -471,6 +927,8 @@ static int __init init_syscall_trace(struct trace_event_call *call)
return -ENOSYS;
}
+ check_faultable_syscall(call, num);
+
if (set_syscall_print_fmt(call) < 0)
return -ENOMEM;
--
2.47.2
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 4/7] tracing: Have system call events record user array data
2025-08-05 19:26 [PATCH 0/7] tracing: Show contents of syscall trace event user space fields Steven Rostedt
` (2 preceding siblings ...)
2025-08-05 19:26 ` [PATCH 3/7] tracing: Have syscall trace events read user space string Steven Rostedt
@ 2025-08-05 19:26 ` Steven Rostedt
2025-08-05 19:26 ` [PATCH 5/7] tracing: Display some syscall arrays as strings Steven Rostedt
` (2 subsequent siblings)
6 siblings, 0 replies; 22+ messages in thread
From: Steven Rostedt @ 2025-08-05 19:26 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Peter Zijlstra, Namhyung Kim, Takaya Saeki, Tom Zanussi,
Thomas Gleixner, Ian Rogers, aahringo, Douglas Raillard
From: Steven Rostedt <rostedt@goodmis.org>
For system call events that have a length field, add a "user_arg_size"
parameter to the system call meta data that denotes the index of the args
array that holds the size of arg that the user_mask field has a bit set
for.
The "user_mask" has a bit set that denotes the arg that points to an array
in the user space address space and if a system call event has the
user_mask field set and the user_arg_size set, it will then record the
content of that address into the trace event, up to the size defined by
SYSCALL_FAULT_BUF_SZ - 1.
This allows the output to look like:
sys_write(fd: 0xa, buf: 0x5646978d13c0 (01:00:05:00:00:00:00:00:01:87:55:89:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00), count: 0x20)
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
include/trace/syscall.h | 4 +-
kernel/trace/trace_syscalls.c | 112 ++++++++++++++++++++++++++--------
2 files changed, 88 insertions(+), 28 deletions(-)
diff --git a/include/trace/syscall.h b/include/trace/syscall.h
index 85f21ca15a41..9413c139da66 100644
--- a/include/trace/syscall.h
+++ b/include/trace/syscall.h
@@ -16,6 +16,7 @@
* @name: name of the syscall
* @syscall_nr: number of the syscall
* @nb_args: number of parameters it takes
+ * @user_arg_size: holds @arg that has size of the user space to read
* @user_mask: mask of @args that will read user space
* @types: list of types as strings
* @args: list of args as strings (args[i] matches types[i])
@@ -26,7 +27,8 @@
struct syscall_metadata {
const char *name;
int syscall_nr;
- short nb_args;
+ u8 nb_args;
+ s8 user_arg_size;
short user_mask;
const char **types;
const char **args;
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 3233319ce266..b0a587f2e4b5 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -125,7 +125,7 @@ const char *get_syscall_name(int syscall)
return entry->name;
}
-/* Added to user strings when max limit is reached */
+/* Added to user strings or arrays when max limit is reached */
#define EXTRA "..."
static enum print_line_t
@@ -137,7 +137,7 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
struct trace_entry *ent = iter->ent;
struct syscall_trace_enter *trace;
struct syscall_metadata *entry;
- int i, syscall, val;
+ int i, syscall, val, len;
unsigned char *ptr;
trace = (typeof(trace))ent;
@@ -183,7 +183,25 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
ptr = (void *)ent + (val & 0xffff);
- trace_seq_printf(s, " \"%s\"", ptr);
+ if (entry->user_arg_size < 0) {
+ trace_seq_printf(s, " \"%s\"", ptr);
+ continue;
+ }
+
+ len = val >> 16;
+
+ val = trace->args[entry->user_arg_size];
+
+ trace_seq_puts(s, " (");
+ for (int x = 0; x < len; x++, ptr++) {
+ if (x)
+ trace_seq_putc(s, ':');
+ trace_seq_printf(s, "%02x", *ptr);
+ }
+ if (len < val)
+ trace_seq_printf(s, ", %s", EXTRA);
+
+ trace_seq_putc(s, ')');
}
trace_seq_putc(s, ')');
@@ -248,8 +266,11 @@ __set_enter_print_fmt(struct syscall_metadata *entry, char *buf, int len)
if (!(BIT(i) & entry->user_mask))
continue;
- /* Add the format for the user space string */
- pos += snprintf(buf + pos, LEN_OR_ZERO, " \\\"%%s\\\"");
+ /* Add the format for the user space string or array */
+ if (entry->user_arg_size < 0)
+ pos += snprintf(buf + pos, LEN_OR_ZERO, " \\\"%%s\\\"");
+ else
+ pos += snprintf(buf + pos, LEN_OR_ZERO, " (%%s)");
}
pos += snprintf(buf + pos, LEN_OR_ZERO, "\"");
@@ -258,9 +279,14 @@ __set_enter_print_fmt(struct syscall_metadata *entry, char *buf, int len)
", ((unsigned long)(REC->%s))", entry->args[i]);
if (!(BIT(i) & entry->user_mask))
continue;
- /* The user space string for arg has name __<arg>_val */
- pos += snprintf(buf + pos, LEN_OR_ZERO, ", __get_str(__%s_val)",
- entry->args[i]);
+ /* The user space data for arg has name __<arg>_val */
+ if (entry->user_arg_size < 0) {
+ pos += snprintf(buf + pos, LEN_OR_ZERO, ", __get_str(__%s_val)",
+ entry->args[i]);
+ } else {
+ pos += snprintf(buf + pos, LEN_OR_ZERO, ", __print_dynamic_array(__%s_val, 1)",
+ entry->args[i]);
+ }
}
#undef LEN_OR_ZERO
@@ -331,9 +357,9 @@ static int __init syscall_enter_define_fields(struct trace_event_call *call)
idx = ffs(mask) - 1;
/*
- * User space strings are faulted into a temporary buffer and then
- * added as a dynamic string to the end of the event.
- * The user space string name for the arg pointer is "__<arg>_val".
+ * User space data is faulted into a temporary buffer and then
+ * added as a dynamic string or array to the end of the event.
+ * The user space data name for the arg pointer is "__<arg>_val".
*/
len = strlen(meta->args[idx]) + sizeof("___val");
arg = kmalloc(len, GFP_KERNEL);
@@ -480,6 +506,7 @@ static char *sys_fault_user(struct syscall_metadata *sys_data, struct syscall_bu
unsigned long mask = sys_data->user_mask;
unsigned int cnt;
int idx = ffs(mask) - 1;
+ bool array = false;
char *ptr;
int trys = 0;
int ret;
@@ -511,6 +538,18 @@ static char *sys_fault_user(struct syscall_metadata *sys_data, struct syscall_bu
/* Read the current sched switch count */
cnt = this_cpu_read(sched_switch_cnt);
+ /*
+ * If this system call event has a size argument, use
+ * it to define how much of user space memory to read,
+ * and read it as an array and not a string.
+ */
+ if (sys_data->user_arg_size >= 0) {
+ array = true;
+ size = args[sys_data->user_arg_size];
+ if (size > SYSCALL_FAULT_BUF_SZ - 1)
+ size = SYSCALL_FAULT_BUF_SZ - 1;
+ }
+
/*
* Preemption is going to be enabled, but this task must
* remain on this CPU.
@@ -523,7 +562,12 @@ static char *sys_fault_user(struct syscall_metadata *sys_data, struct syscall_bu
*/
preempt_enable_notrace();
- ret = strncpy_from_user(buf, ptr, size);
+ if (array) {
+ ret = __copy_from_user(buf, ptr, size);
+ ret = ret ? -1 : size;
+ } else {
+ ret = strncpy_from_user(buf, ptr, size);
+ }
preempt_disable_notrace();
migrate_enable();
@@ -548,22 +592,24 @@ static char *sys_fault_user(struct syscall_metadata *sys_data, struct syscall_bu
if (this_cpu_read(sched_switch_cnt) > cnt + 1)
goto again;
- /* Replace any non-printable characters with '.' */
- for (int i = 0; i < ret; i++) {
- if (!isprint(buf[i]))
- buf[i] = '.';
- }
+ /* For strings, replace any non-printable characters with '.' */
+ if (!array) {
+ for (int i = 0; i < ret; i++) {
+ if (!isprint(buf[i]))
+ buf[i] = '.';
+ }
- /*
- * If the text was truncated due to our max limit, add "..." to
- * the string.
- */
- if (ret > SYSCALL_FAULT_BUF_SZ - sizeof(EXTRA)) {
- strscpy(buf + SYSCALL_FAULT_BUF_SZ - sizeof(EXTRA),
- EXTRA, sizeof(EXTRA));
- ret = SYSCALL_FAULT_BUF_SZ;
- } else {
- buf[ret++] = '\0';
+ /*
+ * If the text was truncated due to our max limit, add "..." to
+ * the string.
+ */
+ if (ret > SYSCALL_FAULT_BUF_SZ - sizeof(EXTRA)) {
+ strscpy(buf + SYSCALL_FAULT_BUF_SZ - sizeof(EXTRA),
+ EXTRA, sizeof(EXTRA));
+ ret = SYSCALL_FAULT_BUF_SZ;
+ } else {
+ buf[ret++] = '\0';
+ }
}
*data_size = ret;
@@ -660,6 +706,9 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
/* Store the offset and the size into the meta data */
*(int *)ptr = val | (user_size << 16);
+ if (WARN_ON_ONCE((ptr - (void *)entry + user_size) > size))
+ user_size = 0;
+
/* Nothing to do if the user space was empty or faulted */
if (user_size) {
/* Now store the user space data into the event */
@@ -813,7 +862,16 @@ static void check_faultable_syscall(struct trace_event_call *call, int nr)
if (sys_data->enter_event != call)
return;
+ sys_data->user_arg_size = -1;
+
switch (nr) {
+ /* user arg 1 with size arg at 2 */
+ case __NR_write:
+ case __NR_mq_timedsend:
+ case __NR_pwrite64:
+ sys_data->user_mask = BIT(1);
+ sys_data->user_arg_size = 2;
+ break;
/* user arg at position 0 */
case __NR_access:
case __NR_acct:
--
2.47.2
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 5/7] tracing: Display some syscall arrays as strings
2025-08-05 19:26 [PATCH 0/7] tracing: Show contents of syscall trace event user space fields Steven Rostedt
` (3 preceding siblings ...)
2025-08-05 19:26 ` [PATCH 4/7] tracing: Have system call events record user array data Steven Rostedt
@ 2025-08-05 19:26 ` Steven Rostedt
2025-08-06 15:12 ` kernel test robot
2025-08-05 19:26 ` [PATCH 6/7] tracing: Allow syscall trace events to read more than one user parameter Steven Rostedt
2025-08-05 19:26 ` [PATCH 7/7] tracing: Add syscall_user_buf_size to limit amount written Steven Rostedt
6 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2025-08-05 19:26 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Peter Zijlstra, Namhyung Kim, Takaya Saeki, Tom Zanussi,
Thomas Gleixner, Ian Rogers, aahringo, Douglas Raillard
From: Steven Rostedt <rostedt@goodmis.org>
Some of the system calls that read a fixed length of memory from the user
space address are not arrays but strings. Take a bit away from the nb_args
field in the syscall meta data to use as a flag to denote that the system
call's user_arg_size is being used as a string. The nb_args should never
be more than 6, so 7 bits is plenty to hold that number. When the
user_arg_is_str flag that, when set, will display the data array from the
user space address as a string and not an array.
This will allow the output to look like this:
sys_sethostname(name: 0x5584310eb2a0 "debian", len: 6)
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
include/trace/syscall.h | 4 +++-
kernel/trace/trace_syscalls.c | 25 +++++++++++++++++++------
2 files changed, 22 insertions(+), 7 deletions(-)
diff --git a/include/trace/syscall.h b/include/trace/syscall.h
index 9413c139da66..0dd7f2b33431 100644
--- a/include/trace/syscall.h
+++ b/include/trace/syscall.h
@@ -16,6 +16,7 @@
* @name: name of the syscall
* @syscall_nr: number of the syscall
* @nb_args: number of parameters it takes
+ * @user_arg_is_str: set if the arg for @user_arg_size is a string
* @user_arg_size: holds @arg that has size of the user space to read
* @user_mask: mask of @args that will read user space
* @types: list of types as strings
@@ -27,7 +28,8 @@
struct syscall_metadata {
const char *name;
int syscall_nr;
- u8 nb_args;
+ u8 nb_args:7;
+ u8 user_arg_is_str:1;
s8 user_arg_size;
short user_mask;
const char **types;
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index b0a587f2e4b5..8c0142eea898 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -182,14 +182,13 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
val = *(int *)ptr;
ptr = (void *)ent + (val & 0xffff);
+ len = val >> 16;
- if (entry->user_arg_size < 0) {
- trace_seq_printf(s, " \"%s\"", ptr);
+ if (entry->user_arg_size < 0 || entry->user_arg_is_str) {
+ trace_seq_printf(s, " \"%.*s\"", len, ptr);
continue;
}
- len = val >> 16;
-
val = trace->args[entry->user_arg_size];
trace_seq_puts(s, " (");
@@ -250,6 +249,7 @@ print_syscall_exit(struct trace_iterator *iter, int flags,
static int __init
__set_enter_print_fmt(struct syscall_metadata *entry, char *buf, int len)
{
+ bool is_string = entry->user_arg_is_str;
int i;
int pos = 0;
@@ -267,7 +267,7 @@ __set_enter_print_fmt(struct syscall_metadata *entry, char *buf, int len)
continue;
/* Add the format for the user space string or array */
- if (entry->user_arg_size < 0)
+ if (entry->user_arg_size < 0 || is_string)
pos += snprintf(buf + pos, LEN_OR_ZERO, " \\\"%%s\\\"");
else
pos += snprintf(buf + pos, LEN_OR_ZERO, " (%%s)");
@@ -280,7 +280,7 @@ __set_enter_print_fmt(struct syscall_metadata *entry, char *buf, int len)
if (!(BIT(i) & entry->user_mask))
continue;
/* The user space data for arg has name __<arg>_val */
- if (entry->user_arg_size < 0) {
+ if (entry->user_arg_size < 0 || is_string) {
pos += snprintf(buf + pos, LEN_OR_ZERO, ", __get_str(__%s_val)",
entry->args[i]);
} else {
@@ -872,6 +872,19 @@ static void check_faultable_syscall(struct trace_event_call *call, int nr)
sys_data->user_mask = BIT(1);
sys_data->user_arg_size = 2;
break;
+ /* user arg 0 with size arg at 1 as string */
+ case __NR_setdomainname:
+ case __NR_sethostname:
+ sys_data->user_mask = BIT(0);
+ sys_data->user_arg_size = 1;
+ sys_data->user_arg_is_str = 1;
+ break;
+ /* user arg 4 with size arg at 3 as string */
+ case __NR_kexec_file_load:
+ sys_data->user_mask = BIT(4);
+ sys_data->user_arg_size = 3;
+ sys_data->user_arg_is_str = 1;
+ break;
/* user arg at position 0 */
case __NR_access:
case __NR_acct:
--
2.47.2
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 6/7] tracing: Allow syscall trace events to read more than one user parameter
2025-08-05 19:26 [PATCH 0/7] tracing: Show contents of syscall trace event user space fields Steven Rostedt
` (4 preceding siblings ...)
2025-08-05 19:26 ` [PATCH 5/7] tracing: Display some syscall arrays as strings Steven Rostedt
@ 2025-08-05 19:26 ` Steven Rostedt
2025-08-06 23:52 ` kernel test robot
2025-08-05 19:26 ` [PATCH 7/7] tracing: Add syscall_user_buf_size to limit amount written Steven Rostedt
6 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2025-08-05 19:26 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Peter Zijlstra, Namhyung Kim, Takaya Saeki, Tom Zanussi,
Thomas Gleixner, Ian Rogers, aahringo, Douglas Raillard
From: Steven Rostedt <rostedt@goodmis.org>
Allow more than one field of a syscall trace event to read user space.
Build on top of the user_mask by allowing more than one bit to be set that
corresponds to the @args array of the syscall metadata. For each argument
in the @args array that is to be read, it will have a dynamic array/string
field associated to it.
Note that multiple fields to be read from user space is not supported if
the user_arg_size field is set in the syscall metada. That field can only
be used if only one field is being read from user space as that field is a
number representing the size field of the syscall event that holds the
size of the data to read from user space. It becomes ambiguous if the
system call reads more than one field. Currently this is not an issue.
If a syscall event happens to enable two events to read user space and
sets the user_arg_size field, it will trigger a warning at boot and the
user_arg_size field will be cleared.
The per CPU buffer that is used to read the user space addresses is now
broken up into 3 sections, each of 168 bytes. The reason for 168 is that
it is the biggest portion of 512 bytes divided by 3 that is 8 byte aligned.
The max amount copied into the ring buffer from user space is now only 128
bytes, which is plenty. When reading user space, it still reads 167
(168-1) bytes and uses the remaining to know if it should append the extra
"..." to the end or not.
This will allow the event to look like this:
sys_renameat2(olddfd: 0xffffff9c, oldname: 0x7ffe02facdff "/tmp/x", newdfd: 0xffffff9c, newname: 0x7ffe02face06 "/tmp/y", flags: 1)
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/trace/trace_syscalls.c | 312 ++++++++++++++++++++++------------
1 file changed, 207 insertions(+), 105 deletions(-)
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 8c0142eea898..b39fa9dd1067 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -139,6 +139,7 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
struct syscall_metadata *entry;
int i, syscall, val, len;
unsigned char *ptr;
+ int offset = 0;
trace = (typeof(trace))ent;
syscall = trace->nr;
@@ -178,11 +179,12 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
continue;
/* This arg points to a user space string */
- ptr = (void *)trace->args + sizeof(long) * entry->nb_args;
+ ptr = (void *)trace->args + sizeof(long) * entry->nb_args + offset;
val = *(int *)ptr;
ptr = (void *)ent + (val & 0xffff);
len = val >> 16;
+ offset += 4;
if (entry->user_arg_size < 0 || entry->user_arg_is_str) {
trace_seq_printf(s, " \"%.*s\"", len, ptr);
@@ -335,7 +337,6 @@ static int __init syscall_enter_define_fields(struct trace_event_call *call)
unsigned long mask;
char *arg;
int offset = offsetof(typeof(trace), args);
- int idx;
int ret = 0;
int len;
int i;
@@ -354,27 +355,35 @@ static int __init syscall_enter_define_fields(struct trace_event_call *call)
return ret;
mask = meta->user_mask;
- idx = ffs(mask) - 1;
- /*
- * User space data is faulted into a temporary buffer and then
- * added as a dynamic string or array to the end of the event.
- * The user space data name for the arg pointer is "__<arg>_val".
- */
- len = strlen(meta->args[idx]) + sizeof("___val");
- arg = kmalloc(len, GFP_KERNEL);
- if (WARN_ON_ONCE(!arg)) {
- meta->user_mask = 0;
- return -ENOMEM;
- }
+ while (mask) {
+ int idx = ffs(mask) - 1;
+ mask &= ~BIT(idx);
+
+ /*
+ * User space data is faulted into a temporary buffer and then
+ * added as a dynamic string or array to the end of the event.
+ * The user space data name for the arg pointer is
+ * "__<arg>_val".
+ */
+ len = strlen(meta->args[idx]) + sizeof("___val");
+ arg = kmalloc(len, GFP_KERNEL);
+ if (WARN_ON_ONCE(!arg)) {
+ meta->user_mask = 0;
+ return -ENOMEM;
+ }
- snprintf(arg, len, "__%s_val", meta->args[idx]);
+ snprintf(arg, len, "__%s_val", meta->args[idx]);
- ret = trace_define_field(call, "__data_loc char[]",
- arg, offset, sizeof(int), 0,
- FILTER_OTHER);
- if (ret)
- kfree(arg);
+ ret = trace_define_field(call, "__data_loc char[]",
+ arg, offset, sizeof(int), 0,
+ FILTER_OTHER);
+ if (ret) {
+ kfree(arg);
+ break;
+ }
+ offset += 4;
+ }
return ret;
}
@@ -387,8 +396,25 @@ struct syscall_buf_info {
struct syscall_buf __percpu *sbuf;
};
-/* Create a per CPU temporary buffer to copy user space pointers into */
+/*
+ * Create a per CPU temporary buffer to copy user space pointers into.
+ *
+ * SYSCALL_FAULT_BUF_SZ holds the size of the per CPU buffer to use
+ * to copy memory from user space addresses into.
+ *
+ * SYSCALL_FAULT_ARG_SZ is the amount to copy from user space.
+ *
+ * SYSCALL_FAULT_USER_MAX is the amount to copy into the ring buffer.
+ * It's slightly smaller than SYSCALL_FAULT_ARG_SZ to know if it
+ * needs to append the EXTRA or not.
+ *
+ * This only allows up to 3 args from system calls.
+ */
#define SYSCALL_FAULT_BUF_SZ 512
+#define SYSCALL_FAULT_ARG_SZ 168
+#define SYSCALL_FAULT_USER_MAX 128
+#define SYSCALL_FAULT_MAX_CNT 3
+
static struct syscall_buf_info *syscall_buffer;
static DEFINE_PER_CPU(unsigned long, sched_switch_cnt);
@@ -498,22 +524,57 @@ static void syscall_fault_buffer_disable(void)
call_rcu_tasks_trace(&sinfo->rcu, rcu_free_syscall_buffer);
}
-static char *sys_fault_user(struct syscall_metadata *sys_data, struct syscall_buf_info *sinfo,
- unsigned long *args, unsigned int *data_size)
+static char *sys_fault_user(struct syscall_metadata *sys_data,
+ struct syscall_buf_info *sinfo,
+ unsigned long *args,
+ unsigned int data_size[SYSCALL_FAULT_MAX_CNT])
{
- char *buf = per_cpu_ptr(sinfo->sbuf, smp_processor_id())->buf;
- unsigned long size = SYSCALL_FAULT_BUF_SZ - 1;
+ char *buffer = per_cpu_ptr(sinfo->sbuf, smp_processor_id())->buf;
unsigned long mask = sys_data->user_mask;
+ unsigned long size = SYSCALL_FAULT_ARG_SZ - 1;
unsigned int cnt;
- int idx = ffs(mask) - 1;
bool array = false;
- char *ptr;
+ char *ptr_array[SYSCALL_FAULT_MAX_CNT];
+ char *buf;
+ int read[SYSCALL_FAULT_MAX_CNT];
int trys = 0;
+ int uargs;
int ret;
+ int i = 0;
+
+ /* The extra is appended to the user data in the buffer */
+ BUILD_BUG_ON(SYSCALL_FAULT_USER_MAX + sizeof(EXTRA) >=
+ SYSCALL_FAULT_ARG_SZ);
+
+ /*
+ * If this system call event has a size argument, use
+ * it to define how much of user space memory to read,
+ * and read it as an array and not a string.
+ */
+ if (sys_data->user_arg_size >= 0) {
+ array = true;
+ size = args[sys_data->user_arg_size];
+ if (size > SYSCALL_FAULT_ARG_SZ - 1)
+ size = SYSCALL_FAULT_ARG_SZ - 1;
+ }
+
+ while (mask) {
+ int idx = ffs(mask) - 1;
+ mask &= ~BIT(idx);
+
+ if (WARN_ON_ONCE(i == SYSCALL_FAULT_MAX_CNT))
+ break;
+
+ /* Get the pointer to user space memory to read */
+ ptr_array[i++] = (char *)args[idx];
+ }
- /* Get the pointer to user space memory to read */
- ptr = (char *)args[idx];
- *data_size = 0;
+ uargs = i;
+
+ /* Clear the values that are not used */
+ for (; i < SYSCALL_FAULT_MAX_CNT; i++) {
+ data_size[i] = -1; /* Denotes no pointer */
+ }
again:
/*
@@ -532,24 +593,12 @@ static char *sys_fault_user(struct syscall_metadata *sys_data, struct syscall_bu
if (!once)
pr_warn("Error: Too many tries to read syscall %s\n", sys_data->name);
once = true;
- return buf;
+ return buffer;
}
/* Read the current sched switch count */
cnt = this_cpu_read(sched_switch_cnt);
- /*
- * If this system call event has a size argument, use
- * it to define how much of user space memory to read,
- * and read it as an array and not a string.
- */
- if (sys_data->user_arg_size >= 0) {
- array = true;
- size = args[sys_data->user_arg_size];
- if (size > SYSCALL_FAULT_BUF_SZ - 1)
- size = SYSCALL_FAULT_BUF_SZ - 1;
- }
-
/*
* Preemption is going to be enabled, but this task must
* remain on this CPU.
@@ -562,20 +611,23 @@ static char *sys_fault_user(struct syscall_metadata *sys_data, struct syscall_bu
*/
preempt_enable_notrace();
- if (array) {
- ret = __copy_from_user(buf, ptr, size);
- ret = ret ? -1 : size;
- } else {
- ret = strncpy_from_user(buf, ptr, size);
+ buf = buffer;
+
+ for (i = 0; i < uargs; i++, buf += SYSCALL_FAULT_ARG_SZ) {
+ char *ptr = ptr_array[i];
+
+ if (array) {
+ ret = __copy_from_user(buf, ptr, size);
+ ret = ret ? -1 : size;
+ } else {
+ ret = strncpy_from_user(buf, ptr, size);
+ }
+ read[i] = ret;
}
preempt_disable_notrace();
migrate_enable();
- /* If it faulted, no use to try again */
- if (ret < 0)
- return buf;
-
/*
* Preemption is disabled again, now check the sched_switch_cnt.
* If it increased by two or more, then another user space process
@@ -592,28 +644,39 @@ static char *sys_fault_user(struct syscall_metadata *sys_data, struct syscall_bu
if (this_cpu_read(sched_switch_cnt) > cnt + 1)
goto again;
- /* For strings, replace any non-printable characters with '.' */
- if (!array) {
- for (int i = 0; i < ret; i++) {
- if (!isprint(buf[i]))
- buf[i] = '.';
- }
+ buf = buffer;
+ for (i = 0; i < uargs; i++, buf += SYSCALL_FAULT_ARG_SZ) {
- /*
- * If the text was truncated due to our max limit, add "..." to
- * the string.
- */
- if (ret > SYSCALL_FAULT_BUF_SZ - sizeof(EXTRA)) {
- strscpy(buf + SYSCALL_FAULT_BUF_SZ - sizeof(EXTRA),
- EXTRA, sizeof(EXTRA));
- ret = SYSCALL_FAULT_BUF_SZ;
+ ret = read[i];
+ if (ret < 0)
+ continue;
+ buf[ret] = '\0';
+
+ /* For strings, replace any non-printable characters with '.' */
+ if (!array) {
+ for (int x = 0; x < ret; x++) {
+ if (!isprint(buf[x]))
+ buf[x] = '.';
+ }
+
+ /*
+ * If the text was truncated due to our max limit,
+ * add "..." to the string.
+ */
+ if (ret > SYSCALL_FAULT_USER_MAX) {
+ strscpy(buf + SYSCALL_FAULT_USER_MAX, EXTRA,
+ sizeof(EXTRA));
+ ret = SYSCALL_FAULT_USER_MAX + sizeof(EXTRA);
+ } else {
+ buf[ret++] = '\0';
+ }
} else {
- buf[ret++] = '\0';
+ ret = min(ret, SYSCALL_FAULT_USER_MAX);
}
+ data_size[i] = ret;
}
- *data_size = ret;
- return buf;
+ return buffer;
}
static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
@@ -625,9 +688,10 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
struct trace_event_buffer fbuffer;
unsigned long args[6];
char *user_ptr;
- int user_size = 0;
+ int user_sizes[SYSCALL_FAULT_MAX_CNT] = {};
int syscall_nr;
int size = 0;
+ int uargs = 0;
bool mayfault;
/*
@@ -660,20 +724,27 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
if (mayfault) {
struct syscall_buf_info *sinfo;
+ int i;
/* If the syscall_buffer is NULL, tracing is being shutdown */
sinfo = READ_ONCE(syscall_buffer);
if (!sinfo)
return;
- user_ptr = sys_fault_user(sys_data, sinfo, args, &user_size);
+ user_ptr = sys_fault_user(sys_data, sinfo, args, user_sizes);
/*
* user_size is the amount of data to append.
* Need to add 4 for the meta field that points to
* the user memory at the end of the event and also
* stores its size.
*/
- size = 4 + user_size;
+ for (i = 0; i < SYSCALL_FAULT_MAX_CNT; i++) {
+ if (user_sizes[i] < 0)
+ break;
+ size += user_sizes[i] + 4;
+ }
+ /* Save the number of user read arguments of this syscall */
+ uargs = i;
}
size += sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
@@ -688,6 +759,7 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
memcpy(entry->args, args, sizeof(unsigned long) * sys_data->nb_args);
if (mayfault) {
+ char *buf = user_ptr;
void *ptr;
int val;
@@ -699,21 +771,30 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
/*
* The meta data will store the offset of the user data from
- * the beginning of the event.
+ * the beginning of the event. That is after the static arguments
+ * and the meta data fields.
*/
- val = (ptr - (void *)entry) + 4;
+ val = (ptr - (void *)entry) + 4 * uargs;
+
+ for (int i = 0; i < uargs; i++) {
- /* Store the offset and the size into the meta data */
- *(int *)ptr = val | (user_size << 16);
+ if (i)
+ val += user_sizes[i - 1];
- if (WARN_ON_ONCE((ptr - (void *)entry + user_size) > size))
- user_size = 0;
+ /* Store the offset and the size into the meta data */
+ *(int *)ptr = val | (user_sizes[i] << 16);
- /* Nothing to do if the user space was empty or faulted */
- if (user_size) {
- /* Now store the user space data into the event */
+ /* Skip the meta data */
ptr += 4;
- memcpy(ptr, user_ptr, user_size);
+ }
+
+ for (int i = 0; i < uargs; i++, buf += SYSCALL_FAULT_ARG_SZ) {
+ /* Nothing to do if the user space was empty or faulted */
+ if (!user_sizes[i])
+ continue;
+
+ memcpy(ptr, buf, user_sizes[i]);
+ ptr += user_sizes[i];
}
}
@@ -857,6 +938,7 @@ static void unreg_event_syscall_exit(struct trace_event_file *file,
static void check_faultable_syscall(struct trace_event_call *call, int nr)
{
struct syscall_metadata *sys_data = call->data;
+ unsigned long mask;
/* Only work on entry */
if (sys_data->enter_event != call)
@@ -888,7 +970,6 @@ static void check_faultable_syscall(struct trace_event_call *call, int nr)
/* user arg at position 0 */
case __NR_access:
case __NR_acct:
- case __NR_add_key: /* Just _type. TODO add _description */
case __NR_chdir:
case __NR_chown:
case __NR_chmod:
@@ -897,28 +978,15 @@ static void check_faultable_syscall(struct trace_event_call *call, int nr)
case __NR_delete_module:
case __NR_execve:
case __NR_fsopen:
- case __NR_getxattr: /* Just pathname, TODO add name */
case __NR_lchown:
- case __NR_lgetxattr: /* Just pathname, TODO add name */
- case __NR_lremovexattr: /* Just pathname, TODO add name */
- case __NR_link: /* Just oldname. TODO add newname */
- case __NR_listxattr: /* Just pathname, TODO add list */
- case __NR_llistxattr: /* Just pathname, TODO add list */
- case __NR_lsetxattr: /* Just pathname, TODO add list */
case __NR_open:
case __NR_memfd_create:
- case __NR_mount: /* Just dev_name, TODO add dir_name and type */
case __NR_mkdir:
case __NR_mknod:
case __NR_mq_open:
case __NR_mq_unlink:
- case __NR_pivot_root: /* Just new_root, TODO add old_root */
case __NR_readlink:
- case __NR_removexattr: /* Just pathname, TODO add name */
- case __NR_rename: /* Just oldname. TODO add newname */
- case __NR_request_key: /* Just _type. TODO add _description */
case __NR_rmdir:
- case __NR_setxattr: /* Just pathname, TODO add list */
case __NR_shmdt:
case __NR_statfs:
case __NR_swapon:
@@ -945,14 +1013,10 @@ static void check_faultable_syscall(struct trace_event_call *call, int nr)
case __NR_fspick:
case __NR_fremovexattr:
case __NR_futimesat:
- case __NR_getxattrat: /* Just pathname, TODO add name */
case __NR_inotify_add_watch:
- case __NR_linkat: /* Just oldname. TODO add newname */
- case __NR_listxattrat: /* Just pathname, TODO add list */
case __NR_mkdirat:
case __NR_mknodat:
case __NR_mount_setattr:
- case __NR_move_mount: /* Just from_pathname, TODO add to_pathname */
case __NR_name_to_handle_at:
case __NR_newfstatat:
case __NR_openat:
@@ -960,13 +1024,8 @@ static void check_faultable_syscall(struct trace_event_call *call, int nr)
case __NR_open_tree:
case __NR_open_tree_attr:
case __NR_readlinkat:
- case __NR_renameat: /* Just oldname. TODO add newname */
- case __NR_renameat2: /* Just oldname. TODO add newname */
- case __NR_removexattrat: /* Just pathname, TODO add name */
case __NR_quotactl:
- case __NR_setxattrat: /* Just pathname, TODO add list */
case __NR_syslog:
- case __NR_symlinkat: /* Just oldname. TODO add newname */
case __NR_statx:
case __NR_unlinkat:
case __NR_utimensat:
@@ -981,9 +1040,52 @@ static void check_faultable_syscall(struct trace_event_call *call, int nr)
case __NR_fanotify_mark:
sys_data->user_mask = BIT(4);
break;
+ /* 2 user args, 0 and 1 */
+ case __NR_add_key:
+ case __NR_getxattr:
+ case __NR_lgetxattr:
+ case __NR_lremovexattr:
+ case __NR_link:
+ case __NR_listxattr:
+ case __NR_llistxattr:
+ case __NR_lsetxattr:
+ case __NR_pivot_root:
+ case __NR_removexattr:
+ case __NR_rename:
+ case __NR_request_key:
+ case __NR_setxattr:
+ case __NR_symlinkat:
+ sys_data->user_mask = BIT(0) | BIT(1);
+ break;
+ /* 2 user args, 1 and 3 */
+ case __NR_getxattrat:
+ case __NR_linkat:
+ case __NR_listxattrat:
+ case __NR_move_mount:
+ case __NR_renameat:
+ case __NR_renameat2:
+ case __NR_removexattrat:
+ case __NR_setxattrat:
+ sys_data->user_mask = BIT(1) | BIT(3);
+ break;
+ case __NR_mount: /* Just dev_name and dir_name, TODO add type */
+ sys_data->user_mask = BIT(0) | BIT(1) | BIT(2);
+ break;
default:
sys_data->user_mask = 0;
+ return;
}
+
+ if (sys_data->user_arg_size < 0)
+ return;
+
+ /*
+ * The user_arg_size can only be used when the system call
+ * is reading only a single address from user space.
+ */
+ mask = sys_data->user_mask;
+ if (WARN_ON(mask & (mask - 1)))
+ sys_data->user_arg_size = -1;
}
static int __init init_syscall_trace(struct trace_event_call *call)
--
2.47.2
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 7/7] tracing: Add syscall_user_buf_size to limit amount written
2025-08-05 19:26 [PATCH 0/7] tracing: Show contents of syscall trace event user space fields Steven Rostedt
` (5 preceding siblings ...)
2025-08-05 19:26 ` [PATCH 6/7] tracing: Allow syscall trace events to read more than one user parameter Steven Rostedt
@ 2025-08-05 19:26 ` Steven Rostedt
2025-08-06 10:50 ` Douglas Raillard
2025-08-06 14:50 ` kernel test robot
6 siblings, 2 replies; 22+ messages in thread
From: Steven Rostedt @ 2025-08-05 19:26 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Peter Zijlstra, Namhyung Kim, Takaya Saeki, Tom Zanussi,
Thomas Gleixner, Ian Rogers, aahringo, Douglas Raillard
From: Steven Rostedt <rostedt@goodmis.org>
When a system call that reads user space addresses copy it to the ring
buffer, it can copy up to 511 bytes of data. This can waste precious ring
buffer space if the user isn't interested in the output. Add a new file
"syscall_user_buf_size" that gets initialized to a new config
CONFIG_SYSCALL_BUF_SIZE_DEFAULT that defaults to 128.
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Documentation/trace/ftrace.rst | 7 +++++
kernel/trace/Kconfig | 13 +++++++++
kernel/trace/trace.c | 52 ++++++++++++++++++++++++++++++++++
kernel/trace/trace.h | 3 ++
kernel/trace/trace_syscalls.c | 22 ++++++++------
5 files changed, 88 insertions(+), 9 deletions(-)
diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
index af66a05e18cc..4712bbfcfd08 100644
--- a/Documentation/trace/ftrace.rst
+++ b/Documentation/trace/ftrace.rst
@@ -366,6 +366,13 @@ of ftrace. Here is a list of some of the key files:
for each function. The displayed address is the patch-site address
and can differ from /proc/kallsyms address.
+ syscall_user_buf_size:
+
+ Some system call trace events will record the data from a user
+ space address that one of the parameters point to. The amount of
+ data per event is limited. This file holds the max number of bytes
+ that will be recorded into the ring buffer to hold this data.
+
dyn_ftrace_total_info:
This file is for debugging purposes. The number of functions that
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index f80298e6aa16..aa28d7ca3e31 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -574,6 +574,19 @@ config FTRACE_SYSCALLS
help
Basic tracer to catch the syscall entry and exit events.
+config TRACE_SYSCALL_BUF_SIZE_DEFAULT
+ int "System call user read max size"
+ range 0 128
+ default 128
+ depends on FTRACE_SYSCALLS
+ help
+ Some system call trace events will record the data from a user
+ space address that one of the parameters point to. The amount of
+ data per event is limited. It may be further limited by this
+ config and later changed by writing an ASCII number into:
+
+ /sys/kernel/tracing/syscall_user_buf_size
+
config TRACER_SNAPSHOT
bool "Create a snapshot trace buffer"
select TRACER_MAX_TRACE
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index d0b1964648c1..1db708ed0625 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -6913,6 +6913,43 @@ static ssize_t tracing_splice_read_pipe(struct file *filp,
goto out;
}
+static ssize_t
+tracing_syscall_buf_read(struct file *filp, char __user *ubuf,
+ size_t cnt, loff_t *ppos)
+{
+ struct inode *inode = file_inode(filp);
+ struct trace_array *tr = inode->i_private;
+ char buf[64];
+ int r;
+
+ r = snprintf(buf, 64, "%d\n", tr->syscall_buf_sz);
+
+ return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
+}
+
+static ssize_t
+tracing_syscall_buf_write(struct file *filp, const char __user *ubuf,
+ size_t cnt, loff_t *ppos)
+{
+ struct inode *inode = file_inode(filp);
+ struct trace_array *tr = inode->i_private;
+ unsigned long val;
+ int ret;
+
+ ret = kstrtoul_from_user(ubuf, cnt, 10, &val);
+ if (ret)
+ return ret;
+
+ if (val > SYSCALL_FAULT_USER_MAX)
+ val = SYSCALL_FAULT_USER_MAX;
+
+ tr->syscall_buf_sz = val;
+
+ *ppos += cnt;
+
+ return cnt;
+}
+
static ssize_t
tracing_entries_read(struct file *filp, char __user *ubuf,
size_t cnt, loff_t *ppos)
@@ -7737,6 +7774,14 @@ static const struct file_operations tracing_entries_fops = {
.release = tracing_release_generic_tr,
};
+static const struct file_operations tracing_syscall_buf_fops = {
+ .open = tracing_open_generic_tr,
+ .read = tracing_syscall_buf_read,
+ .write = tracing_syscall_buf_write,
+ .llseek = generic_file_llseek,
+ .release = tracing_release_generic_tr,
+};
+
static const struct file_operations tracing_buffer_meta_fops = {
.open = tracing_buffer_meta_open,
.read = seq_read,
@@ -9839,6 +9884,8 @@ trace_array_create_systems(const char *name, const char *systems,
raw_spin_lock_init(&tr->start_lock);
+ tr->syscall_buf_sz = global_trace.syscall_buf_sz;
+
tr->max_lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;
#ifdef CONFIG_TRACER_MAX_TRACE
spin_lock_init(&tr->snapshot_trigger_lock);
@@ -10155,6 +10202,9 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer)
trace_create_file("buffer_subbuf_size_kb", TRACE_MODE_WRITE, d_tracer,
tr, &buffer_subbuf_size_fops);
+ trace_create_file("syscall_user_buf_size", TRACE_MODE_WRITE, d_tracer,
+ tr, &tracing_syscall_buf_fops);
+
create_trace_options_dir(tr);
#ifdef CONFIG_TRACER_MAX_TRACE
@@ -11081,6 +11131,8 @@ __init static int tracer_alloc_buffers(void)
global_trace.flags = TRACE_ARRAY_FL_GLOBAL;
+ global_trace.syscall_buf_sz = CONFIG_TRACE_SYSCALL_BUF_SIZE_DEFAULT;
+
INIT_LIST_HEAD(&global_trace.systems);
INIT_LIST_HEAD(&global_trace.events);
INIT_LIST_HEAD(&global_trace.hist_vars);
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 1dbf1d3cf2f1..1b3e464619f0 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -131,6 +131,8 @@ enum trace_type {
#define HIST_STACKTRACE_SIZE (HIST_STACKTRACE_DEPTH * sizeof(unsigned long))
#define HIST_STACKTRACE_SKIP 5
+#define SYSCALL_FAULT_USER_MAX 128
+
/*
* syscalls are special, and need special handling, this is why
* they are not included in trace_entries.h
@@ -430,6 +432,7 @@ struct trace_array {
int function_enabled;
#endif
int no_filter_buffering_ref;
+ unsigned int syscall_buf_sz;
struct list_head hist_vars;
#ifdef CONFIG_TRACER_SNAPSHOT
struct cond_snapshot *cond_snapshot;
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index b39fa9dd1067..e9162165c4d2 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -407,17 +407,16 @@ struct syscall_buf_info {
* SYSCALL_FAULT_USER_MAX is the amount to copy into the ring buffer.
* It's slightly smaller than SYSCALL_FAULT_ARG_SZ to know if it
* needs to append the EXTRA or not.
+ * (defined in kernel/trace/trace.h)
*
* This only allows up to 3 args from system calls.
*/
#define SYSCALL_FAULT_BUF_SZ 512
#define SYSCALL_FAULT_ARG_SZ 168
-#define SYSCALL_FAULT_USER_MAX 128
#define SYSCALL_FAULT_MAX_CNT 3
static struct syscall_buf_info *syscall_buffer;
static DEFINE_PER_CPU(unsigned long, sched_switch_cnt);
-
static int syscall_fault_buffer_cnt;
static void syscall_fault_buffer_free(struct syscall_buf_info *sinfo)
@@ -524,7 +523,7 @@ static void syscall_fault_buffer_disable(void)
call_rcu_tasks_trace(&sinfo->rcu, rcu_free_syscall_buffer);
}
-static char *sys_fault_user(struct syscall_metadata *sys_data,
+static char *sys_fault_user(struct trace_array *tr, struct syscall_metadata *sys_data,
struct syscall_buf_info *sinfo,
unsigned long *args,
unsigned int data_size[SYSCALL_FAULT_MAX_CNT])
@@ -576,6 +575,10 @@ static char *sys_fault_user(struct syscall_metadata *sys_data,
data_size[i] = -1; /* Denotes no pointer */
}
+ /* A zero size means do not even try */
+ if (!tr->syscall_buf_sz)
+ return buffer;
+
again:
/*
* If this task is preempted by another user space task, it
@@ -659,19 +662,20 @@ static char *sys_fault_user(struct syscall_metadata *sys_data,
buf[x] = '.';
}
+ size = min(tr->syscall_buf_sz, SYSCALL_FAULT_USER_MAX);
+
/*
* If the text was truncated due to our max limit,
* add "..." to the string.
*/
- if (ret > SYSCALL_FAULT_USER_MAX) {
- strscpy(buf + SYSCALL_FAULT_USER_MAX, EXTRA,
- sizeof(EXTRA));
- ret = SYSCALL_FAULT_USER_MAX + sizeof(EXTRA);
+ if (ret > size) {
+ strscpy(buf + size, EXTRA, sizeof(EXTRA));
+ ret = size + sizeof(EXTRA);
} else {
buf[ret++] = '\0';
}
} else {
- ret = min(ret, SYSCALL_FAULT_USER_MAX);
+ ret = min((unsigned int)ret, tr->syscall_buf_sz);
}
data_size[i] = ret;
}
@@ -731,7 +735,7 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
if (!sinfo)
return;
- user_ptr = sys_fault_user(sys_data, sinfo, args, user_sizes);
+ user_ptr = sys_fault_user(tr, sys_data, sinfo, args, user_sizes);
/*
* user_size is the amount of data to append.
* Need to add 4 for the meta field that points to
--
2.47.2
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH 2/7] tracing: Have syscall trace events show "0x" for values greater than 10
2025-08-05 19:26 ` [PATCH 2/7] tracing: Have syscall trace events show "0x" for values greater than 10 Steven Rostedt
@ 2025-08-06 10:24 ` Douglas Raillard
2025-08-06 12:39 ` Steven Rostedt
0 siblings, 1 reply; 22+ messages in thread
From: Douglas Raillard @ 2025-08-06 10:24 UTC (permalink / raw)
To: Steven Rostedt, linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Peter Zijlstra, Namhyung Kim, Takaya Saeki, Tom Zanussi,
Thomas Gleixner, Ian Rogers, aahringo
On 05-08-2025 20:26, Steven Rostedt wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
>
> Currently the syscall trace events show each value as hexadecimal, but
> without adding "0x" it can be confusing:
>
> sys_write(fd: 4, buf: 0x55c4a1fa9270, count: 44)
>
> Looks like the above write wrote 44 bytes, when in reality it wrote 68
> bytes.
>
> Add a "0x" for all values greater or equal to 10 to remove the ambiguity.
> For values less than 10, leave off the "0x" as that just adds noise to the
> output.
I'm on the fence for the value-dependent format. This looks like
it could easily make life harder for quick&dirty scripts, but then both
awk's strtonum() and Python's int(x, base=16) seem to handle
the presence/absence of the 0x prefix so maybe it's a non-issue.
OTH, a hand-crafted regex designed after a small set of input may start
to randomly fail if one field unexpectedly goes beyond 10 ...
Just using explicit hex may be the best here, as the actual proper fix
(type-level display hints) is harder. It could probably be implemented
using btf_decl_tag() and __builtin_btf_type_id() to retrieve the BTF info.
>
> Also change the iterator to check if "i" is nonzero and print the ", "
> delimiter at the start, then adding the logic to the trace_seq_printf() at
> the end.
>
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
> kernel/trace/trace_syscalls.c | 12 +++++++++---
> 1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
> index 3a0b65f89130..0f932b22f9ec 100644
> --- a/kernel/trace/trace_syscalls.c
> +++ b/kernel/trace/trace_syscalls.c
> @@ -153,14 +153,20 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
> if (trace_seq_has_overflowed(s))
> goto end;
>
> + if (i)
> + trace_seq_puts(s, ", ");
> +
> /* parameter types */
> if (tr && tr->trace_flags & TRACE_ITER_VERBOSE)
> trace_seq_printf(s, "%s ", entry->types[i]);
>
> /* parameter values */
> - trace_seq_printf(s, "%s: %lx%s", entry->args[i],
> - trace->args[i],
> - i == entry->nb_args - 1 ? "" : ", ");
> + if (trace->args[i] < 10)
> + trace_seq_printf(s, "%s: %lu", entry->args[i],
> + trace->args[i]);
> + else
> + trace_seq_printf(s, "%s: 0x%lx", entry->args[i],
> + trace->args[i]);
> }
>
> trace_seq_putc(s, ')');
--
Douglas
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 7/7] tracing: Add syscall_user_buf_size to limit amount written
2025-08-05 19:26 ` [PATCH 7/7] tracing: Add syscall_user_buf_size to limit amount written Steven Rostedt
@ 2025-08-06 10:50 ` Douglas Raillard
2025-08-06 12:43 ` Steven Rostedt
2025-08-06 14:50 ` kernel test robot
1 sibling, 1 reply; 22+ messages in thread
From: Douglas Raillard @ 2025-08-06 10:50 UTC (permalink / raw)
To: Steven Rostedt, linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Peter Zijlstra, Namhyung Kim, Takaya Saeki, Tom Zanussi,
Thomas Gleixner, Ian Rogers, aahringo
On 05-08-2025 20:26, Steven Rostedt wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
>
> When a system call that reads user space addresses copy it to the ring
> buffer, it can copy up to 511 bytes of data. This can waste precious ring
> buffer space if the user isn't interested in the output. Add a new file
> "syscall_user_buf_size" that gets initialized to a new config
> CONFIG_SYSCALL_BUF_SIZE_DEFAULT that defaults to 128.
Have you considered dynamically removing some event fields ? We routinely hit
the same problem with some of our events that have rarely-used large fields.
If we could have a "fields" file in /sys/kernel/tracing/events/*/*/fields
that allowed selecting what field is needed that would be amazing. I had plans
to build something like that in our kernel module based on the synthetic events API,
but did not proceed as that API is not exported in a useful way.
>
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
> Documentation/trace/ftrace.rst | 7 +++++
> kernel/trace/Kconfig | 13 +++++++++
> kernel/trace/trace.c | 52 ++++++++++++++++++++++++++++++++++
> kernel/trace/trace.h | 3 ++
> kernel/trace/trace_syscalls.c | 22 ++++++++------
> 5 files changed, 88 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
> index af66a05e18cc..4712bbfcfd08 100644
> --- a/Documentation/trace/ftrace.rst
> +++ b/Documentation/trace/ftrace.rst
> @@ -366,6 +366,13 @@ of ftrace. Here is a list of some of the key files:
> for each function. The displayed address is the patch-site address
> and can differ from /proc/kallsyms address.
>
> + syscall_user_buf_size:
> +
> + Some system call trace events will record the data from a user
> + space address that one of the parameters point to. The amount of
> + data per event is limited. This file holds the max number of bytes
> + that will be recorded into the ring buffer to hold this data.
> +
> dyn_ftrace_total_info:
>
> This file is for debugging purposes. The number of functions that
> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> index f80298e6aa16..aa28d7ca3e31 100644
> --- a/kernel/trace/Kconfig
> +++ b/kernel/trace/Kconfig
> @@ -574,6 +574,19 @@ config FTRACE_SYSCALLS
> help
> Basic tracer to catch the syscall entry and exit events.
>
> +config TRACE_SYSCALL_BUF_SIZE_DEFAULT
> + int "System call user read max size"
> + range 0 128
> + default 128
> + depends on FTRACE_SYSCALLS
> + help
> + Some system call trace events will record the data from a user
> + space address that one of the parameters point to. The amount of
> + data per event is limited. It may be further limited by this
> + config and later changed by writing an ASCII number into:
> +
> + /sys/kernel/tracing/syscall_user_buf_size
> +
> config TRACER_SNAPSHOT
> bool "Create a snapshot trace buffer"
> select TRACER_MAX_TRACE
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index d0b1964648c1..1db708ed0625 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -6913,6 +6913,43 @@ static ssize_t tracing_splice_read_pipe(struct file *filp,
> goto out;
> }
>
> +static ssize_t
> +tracing_syscall_buf_read(struct file *filp, char __user *ubuf,
> + size_t cnt, loff_t *ppos)
> +{
> + struct inode *inode = file_inode(filp);
> + struct trace_array *tr = inode->i_private;
> + char buf[64];
> + int r;
> +
> + r = snprintf(buf, 64, "%d\n", tr->syscall_buf_sz);
> +
> + return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
> +}
> +
> +static ssize_t
> +tracing_syscall_buf_write(struct file *filp, const char __user *ubuf,
> + size_t cnt, loff_t *ppos)
> +{
> + struct inode *inode = file_inode(filp);
> + struct trace_array *tr = inode->i_private;
> + unsigned long val;
> + int ret;
> +
> + ret = kstrtoul_from_user(ubuf, cnt, 10, &val);
> + if (ret)
> + return ret;
> +
> + if (val > SYSCALL_FAULT_USER_MAX)
> + val = SYSCALL_FAULT_USER_MAX;
> +
> + tr->syscall_buf_sz = val;
> +
> + *ppos += cnt;
> +
> + return cnt;
> +}
> +
> static ssize_t
> tracing_entries_read(struct file *filp, char __user *ubuf,
> size_t cnt, loff_t *ppos)
> @@ -7737,6 +7774,14 @@ static const struct file_operations tracing_entries_fops = {
> .release = tracing_release_generic_tr,
> };
>
> +static const struct file_operations tracing_syscall_buf_fops = {
> + .open = tracing_open_generic_tr,
> + .read = tracing_syscall_buf_read,
> + .write = tracing_syscall_buf_write,
> + .llseek = generic_file_llseek,
> + .release = tracing_release_generic_tr,
> +};
> +
> static const struct file_operations tracing_buffer_meta_fops = {
> .open = tracing_buffer_meta_open,
> .read = seq_read,
> @@ -9839,6 +9884,8 @@ trace_array_create_systems(const char *name, const char *systems,
>
> raw_spin_lock_init(&tr->start_lock);
>
> + tr->syscall_buf_sz = global_trace.syscall_buf_sz;
> +
> tr->max_lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;
> #ifdef CONFIG_TRACER_MAX_TRACE
> spin_lock_init(&tr->snapshot_trigger_lock);
> @@ -10155,6 +10202,9 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer)
> trace_create_file("buffer_subbuf_size_kb", TRACE_MODE_WRITE, d_tracer,
> tr, &buffer_subbuf_size_fops);
>
> + trace_create_file("syscall_user_buf_size", TRACE_MODE_WRITE, d_tracer,
> + tr, &tracing_syscall_buf_fops);
> +
> create_trace_options_dir(tr);
>
> #ifdef CONFIG_TRACER_MAX_TRACE
> @@ -11081,6 +11131,8 @@ __init static int tracer_alloc_buffers(void)
>
> global_trace.flags = TRACE_ARRAY_FL_GLOBAL;
>
> + global_trace.syscall_buf_sz = CONFIG_TRACE_SYSCALL_BUF_SIZE_DEFAULT;
> +
> INIT_LIST_HEAD(&global_trace.systems);
> INIT_LIST_HEAD(&global_trace.events);
> INIT_LIST_HEAD(&global_trace.hist_vars);
> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> index 1dbf1d3cf2f1..1b3e464619f0 100644
> --- a/kernel/trace/trace.h
> +++ b/kernel/trace/trace.h
> @@ -131,6 +131,8 @@ enum trace_type {
> #define HIST_STACKTRACE_SIZE (HIST_STACKTRACE_DEPTH * sizeof(unsigned long))
> #define HIST_STACKTRACE_SKIP 5
>
> +#define SYSCALL_FAULT_USER_MAX 128
> +
> /*
> * syscalls are special, and need special handling, this is why
> * they are not included in trace_entries.h
> @@ -430,6 +432,7 @@ struct trace_array {
> int function_enabled;
> #endif
> int no_filter_buffering_ref;
> + unsigned int syscall_buf_sz;
> struct list_head hist_vars;
> #ifdef CONFIG_TRACER_SNAPSHOT
> struct cond_snapshot *cond_snapshot;
> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
> index b39fa9dd1067..e9162165c4d2 100644
> --- a/kernel/trace/trace_syscalls.c
> +++ b/kernel/trace/trace_syscalls.c
> @@ -407,17 +407,16 @@ struct syscall_buf_info {
> * SYSCALL_FAULT_USER_MAX is the amount to copy into the ring buffer.
> * It's slightly smaller than SYSCALL_FAULT_ARG_SZ to know if it
> * needs to append the EXTRA or not.
> + * (defined in kernel/trace/trace.h)
> *
> * This only allows up to 3 args from system calls.
> */
> #define SYSCALL_FAULT_BUF_SZ 512
> #define SYSCALL_FAULT_ARG_SZ 168
> -#define SYSCALL_FAULT_USER_MAX 128
> #define SYSCALL_FAULT_MAX_CNT 3
>
> static struct syscall_buf_info *syscall_buffer;
> static DEFINE_PER_CPU(unsigned long, sched_switch_cnt);
> -
> static int syscall_fault_buffer_cnt;
>
> static void syscall_fault_buffer_free(struct syscall_buf_info *sinfo)
> @@ -524,7 +523,7 @@ static void syscall_fault_buffer_disable(void)
> call_rcu_tasks_trace(&sinfo->rcu, rcu_free_syscall_buffer);
> }
>
> -static char *sys_fault_user(struct syscall_metadata *sys_data,
> +static char *sys_fault_user(struct trace_array *tr, struct syscall_metadata *sys_data,
> struct syscall_buf_info *sinfo,
> unsigned long *args,
> unsigned int data_size[SYSCALL_FAULT_MAX_CNT])
> @@ -576,6 +575,10 @@ static char *sys_fault_user(struct syscall_metadata *sys_data,
> data_size[i] = -1; /* Denotes no pointer */
> }
>
> + /* A zero size means do not even try */
> + if (!tr->syscall_buf_sz)
> + return buffer;
> +
> again:
> /*
> * If this task is preempted by another user space task, it
> @@ -659,19 +662,20 @@ static char *sys_fault_user(struct syscall_metadata *sys_data,
> buf[x] = '.';
> }
>
> + size = min(tr->syscall_buf_sz, SYSCALL_FAULT_USER_MAX);
> +
> /*
> * If the text was truncated due to our max limit,
> * add "..." to the string.
> */
> - if (ret > SYSCALL_FAULT_USER_MAX) {
> - strscpy(buf + SYSCALL_FAULT_USER_MAX, EXTRA,
> - sizeof(EXTRA));
> - ret = SYSCALL_FAULT_USER_MAX + sizeof(EXTRA);
> + if (ret > size) {
> + strscpy(buf + size, EXTRA, sizeof(EXTRA));
> + ret = size + sizeof(EXTRA);
> } else {
> buf[ret++] = '\0';
> }
> } else {
> - ret = min(ret, SYSCALL_FAULT_USER_MAX);
> + ret = min((unsigned int)ret, tr->syscall_buf_sz);
> }
> data_size[i] = ret;
> }
> @@ -731,7 +735,7 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
> if (!sinfo)
> return;
>
> - user_ptr = sys_fault_user(sys_data, sinfo, args, user_sizes);
> + user_ptr = sys_fault_user(tr, sys_data, sinfo, args, user_sizes);
> /*
> * user_size is the amount of data to append.
> * Need to add 4 for the meta field that points to
--
Douglas
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/7] tracing: Have syscall trace events show "0x" for values greater than 10
2025-08-06 10:24 ` Douglas Raillard
@ 2025-08-06 12:39 ` Steven Rostedt
2025-08-06 16:42 ` Douglas Raillard
0 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2025-08-06 12:39 UTC (permalink / raw)
To: Douglas Raillard
Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
Mathieu Desnoyers, Andrew Morton, Peter Zijlstra, Namhyung Kim,
Takaya Saeki, Tom Zanussi, Thomas Gleixner, Ian Rogers, aahringo
On Wed, 6 Aug 2025 11:24:33 +0100
Douglas Raillard <douglas.raillard@arm.com> wrote:
> On 05-08-2025 20:26, Steven Rostedt wrote:
> > From: Steven Rostedt <rostedt@goodmis.org>
> >
> > Currently the syscall trace events show each value as hexadecimal, but
> > without adding "0x" it can be confusing:
> >
> > sys_write(fd: 4, buf: 0x55c4a1fa9270, count: 44)
> >
> > Looks like the above write wrote 44 bytes, when in reality it wrote 68
> > bytes.
> >
> > Add a "0x" for all values greater or equal to 10 to remove the ambiguity.
> > For values less than 10, leave off the "0x" as that just adds noise to the
> > output.
>
> I'm on the fence for the value-dependent format. This looks like
> it could easily make life harder for quick&dirty scripts, but then both
> awk's strtonum() and Python's int(x, base=16) seem to handle
> the presence/absence of the 0x prefix so maybe it's a non-issue.
Yes, but the trace file is more for humans than scripts, and the 0x1 is
just noise.
>
> OTH, a hand-crafted regex designed after a small set of input may start
> to randomly fail if one field unexpectedly goes beyond 10 ...
If you are doing hand crafted scripts, I'd suggest to use tracing
filters or other tooling that can handle this easily ;-)
>
> Just using explicit hex may be the best here, as the actual proper fix
> (type-level display hints) is harder. It could probably be implemented
> using btf_decl_tag() and __builtin_btf_type_id() to retrieve the BTF info.
That would add a dependency on BTF, which I would like to avoid.
Thanks!
-- Steve
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 7/7] tracing: Add syscall_user_buf_size to limit amount written
2025-08-06 10:50 ` Douglas Raillard
@ 2025-08-06 12:43 ` Steven Rostedt
2025-08-06 16:21 ` Douglas Raillard
0 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2025-08-06 12:43 UTC (permalink / raw)
To: Douglas Raillard
Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
Mathieu Desnoyers, Andrew Morton, Peter Zijlstra, Namhyung Kim,
Takaya Saeki, Tom Zanussi, Thomas Gleixner, Ian Rogers, aahringo
On Wed, 6 Aug 2025 11:50:06 +0100
Douglas Raillard <douglas.raillard@arm.com> wrote:
> On 05-08-2025 20:26, Steven Rostedt wrote:
> > From: Steven Rostedt <rostedt@goodmis.org>
> >
> > When a system call that reads user space addresses copy it to the ring
> > buffer, it can copy up to 511 bytes of data. This can waste precious ring
> > buffer space if the user isn't interested in the output. Add a new file
> > "syscall_user_buf_size" that gets initialized to a new config
> > CONFIG_SYSCALL_BUF_SIZE_DEFAULT that defaults to 128.
>
> Have you considered dynamically removing some event fields ? We routinely hit
> the same problem with some of our events that have rarely-used large fields.
We do that already with eprobes. Note, syscall events are pseudo events
hooked on the raw_syscall events. Thus modifying what is displayed is
trivial as it's done manually anyway. For normal events, it's all in
the TRACE_EVENT() macro which defines the fields at boot. Trying to
modify it later is very difficult.
>
> If we could have a "fields" file in /sys/kernel/tracing/events/*/*/fields
> that allowed selecting what field is needed that would be amazing. I had plans
> to build something like that in our kernel module based on the synthetic events API,
> but did not proceed as that API is not exported in a useful way.
Take a look at eprobes. You can make a new event based from an existing
event (including other dynamic events and syscalls).
I finally got around to adding documentation about it:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/trace/eprobetrace.rst
-- Steve
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 3/7] tracing: Have syscall trace events read user space string
2025-08-05 19:26 ` [PATCH 3/7] tracing: Have syscall trace events read user space string Steven Rostedt
@ 2025-08-06 14:39 ` kernel test robot
2025-08-11 5:19 ` kernel test robot
1 sibling, 0 replies; 22+ messages in thread
From: kernel test robot @ 2025-08-06 14:39 UTC (permalink / raw)
To: Steven Rostedt, linux-kernel, linux-trace-kernel
Cc: oe-kbuild-all, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers,
Andrew Morton, Linux Memory Management List, Peter Zijlstra,
Namhyung Kim, Takaya Saeki, Tom Zanussi, Thomas Gleixner,
Ian Rogers, aahringo, Douglas Raillard
Hi Steven,
kernel test robot noticed the following build errors:
[auto build test ERROR on trace/for-next]
[also build test ERROR on linus/master v6.16 next-20250806]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Steven-Rostedt/tracing-Replace-syscall-RCU-pointer-assignment-with-READ-WRITE_ONCE/20250806-122312
base: https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace for-next
patch link: https://lore.kernel.org/r/20250805193235.080757106%40kernel.org
patch subject: [PATCH 3/7] tracing: Have syscall trace events read user space string
config: parisc-randconfig-r071-20250806 (https://download.01.org/0day-ci/archive/20250806/202508062230.puMRaDdE-lkp@intel.com/config)
compiler: hppa-linux-gcc (GCC) 11.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250806/202508062230.puMRaDdE-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202508062230.puMRaDdE-lkp@intel.com/
All errors (new ones prefixed by >>):
kernel/trace/trace_syscalls.c: In function 'check_faultable_syscall':
>> kernel/trace/trace_syscalls.c:886:14: error: '__NR_newfstatat' undeclared (first use in this function); did you mean 'sys_newfstatat'?
886 | case __NR_newfstatat:
| ^~~~~~~~~~~~~~~
| sys_newfstatat
kernel/trace/trace_syscalls.c:886:14: note: each undeclared identifier is reported only once for each function it appears in
vim +886 kernel/trace/trace_syscalls.c
802
803 /*
804 * For system calls that reference user space memory that can
805 * be recorded into the event, set the system call meta data's user_mask
806 * to the "args" index that points to the user space memory to retrieve.
807 */
808 static void check_faultable_syscall(struct trace_event_call *call, int nr)
809 {
810 struct syscall_metadata *sys_data = call->data;
811
812 /* Only work on entry */
813 if (sys_data->enter_event != call)
814 return;
815
816 switch (nr) {
817 /* user arg at position 0 */
818 case __NR_access:
819 case __NR_acct:
820 case __NR_add_key: /* Just _type. TODO add _description */
821 case __NR_chdir:
822 case __NR_chown:
823 case __NR_chmod:
824 case __NR_chroot:
825 case __NR_creat:
826 case __NR_delete_module:
827 case __NR_execve:
828 case __NR_fsopen:
829 case __NR_getxattr: /* Just pathname, TODO add name */
830 case __NR_lchown:
831 case __NR_lgetxattr: /* Just pathname, TODO add name */
832 case __NR_lremovexattr: /* Just pathname, TODO add name */
833 case __NR_link: /* Just oldname. TODO add newname */
834 case __NR_listxattr: /* Just pathname, TODO add list */
835 case __NR_llistxattr: /* Just pathname, TODO add list */
836 case __NR_lsetxattr: /* Just pathname, TODO add list */
837 case __NR_open:
838 case __NR_memfd_create:
839 case __NR_mount: /* Just dev_name, TODO add dir_name and type */
840 case __NR_mkdir:
841 case __NR_mknod:
842 case __NR_mq_open:
843 case __NR_mq_unlink:
844 case __NR_pivot_root: /* Just new_root, TODO add old_root */
845 case __NR_readlink:
846 case __NR_removexattr: /* Just pathname, TODO add name */
847 case __NR_rename: /* Just oldname. TODO add newname */
848 case __NR_request_key: /* Just _type. TODO add _description */
849 case __NR_rmdir:
850 case __NR_setxattr: /* Just pathname, TODO add list */
851 case __NR_shmdt:
852 case __NR_statfs:
853 case __NR_swapon:
854 case __NR_swapoff:
855 case __NR_symlink: /* Just oldname. TODO add newname */
856 case __NR_truncate:
857 case __NR_unlink:
858 case __NR_umount2:
859 case __NR_utime:
860 case __NR_utimes:
861 sys_data->user_mask = BIT(0);
862 break;
863 /* user arg at position 1 */
864 case __NR_execveat:
865 case __NR_faccessat:
866 case __NR_faccessat2:
867 case __NR_finit_module:
868 case __NR_fchmodat:
869 case __NR_fchmodat2:
870 case __NR_fchownat:
871 case __NR_fgetxattr:
872 case __NR_flistxattr:
873 case __NR_fsetxattr:
874 case __NR_fspick:
875 case __NR_fremovexattr:
876 case __NR_futimesat:
877 case __NR_getxattrat: /* Just pathname, TODO add name */
878 case __NR_inotify_add_watch:
879 case __NR_linkat: /* Just oldname. TODO add newname */
880 case __NR_listxattrat: /* Just pathname, TODO add list */
881 case __NR_mkdirat:
882 case __NR_mknodat:
883 case __NR_mount_setattr:
884 case __NR_move_mount: /* Just from_pathname, TODO add to_pathname */
885 case __NR_name_to_handle_at:
> 886 case __NR_newfstatat:
887 case __NR_openat:
888 case __NR_openat2:
889 case __NR_open_tree:
890 case __NR_open_tree_attr:
891 case __NR_readlinkat:
892 case __NR_renameat: /* Just oldname. TODO add newname */
893 case __NR_renameat2: /* Just oldname. TODO add newname */
894 case __NR_removexattrat: /* Just pathname, TODO add name */
895 case __NR_quotactl:
896 case __NR_setxattrat: /* Just pathname, TODO add list */
897 case __NR_syslog:
898 case __NR_symlinkat: /* Just oldname. TODO add newname */
899 case __NR_statx:
900 case __NR_unlinkat:
901 case __NR_utimensat:
902 sys_data->user_mask = BIT(1);
903 break;
904 /* user arg at position 2 */
905 case __NR_init_module:
906 case __NR_fsconfig:
907 sys_data->user_mask = BIT(2);
908 break;
909 /* user arg at position 4 */
910 case __NR_fanotify_mark:
911 sys_data->user_mask = BIT(4);
912 break;
913 default:
914 sys_data->user_mask = 0;
915 }
916 }
917
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 7/7] tracing: Add syscall_user_buf_size to limit amount written
2025-08-05 19:26 ` [PATCH 7/7] tracing: Add syscall_user_buf_size to limit amount written Steven Rostedt
2025-08-06 10:50 ` Douglas Raillard
@ 2025-08-06 14:50 ` kernel test robot
1 sibling, 0 replies; 22+ messages in thread
From: kernel test robot @ 2025-08-06 14:50 UTC (permalink / raw)
To: Steven Rostedt, linux-kernel, linux-trace-kernel
Cc: llvm, oe-kbuild-all, Masami Hiramatsu, Mark Rutland,
Mathieu Desnoyers, Andrew Morton, Linux Memory Management List,
Peter Zijlstra, Namhyung Kim, Takaya Saeki, Tom Zanussi,
Thomas Gleixner, Ian Rogers, aahringo, Douglas Raillard
Hi Steven,
kernel test robot noticed the following build errors:
[auto build test ERROR on trace/for-next]
[also build test ERROR on linus/master v6.16 next-20250806]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Steven-Rostedt/tracing-Replace-syscall-RCU-pointer-assignment-with-READ-WRITE_ONCE/20250806-122312
base: https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace for-next
patch link: https://lore.kernel.org/r/20250805193235.747004484%40kernel.org
patch subject: [PATCH 7/7] tracing: Add syscall_user_buf_size to limit amount written
config: hexagon-randconfig-002-20250806 (https://download.01.org/0day-ci/archive/20250806/202508062211.cwYqtLu0-lkp@intel.com/config)
compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project 7b8dea265e72c3037b6b1e54d5ab51b7e14f328b)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250806/202508062211.cwYqtLu0-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202508062211.cwYqtLu0-lkp@intel.com/
All errors (new ones prefixed by >>):
>> kernel/trace/trace.c:11128:32: error: use of undeclared identifier 'CONFIG_TRACE_SYSCALL_BUF_SIZE_DEFAULT'
11128 | global_trace.syscall_buf_sz = CONFIG_TRACE_SYSCALL_BUF_SIZE_DEFAULT;
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
vim +/CONFIG_TRACE_SYSCALL_BUF_SIZE_DEFAULT +11128 kernel/trace/trace.c
11110
11111 init_trace_flags_index(&global_trace);
11112
11113 register_tracer(&nop_trace);
11114
11115 /* Function tracing may start here (via kernel command line) */
11116 init_function_trace();
11117
11118 /* All seems OK, enable tracing */
11119 tracing_disabled = 0;
11120
11121 atomic_notifier_chain_register(&panic_notifier_list,
11122 &trace_panic_notifier);
11123
11124 register_die_notifier(&trace_die_notifier);
11125
11126 global_trace.flags = TRACE_ARRAY_FL_GLOBAL;
11127
11128 global_trace.syscall_buf_sz = CONFIG_TRACE_SYSCALL_BUF_SIZE_DEFAULT;
11129
11130 INIT_LIST_HEAD(&global_trace.systems);
11131 INIT_LIST_HEAD(&global_trace.events);
11132 INIT_LIST_HEAD(&global_trace.hist_vars);
11133 INIT_LIST_HEAD(&global_trace.err_log);
11134 list_add(&global_trace.marker_list, &marker_copies);
11135 list_add(&global_trace.list, &ftrace_trace_arrays);
11136
11137 apply_trace_boot_options();
11138
11139 register_snapshot_cmd();
11140
11141 return 0;
11142
11143 out_free_pipe_cpumask:
11144 free_cpumask_var(global_trace.pipe_cpumask);
11145 out_free_savedcmd:
11146 trace_free_saved_cmdlines_buffer();
11147 out_free_temp_buffer:
11148 ring_buffer_free(temp_buffer);
11149 out_rm_hp_state:
11150 cpuhp_remove_multi_state(CPUHP_TRACE_RB_PREPARE);
11151 out_free_cpumask:
11152 free_cpumask_var(global_trace.tracing_cpumask);
11153 out_free_buffer_mask:
11154 free_cpumask_var(tracing_buffer_mask);
11155 return ret;
11156 }
11157
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 5/7] tracing: Display some syscall arrays as strings
2025-08-05 19:26 ` [PATCH 5/7] tracing: Display some syscall arrays as strings Steven Rostedt
@ 2025-08-06 15:12 ` kernel test robot
0 siblings, 0 replies; 22+ messages in thread
From: kernel test robot @ 2025-08-06 15:12 UTC (permalink / raw)
To: Steven Rostedt, linux-kernel, linux-trace-kernel
Cc: oe-kbuild-all, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers,
Andrew Morton, Linux Memory Management List, Peter Zijlstra,
Namhyung Kim, Takaya Saeki, Tom Zanussi, Thomas Gleixner,
Ian Rogers, aahringo, Douglas Raillard
Hi Steven,
kernel test robot noticed the following build errors:
[auto build test ERROR on trace/for-next]
[also build test ERROR on linus/master v6.16 next-20250806]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Steven-Rostedt/tracing-Replace-syscall-RCU-pointer-assignment-with-READ-WRITE_ONCE/20250806-122312
base: https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace for-next
patch link: https://lore.kernel.org/r/20250805193235.416382557%40kernel.org
patch subject: [PATCH 5/7] tracing: Display some syscall arrays as strings
config: i386-randconfig-141-20250806 (https://download.01.org/0day-ci/archive/20250806/202508062215.aRvDXrBA-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14+deb12u1) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250806/202508062215.aRvDXrBA-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202508062215.aRvDXrBA-lkp@intel.com/
All errors (new ones prefixed by >>):
kernel/trace/trace_syscalls.c: In function 'check_faultable_syscall':
>> kernel/trace/trace_syscalls.c:883:14: error: '__NR_kexec_file_load' undeclared (first use in this function)
883 | case __NR_kexec_file_load:
| ^~~~~~~~~~~~~~~~~~~~
kernel/trace/trace_syscalls.c:883:14: note: each undeclared identifier is reported only once for each function it appears in
kernel/trace/trace_syscalls.c:957:14: error: '__NR_newfstatat' undeclared (first use in this function)
957 | case __NR_newfstatat:
| ^~~~~~~~~~~~~~~
vim +/__NR_kexec_file_load +883 kernel/trace/trace_syscalls.c
851
852 /*
853 * For system calls that reference user space memory that can
854 * be recorded into the event, set the system call meta data's user_mask
855 * to the "args" index that points to the user space memory to retrieve.
856 */
857 static void check_faultable_syscall(struct trace_event_call *call, int nr)
858 {
859 struct syscall_metadata *sys_data = call->data;
860
861 /* Only work on entry */
862 if (sys_data->enter_event != call)
863 return;
864
865 sys_data->user_arg_size = -1;
866
867 switch (nr) {
868 /* user arg 1 with size arg at 2 */
869 case __NR_write:
870 case __NR_mq_timedsend:
871 case __NR_pwrite64:
872 sys_data->user_mask = BIT(1);
873 sys_data->user_arg_size = 2;
874 break;
875 /* user arg 0 with size arg at 1 as string */
876 case __NR_setdomainname:
877 case __NR_sethostname:
878 sys_data->user_mask = BIT(0);
879 sys_data->user_arg_size = 1;
880 sys_data->user_arg_is_str = 1;
881 break;
882 /* user arg 4 with size arg at 3 as string */
> 883 case __NR_kexec_file_load:
884 sys_data->user_mask = BIT(4);
885 sys_data->user_arg_size = 3;
886 sys_data->user_arg_is_str = 1;
887 break;
888 /* user arg at position 0 */
889 case __NR_access:
890 case __NR_acct:
891 case __NR_add_key: /* Just _type. TODO add _description */
892 case __NR_chdir:
893 case __NR_chown:
894 case __NR_chmod:
895 case __NR_chroot:
896 case __NR_creat:
897 case __NR_delete_module:
898 case __NR_execve:
899 case __NR_fsopen:
900 case __NR_getxattr: /* Just pathname, TODO add name */
901 case __NR_lchown:
902 case __NR_lgetxattr: /* Just pathname, TODO add name */
903 case __NR_lremovexattr: /* Just pathname, TODO add name */
904 case __NR_link: /* Just oldname. TODO add newname */
905 case __NR_listxattr: /* Just pathname, TODO add list */
906 case __NR_llistxattr: /* Just pathname, TODO add list */
907 case __NR_lsetxattr: /* Just pathname, TODO add list */
908 case __NR_open:
909 case __NR_memfd_create:
910 case __NR_mount: /* Just dev_name, TODO add dir_name and type */
911 case __NR_mkdir:
912 case __NR_mknod:
913 case __NR_mq_open:
914 case __NR_mq_unlink:
915 case __NR_pivot_root: /* Just new_root, TODO add old_root */
916 case __NR_readlink:
917 case __NR_removexattr: /* Just pathname, TODO add name */
918 case __NR_rename: /* Just oldname. TODO add newname */
919 case __NR_request_key: /* Just _type. TODO add _description */
920 case __NR_rmdir:
921 case __NR_setxattr: /* Just pathname, TODO add list */
922 case __NR_shmdt:
923 case __NR_statfs:
924 case __NR_swapon:
925 case __NR_swapoff:
926 case __NR_symlink: /* Just oldname. TODO add newname */
927 case __NR_truncate:
928 case __NR_unlink:
929 case __NR_umount2:
930 case __NR_utime:
931 case __NR_utimes:
932 sys_data->user_mask = BIT(0);
933 break;
934 /* user arg at position 1 */
935 case __NR_execveat:
936 case __NR_faccessat:
937 case __NR_faccessat2:
938 case __NR_finit_module:
939 case __NR_fchmodat:
940 case __NR_fchmodat2:
941 case __NR_fchownat:
942 case __NR_fgetxattr:
943 case __NR_flistxattr:
944 case __NR_fsetxattr:
945 case __NR_fspick:
946 case __NR_fremovexattr:
947 case __NR_futimesat:
948 case __NR_getxattrat: /* Just pathname, TODO add name */
949 case __NR_inotify_add_watch:
950 case __NR_linkat: /* Just oldname. TODO add newname */
951 case __NR_listxattrat: /* Just pathname, TODO add list */
952 case __NR_mkdirat:
953 case __NR_mknodat:
954 case __NR_mount_setattr:
955 case __NR_move_mount: /* Just from_pathname, TODO add to_pathname */
956 case __NR_name_to_handle_at:
957 case __NR_newfstatat:
958 case __NR_openat:
959 case __NR_openat2:
960 case __NR_open_tree:
961 case __NR_open_tree_attr:
962 case __NR_readlinkat:
963 case __NR_renameat: /* Just oldname. TODO add newname */
964 case __NR_renameat2: /* Just oldname. TODO add newname */
965 case __NR_removexattrat: /* Just pathname, TODO add name */
966 case __NR_quotactl:
967 case __NR_setxattrat: /* Just pathname, TODO add list */
968 case __NR_syslog:
969 case __NR_symlinkat: /* Just oldname. TODO add newname */
970 case __NR_statx:
971 case __NR_unlinkat:
972 case __NR_utimensat:
973 sys_data->user_mask = BIT(1);
974 break;
975 /* user arg at position 2 */
976 case __NR_init_module:
977 case __NR_fsconfig:
978 sys_data->user_mask = BIT(2);
979 break;
980 /* user arg at position 4 */
981 case __NR_fanotify_mark:
982 sys_data->user_mask = BIT(4);
983 break;
984 default:
985 sys_data->user_mask = 0;
986 }
987 }
988
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 7/7] tracing: Add syscall_user_buf_size to limit amount written
2025-08-06 12:43 ` Steven Rostedt
@ 2025-08-06 16:21 ` Douglas Raillard
0 siblings, 0 replies; 22+ messages in thread
From: Douglas Raillard @ 2025-08-06 16:21 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
Mathieu Desnoyers, Andrew Morton, Peter Zijlstra, Namhyung Kim,
Takaya Saeki, Tom Zanussi, Thomas Gleixner, Ian Rogers, aahringo
On 06-08-2025 13:43, Steven Rostedt wrote:
> On Wed, 6 Aug 2025 11:50:06 +0100
> Douglas Raillard <douglas.raillard@arm.com> wrote:
>
>> On 05-08-2025 20:26, Steven Rostedt wrote:
>>> From: Steven Rostedt <rostedt@goodmis.org>
>>>
>>> When a system call that reads user space addresses copy it to the ring
>>> buffer, it can copy up to 511 bytes of data. This can waste precious ring
>>> buffer space if the user isn't interested in the output. Add a new file
>>> "syscall_user_buf_size" that gets initialized to a new config
>>> CONFIG_SYSCALL_BUF_SIZE_DEFAULT that defaults to 128.
>>
>> Have you considered dynamically removing some event fields ? We routinely hit
>> the same problem with some of our events that have rarely-used large fields.
>
> We do that already with eprobes. Note, syscall events are pseudo events
> hooked on the raw_syscall events. Thus modifying what is displayed is
> trivial as it's done manually anyway. For normal events, it's all in
> the TRACE_EVENT() macro which defines the fields at boot. Trying to
> modify it later is very difficult.
I was thinking at a filtering step between assigning to an event struct
with TP_fast_assign and actually writing it to the buffer. An array of (offset, size)
would allow selecting which field is to be copied to the buffer, the rest would
be left out (a bit like in some parts of the synthetic event API). The format
file would be impacted to remove some fields, but hopefully not too many other
corners of ftrace.
The advantage of that over eprobe would be:
1. full support of all field types
2. probably lower overhead than the fetch_op interpreter, but maybe not by much.
3. less moving pieces for the user (e.g. no need to have BTF for by-name field access,
no new event name to come up with etc.)
>
>>
>> If we could have a "fields" file in /sys/kernel/tracing/events/*/*/fields
>> that allowed selecting what field is needed that would be amazing. I had plans
>> to build something like that in our kernel module based on the synthetic events API,
>> but did not proceed as that API is not exported in a useful way.
>
> Take a look at eprobes. You can make a new event based from an existing
> event (including other dynamic events and syscalls).
> I finally got around to adding documentation about it:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/trace/eprobetrace.rst
>
That's very interesting, I did not realize that you could access the actual event fields
and not just the tracepoint args. With your recent BTF patch, there is now little limits
on how deep you can drill down in the structs which is great (and actually more powerful
than the original event itself).
Before userspace tooling could make use of that as a field filtering system, a few friction
points would need to be addressed:
1. Getting the field list programmatically is currently not really possible as dealing with
the format file is very tricky. We could just pass on the user-requested field
to the kernel but that would prevent userspace validation with usable error reporting
(the 6.15 kernel I tried it on gave me EINVAL and not even a dmesg error when trying to use
a field that does not exist)
2. The type of the field is not inferred, e.g. an explicit ":string" is needed here:
e:my/sched_switch sched.sched_switch prev_comm=$prev_comm:string
The only place a tool can get this info from is the format file, which means you have to
parse it and apply some conversions (e.g. "__data_loc char[]" becomes "string").
3. Only a restricted subset of field types is supported, e.g. no cpumask, buffers other
than strings etc. In practice, this means the userspace tooling will have to either:
* pass on the restriction to the users (can easily lead to a terrible UX by misleading
the user to think filtering is generally available when in fact it's not).
* or only treat that as a hint and use the unfiltered original event if the user asks
for a field with an unsupported type.
On the bright side, creating a new event like "e:my/sched_switch" gives the event name "sched_switch" but
trace-cmd start -e my/sched_switch will only enable the new event which is exactly what we need.
This way, the trace can look like a normal one except less fields, so downstream data processing
is not impacted and only the data-gathering step needs to know about it.
Depending on whether we want/can deal with those friction point, it could either become a high-level
layer usable like the base event system with extra low-level abilities, or stay as a tool only suitable for
hand-crafted use cases where the user has deeper knowledge of layout on all involved kernels.
On a related note, if we wanted to make something that allowed reducing the amount of stored data and
that could deeply integrate with the userspace tooling in charge of collecting the data to run a user-defined query,
the best bet is to target SQL-like systems. That family is very established and virtually all trace-processing system
will use it as first stage (e.g. Perfetto with sqlite, or LISA with Polars dataframes).
In those systems, some important information can typically be extracted from the user query [1]:
1. Projection: which tables and columns the query needs. In ftrace, that's the list of events and what fields
are needed. Other events/fields can be discarded as they won't be read by the query.
2. Row limit: how many rows the query will read (not always available obviously). In ftrace, that would allow
automatically stopping the tracing when the event count reaches a limit, or set the buffer size based on
the event size for a flight-recorder approach. Additional event occurrences would be discarded by the query
anyway.
3. Predicate filtering: If the query contains a filter to only select rows with a column equal to a specific
value. Other rows don't need to be collected as the query will discard them anyway.
Currently:
1. is partially implemented as you can select specific events, but not what field you want.
2. is partially implemented (buffer size, but AFAIK there is no way of telling ftrace to stop tracing after N events).
3. is fully implemented with /sys/kernel/debug/tracing/events/*/*/filter
If all those are implemented, ftrace would be able to make use of the most important implicit info available
in the user query to limit the collected data size, without the user having to tune anything manually
and without turning the kernel into a full-blown SQL interpreter.
[1] In the Polars dataframe library, data sources such as a parquet file served over HTTP are called "scans".
When Polars executes an expression, it will get the data from the scans the expression refers to,
and will pass the 3 pieces of info to the scan implementation so that processed data size can be minimized
as early as possible in the pipeline. This is referred to as "projection pushdown", "slice pushdown" and "predicate pushdown":
https://docs.pola.rs/user-guide/lazy/optimizations/
If some filtering condition is too complex to express in the limited scan predicate language, filtering will happen
later in the pipeline. If the scan does not have a smart way to apply the filter (e.g. projection pushdown for a row-oriented file format
will probably not bring massive speed improvements) then more data than necessary will be fetched and filtering will happen
later in the pipeline.
> -- Steve
--
Douglas
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/7] tracing: Have syscall trace events show "0x" for values greater than 10
2025-08-06 12:39 ` Steven Rostedt
@ 2025-08-06 16:42 ` Douglas Raillard
2025-08-12 19:22 ` Steven Rostedt
0 siblings, 1 reply; 22+ messages in thread
From: Douglas Raillard @ 2025-08-06 16:42 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
Mathieu Desnoyers, Andrew Morton, Peter Zijlstra, Namhyung Kim,
Takaya Saeki, Tom Zanussi, Thomas Gleixner, Ian Rogers, aahringo
On 06-08-2025 13:39, Steven Rostedt wrote:
> On Wed, 6 Aug 2025 11:24:33 +0100
> Douglas Raillard <douglas.raillard@arm.com> wrote:
>
>> On 05-08-2025 20:26, Steven Rostedt wrote:
>>> From: Steven Rostedt <rostedt@goodmis.org>
>>>
>>> Currently the syscall trace events show each value as hexadecimal, but
>>> without adding "0x" it can be confusing:
>>>
>>> sys_write(fd: 4, buf: 0x55c4a1fa9270, count: 44)
>>>
>>> Looks like the above write wrote 44 bytes, when in reality it wrote 68
>>> bytes.
>>>
>>> Add a "0x" for all values greater or equal to 10 to remove the ambiguity.
>>> For values less than 10, leave off the "0x" as that just adds noise to the
>>> output.
>>
>> I'm on the fence for the value-dependent format. This looks like
>> it could easily make life harder for quick&dirty scripts, but then both
>> awk's strtonum() and Python's int(x, base=16) seem to handle
>> the presence/absence of the 0x prefix so maybe it's a non-issue.
>
> Yes, but the trace file is more for humans than scripts, and the 0x1 is
> just noise.
>
>>
>> OTH, a hand-crafted regex designed after a small set of input may start
>> to randomly fail if one field unexpectedly goes beyond 10 ...
>
> If you are doing hand crafted scripts, I'd suggest to use tracing
> filters or other tooling that can handle this easily ;-)
Well, we both know I don't do this sort of things (or do I ? :) ) but for
the first 5+ years of its existence, LISA was actually parsing the text
output of trace-cmd, and I'd be surprised if it was the only place this was done.
AFAIK to this day, there is no tool providing a simple script/SQL interface
to a trace.dat file beyond basic filtering like trace-cmd report. I've had this
PR [1] opened for a while in LISA that does exactly that but got distracted with
other things:
# Show all the CPUs that emitted a cpu_frequency event
lisa-trace sql trace.dat --query 'SELECT DISTINCT cpu_id FROM cpu_frequency'
shape: (8, 1)
┌────────┐
│ cpu_id │
│ --- │
│ u32 │
╞════════╡
│ 0 │
│ 1 │
│ 2 │
│ 3 │
│ 7 │
│ 4 │
│ 5 │
│ 6 │
└────────┘
[1] https://gitlab.arm.com/tooling/lisa/-/merge_requests/2444
>
>>
>> Just using explicit hex may be the best here, as the actual proper fix
>> (type-level display hints) is harder. It could probably be implemented
>> using btf_decl_tag() and __builtin_btf_type_id() to retrieve the BTF info.
>
> That would add a dependency on BTF, which I would like to avoid.
>
> Thanks!
>
> -- Steve
>
--
Douglas
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 1/7] tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE()
2025-08-05 19:26 ` [PATCH 1/7] tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE() Steven Rostedt
@ 2025-08-06 18:39 ` Paul E. McKenney
2025-08-06 22:05 ` kernel test robot
1 sibling, 0 replies; 22+ messages in thread
From: Paul E. McKenney @ 2025-08-06 18:39 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
Mathieu Desnoyers, Andrew Morton, Peter Zijlstra, Namhyung Kim,
Takaya Saeki, Tom Zanussi, Thomas Gleixner, Ian Rogers, aahringo,
Douglas Raillard
On Tue, Aug 05, 2025 at 03:26:47PM -0400, Steven Rostedt wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
>
> The syscall events are pseudo events that hook to the raw syscalls. The
> ftrace_syscall_enter/exit() callback is called by the raw_syscall
> enter/exit tracepoints respectively whenever any of the syscall events are
> enabled.
>
> The trace_array has an array of syscall "files" that correspond to the
> system calls based on their __NR_SYSCALL number. The array is read and if
> there's a pointer to a trace_event_file then it is considered enabled and
> if it is NULL that syscall event is considered disabled.
>
> Currently it uses an rcu_dereference_sched() to get this pointer and a
> rcu_assign_ptr() or RCU_INIT_POINTER() to write to it. This is unnecessary
> as the file pointer will not go away outside the synchronization of the
> tracepoint logic itself. And this code adds no extra RCU synchronization
> that uses this.
>
> Replace these functions with a simple READ_ONCE() and WRITE_ONCE() which
> is all they need. This will also allow this code to not depend on
> preemption being disabled as system call tracepoints are now allowed to
> fault.
>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
From an RCU-removal viewpoint:
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
But is it possible to give some sort of warning just in case some creative
future developer figures out how to make the file pointer go away outside
of the synchronization of the tracepoint logic itself?
Thanx, Paul
> ---
> kernel/trace/trace_syscalls.c | 14 ++++++--------
> 1 file changed, 6 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
> index 46aab0ab9350..3a0b65f89130 100644
> --- a/kernel/trace/trace_syscalls.c
> +++ b/kernel/trace/trace_syscalls.c
> @@ -310,8 +310,7 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
> if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
> return;
>
> - /* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE) */
> - trace_file = rcu_dereference_sched(tr->enter_syscall_files[syscall_nr]);
> + trace_file = READ_ONCE(tr->enter_syscall_files[syscall_nr]);
> if (!trace_file)
> return;
>
> @@ -356,8 +355,7 @@ static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
> if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
> return;
>
> - /* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE()) */
> - trace_file = rcu_dereference_sched(tr->exit_syscall_files[syscall_nr]);
> + trace_file = READ_ONCE(tr->exit_syscall_files[syscall_nr]);
> if (!trace_file)
> return;
>
> @@ -393,7 +391,7 @@ static int reg_event_syscall_enter(struct trace_event_file *file,
> if (!tr->sys_refcount_enter)
> ret = register_trace_sys_enter(ftrace_syscall_enter, tr);
> if (!ret) {
> - rcu_assign_pointer(tr->enter_syscall_files[num], file);
> + WRITE_ONCE(tr->enter_syscall_files[num], file);
> tr->sys_refcount_enter++;
> }
> mutex_unlock(&syscall_trace_lock);
> @@ -411,7 +409,7 @@ static void unreg_event_syscall_enter(struct trace_event_file *file,
> return;
> mutex_lock(&syscall_trace_lock);
> tr->sys_refcount_enter--;
> - RCU_INIT_POINTER(tr->enter_syscall_files[num], NULL);
> + WRITE_ONCE(tr->enter_syscall_files[num], NULL);
> if (!tr->sys_refcount_enter)
> unregister_trace_sys_enter(ftrace_syscall_enter, tr);
> mutex_unlock(&syscall_trace_lock);
> @@ -431,7 +429,7 @@ static int reg_event_syscall_exit(struct trace_event_file *file,
> if (!tr->sys_refcount_exit)
> ret = register_trace_sys_exit(ftrace_syscall_exit, tr);
> if (!ret) {
> - rcu_assign_pointer(tr->exit_syscall_files[num], file);
> + WRITE_ONCE(tr->exit_syscall_files[num], file);
> tr->sys_refcount_exit++;
> }
> mutex_unlock(&syscall_trace_lock);
> @@ -449,7 +447,7 @@ static void unreg_event_syscall_exit(struct trace_event_file *file,
> return;
> mutex_lock(&syscall_trace_lock);
> tr->sys_refcount_exit--;
> - RCU_INIT_POINTER(tr->exit_syscall_files[num], NULL);
> + WRITE_ONCE(tr->exit_syscall_files[num], NULL);
> if (!tr->sys_refcount_exit)
> unregister_trace_sys_exit(ftrace_syscall_exit, tr);
> mutex_unlock(&syscall_trace_lock);
> --
> 2.47.2
>
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 1/7] tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE()
2025-08-05 19:26 ` [PATCH 1/7] tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE() Steven Rostedt
2025-08-06 18:39 ` Paul E. McKenney
@ 2025-08-06 22:05 ` kernel test robot
1 sibling, 0 replies; 22+ messages in thread
From: kernel test robot @ 2025-08-06 22:05 UTC (permalink / raw)
To: Steven Rostedt, linux-kernel, linux-trace-kernel
Cc: oe-kbuild-all, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers,
Andrew Morton, Linux Memory Management List, Peter Zijlstra,
Namhyung Kim, Takaya Saeki, Tom Zanussi, Thomas Gleixner,
Ian Rogers, aahringo, Douglas Raillard, Paul E. McKenney
Hi Steven,
kernel test robot noticed the following build warnings:
[auto build test WARNING on trace/for-next]
[also build test WARNING on linus/master v6.16 next-20250806]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Steven-Rostedt/tracing-Replace-syscall-RCU-pointer-assignment-with-READ-WRITE_ONCE/20250806-122312
base: https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace for-next
patch link: https://lore.kernel.org/r/20250805193234.745705874%40kernel.org
patch subject: [PATCH 1/7] tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE()
config: x86_64-randconfig-123-20250806 (https://download.01.org/0day-ci/archive/20250807/202508070546.KmEoxWkg-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14+deb12u1) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250807/202508070546.KmEoxWkg-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202508070546.KmEoxWkg-lkp@intel.com/
sparse warnings: (new ones prefixed by >>)
>> kernel/trace/trace_syscalls.c:313:20: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct trace_event_file *trace_file @@ got struct trace_event_file [noderef] __rcu * @@
kernel/trace/trace_syscalls.c:313:20: sparse: expected struct trace_event_file *trace_file
kernel/trace/trace_syscalls.c:313:20: sparse: got struct trace_event_file [noderef] __rcu *
kernel/trace/trace_syscalls.c:358:20: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct trace_event_file *trace_file @@ got struct trace_event_file [noderef] __rcu * @@
kernel/trace/trace_syscalls.c:358:20: sparse: expected struct trace_event_file *trace_file
kernel/trace/trace_syscalls.c:358:20: sparse: got struct trace_event_file [noderef] __rcu *
>> kernel/trace/trace_syscalls.c:394:17: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct trace_event_file [noderef] __rcu *volatile @@ got struct trace_event_file *file @@
kernel/trace/trace_syscalls.c:394:17: sparse: expected struct trace_event_file [noderef] __rcu *volatile
kernel/trace/trace_syscalls.c:394:17: sparse: got struct trace_event_file *file
kernel/trace/trace_syscalls.c:432:17: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct trace_event_file [noderef] __rcu *volatile @@ got struct trace_event_file *file @@
kernel/trace/trace_syscalls.c:432:17: sparse: expected struct trace_event_file [noderef] __rcu *volatile
kernel/trace/trace_syscalls.c:432:17: sparse: got struct trace_event_file *file
vim +313 kernel/trace/trace_syscalls.c
290
291 static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
292 {
293 struct trace_array *tr = data;
294 struct trace_event_file *trace_file;
295 struct syscall_trace_enter *entry;
296 struct syscall_metadata *sys_data;
297 struct trace_event_buffer fbuffer;
298 unsigned long args[6];
299 int syscall_nr;
300 int size;
301
302 /*
303 * Syscall probe called with preemption enabled, but the ring
304 * buffer and per-cpu data require preemption to be disabled.
305 */
306 might_fault();
307 guard(preempt_notrace)();
308
309 syscall_nr = trace_get_syscall_nr(current, regs);
310 if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
311 return;
312
> 313 trace_file = READ_ONCE(tr->enter_syscall_files[syscall_nr]);
314 if (!trace_file)
315 return;
316
317 if (trace_trigger_soft_disabled(trace_file))
318 return;
319
320 sys_data = syscall_nr_to_meta(syscall_nr);
321 if (!sys_data)
322 return;
323
324 size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
325
326 entry = trace_event_buffer_reserve(&fbuffer, trace_file, size);
327 if (!entry)
328 return;
329
330 entry = ring_buffer_event_data(fbuffer.event);
331 entry->nr = syscall_nr;
332 syscall_get_arguments(current, regs, args);
333 memcpy(entry->args, args, sizeof(unsigned long) * sys_data->nb_args);
334
335 trace_event_buffer_commit(&fbuffer);
336 }
337
338 static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
339 {
340 struct trace_array *tr = data;
341 struct trace_event_file *trace_file;
342 struct syscall_trace_exit *entry;
343 struct syscall_metadata *sys_data;
344 struct trace_event_buffer fbuffer;
345 int syscall_nr;
346
347 /*
348 * Syscall probe called with preemption enabled, but the ring
349 * buffer and per-cpu data require preemption to be disabled.
350 */
351 might_fault();
352 guard(preempt_notrace)();
353
354 syscall_nr = trace_get_syscall_nr(current, regs);
355 if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
356 return;
357
358 trace_file = READ_ONCE(tr->exit_syscall_files[syscall_nr]);
359 if (!trace_file)
360 return;
361
362 if (trace_trigger_soft_disabled(trace_file))
363 return;
364
365 sys_data = syscall_nr_to_meta(syscall_nr);
366 if (!sys_data)
367 return;
368
369 entry = trace_event_buffer_reserve(&fbuffer, trace_file, sizeof(*entry));
370 if (!entry)
371 return;
372
373 entry = ring_buffer_event_data(fbuffer.event);
374 entry->nr = syscall_nr;
375 entry->ret = syscall_get_return_value(current, regs);
376
377 trace_event_buffer_commit(&fbuffer);
378 }
379
380 static int reg_event_syscall_enter(struct trace_event_file *file,
381 struct trace_event_call *call)
382 {
383 struct trace_array *tr = file->tr;
384 int ret = 0;
385 int num;
386
387 num = ((struct syscall_metadata *)call->data)->syscall_nr;
388 if (WARN_ON_ONCE(num < 0 || num >= NR_syscalls))
389 return -ENOSYS;
390 mutex_lock(&syscall_trace_lock);
391 if (!tr->sys_refcount_enter)
392 ret = register_trace_sys_enter(ftrace_syscall_enter, tr);
393 if (!ret) {
> 394 WRITE_ONCE(tr->enter_syscall_files[num], file);
395 tr->sys_refcount_enter++;
396 }
397 mutex_unlock(&syscall_trace_lock);
398 return ret;
399 }
400
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 6/7] tracing: Allow syscall trace events to read more than one user parameter
2025-08-05 19:26 ` [PATCH 6/7] tracing: Allow syscall trace events to read more than one user parameter Steven Rostedt
@ 2025-08-06 23:52 ` kernel test robot
0 siblings, 0 replies; 22+ messages in thread
From: kernel test robot @ 2025-08-06 23:52 UTC (permalink / raw)
To: Steven Rostedt, linux-kernel, linux-trace-kernel
Cc: oe-kbuild-all, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers,
Andrew Morton, Linux Memory Management List, Peter Zijlstra,
Namhyung Kim, Takaya Saeki, Tom Zanussi, Thomas Gleixner,
Ian Rogers, aahringo, Douglas Raillard
Hi Steven,
kernel test robot noticed the following build warnings:
[auto build test WARNING on trace/for-next]
[also build test WARNING on linus/master v6.16 next-20250806]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Steven-Rostedt/tracing-Replace-syscall-RCU-pointer-assignment-with-READ-WRITE_ONCE/20250806-122312
base: https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace for-next
patch link: https://lore.kernel.org/r/20250805193235.582013098%40kernel.org
patch subject: [PATCH 6/7] tracing: Allow syscall trace events to read more than one user parameter
config: x86_64-randconfig-123-20250806 (https://download.01.org/0day-ci/archive/20250807/202508070706.TiTQY0Ne-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14+deb12u1) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250807/202508070706.TiTQY0Ne-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202508070706.TiTQY0Ne-lkp@intel.com/
sparse warnings: (new ones prefixed by >>)
>> kernel/trace/trace_syscalls.c:620:53: sparse: sparse: incorrect type in argument 2 (different address spaces) @@ expected void const [noderef] __user *from @@ got char *ptr @@
kernel/trace/trace_syscalls.c:620:53: sparse: expected void const [noderef] __user *from
kernel/trace/trace_syscalls.c:620:53: sparse: got char *ptr
>> kernel/trace/trace_syscalls.c:623:54: sparse: sparse: incorrect type in argument 2 (different address spaces) @@ expected char const [noderef] __user *src @@ got char *ptr @@
kernel/trace/trace_syscalls.c:623:54: sparse: expected char const [noderef] __user *src
kernel/trace/trace_syscalls.c:623:54: sparse: got char *ptr
kernel/trace/trace_syscalls.c:707:20: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct trace_event_file *trace_file @@ got struct trace_event_file [noderef] __rcu * @@
kernel/trace/trace_syscalls.c:707:20: sparse: expected struct trace_event_file *trace_file
kernel/trace/trace_syscalls.c:707:20: sparse: got struct trace_event_file [noderef] __rcu *
kernel/trace/trace_syscalls.c:824:20: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct trace_event_file *trace_file @@ got struct trace_event_file [noderef] __rcu * @@
kernel/trace/trace_syscalls.c:824:20: sparse: expected struct trace_event_file *trace_file
kernel/trace/trace_syscalls.c:824:20: sparse: got struct trace_event_file [noderef] __rcu *
kernel/trace/trace_syscalls.c:871:9: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct trace_event_file [noderef] __rcu *volatile @@ got struct trace_event_file *file @@
kernel/trace/trace_syscalls.c:871:9: sparse: expected struct trace_event_file [noderef] __rcu *volatile
kernel/trace/trace_syscalls.c:871:9: sparse: got struct trace_event_file *file
kernel/trace/trace_syscalls.c:909:17: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct trace_event_file [noderef] __rcu *volatile @@ got struct trace_event_file *file @@
kernel/trace/trace_syscalls.c:909:17: sparse: expected struct trace_event_file [noderef] __rcu *volatile
kernel/trace/trace_syscalls.c:909:17: sparse: got struct trace_event_file *file
vim +620 kernel/trace/trace_syscalls.c
6bc850d6f8f730 Steven Rostedt 2025-08-05 526
623bd9e046f95c Steven Rostedt 2025-08-05 527 static char *sys_fault_user(struct syscall_metadata *sys_data,
623bd9e046f95c Steven Rostedt 2025-08-05 528 struct syscall_buf_info *sinfo,
623bd9e046f95c Steven Rostedt 2025-08-05 529 unsigned long *args,
623bd9e046f95c Steven Rostedt 2025-08-05 530 unsigned int data_size[SYSCALL_FAULT_MAX_CNT])
6bc850d6f8f730 Steven Rostedt 2025-08-05 531 {
623bd9e046f95c Steven Rostedt 2025-08-05 532 char *buffer = per_cpu_ptr(sinfo->sbuf, smp_processor_id())->buf;
6bc850d6f8f730 Steven Rostedt 2025-08-05 533 unsigned long mask = sys_data->user_mask;
623bd9e046f95c Steven Rostedt 2025-08-05 534 unsigned long size = SYSCALL_FAULT_ARG_SZ - 1;
6bc850d6f8f730 Steven Rostedt 2025-08-05 535 unsigned int cnt;
b979d33ec48bbd Steven Rostedt 2025-08-05 536 bool array = false;
623bd9e046f95c Steven Rostedt 2025-08-05 537 char *ptr_array[SYSCALL_FAULT_MAX_CNT];
623bd9e046f95c Steven Rostedt 2025-08-05 538 char *buf;
623bd9e046f95c Steven Rostedt 2025-08-05 539 int read[SYSCALL_FAULT_MAX_CNT];
6bc850d6f8f730 Steven Rostedt 2025-08-05 540 int trys = 0;
623bd9e046f95c Steven Rostedt 2025-08-05 541 int uargs;
6bc850d6f8f730 Steven Rostedt 2025-08-05 542 int ret;
623bd9e046f95c Steven Rostedt 2025-08-05 543 int i = 0;
623bd9e046f95c Steven Rostedt 2025-08-05 544
623bd9e046f95c Steven Rostedt 2025-08-05 545 /* The extra is appended to the user data in the buffer */
623bd9e046f95c Steven Rostedt 2025-08-05 546 BUILD_BUG_ON(SYSCALL_FAULT_USER_MAX + sizeof(EXTRA) >=
623bd9e046f95c Steven Rostedt 2025-08-05 547 SYSCALL_FAULT_ARG_SZ);
623bd9e046f95c Steven Rostedt 2025-08-05 548
623bd9e046f95c Steven Rostedt 2025-08-05 549 /*
623bd9e046f95c Steven Rostedt 2025-08-05 550 * If this system call event has a size argument, use
623bd9e046f95c Steven Rostedt 2025-08-05 551 * it to define how much of user space memory to read,
623bd9e046f95c Steven Rostedt 2025-08-05 552 * and read it as an array and not a string.
623bd9e046f95c Steven Rostedt 2025-08-05 553 */
623bd9e046f95c Steven Rostedt 2025-08-05 554 if (sys_data->user_arg_size >= 0) {
623bd9e046f95c Steven Rostedt 2025-08-05 555 array = true;
623bd9e046f95c Steven Rostedt 2025-08-05 556 size = args[sys_data->user_arg_size];
623bd9e046f95c Steven Rostedt 2025-08-05 557 if (size > SYSCALL_FAULT_ARG_SZ - 1)
623bd9e046f95c Steven Rostedt 2025-08-05 558 size = SYSCALL_FAULT_ARG_SZ - 1;
623bd9e046f95c Steven Rostedt 2025-08-05 559 }
623bd9e046f95c Steven Rostedt 2025-08-05 560
623bd9e046f95c Steven Rostedt 2025-08-05 561 while (mask) {
623bd9e046f95c Steven Rostedt 2025-08-05 562 int idx = ffs(mask) - 1;
623bd9e046f95c Steven Rostedt 2025-08-05 563 mask &= ~BIT(idx);
623bd9e046f95c Steven Rostedt 2025-08-05 564
623bd9e046f95c Steven Rostedt 2025-08-05 565 if (WARN_ON_ONCE(i == SYSCALL_FAULT_MAX_CNT))
623bd9e046f95c Steven Rostedt 2025-08-05 566 break;
6bc850d6f8f730 Steven Rostedt 2025-08-05 567
6bc850d6f8f730 Steven Rostedt 2025-08-05 568 /* Get the pointer to user space memory to read */
623bd9e046f95c Steven Rostedt 2025-08-05 569 ptr_array[i++] = (char *)args[idx];
623bd9e046f95c Steven Rostedt 2025-08-05 570 }
623bd9e046f95c Steven Rostedt 2025-08-05 571
623bd9e046f95c Steven Rostedt 2025-08-05 572 uargs = i;
623bd9e046f95c Steven Rostedt 2025-08-05 573
623bd9e046f95c Steven Rostedt 2025-08-05 574 /* Clear the values that are not used */
623bd9e046f95c Steven Rostedt 2025-08-05 575 for (; i < SYSCALL_FAULT_MAX_CNT; i++) {
623bd9e046f95c Steven Rostedt 2025-08-05 576 data_size[i] = -1; /* Denotes no pointer */
623bd9e046f95c Steven Rostedt 2025-08-05 577 }
6bc850d6f8f730 Steven Rostedt 2025-08-05 578
6bc850d6f8f730 Steven Rostedt 2025-08-05 579 again:
6bc850d6f8f730 Steven Rostedt 2025-08-05 580 /*
6bc850d6f8f730 Steven Rostedt 2025-08-05 581 * If this task is preempted by another user space task, it
6bc850d6f8f730 Steven Rostedt 2025-08-05 582 * will cause this task to try again. But just in case something
6bc850d6f8f730 Steven Rostedt 2025-08-05 583 * changes where the copying from user space causes another task
6bc850d6f8f730 Steven Rostedt 2025-08-05 584 * to run, prevent this from going into an infinite loop.
6bc850d6f8f730 Steven Rostedt 2025-08-05 585 * 10 tries should be plenty.
6bc850d6f8f730 Steven Rostedt 2025-08-05 586 */
6bc850d6f8f730 Steven Rostedt 2025-08-05 587 if (trys++ > 10) {
6bc850d6f8f730 Steven Rostedt 2025-08-05 588 static bool once;
6bc850d6f8f730 Steven Rostedt 2025-08-05 589 /*
6bc850d6f8f730 Steven Rostedt 2025-08-05 590 * Only print a message instead of a WARN_ON() as this could
6bc850d6f8f730 Steven Rostedt 2025-08-05 591 * theoretically trigger under real load.
6bc850d6f8f730 Steven Rostedt 2025-08-05 592 */
6bc850d6f8f730 Steven Rostedt 2025-08-05 593 if (!once)
6bc850d6f8f730 Steven Rostedt 2025-08-05 594 pr_warn("Error: Too many tries to read syscall %s\n", sys_data->name);
6bc850d6f8f730 Steven Rostedt 2025-08-05 595 once = true;
623bd9e046f95c Steven Rostedt 2025-08-05 596 return buffer;
6bc850d6f8f730 Steven Rostedt 2025-08-05 597 }
6bc850d6f8f730 Steven Rostedt 2025-08-05 598
6bc850d6f8f730 Steven Rostedt 2025-08-05 599 /* Read the current sched switch count */
6bc850d6f8f730 Steven Rostedt 2025-08-05 600 cnt = this_cpu_read(sched_switch_cnt);
6bc850d6f8f730 Steven Rostedt 2025-08-05 601
6bc850d6f8f730 Steven Rostedt 2025-08-05 602 /*
6bc850d6f8f730 Steven Rostedt 2025-08-05 603 * Preemption is going to be enabled, but this task must
6bc850d6f8f730 Steven Rostedt 2025-08-05 604 * remain on this CPU.
6bc850d6f8f730 Steven Rostedt 2025-08-05 605 */
6bc850d6f8f730 Steven Rostedt 2025-08-05 606 migrate_disable();
6bc850d6f8f730 Steven Rostedt 2025-08-05 607
6bc850d6f8f730 Steven Rostedt 2025-08-05 608 /*
6bc850d6f8f730 Steven Rostedt 2025-08-05 609 * Now preemption is being enabed and another task can come in
6bc850d6f8f730 Steven Rostedt 2025-08-05 610 * and use the same buffer and corrupt our data.
6bc850d6f8f730 Steven Rostedt 2025-08-05 611 */
6bc850d6f8f730 Steven Rostedt 2025-08-05 612 preempt_enable_notrace();
6bc850d6f8f730 Steven Rostedt 2025-08-05 613
623bd9e046f95c Steven Rostedt 2025-08-05 614 buf = buffer;
623bd9e046f95c Steven Rostedt 2025-08-05 615
623bd9e046f95c Steven Rostedt 2025-08-05 616 for (i = 0; i < uargs; i++, buf += SYSCALL_FAULT_ARG_SZ) {
623bd9e046f95c Steven Rostedt 2025-08-05 617 char *ptr = ptr_array[i];
623bd9e046f95c Steven Rostedt 2025-08-05 618
b979d33ec48bbd Steven Rostedt 2025-08-05 619 if (array) {
b979d33ec48bbd Steven Rostedt 2025-08-05 @620 ret = __copy_from_user(buf, ptr, size);
b979d33ec48bbd Steven Rostedt 2025-08-05 621 ret = ret ? -1 : size;
b979d33ec48bbd Steven Rostedt 2025-08-05 622 } else {
6bc850d6f8f730 Steven Rostedt 2025-08-05 @623 ret = strncpy_from_user(buf, ptr, size);
b979d33ec48bbd Steven Rostedt 2025-08-05 624 }
623bd9e046f95c Steven Rostedt 2025-08-05 625 read[i] = ret;
623bd9e046f95c Steven Rostedt 2025-08-05 626 }
6bc850d6f8f730 Steven Rostedt 2025-08-05 627
6bc850d6f8f730 Steven Rostedt 2025-08-05 628 preempt_disable_notrace();
6bc850d6f8f730 Steven Rostedt 2025-08-05 629 migrate_enable();
6bc850d6f8f730 Steven Rostedt 2025-08-05 630
6bc850d6f8f730 Steven Rostedt 2025-08-05 631 /*
6bc850d6f8f730 Steven Rostedt 2025-08-05 632 * Preemption is disabled again, now check the sched_switch_cnt.
6bc850d6f8f730 Steven Rostedt 2025-08-05 633 * If it increased by two or more, then another user space process
6bc850d6f8f730 Steven Rostedt 2025-08-05 634 * may have schedule in and corrupted our buffer. In that case
6bc850d6f8f730 Steven Rostedt 2025-08-05 635 * the copying must be retried.
6bc850d6f8f730 Steven Rostedt 2025-08-05 636 *
6bc850d6f8f730 Steven Rostedt 2025-08-05 637 * Note, if this task was scheduled out and only kernel threads
6bc850d6f8f730 Steven Rostedt 2025-08-05 638 * were scheduled in (maybe to process the fault), then the
6bc850d6f8f730 Steven Rostedt 2025-08-05 639 * counter would increment again when this task scheduled in.
6bc850d6f8f730 Steven Rostedt 2025-08-05 640 * If this task scheduled out and another user task scheduled
6bc850d6f8f730 Steven Rostedt 2025-08-05 641 * in, this task would still need to be scheduled back in and
6bc850d6f8f730 Steven Rostedt 2025-08-05 642 * the counter would increment by at least two.
6bc850d6f8f730 Steven Rostedt 2025-08-05 643 */
6bc850d6f8f730 Steven Rostedt 2025-08-05 644 if (this_cpu_read(sched_switch_cnt) > cnt + 1)
6bc850d6f8f730 Steven Rostedt 2025-08-05 645 goto again;
6bc850d6f8f730 Steven Rostedt 2025-08-05 646
623bd9e046f95c Steven Rostedt 2025-08-05 647 buf = buffer;
623bd9e046f95c Steven Rostedt 2025-08-05 648 for (i = 0; i < uargs; i++, buf += SYSCALL_FAULT_ARG_SZ) {
623bd9e046f95c Steven Rostedt 2025-08-05 649
623bd9e046f95c Steven Rostedt 2025-08-05 650 ret = read[i];
623bd9e046f95c Steven Rostedt 2025-08-05 651 if (ret < 0)
623bd9e046f95c Steven Rostedt 2025-08-05 652 continue;
623bd9e046f95c Steven Rostedt 2025-08-05 653 buf[ret] = '\0';
623bd9e046f95c Steven Rostedt 2025-08-05 654
b979d33ec48bbd Steven Rostedt 2025-08-05 655 /* For strings, replace any non-printable characters with '.' */
b979d33ec48bbd Steven Rostedt 2025-08-05 656 if (!array) {
623bd9e046f95c Steven Rostedt 2025-08-05 657 for (int x = 0; x < ret; x++) {
623bd9e046f95c Steven Rostedt 2025-08-05 658 if (!isprint(buf[x]))
623bd9e046f95c Steven Rostedt 2025-08-05 659 buf[x] = '.';
6bc850d6f8f730 Steven Rostedt 2025-08-05 660 }
6bc850d6f8f730 Steven Rostedt 2025-08-05 661
6bc850d6f8f730 Steven Rostedt 2025-08-05 662 /*
623bd9e046f95c Steven Rostedt 2025-08-05 663 * If the text was truncated due to our max limit,
623bd9e046f95c Steven Rostedt 2025-08-05 664 * add "..." to the string.
6bc850d6f8f730 Steven Rostedt 2025-08-05 665 */
623bd9e046f95c Steven Rostedt 2025-08-05 666 if (ret > SYSCALL_FAULT_USER_MAX) {
623bd9e046f95c Steven Rostedt 2025-08-05 667 strscpy(buf + SYSCALL_FAULT_USER_MAX, EXTRA,
623bd9e046f95c Steven Rostedt 2025-08-05 668 sizeof(EXTRA));
623bd9e046f95c Steven Rostedt 2025-08-05 669 ret = SYSCALL_FAULT_USER_MAX + sizeof(EXTRA);
6bc850d6f8f730 Steven Rostedt 2025-08-05 670 } else {
6bc850d6f8f730 Steven Rostedt 2025-08-05 671 buf[ret++] = '\0';
6bc850d6f8f730 Steven Rostedt 2025-08-05 672 }
623bd9e046f95c Steven Rostedt 2025-08-05 673 } else {
623bd9e046f95c Steven Rostedt 2025-08-05 674 ret = min(ret, SYSCALL_FAULT_USER_MAX);
623bd9e046f95c Steven Rostedt 2025-08-05 675 }
623bd9e046f95c Steven Rostedt 2025-08-05 676 data_size[i] = ret;
b979d33ec48bbd Steven Rostedt 2025-08-05 677 }
6bc850d6f8f730 Steven Rostedt 2025-08-05 678
623bd9e046f95c Steven Rostedt 2025-08-05 679 return buffer;
6bc850d6f8f730 Steven Rostedt 2025-08-05 680 }
6bc850d6f8f730 Steven Rostedt 2025-08-05 681
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 3/7] tracing: Have syscall trace events read user space string
2025-08-05 19:26 ` [PATCH 3/7] tracing: Have syscall trace events read user space string Steven Rostedt
2025-08-06 14:39 ` kernel test robot
@ 2025-08-11 5:19 ` kernel test robot
1 sibling, 0 replies; 22+ messages in thread
From: kernel test robot @ 2025-08-11 5:19 UTC (permalink / raw)
To: Steven Rostedt
Cc: oe-lkp, lkp, linux-kernel, linux-trace-kernel, Masami Hiramatsu,
Mark Rutland, Mathieu Desnoyers, Andrew Morton, Peter Zijlstra,
Namhyung Kim, Takaya Saeki, Tom Zanussi, Thomas Gleixner,
Ian Rogers, aahringo, Douglas Raillard, oliver.sang
Hello,
kernel test robot noticed "BUG:KASAN:slab-out-of-bounds_in_syscall_fault_buffer_enable" on:
commit: 6bc850d6f8f7308a184edfd60ee1acdd89ced128 ("[PATCH 3/7] tracing: Have syscall trace events read user space string")
url: https://github.com/intel-lab-lkp/linux/commits/Steven-Rostedt/tracing-Replace-syscall-RCU-pointer-assignment-with-READ-WRITE_ONCE/20250806-122312
base: https://git.kernel.org/cgit/linux/kernel/git/trace/linux-trace for-next
patch link: https://lore.kernel.org/all/20250805193235.080757106@kernel.org/
patch subject: [PATCH 3/7] tracing: Have syscall trace events read user space string
in testcase: boot
config: x86_64-rhel-9.4-kunit
compiler: gcc-12
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
(please refer to attached dmesg/kmsg for entire log/backtrace)
+---------------------------------------------------------------+------------+------------+
| | 63f89ba6a0 | 6bc850d6f8 |
+---------------------------------------------------------------+------------+------------+
| BUG:KASAN:slab-out-of-bounds_in_syscall_fault_buffer_enable | 0 | 24 |
+---------------------------------------------------------------+------------+------------+
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202508110616.33657b6c-lkp@intel.com
[ 47.226292][ T1] BUG: KASAN: slab-out-of-bounds in syscall_fault_buffer_enable (kernel/trace/trace_syscalls.c:430)
[ 47.227603][ T1] Write of size 8 at addr ffff8881baea5f10 by task swapper/0/1
[ 47.228735][ T1]
[ 47.229107][ T1] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.16.0-rc7-00138-g6bc850d6f8f7 #1 PREEMPT(voluntary)
[ 47.229114][ T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 47.229117][ T1] Call Trace:
[ 47.229121][ T1] <TASK>
[ 47.229124][ T1] dump_stack_lvl (lib/dump_stack.c:123 (discriminator 1))
[ 47.229134][ T1] print_address_description+0x2c/0x380
[ 47.229142][ T1] ? syscall_fault_buffer_enable (kernel/trace/trace_syscalls.c:430)
[ 47.229146][ T1] print_report (mm/kasan/report.c:481)
[ 47.229150][ T1] ? syscall_fault_buffer_enable (kernel/trace/trace_syscalls.c:430)
[ 47.229153][ T1] ? kasan_addr_to_slab (mm/kasan/common.c:37)
[ 47.229156][ T1] ? syscall_fault_buffer_enable (kernel/trace/trace_syscalls.c:430)
[ 47.229159][ T1] kasan_report (mm/kasan/report.c:595)
[ 47.229164][ T1] ? syscall_fault_buffer_enable (kernel/trace/trace_syscalls.c:430)
[ 47.229167][ T1] syscall_fault_buffer_enable (kernel/trace/trace_syscalls.c:430)
[ 47.229171][ T1] ? mutex_unlock (arch/x86/include/asm/atomic64_64.h:101 include/linux/atomic/atomic-arch-fallback.h:4329 include/linux/atomic/atomic-long.h:1506 include/linux/atomic/atomic-instrumented.h:4481 kernel/locking/mutex.c:167 kernel/locking/mutex.c:537)
[ 47.229177][ T1] syscall_enter_register (kernel/trace/trace_syscalls.c:729 kernel/trace/trace_syscalls.c:1259)
[ 47.229181][ T1] __ftrace_event_enable_disable (kernel/trace/trace_events.c:860)
[ 47.229186][ T1] ? __pfx__printk (kernel/printk/printk.c:2470)
[ 47.229192][ T1] __ftrace_set_clr_event_nolock (kernel/trace/trace_events.c:890 kernel/trace/trace_events.c:1353)
[ 47.229197][ T1] event_trace_self_tests (kernel/trace/trace_events.c:1384 (discriminator 1) kernel/trace/trace_events.c:4779 (discriminator 1))
[ 47.229203][ T1] ? __pfx_event_trace_self_tests_init (kernel/trace/trace_events.c:4892)
[ 47.229208][ T1] event_trace_self_tests_init (include/linux/list.h:373 kernel/trace/trace.h:487 kernel/trace/trace_events.c:4871 kernel/trace/trace_events.c:4894)
[ 47.229212][ T1] do_one_initcall (init/main.c:1274)
[ 47.229216][ T1] ? __pfx_do_one_initcall (init/main.c:1265)
[ 47.229219][ T1] ? __pfx_parse_args (kernel/params.c:168)
[ 47.229223][ T1] ? __kasan_kmalloc (include/linux/kfence.h:58 mm/kasan/common.c:390)
[ 47.229227][ T1] ? do_initcalls (include/linux/slab.h:909 include/linux/slab.h:1039 init/main.c:1345)
[ 47.229232][ T1] do_initcalls (init/main.c:1335 init/main.c:1352)
[ 47.229236][ T1] kernel_init_freeable (init/main.c:1586)
[ 47.229241][ T1] ? __pfx_kernel_init (init/main.c:1466)
[ 47.229247][ T1] kernel_init (init/main.c:1476)
[ 47.229251][ T1] ? calculate_sigpending (kernel/signal.c:194)
[ 47.229256][ T1] ? __pfx_kernel_init (init/main.c:1466)
[ 47.229259][ T1] ret_from_fork (arch/x86/kernel/process.c:154)
[ 47.229265][ T1] ? __pfx_kernel_init (init/main.c:1466)
[ 47.229269][ T1] ret_from_fork_asm (arch/x86/entry/entry_64.S:258)
[ 47.229275][ T1] </TASK>
[ 47.229276][ T1]
[ 47.262431][ T1] Allocated by task 1:
[ 47.263075][ T1] kasan_save_stack (mm/kasan/common.c:48)
[ 47.263810][ T1] kasan_save_track (arch/x86/include/asm/current.h:25 mm/kasan/common.c:60 mm/kasan/common.c:69)
[ 47.264530][ T1] __kasan_kmalloc (mm/kasan/common.c:377 mm/kasan/common.c:394)
[ 47.265210][ T1] syscall_fault_buffer_enable (include/linux/slab.h:905 kernel/trace/trace_syscalls.c:426)
[ 47.266034][ T1] syscall_enter_register (kernel/trace/trace_syscalls.c:729 kernel/trace/trace_syscalls.c:1259)
[ 47.266825][ T1] __ftrace_event_enable_disable (kernel/trace/trace_events.c:860)
[ 47.267743][ T1] __ftrace_set_clr_event_nolock (kernel/trace/trace_events.c:890 kernel/trace/trace_events.c:1353)
[ 47.268609][ T1] event_trace_self_tests (kernel/trace/trace_events.c:1384 (discriminator 1) kernel/trace/trace_events.c:4779 (discriminator 1))
[ 47.269392][ T1] event_trace_self_tests_init (include/linux/list.h:373 kernel/trace/trace.h:487 kernel/trace/trace_events.c:4871 kernel/trace/trace_events.c:4894)
[ 47.270229][ T1] do_one_initcall (init/main.c:1274)
[ 47.270971][ T1] do_initcalls (init/main.c:1335 init/main.c:1352)
[ 47.271669][ T1] kernel_init_freeable (init/main.c:1586)
[ 47.272429][ T1] kernel_init (init/main.c:1476)
[ 47.273127][ T1] ret_from_fork (arch/x86/kernel/process.c:154)
[ 47.273817][ T1] ret_from_fork_asm (arch/x86/entry/entry_64.S:258)
[ 47.274547][ T1]
[ 47.274977][ T1] The buggy address belongs to the object at ffff8881baea5f00
[ 47.274977][ T1] which belongs to the cache kmalloc-8 of size 8
[ 47.277002][ T1] The buggy address is located 8 bytes to the right of
[ 47.277002][ T1] allocated 8-byte region [ffff8881baea5f00, ffff8881baea5f08)
[ 47.279110][ T1]
[ 47.279551][ T1] The buggy address belongs to the physical page:
[ 47.280447][ T1] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1baea5
[ 47.281773][ T1] flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff)
[ 47.282819][ T1] page_type: f5(slab)
[ 47.283449][ T1] raw: 0017ffffc0000000 ffff888100041500 dead000000000122 0000000000000000
[ 47.284709][ T1] raw: 0000000000000000 0000000080800080 00000000f5000000 0000000000000000
[ 47.286003][ T1] page dumped because: kasan: bad access detected
[ 47.286854][ T1] page_owner tracks the page as allocated
[ 47.287710][ T1] page last allocated via order 0, migratetype Unmovable, gfp_mask 0x52cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP), pid 1, tgid 1 (swapper/0), ts 35802055637, free_ts 0
[ 47.290101][ T1] post_alloc_hook (include/linux/page_owner.h:32 mm/page_alloc.c:1704)
[ 47.290799][ T1] get_page_from_freelist (mm/page_alloc.c:1714 mm/page_alloc.c:3669)
[ 47.291594][ T1] __alloc_frozen_pages_noprof (mm/page_alloc.c:4959)
[ 47.292391][ T1] alloc_pages_mpol (mm/mempolicy.c:2421)
[ 47.293117][ T1] allocate_slab (mm/slub.c:2451 mm/slub.c:2619)
[ 47.293768][ T1] ___slab_alloc (mm/slub.c:3859 (discriminator 3))
[ 47.294447][ T1] __kmalloc_node_track_caller_noprof (mm/slub.c:3949 mm/slub.c:4024 mm/slub.c:4185 mm/slub.c:4327 mm/slub.c:4347)
[ 47.295322][ T1] kstrdup (mm/util.c:63 mm/util.c:83)
[ 47.295898][ T1] __kernfs_new_node (fs/kernfs/dir.c:634)
[ 47.296626][ T1] kernfs_new_node (fs/kernfs/dir.c:713)
[ 47.297380][ T1] kernfs_create_dir_ns (fs/kernfs/dir.c:1085)
[ 47.298159][ T1] sysfs_create_dir_ns (fs/sysfs/dir.c:61)
[ 47.298933][ T1] kobject_add_internal (lib/kobject.c:73 lib/kobject.c:240)
[ 47.299712][ T1] kobject_init_and_add (lib/kobject.c:374 lib/kobject.c:457)
[ 47.300448][ T1] net_rx_queue_update_kobjects (net/core/net-sysfs.c:1239 net/core/net-sysfs.c:1301)
[ 47.301292][ T1] netdev_register_kobject (net/core/net-sysfs.c:2093 net/core/net-sysfs.c:2340)
[ 47.302015][ T1] page_owner free stack trace missing
[ 47.302748][ T1]
[ 47.303126][ T1] Memory state around the buggy address:
[ 47.303914][ T1] ffff8881baea5e00: 06 fc fc fc 06 fc fc fc 04 fc fc fc 06 fc fc fc
[ 47.305085][ T1] ffff8881baea5e80: 05 fc fc fc 05 fc fc fc 06 fc fc fc fc fc fc fc
[ 47.306239][ T1] >ffff8881baea5f00: 00 fc fc fc 07 fc fc fc 00 fc fc fc fa fc fc fc
[ 47.307378][ T1] ^
[ 47.308014][ T1] ffff8881baea5f80: fa fc fc fc fc fc fc fc 06 fc fc fc 06 fc fc fc
[ 47.309220][ T1] ffff8881baea6000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 47.310929][ T1] ==================================================================
[ 47.312186][ T1] Disabling lock debugging due to kernel taint
[ 47.318329][ T1] OK
[ 47.318896][ T1] Testing event system hyperv: OK
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20250811/202508110616.33657b6c-lkp@intel.com
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/7] tracing: Have syscall trace events show "0x" for values greater than 10
2025-08-06 16:42 ` Douglas Raillard
@ 2025-08-12 19:22 ` Steven Rostedt
0 siblings, 0 replies; 22+ messages in thread
From: Steven Rostedt @ 2025-08-12 19:22 UTC (permalink / raw)
To: Douglas Raillard
Cc: Steven Rostedt, linux-kernel, linux-trace-kernel,
Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Peter Zijlstra, Namhyung Kim, Takaya Saeki, Tom Zanussi,
Thomas Gleixner, Ian Rogers, aahringo
On Wed, 6 Aug 2025 17:42:38 +0100
Douglas Raillard <douglas.raillard@arm.com> wrote:
>
> AFAIK to this day, there is no tool providing a simple script/SQL interface
> to a trace.dat file beyond basic filtering like trace-cmd report. I've had this
Well, I did have this patch that I never applied:
https://lore.kernel.org/all/20200116104804.5d2f71e2@gandalf.local.home/
-- Steve
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2025-08-12 19:22 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-05 19:26 [PATCH 0/7] tracing: Show contents of syscall trace event user space fields Steven Rostedt
2025-08-05 19:26 ` [PATCH 1/7] tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE() Steven Rostedt
2025-08-06 18:39 ` Paul E. McKenney
2025-08-06 22:05 ` kernel test robot
2025-08-05 19:26 ` [PATCH 2/7] tracing: Have syscall trace events show "0x" for values greater than 10 Steven Rostedt
2025-08-06 10:24 ` Douglas Raillard
2025-08-06 12:39 ` Steven Rostedt
2025-08-06 16:42 ` Douglas Raillard
2025-08-12 19:22 ` Steven Rostedt
2025-08-05 19:26 ` [PATCH 3/7] tracing: Have syscall trace events read user space string Steven Rostedt
2025-08-06 14:39 ` kernel test robot
2025-08-11 5:19 ` kernel test robot
2025-08-05 19:26 ` [PATCH 4/7] tracing: Have system call events record user array data Steven Rostedt
2025-08-05 19:26 ` [PATCH 5/7] tracing: Display some syscall arrays as strings Steven Rostedt
2025-08-06 15:12 ` kernel test robot
2025-08-05 19:26 ` [PATCH 6/7] tracing: Allow syscall trace events to read more than one user parameter Steven Rostedt
2025-08-06 23:52 ` kernel test robot
2025-08-05 19:26 ` [PATCH 7/7] tracing: Add syscall_user_buf_size to limit amount written Steven Rostedt
2025-08-06 10:50 ` Douglas Raillard
2025-08-06 12:43 ` Steven Rostedt
2025-08-06 16:21 ` Douglas Raillard
2025-08-06 14:50 ` kernel test robot
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).