* [PATCH v2 0/8] tracing: Show contents of syscall trace event user space fields
@ 2025-09-23 13:04 Steven Rostedt
2025-09-23 13:04 ` [PATCH v2 1/8] tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE() Steven Rostedt
` (7 more replies)
0 siblings, 8 replies; 13+ messages in thread
From: Steven Rostedt @ 2025-09-23 13:04 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Peter Zijlstra, Namhyung Kim, Takaya Saeki, Tom Zanussi,
Thomas Gleixner, Ian Rogers, Douglas Raillard
As of commit 654ced4a1377 ("tracing: Introduce tracepoint_is_faultable()")
system call trace events allow faulting in user space memory. Have some of
the system call trace events take advantage of this.
Introduce a way to read from user space addresses from the syscall trace
event. The way this is accomplished is by creating a per CPU temporary
buffer that is used to read unsafe user memory.
When a syscall trace event needs to read user memory, it reads the per CPU
sched switch counter. It then disables migration, enables preemption,
copies the user space memory into this buffer, then disables preemption again.
If the counter is the same as the original value the buffer is valid.
Otherwise it needs to try again. This is similar to how seqcount works, but
uses the per CPU sched switch counter as its sequence counter. If the counter
is not the same, it means another task scheduled in, and that task could have
used the same buffer and overwritten the data.
A new file is created in the tracefs directory (and also per instance) that
allows the user to shorten the amount copied from user space. It can be
completely disabled if set to zero (it will only display "" or (, ...)
but no copying from user space will be performed). The max size to copy is
hard coded to 128, which should be enough for this purpose.
This allows the output to look like this:
sys_access(filename: 0x7f8c55368470 "/etc/ld.so.preload", mode: 4)
sys_execve(filename: 0x564ebcf5a6b8 "/usr/bin/emacs", argv: 0x7fff357c0300, envp: 0x564ebc4a4820)
sys_write(fd: 1, buf: 0x56430f353be0 (2f:72:6f:6f:74:0a) "/root.", count: 6)
sys_sethostname(name: 0x5584310eb2a0 "debian", len: 6)
sys_renameat2(olddfd: 0xffffff9c, oldname: 0x7ffe02facdff "/tmp/x", newdfd: 0xffffff9c, newname: 0x7ffe02face06 "/tmp/y", flags: 1)
Changes since v1: https://lore.kernel.org/linux-trace-kernel/20250805192646.328291790@kernel.org/
- Removed __rcu annotation to the fields that do not need RCU to protect
them.
- Hide newsfstat around
#if defined(__ARCH_WANT_NEW_STAT) || defined(__ARCH_WANT_STAT64)
as parisc failed to build without it. (kernel test robot)
- Fixed allocation of sinfo which used sizeof(sinfo) and not
sizeof(*sinfo) (kernel test robot)
- Instead of incrementing a counter via the sched_switch tracepoint, use
the nr_context_switches() API. (Mathieu Desnoyers).
- Use the length saved in the meta data of the event to limit the size of
the string printed "%.*s", len, str.
- Add comment describing that the method to read the memory from user
space is similar to how seqcount works.
- Hide kexec_file_load around
#if defined(__ARCH_WANT_TIME32_SYSCALLS) || __BITS_PER_LONG != 32
to not break the i386 build.
- Added __user annotation to variable copying from user (kernel test robot)
- Change default to 63 (127 seemed too much)
- Change the max to 165 to fill in the extra data.
- Use the size macros of the max size and max args to calculate the size
of the buffer to save the values in.
- Added new patch to show printable characters of binary arrays that are
displayed.
Steven Rostedt (8):
tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE()
tracing: Have syscall trace events show "0x" for values greater than 10
tracing: Have syscall trace events read user space string
tracing: Have system call events record user array data
tracing: Display some syscall arrays as strings
tracing: Allow syscall trace events to read more than one user parameter
tracing: Add syscall_user_buf_size to limit amount written
tracing: Show printable characters in syscall arrays
----
Documentation/trace/ftrace.rst | 8 +
include/trace/syscall.h | 8 +-
kernel/trace/Kconfig | 13 +
kernel/trace/trace.c | 52 +++
kernel/trace/trace.h | 7 +-
kernel/trace/trace_syscalls.c | 700 +++++++++++++++++++++++++++++++++++++++--
6 files changed, 756 insertions(+), 32 deletions(-)
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v2 1/8] tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE()
2025-09-23 13:04 [PATCH v2 0/8] tracing: Show contents of syscall trace event user space fields Steven Rostedt
@ 2025-09-23 13:04 ` Steven Rostedt
2025-09-23 13:04 ` [PATCH v2 2/8] tracing: Have syscall trace events show "0x" for values greater than 10 Steven Rostedt
` (6 subsequent siblings)
7 siblings, 0 replies; 13+ messages in thread
From: Steven Rostedt @ 2025-09-23 13:04 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Peter Zijlstra, Namhyung Kim, Takaya Saeki, Tom Zanussi,
Thomas Gleixner, Ian Rogers, Douglas Raillard, Paul E. McKenney
From: Steven Rostedt <rostedt@goodmis.org>
The syscall events are pseudo events that hook to the raw syscalls. The
ftrace_syscall_enter/exit() callback is called by the raw_syscall
enter/exit tracepoints respectively whenever any of the syscall events are
enabled.
The trace_array has an array of syscall "files" that correspond to the
system calls based on their __NR_SYSCALL number. The array is read and if
there's a pointer to a trace_event_file then it is considered enabled and
if it is NULL that syscall event is considered disabled.
Currently it uses an rcu_dereference_sched() to get this pointer and a
rcu_assign_ptr() or RCU_INIT_POINTER() to write to it. This is unnecessary
as the file pointer will not go away outside the synchronization of the
tracepoint logic itself. And this code adds no extra RCU synchronization
that uses this.
Replace these functions with a simple READ_ONCE() and WRITE_ONCE() which
is all they need. This will also allow this code to not depend on
preemption being disabled as system call tracepoints are now allowed to
fault.
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v1: https://lore.kernel.org/20250805193234.745705874@kernel.org
- Removed __rcu annotation to the fields that do not need RCU to protect
them.
kernel/trace/trace.h | 4 ++--
kernel/trace/trace_syscalls.c | 14 ++++++--------
2 files changed, 8 insertions(+), 10 deletions(-)
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 5f4bed5842f9..85eabb454bee 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -380,8 +380,8 @@ struct trace_array {
#ifdef CONFIG_FTRACE_SYSCALLS
int sys_refcount_enter;
int sys_refcount_exit;
- struct trace_event_file __rcu *enter_syscall_files[NR_syscalls];
- struct trace_event_file __rcu *exit_syscall_files[NR_syscalls];
+ struct trace_event_file *enter_syscall_files[NR_syscalls];
+ struct trace_event_file *exit_syscall_files[NR_syscalls];
#endif
int stop_count;
int clock_id;
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 46aab0ab9350..3a0b65f89130 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -310,8 +310,7 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
return;
- /* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE) */
- trace_file = rcu_dereference_sched(tr->enter_syscall_files[syscall_nr]);
+ trace_file = READ_ONCE(tr->enter_syscall_files[syscall_nr]);
if (!trace_file)
return;
@@ -356,8 +355,7 @@ static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
return;
- /* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE()) */
- trace_file = rcu_dereference_sched(tr->exit_syscall_files[syscall_nr]);
+ trace_file = READ_ONCE(tr->exit_syscall_files[syscall_nr]);
if (!trace_file)
return;
@@ -393,7 +391,7 @@ static int reg_event_syscall_enter(struct trace_event_file *file,
if (!tr->sys_refcount_enter)
ret = register_trace_sys_enter(ftrace_syscall_enter, tr);
if (!ret) {
- rcu_assign_pointer(tr->enter_syscall_files[num], file);
+ WRITE_ONCE(tr->enter_syscall_files[num], file);
tr->sys_refcount_enter++;
}
mutex_unlock(&syscall_trace_lock);
@@ -411,7 +409,7 @@ static void unreg_event_syscall_enter(struct trace_event_file *file,
return;
mutex_lock(&syscall_trace_lock);
tr->sys_refcount_enter--;
- RCU_INIT_POINTER(tr->enter_syscall_files[num], NULL);
+ WRITE_ONCE(tr->enter_syscall_files[num], NULL);
if (!tr->sys_refcount_enter)
unregister_trace_sys_enter(ftrace_syscall_enter, tr);
mutex_unlock(&syscall_trace_lock);
@@ -431,7 +429,7 @@ static int reg_event_syscall_exit(struct trace_event_file *file,
if (!tr->sys_refcount_exit)
ret = register_trace_sys_exit(ftrace_syscall_exit, tr);
if (!ret) {
- rcu_assign_pointer(tr->exit_syscall_files[num], file);
+ WRITE_ONCE(tr->exit_syscall_files[num], file);
tr->sys_refcount_exit++;
}
mutex_unlock(&syscall_trace_lock);
@@ -449,7 +447,7 @@ static void unreg_event_syscall_exit(struct trace_event_file *file,
return;
mutex_lock(&syscall_trace_lock);
tr->sys_refcount_exit--;
- RCU_INIT_POINTER(tr->exit_syscall_files[num], NULL);
+ WRITE_ONCE(tr->exit_syscall_files[num], NULL);
if (!tr->sys_refcount_exit)
unregister_trace_sys_exit(ftrace_syscall_exit, tr);
mutex_unlock(&syscall_trace_lock);
--
2.50.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v2 2/8] tracing: Have syscall trace events show "0x" for values greater than 10
2025-09-23 13:04 [PATCH v2 0/8] tracing: Show contents of syscall trace event user space fields Steven Rostedt
2025-09-23 13:04 ` [PATCH v2 1/8] tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE() Steven Rostedt
@ 2025-09-23 13:04 ` Steven Rostedt
2025-09-23 13:05 ` [PATCH v2 3/8] tracing: Have syscall trace events read user space string Steven Rostedt
` (5 subsequent siblings)
7 siblings, 0 replies; 13+ messages in thread
From: Steven Rostedt @ 2025-09-23 13:04 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Peter Zijlstra, Namhyung Kim, Takaya Saeki, Tom Zanussi,
Thomas Gleixner, Ian Rogers, Douglas Raillard
From: Steven Rostedt <rostedt@goodmis.org>
Currently the syscall trace events show each value as hexadecimal, but
without adding "0x" it can be confusing:
sys_write(fd: 4, buf: 0x55c4a1fa9270, count: 44)
Looks like the above write wrote 44 bytes, when in reality it wrote 68
bytes.
Add a "0x" for all values greater or equal to 10 to remove the ambiguity.
For values less than 10, leave off the "0x" as that just adds noise to the
output.
Also change the iterator to check if "i" is nonzero and print the ", "
delimiter at the start, then adding the logic to the trace_seq_printf() at
the end.
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/trace/trace_syscalls.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 3a0b65f89130..0f932b22f9ec 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -153,14 +153,20 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
if (trace_seq_has_overflowed(s))
goto end;
+ if (i)
+ trace_seq_puts(s, ", ");
+
/* parameter types */
if (tr && tr->trace_flags & TRACE_ITER_VERBOSE)
trace_seq_printf(s, "%s ", entry->types[i]);
/* parameter values */
- trace_seq_printf(s, "%s: %lx%s", entry->args[i],
- trace->args[i],
- i == entry->nb_args - 1 ? "" : ", ");
+ if (trace->args[i] < 10)
+ trace_seq_printf(s, "%s: %lu", entry->args[i],
+ trace->args[i]);
+ else
+ trace_seq_printf(s, "%s: 0x%lx", entry->args[i],
+ trace->args[i]);
}
trace_seq_putc(s, ')');
--
2.50.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v2 3/8] tracing: Have syscall trace events read user space string
2025-09-23 13:04 [PATCH v2 0/8] tracing: Show contents of syscall trace event user space fields Steven Rostedt
2025-09-23 13:04 ` [PATCH v2 1/8] tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE() Steven Rostedt
2025-09-23 13:04 ` [PATCH v2 2/8] tracing: Have syscall trace events show "0x" for values greater than 10 Steven Rostedt
@ 2025-09-23 13:05 ` Steven Rostedt
2025-09-25 7:26 ` Peter Zijlstra
2025-09-23 13:05 ` [PATCH v2 4/8] tracing: Have system call events record user array data Steven Rostedt
` (4 subsequent siblings)
7 siblings, 1 reply; 13+ messages in thread
From: Steven Rostedt @ 2025-09-23 13:05 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Peter Zijlstra, Namhyung Kim, Takaya Saeki, Tom Zanussi,
Thomas Gleixner, Ian Rogers, Douglas Raillard
From: Steven Rostedt <rostedt@goodmis.org>
As of commit 654ced4a1377 ("tracing: Introduce tracepoint_is_faultable()")
system call trace events allow faulting in user space memory. Have some of
the system call trace events take advantage of this.
Introduce a way to read strings that are nul terminated into the trace
event. The way this is accomplished is by creating a per CPU temporary
buffer that is used to read unsafe user memory.
When a syscall trace event needs to read user memory, it reads the per CPU
schedule switch counter. It then disables migration and enables
preemption, copies the user space memory into this buffer, then disables
preemption again. It reads the per CPU schedule switch counter again and
if it matches it considers the buffer is valid. Otherwise it needs to try
again. This is similar to how seqcount works, but uses the per CPU context
switch counter as the sequence counter.
The reason it uses the sched switch counter and not just a per CPU counter
is because that wouldn't catch the case of:
[task 1]
cnt = this_cpu_inc(counter);
preempt_enable()
<sched switch to task 2>
[task 2]
cnt = this_cpu_inc(counter);
preempt_enable();
buffer = task 2 data
<sched switch to task 1>
[task 1]
buffer = task 1 data
<sched switch to task 2>
[task 2]
preempt_disable();
if (cnt == this_cpu_read(counter))
Will return true even though the buffer was corrupted.
The syscall event has its nb_args shorten from an int to a short (where
even u8 is plenty big enough) and the freed two bytes are used for
"user_mask". The new "user_mask" field is used to store the index of the
"args" field array that has the address to read from user space. This
value is set to 0 if the system call event does not need to read user
space for a field. This mask can be used to know if the event may fault or
not. Only one bit set in user_mask is supported at this time.
This allows the output to look like this:
sys_access(filename: 0x7f8c55368470 "/etc/ld.so.preload", mode: 4)
sys_execve(filename: 0x564ebcf5a6b8 "/usr/bin/emacs", argv: 0x7fff357c0300, envp: 0x564ebc4a4820)
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v1: https://lore.kernel.org/20250805193235.080757106@kernel.org
- Hide newsfstat around
#if defined(__ARCH_WANT_NEW_STAT) || defined(__ARCH_WANT_STAT64)
as parisc failed to build without it. (kernel test robot)
- Fixed allocation of sinfo which used sizeof(sinfo) and not
sizeof(*sinfo) (kernel test robot)
- Instead of incrementing a counter via the sched_switch tracepoint, use
the nr_context_switches() API. (Mathieu Desnoyers).
- Use the length saved in the meta data of the event to limit the size of
the string printed "%.*s", len, str.
- Add comment describing that the method to read the memory from user
space is similar to how seqcount works.
include/trace/syscall.h | 4 +-
kernel/trace/trace_syscalls.c | 480 ++++++++++++++++++++++++++++++++--
2 files changed, 464 insertions(+), 20 deletions(-)
diff --git a/include/trace/syscall.h b/include/trace/syscall.h
index 8e193f3a33b3..85f21ca15a41 100644
--- a/include/trace/syscall.h
+++ b/include/trace/syscall.h
@@ -16,6 +16,7 @@
* @name: name of the syscall
* @syscall_nr: number of the syscall
* @nb_args: number of parameters it takes
+ * @user_mask: mask of @args that will read user space
* @types: list of types as strings
* @args: list of args as strings (args[i] matches types[i])
* @enter_fields: list of fields for syscall_enter trace event
@@ -25,7 +26,8 @@
struct syscall_metadata {
const char *name;
int syscall_nr;
- int nb_args;
+ short nb_args;
+ short user_mask;
const char **types;
const char **args;
struct list_head enter_fields;
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 0f932b22f9ec..7ea763c07bb7 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -1,6 +1,7 @@
// SPDX-License-Identifier: GPL-2.0
#include <trace/syscall.h>
#include <trace/events/syscalls.h>
+#include <linux/kernel_stat.h>
#include <linux/syscalls.h>
#include <linux/slab.h>
#include <linux/kernel.h>
@@ -123,6 +124,9 @@ const char *get_syscall_name(int syscall)
return entry->name;
}
+/* Added to user strings when max limit is reached */
+#define EXTRA "..."
+
static enum print_line_t
print_syscall_enter(struct trace_iterator *iter, int flags,
struct trace_event *event)
@@ -132,7 +136,9 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
struct trace_entry *ent = iter->ent;
struct syscall_trace_enter *trace;
struct syscall_metadata *entry;
- int i, syscall;
+ int i, syscall, val;
+ unsigned char *ptr;
+ int len;
trace = (typeof(trace))ent;
syscall = trace->nr;
@@ -167,6 +173,19 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
else
trace_seq_printf(s, "%s: 0x%lx", entry->args[i],
trace->args[i]);
+
+ if (!(BIT(i) & entry->user_mask))
+ continue;
+
+ /* This arg points to a user space string */
+ ptr = (void *)trace->args + sizeof(long) * entry->nb_args;
+ val = *(int *)ptr;
+
+ /* The value is a dynamic string (len << 16 | offset) */
+ ptr = (void *)ent + (val & 0xffff);
+ len = val >> 16;
+
+ trace_seq_printf(s, " \"%.*s\"", len, ptr);
}
trace_seq_putc(s, ')');
@@ -223,15 +242,27 @@ __set_enter_print_fmt(struct syscall_metadata *entry, char *buf, int len)
pos += snprintf(buf + pos, LEN_OR_ZERO, "\"");
for (i = 0; i < entry->nb_args; i++) {
- pos += snprintf(buf + pos, LEN_OR_ZERO, "%s: 0x%%0%zulx%s",
- entry->args[i], sizeof(unsigned long),
- i == entry->nb_args - 1 ? "" : ", ");
+ if (i)
+ pos += snprintf(buf + pos, LEN_OR_ZERO, ", ");
+ pos += snprintf(buf + pos, LEN_OR_ZERO, "%s: 0x%%0%zulx",
+ entry->args[i], sizeof(unsigned long));
+
+ if (!(BIT(i) & entry->user_mask))
+ continue;
+
+ /* Add the format for the user space string */
+ pos += snprintf(buf + pos, LEN_OR_ZERO, " \\\"%%s\\\"");
}
pos += snprintf(buf + pos, LEN_OR_ZERO, "\"");
for (i = 0; i < entry->nb_args; i++) {
pos += snprintf(buf + pos, LEN_OR_ZERO,
", ((unsigned long)(REC->%s))", entry->args[i]);
+ if (!(BIT(i) & entry->user_mask))
+ continue;
+ /* The user space string for arg has name __<arg>_val */
+ pos += snprintf(buf + pos, LEN_OR_ZERO, ", __get_str(__%s_val)",
+ entry->args[i]);
}
#undef LEN_OR_ZERO
@@ -277,8 +308,12 @@ static int __init syscall_enter_define_fields(struct trace_event_call *call)
{
struct syscall_trace_enter trace;
struct syscall_metadata *meta = call->data;
+ unsigned long mask;
+ char *arg;
int offset = offsetof(typeof(trace), args);
+ int idx;
int ret = 0;
+ int len;
int i;
for (i = 0; i < meta->nb_args; i++) {
@@ -291,9 +326,232 @@ static int __init syscall_enter_define_fields(struct trace_event_call *call)
offset += sizeof(unsigned long);
}
+ if (ret || !meta->user_mask)
+ return ret;
+
+ mask = meta->user_mask;
+ idx = ffs(mask) - 1;
+
+ /*
+ * User space strings are faulted into a temporary buffer and then
+ * added as a dynamic string to the end of the event.
+ * The user space string name for the arg pointer is "__<arg>_val".
+ */
+ len = strlen(meta->args[idx]) + sizeof("___val");
+ arg = kmalloc(len, GFP_KERNEL);
+ if (WARN_ON_ONCE(!arg)) {
+ meta->user_mask = 0;
+ return -ENOMEM;
+ }
+
+ snprintf(arg, len, "__%s_val", meta->args[idx]);
+
+ ret = trace_define_field(call, "__data_loc char[]",
+ arg, offset, sizeof(int), 0,
+ FILTER_OTHER);
+ if (ret)
+ kfree(arg);
return ret;
}
+struct syscall_buf {
+ char *buf;
+};
+
+struct syscall_buf_info {
+ struct rcu_head rcu;
+ struct syscall_buf __percpu *sbuf;
+};
+
+/* Create a per CPU temporary buffer to copy user space pointers into */
+#define SYSCALL_FAULT_BUF_SZ 512
+static struct syscall_buf_info *syscall_buffer;
+
+static int syscall_fault_buffer_cnt;
+
+static void syscall_fault_buffer_free(struct syscall_buf_info *sinfo)
+{
+ char *buf;
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ buf = per_cpu_ptr(sinfo->sbuf, cpu)->buf;
+ kfree(buf);
+ }
+ kfree(sinfo);
+}
+
+static void rcu_free_syscall_buffer(struct rcu_head *rcu)
+{
+ struct syscall_buf_info *sinfo = container_of(rcu, struct syscall_buf_info, rcu);
+
+ syscall_fault_buffer_free(sinfo);
+}
+
+/*
+ * The per CPU buffer syscall_fault_buffer is written to optimstically.
+ * The per CPU context switch count is taken, preemption is enabled,
+ * the copying of the user space memory is placed into the syscall_fault_buffer,
+ * Preeption is re-enabled and the count is read again. If the count does
+ * not match its previous reading, it could mean that another user space
+ * task scheduled in and the buffer is unreliable for use.
+ */
+static int syscall_fault_buffer_enable(void)
+{
+ struct syscall_buf_info *sinfo;
+ char *buf;
+ int cpu;
+
+ lockdep_assert_held(&syscall_trace_lock);
+
+ if (syscall_fault_buffer_cnt++)
+ return 0;
+
+ sinfo = kmalloc(sizeof(*sinfo), GFP_KERNEL);
+ if (!sinfo)
+ return -ENOMEM;
+
+ sinfo->sbuf = alloc_percpu(struct syscall_buf);
+ if (!sinfo->sbuf) {
+ kfree(sinfo);
+ return -ENOMEM;
+ }
+
+ /* Clear each buffer in case of error */
+ for_each_possible_cpu(cpu) {
+ per_cpu_ptr(sinfo->sbuf, cpu)->buf = NULL;
+ }
+
+ for_each_possible_cpu(cpu) {
+ buf = kmalloc_node(SYSCALL_FAULT_BUF_SZ, GFP_KERNEL,
+ cpu_to_node(cpu));
+ if (!buf) {
+ syscall_fault_buffer_free(sinfo);
+ return -ENOMEM;
+ }
+ per_cpu_ptr(sinfo->sbuf, cpu)->buf = buf;
+ }
+
+ WRITE_ONCE(syscall_buffer, sinfo);
+ return 0;
+}
+
+static void syscall_fault_buffer_disable(void)
+{
+ struct syscall_buf_info *sinfo = syscall_buffer;
+
+ lockdep_assert_held(&syscall_trace_lock);
+
+ if (--syscall_fault_buffer_cnt)
+ return;
+
+ WRITE_ONCE(syscall_buffer, NULL);
+ call_rcu_tasks_trace(&sinfo->rcu, rcu_free_syscall_buffer);
+}
+
+static char *sys_fault_user(struct syscall_metadata *sys_data, struct syscall_buf_info *sinfo,
+ unsigned long *args, unsigned int *data_size)
+{
+ int cpu = smp_processor_id();
+ char *buf = per_cpu_ptr(sinfo->sbuf, cpu)->buf;
+ unsigned long size = SYSCALL_FAULT_BUF_SZ - 1;
+ unsigned long mask = sys_data->user_mask;
+ unsigned int cnt;
+ int idx = ffs(mask) - 1;
+ char *ptr;
+ int trys = 0;
+ int ret;
+
+ /* Get the pointer to user space memory to read */
+ ptr = (char *)args[idx];
+ *data_size = 0;
+
+ /*
+ * This acts similar to a seqcount. The per CPU context switches are
+ * recorded, migration is disabled and preemption is enabled. The
+ * read of the user space memory is copied into the per CPU buffer.
+ * Preemption is disabled again, and if the per CPU context switches count
+ * is still the same, it means the buffer has not been corrupted.
+ * If the count is different, it is assumed the buffer is corrupted
+ * and reading must be tried again.
+ */
+ again:
+ /*
+ * If for some reason, copy_from_user() always causes a context
+ * switch, this would then cause an inifinite loop.
+ * If this task is preempted by another user space task, it
+ * will cause this task to try again. But just in case something
+ * changes where the copying from user space causes another task
+ * to run, prevent this from going into an infinite loop.
+ * 10 tries should be plenty.
+ */
+ if (trys++ > 10) {
+ static bool once;
+ /*
+ * Only print a message instead of a WARN_ON() as this could
+ * theoretically trigger under real load.
+ */
+ if (!once)
+ pr_warn("Error: Too many tries to read syscall %s\n", sys_data->name);
+ once = true;
+ return buf;
+ }
+
+ /* Read the current CPU context switch counter */
+ cnt = nr_context_switches_cpu(cpu);
+
+ /*
+ * Preemption is going to be enabled, but this task must
+ * remain on this CPU.
+ */
+ migrate_disable();
+
+ /*
+ * Now preemption is being enabed and another task can come in
+ * and use the same buffer and corrupt our data.
+ */
+ preempt_enable_notrace();
+
+ ret = strncpy_from_user(buf, ptr, size);
+
+ preempt_disable_notrace();
+ migrate_enable();
+
+ /* If it faulted, no use to try again */
+ if (ret < 0)
+ return buf;
+
+ /*
+ * Preemption is disabled again, now check the per CPU context
+ * switch counter. If it doesn't match, then another user space
+ * process may have schedule in and corrupted our buffer. In that
+ * case the copying must be retried.
+ */
+ if (nr_context_switches_cpu(cpu) != cnt)
+ goto again;
+
+ /* Replace any non-printable characters with '.' */
+ for (int i = 0; i < ret; i++) {
+ if (!isprint(buf[i]))
+ buf[i] = '.';
+ }
+
+ /*
+ * If the text was truncated due to our max limit, add "..." to
+ * the string.
+ */
+ if (ret > SYSCALL_FAULT_BUF_SZ - sizeof(EXTRA)) {
+ strscpy(buf + SYSCALL_FAULT_BUF_SZ - sizeof(EXTRA),
+ EXTRA, sizeof(EXTRA));
+ ret = SYSCALL_FAULT_BUF_SZ;
+ } else {
+ buf[ret++] = '\0';
+ }
+
+ *data_size = ret;
+ return buf;
+}
+
static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
{
struct trace_array *tr = data;
@@ -302,15 +560,17 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
struct syscall_metadata *sys_data;
struct trace_event_buffer fbuffer;
unsigned long args[6];
+ char *user_ptr;
+ int user_size = 0;
int syscall_nr;
- int size;
+ int size = 0;
+ bool mayfault;
/*
* Syscall probe called with preemption enabled, but the ring
* buffer and per-cpu data require preemption to be disabled.
*/
might_fault();
- guard(preempt_notrace)();
syscall_nr = trace_get_syscall_nr(current, regs);
if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
@@ -327,7 +587,32 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
if (!sys_data)
return;
- size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
+ /* Check if this syscall event faults in user space memory */
+ mayfault = sys_data->user_mask != 0;
+
+ guard(preempt_notrace)();
+
+ syscall_get_arguments(current, regs, args);
+
+ if (mayfault) {
+ struct syscall_buf_info *sinfo;
+
+ /* If the syscall_buffer is NULL, tracing is being shutdown */
+ sinfo = READ_ONCE(syscall_buffer);
+ if (!sinfo)
+ return;
+
+ user_ptr = sys_fault_user(sys_data, sinfo, args, &user_size);
+ /*
+ * user_size is the amount of data to append.
+ * Need to add 4 for the meta field that points to
+ * the user memory at the end of the event and also
+ * stores its size.
+ */
+ size = 4 + user_size;
+ }
+
+ size += sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
entry = trace_event_buffer_reserve(&fbuffer, trace_file, size);
if (!entry)
@@ -335,9 +620,36 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
entry = ring_buffer_event_data(fbuffer.event);
entry->nr = syscall_nr;
- syscall_get_arguments(current, regs, args);
+
memcpy(entry->args, args, sizeof(unsigned long) * sys_data->nb_args);
+ if (mayfault) {
+ void *ptr;
+ int val;
+
+ /*
+ * Set the pointer to point to the meta data of the event
+ * that has information about the stored user space memory.
+ */
+ ptr = (void *)entry->args + sizeof(unsigned long) * sys_data->nb_args;
+
+ /*
+ * The meta data will store the offset of the user data from
+ * the beginning of the event.
+ */
+ val = (ptr - (void *)entry) + 4;
+
+ /* Store the offset and the size into the meta data */
+ *(int *)ptr = val | (user_size << 16);
+
+ /* Nothing to do if the user space was empty or faulted */
+ if (user_size) {
+ /* Now store the user space data into the event */
+ ptr += 4;
+ memcpy(ptr, user_ptr, user_size);
+ }
+ }
+
trace_event_buffer_commit(&fbuffer);
}
@@ -386,39 +698,50 @@ static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
static int reg_event_syscall_enter(struct trace_event_file *file,
struct trace_event_call *call)
{
+ struct syscall_metadata *sys_data = call->data;
struct trace_array *tr = file->tr;
int ret = 0;
int num;
- num = ((struct syscall_metadata *)call->data)->syscall_nr;
+ num = sys_data->syscall_nr;
if (WARN_ON_ONCE(num < 0 || num >= NR_syscalls))
return -ENOSYS;
- mutex_lock(&syscall_trace_lock);
- if (!tr->sys_refcount_enter)
+ guard(mutex)(&syscall_trace_lock);
+ if (sys_data->user_mask) {
+ ret = syscall_fault_buffer_enable();
+ if (ret)
+ return ret;
+ }
+ if (!tr->sys_refcount_enter) {
ret = register_trace_sys_enter(ftrace_syscall_enter, tr);
- if (!ret) {
- WRITE_ONCE(tr->enter_syscall_files[num], file);
- tr->sys_refcount_enter++;
+ if (ret < 0) {
+ if (sys_data->user_mask)
+ syscall_fault_buffer_disable();
+ return ret;
+ }
}
- mutex_unlock(&syscall_trace_lock);
- return ret;
+ WRITE_ONCE(tr->enter_syscall_files[num], file);
+ tr->sys_refcount_enter++;
+ return 0;
}
static void unreg_event_syscall_enter(struct trace_event_file *file,
struct trace_event_call *call)
{
+ struct syscall_metadata *sys_data = call->data;
struct trace_array *tr = file->tr;
int num;
- num = ((struct syscall_metadata *)call->data)->syscall_nr;
+ num = sys_data->syscall_nr;
if (WARN_ON_ONCE(num < 0 || num >= NR_syscalls))
return;
- mutex_lock(&syscall_trace_lock);
+ guard(mutex)(&syscall_trace_lock);
tr->sys_refcount_enter--;
WRITE_ONCE(tr->enter_syscall_files[num], NULL);
if (!tr->sys_refcount_enter)
unregister_trace_sys_enter(ftrace_syscall_enter, tr);
- mutex_unlock(&syscall_trace_lock);
+ if (sys_data->user_mask)
+ syscall_fault_buffer_disable();
}
static int reg_event_syscall_exit(struct trace_event_file *file,
@@ -459,6 +782,123 @@ static void unreg_event_syscall_exit(struct trace_event_file *file,
mutex_unlock(&syscall_trace_lock);
}
+/*
+ * For system calls that reference user space memory that can
+ * be recorded into the event, set the system call meta data's user_mask
+ * to the "args" index that points to the user space memory to retrieve.
+ */
+static void check_faultable_syscall(struct trace_event_call *call, int nr)
+{
+ struct syscall_metadata *sys_data = call->data;
+
+ /* Only work on entry */
+ if (sys_data->enter_event != call)
+ return;
+
+ switch (nr) {
+ /* user arg at position 0 */
+ case __NR_access:
+ case __NR_acct:
+ case __NR_add_key: /* Just _type. TODO add _description */
+ case __NR_chdir:
+ case __NR_chown:
+ case __NR_chmod:
+ case __NR_chroot:
+ case __NR_creat:
+ case __NR_delete_module:
+ case __NR_execve:
+ case __NR_fsopen:
+ case __NR_getxattr: /* Just pathname, TODO add name */
+ case __NR_lchown:
+ case __NR_lgetxattr: /* Just pathname, TODO add name */
+ case __NR_lremovexattr: /* Just pathname, TODO add name */
+ case __NR_link: /* Just oldname. TODO add newname */
+ case __NR_listxattr: /* Just pathname, TODO add list */
+ case __NR_llistxattr: /* Just pathname, TODO add list */
+ case __NR_lsetxattr: /* Just pathname, TODO add list */
+ case __NR_open:
+ case __NR_memfd_create:
+ case __NR_mount: /* Just dev_name, TODO add dir_name and type */
+ case __NR_mkdir:
+ case __NR_mknod:
+ case __NR_mq_open:
+ case __NR_mq_unlink:
+ case __NR_pivot_root: /* Just new_root, TODO add old_root */
+ case __NR_readlink:
+ case __NR_removexattr: /* Just pathname, TODO add name */
+ case __NR_rename: /* Just oldname. TODO add newname */
+ case __NR_request_key: /* Just _type. TODO add _description */
+ case __NR_rmdir:
+ case __NR_setxattr: /* Just pathname, TODO add list */
+ case __NR_shmdt:
+ case __NR_statfs:
+ case __NR_swapon:
+ case __NR_swapoff:
+ case __NR_symlink: /* Just oldname. TODO add newname */
+ case __NR_truncate:
+ case __NR_unlink:
+ case __NR_umount2:
+ case __NR_utime:
+ case __NR_utimes:
+ sys_data->user_mask = BIT(0);
+ break;
+ /* user arg at position 1 */
+ case __NR_execveat:
+ case __NR_faccessat:
+ case __NR_faccessat2:
+ case __NR_finit_module:
+ case __NR_fchmodat:
+ case __NR_fchmodat2:
+ case __NR_fchownat:
+ case __NR_fgetxattr:
+ case __NR_flistxattr:
+ case __NR_fsetxattr:
+ case __NR_fspick:
+ case __NR_fremovexattr:
+ case __NR_futimesat:
+ case __NR_getxattrat: /* Just pathname, TODO add name */
+ case __NR_inotify_add_watch:
+ case __NR_linkat: /* Just oldname. TODO add newname */
+ case __NR_listxattrat: /* Just pathname, TODO add list */
+ case __NR_mkdirat:
+ case __NR_mknodat:
+ case __NR_mount_setattr:
+ case __NR_move_mount: /* Just from_pathname, TODO add to_pathname */
+ case __NR_name_to_handle_at:
+#if defined(__ARCH_WANT_NEW_STAT) || defined(__ARCH_WANT_STAT64)
+ case __NR_newfstatat:
+#endif
+ case __NR_openat:
+ case __NR_openat2:
+ case __NR_open_tree:
+ case __NR_open_tree_attr:
+ case __NR_readlinkat:
+ case __NR_renameat: /* Just oldname. TODO add newname */
+ case __NR_renameat2: /* Just oldname. TODO add newname */
+ case __NR_removexattrat: /* Just pathname, TODO add name */
+ case __NR_quotactl:
+ case __NR_setxattrat: /* Just pathname, TODO add list */
+ case __NR_syslog:
+ case __NR_symlinkat: /* Just oldname. TODO add newname */
+ case __NR_statx:
+ case __NR_unlinkat:
+ case __NR_utimensat:
+ sys_data->user_mask = BIT(1);
+ break;
+ /* user arg at position 2 */
+ case __NR_init_module:
+ case __NR_fsconfig:
+ sys_data->user_mask = BIT(2);
+ break;
+ /* user arg at position 4 */
+ case __NR_fanotify_mark:
+ sys_data->user_mask = BIT(4);
+ break;
+ default:
+ sys_data->user_mask = 0;
+ }
+}
+
static int __init init_syscall_trace(struct trace_event_call *call)
{
int id;
@@ -471,6 +911,8 @@ static int __init init_syscall_trace(struct trace_event_call *call)
return -ENOSYS;
}
+ check_faultable_syscall(call, num);
+
if (set_syscall_print_fmt(call) < 0)
return -ENOMEM;
--
2.50.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v2 4/8] tracing: Have system call events record user array data
2025-09-23 13:04 [PATCH v2 0/8] tracing: Show contents of syscall trace event user space fields Steven Rostedt
` (2 preceding siblings ...)
2025-09-23 13:05 ` [PATCH v2 3/8] tracing: Have syscall trace events read user space string Steven Rostedt
@ 2025-09-23 13:05 ` Steven Rostedt
2025-09-23 13:05 ` [PATCH v2 5/8] tracing: Display some syscall arrays as strings Steven Rostedt
` (3 subsequent siblings)
7 siblings, 0 replies; 13+ messages in thread
From: Steven Rostedt @ 2025-09-23 13:05 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Peter Zijlstra, Namhyung Kim, Takaya Saeki, Tom Zanussi,
Thomas Gleixner, Ian Rogers, Douglas Raillard
From: Steven Rostedt <rostedt@goodmis.org>
For system call events that have a length field, add a "user_arg_size"
parameter to the system call meta data that denotes the index of the args
array that holds the size of arg that the user_mask field has a bit set
for.
The "user_mask" has a bit set that denotes the arg that points to an array
in the user space address space and if a system call event has the
user_mask field set and the user_arg_size set, it will then record the
content of that address into the trace event, up to the size defined by
SYSCALL_FAULT_BUF_SZ - 1.
This allows the output to look like:
sys_write(fd: 0xa, buf: 0x5646978d13c0 (01:00:05:00:00:00:00:00:01:87:55:89:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00), count: 0x20)
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
include/trace/syscall.h | 4 +-
kernel/trace/trace_syscalls.c | 111 +++++++++++++++++++++++++---------
2 files changed, 86 insertions(+), 29 deletions(-)
diff --git a/include/trace/syscall.h b/include/trace/syscall.h
index 85f21ca15a41..9413c139da66 100644
--- a/include/trace/syscall.h
+++ b/include/trace/syscall.h
@@ -16,6 +16,7 @@
* @name: name of the syscall
* @syscall_nr: number of the syscall
* @nb_args: number of parameters it takes
+ * @user_arg_size: holds @arg that has size of the user space to read
* @user_mask: mask of @args that will read user space
* @types: list of types as strings
* @args: list of args as strings (args[i] matches types[i])
@@ -26,7 +27,8 @@
struct syscall_metadata {
const char *name;
int syscall_nr;
- short nb_args;
+ u8 nb_args;
+ s8 user_arg_size;
short user_mask;
const char **types;
const char **args;
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 7ea763c07bb7..7658b592c55f 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -124,7 +124,7 @@ const char *get_syscall_name(int syscall)
return entry->name;
}
-/* Added to user strings when max limit is reached */
+/* Added to user strings or arrays when max limit is reached */
#define EXTRA "..."
static enum print_line_t
@@ -136,9 +136,8 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
struct trace_entry *ent = iter->ent;
struct syscall_trace_enter *trace;
struct syscall_metadata *entry;
- int i, syscall, val;
+ int i, syscall, val, len;
unsigned char *ptr;
- int len;
trace = (typeof(trace))ent;
syscall = trace->nr;
@@ -185,7 +184,23 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
ptr = (void *)ent + (val & 0xffff);
len = val >> 16;
- trace_seq_printf(s, " \"%.*s\"", len, ptr);
+ if (entry->user_arg_size < 0) {
+ trace_seq_printf(s, " \"%.*s\"", len, ptr);
+ continue;
+ }
+
+ val = trace->args[entry->user_arg_size];
+
+ trace_seq_puts(s, " (");
+ for (int x = 0; x < len; x++, ptr++) {
+ if (x)
+ trace_seq_putc(s, ':');
+ trace_seq_printf(s, "%02x", *ptr);
+ }
+ if (len < val)
+ trace_seq_printf(s, ", %s", EXTRA);
+
+ trace_seq_putc(s, ')');
}
trace_seq_putc(s, ')');
@@ -250,8 +265,11 @@ __set_enter_print_fmt(struct syscall_metadata *entry, char *buf, int len)
if (!(BIT(i) & entry->user_mask))
continue;
- /* Add the format for the user space string */
- pos += snprintf(buf + pos, LEN_OR_ZERO, " \\\"%%s\\\"");
+ /* Add the format for the user space string or array */
+ if (entry->user_arg_size < 0)
+ pos += snprintf(buf + pos, LEN_OR_ZERO, " \\\"%%s\\\"");
+ else
+ pos += snprintf(buf + pos, LEN_OR_ZERO, " (%%s)");
}
pos += snprintf(buf + pos, LEN_OR_ZERO, "\"");
@@ -260,9 +278,14 @@ __set_enter_print_fmt(struct syscall_metadata *entry, char *buf, int len)
", ((unsigned long)(REC->%s))", entry->args[i]);
if (!(BIT(i) & entry->user_mask))
continue;
- /* The user space string for arg has name __<arg>_val */
- pos += snprintf(buf + pos, LEN_OR_ZERO, ", __get_str(__%s_val)",
- entry->args[i]);
+ /* The user space data for arg has name __<arg>_val */
+ if (entry->user_arg_size < 0) {
+ pos += snprintf(buf + pos, LEN_OR_ZERO, ", __get_str(__%s_val)",
+ entry->args[i]);
+ } else {
+ pos += snprintf(buf + pos, LEN_OR_ZERO, ", __print_dynamic_array(__%s_val, 1)",
+ entry->args[i]);
+ }
}
#undef LEN_OR_ZERO
@@ -333,9 +356,9 @@ static int __init syscall_enter_define_fields(struct trace_event_call *call)
idx = ffs(mask) - 1;
/*
- * User space strings are faulted into a temporary buffer and then
- * added as a dynamic string to the end of the event.
- * The user space string name for the arg pointer is "__<arg>_val".
+ * User space data is faulted into a temporary buffer and then
+ * added as a dynamic string or array to the end of the event.
+ * The user space data name for the arg pointer is "__<arg>_val".
*/
len = strlen(meta->args[idx]) + sizeof("___val");
arg = kmalloc(len, GFP_KERNEL);
@@ -458,6 +481,7 @@ static char *sys_fault_user(struct syscall_metadata *sys_data, struct syscall_bu
unsigned long mask = sys_data->user_mask;
unsigned int cnt;
int idx = ffs(mask) - 1;
+ bool array = false;
char *ptr;
int trys = 0;
int ret;
@@ -500,6 +524,18 @@ static char *sys_fault_user(struct syscall_metadata *sys_data, struct syscall_bu
/* Read the current CPU context switch counter */
cnt = nr_context_switches_cpu(cpu);
+ /*
+ * If this system call event has a size argument, use
+ * it to define how much of user space memory to read,
+ * and read it as an array and not a string.
+ */
+ if (sys_data->user_arg_size >= 0) {
+ array = true;
+ size = args[sys_data->user_arg_size];
+ if (size > SYSCALL_FAULT_BUF_SZ - 1)
+ size = SYSCALL_FAULT_BUF_SZ - 1;
+ }
+
/*
* Preemption is going to be enabled, but this task must
* remain on this CPU.
@@ -512,7 +548,12 @@ static char *sys_fault_user(struct syscall_metadata *sys_data, struct syscall_bu
*/
preempt_enable_notrace();
- ret = strncpy_from_user(buf, ptr, size);
+ if (array) {
+ ret = __copy_from_user(buf, ptr, size);
+ ret = ret ? -1 : size;
+ } else {
+ ret = strncpy_from_user(buf, ptr, size);
+ }
preempt_disable_notrace();
migrate_enable();
@@ -530,22 +571,24 @@ static char *sys_fault_user(struct syscall_metadata *sys_data, struct syscall_bu
if (nr_context_switches_cpu(cpu) != cnt)
goto again;
- /* Replace any non-printable characters with '.' */
- for (int i = 0; i < ret; i++) {
- if (!isprint(buf[i]))
- buf[i] = '.';
- }
+ /* For strings, replace any non-printable characters with '.' */
+ if (!array) {
+ for (int i = 0; i < ret; i++) {
+ if (!isprint(buf[i]))
+ buf[i] = '.';
+ }
- /*
- * If the text was truncated due to our max limit, add "..." to
- * the string.
- */
- if (ret > SYSCALL_FAULT_BUF_SZ - sizeof(EXTRA)) {
- strscpy(buf + SYSCALL_FAULT_BUF_SZ - sizeof(EXTRA),
- EXTRA, sizeof(EXTRA));
- ret = SYSCALL_FAULT_BUF_SZ;
- } else {
- buf[ret++] = '\0';
+ /*
+ * If the text was truncated due to our max limit, add "..." to
+ * the string.
+ */
+ if (ret > SYSCALL_FAULT_BUF_SZ - sizeof(EXTRA)) {
+ strscpy(buf + SYSCALL_FAULT_BUF_SZ - sizeof(EXTRA),
+ EXTRA, sizeof(EXTRA));
+ ret = SYSCALL_FAULT_BUF_SZ;
+ } else {
+ buf[ret++] = '\0';
+ }
}
*data_size = ret;
@@ -642,6 +685,9 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
/* Store the offset and the size into the meta data */
*(int *)ptr = val | (user_size << 16);
+ if (WARN_ON_ONCE((ptr - (void *)entry + user_size) > size))
+ user_size = 0;
+
/* Nothing to do if the user space was empty or faulted */
if (user_size) {
/* Now store the user space data into the event */
@@ -795,7 +841,16 @@ static void check_faultable_syscall(struct trace_event_call *call, int nr)
if (sys_data->enter_event != call)
return;
+ sys_data->user_arg_size = -1;
+
switch (nr) {
+ /* user arg 1 with size arg at 2 */
+ case __NR_write:
+ case __NR_mq_timedsend:
+ case __NR_pwrite64:
+ sys_data->user_mask = BIT(1);
+ sys_data->user_arg_size = 2;
+ break;
/* user arg at position 0 */
case __NR_access:
case __NR_acct:
--
2.50.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v2 5/8] tracing: Display some syscall arrays as strings
2025-09-23 13:04 [PATCH v2 0/8] tracing: Show contents of syscall trace event user space fields Steven Rostedt
` (3 preceding siblings ...)
2025-09-23 13:05 ` [PATCH v2 4/8] tracing: Have system call events record user array data Steven Rostedt
@ 2025-09-23 13:05 ` Steven Rostedt
2025-09-23 13:05 ` [PATCH v2 6/8] tracing: Allow syscall trace events to read more than one user parameter Steven Rostedt
` (2 subsequent siblings)
7 siblings, 0 replies; 13+ messages in thread
From: Steven Rostedt @ 2025-09-23 13:05 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Peter Zijlstra, Namhyung Kim, Takaya Saeki, Tom Zanussi,
Thomas Gleixner, Ian Rogers, Douglas Raillard
From: Steven Rostedt <rostedt@goodmis.org>
Some of the system calls that read a fixed length of memory from the user
space address are not arrays but strings. Take a bit away from the nb_args
field in the syscall meta data to use as a flag to denote that the system
call's user_arg_size is being used as a string. The nb_args should never
be more than 6, so 7 bits is plenty to hold that number. When the
user_arg_is_str flag that, when set, will display the data array from the
user space address as a string and not an array.
This will allow the output to look like this:
sys_sethostname(name: 0x5584310eb2a0 "debian", len: 6)
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v1: https://lore.kernel.org/20250805193235.416382557@kernel.org
- Hide kexec_file_load around
#if defined(__ARCH_WANT_TIME32_SYSCALLS) || __BITS_PER_LONG != 32
to not break the i386 build.
include/trace/syscall.h | 4 +++-
kernel/trace/trace_syscalls.c | 22 +++++++++++++++++++---
2 files changed, 22 insertions(+), 4 deletions(-)
diff --git a/include/trace/syscall.h b/include/trace/syscall.h
index 9413c139da66..0dd7f2b33431 100644
--- a/include/trace/syscall.h
+++ b/include/trace/syscall.h
@@ -16,6 +16,7 @@
* @name: name of the syscall
* @syscall_nr: number of the syscall
* @nb_args: number of parameters it takes
+ * @user_arg_is_str: set if the arg for @user_arg_size is a string
* @user_arg_size: holds @arg that has size of the user space to read
* @user_mask: mask of @args that will read user space
* @types: list of types as strings
@@ -27,7 +28,8 @@
struct syscall_metadata {
const char *name;
int syscall_nr;
- u8 nb_args;
+ u8 nb_args:7;
+ u8 user_arg_is_str:1;
s8 user_arg_size;
short user_mask;
const char **types;
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 7658b592c55f..64be38cf790d 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -184,7 +184,7 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
ptr = (void *)ent + (val & 0xffff);
len = val >> 16;
- if (entry->user_arg_size < 0) {
+ if (entry->user_arg_size < 0 || entry->user_arg_is_str) {
trace_seq_printf(s, " \"%.*s\"", len, ptr);
continue;
}
@@ -249,6 +249,7 @@ print_syscall_exit(struct trace_iterator *iter, int flags,
static int __init
__set_enter_print_fmt(struct syscall_metadata *entry, char *buf, int len)
{
+ bool is_string = entry->user_arg_is_str;
int i;
int pos = 0;
@@ -266,7 +267,7 @@ __set_enter_print_fmt(struct syscall_metadata *entry, char *buf, int len)
continue;
/* Add the format for the user space string or array */
- if (entry->user_arg_size < 0)
+ if (entry->user_arg_size < 0 || is_string)
pos += snprintf(buf + pos, LEN_OR_ZERO, " \\\"%%s\\\"");
else
pos += snprintf(buf + pos, LEN_OR_ZERO, " (%%s)");
@@ -279,7 +280,7 @@ __set_enter_print_fmt(struct syscall_metadata *entry, char *buf, int len)
if (!(BIT(i) & entry->user_mask))
continue;
/* The user space data for arg has name __<arg>_val */
- if (entry->user_arg_size < 0) {
+ if (entry->user_arg_size < 0 || is_string) {
pos += snprintf(buf + pos, LEN_OR_ZERO, ", __get_str(__%s_val)",
entry->args[i]);
} else {
@@ -851,6 +852,21 @@ static void check_faultable_syscall(struct trace_event_call *call, int nr)
sys_data->user_mask = BIT(1);
sys_data->user_arg_size = 2;
break;
+ /* user arg 0 with size arg at 1 as string */
+ case __NR_setdomainname:
+ case __NR_sethostname:
+ sys_data->user_mask = BIT(0);
+ sys_data->user_arg_size = 1;
+ sys_data->user_arg_is_str = 1;
+ break;
+#if defined(__ARCH_WANT_TIME32_SYSCALLS) || __BITS_PER_LONG != 32
+ /* user arg 4 with size arg at 3 as string */
+ case __NR_kexec_file_load:
+ sys_data->user_mask = BIT(4);
+ sys_data->user_arg_size = 3;
+ sys_data->user_arg_is_str = 1;
+ break;
+#endif
/* user arg at position 0 */
case __NR_access:
case __NR_acct:
--
2.50.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v2 6/8] tracing: Allow syscall trace events to read more than one user parameter
2025-09-23 13:04 [PATCH v2 0/8] tracing: Show contents of syscall trace event user space fields Steven Rostedt
` (4 preceding siblings ...)
2025-09-23 13:05 ` [PATCH v2 5/8] tracing: Display some syscall arrays as strings Steven Rostedt
@ 2025-09-23 13:05 ` Steven Rostedt
2025-09-23 13:05 ` [PATCH v2 7/8] tracing: Add syscall_user_buf_size to limit amount written Steven Rostedt
2025-09-23 13:05 ` [PATCH v2 8/8] tracing: Show printable characters in syscall arrays Steven Rostedt
7 siblings, 0 replies; 13+ messages in thread
From: Steven Rostedt @ 2025-09-23 13:05 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Peter Zijlstra, Namhyung Kim, Takaya Saeki, Tom Zanussi,
Thomas Gleixner, Ian Rogers, Douglas Raillard
From: Steven Rostedt <rostedt@goodmis.org>
Allow more than one field of a syscall trace event to read user space.
Build on top of the user_mask by allowing more than one bit to be set that
corresponds to the @args array of the syscall metadata. For each argument
in the @args array that is to be read, it will have a dynamic array/string
field associated to it.
Note that multiple fields to be read from user space is not supported if
the user_arg_size field is set in the syscall metada. That field can only
be used if only one field is being read from user space as that field is a
number representing the size field of the syscall event that holds the
size of the data to read from user space. It becomes ambiguous if the
system call reads more than one field. Currently this is not an issue.
If a syscall event happens to enable two events to read user space and
sets the user_arg_size field, it will trigger a warning at boot and the
user_arg_size field will be cleared.
The per CPU buffer that is used to read the user space addresses is now
broken up into 3 sections, each of 168 bytes. The reason for 168 is that
it is the biggest portion of 512 bytes divided by 3 that is 8 byte aligned.
The max amount copied into the ring buffer from user space is now only 128
bytes, which is plenty. When reading user space, it still reads 167
(168-1) bytes and uses the remaining to know if it should append the extra
"..." to the end or not.
This will allow the event to look like this:
sys_renameat2(olddfd: 0xffffff9c, oldname: 0x7ffe02facdff "/tmp/x", newdfd: 0xffffff9c, newname: 0x7ffe02face06 "/tmp/y", flags: 1)
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v1: https://lore.kernel.org/20250805193235.582013098@kernel.org
- Added __user annotation to variable copying from user (kernel test robot)
kernel/trace/trace_syscalls.c | 312 ++++++++++++++++++++++------------
1 file changed, 207 insertions(+), 105 deletions(-)
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 64be38cf790d..b602c9a7dbd8 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -138,6 +138,7 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
struct syscall_metadata *entry;
int i, syscall, val, len;
unsigned char *ptr;
+ int offset = 0;
trace = (typeof(trace))ent;
syscall = trace->nr;
@@ -177,12 +178,13 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
continue;
/* This arg points to a user space string */
- ptr = (void *)trace->args + sizeof(long) * entry->nb_args;
+ ptr = (void *)trace->args + sizeof(long) * entry->nb_args + offset;
val = *(int *)ptr;
/* The value is a dynamic string (len << 16 | offset) */
ptr = (void *)ent + (val & 0xffff);
len = val >> 16;
+ offset += 4;
if (entry->user_arg_size < 0 || entry->user_arg_is_str) {
trace_seq_printf(s, " \"%.*s\"", len, ptr);
@@ -335,7 +337,6 @@ static int __init syscall_enter_define_fields(struct trace_event_call *call)
unsigned long mask;
char *arg;
int offset = offsetof(typeof(trace), args);
- int idx;
int ret = 0;
int len;
int i;
@@ -354,27 +355,35 @@ static int __init syscall_enter_define_fields(struct trace_event_call *call)
return ret;
mask = meta->user_mask;
- idx = ffs(mask) - 1;
- /*
- * User space data is faulted into a temporary buffer and then
- * added as a dynamic string or array to the end of the event.
- * The user space data name for the arg pointer is "__<arg>_val".
- */
- len = strlen(meta->args[idx]) + sizeof("___val");
- arg = kmalloc(len, GFP_KERNEL);
- if (WARN_ON_ONCE(!arg)) {
- meta->user_mask = 0;
- return -ENOMEM;
- }
+ while (mask) {
+ int idx = ffs(mask) - 1;
+ mask &= ~BIT(idx);
+
+ /*
+ * User space data is faulted into a temporary buffer and then
+ * added as a dynamic string or array to the end of the event.
+ * The user space data name for the arg pointer is
+ * "__<arg>_val".
+ */
+ len = strlen(meta->args[idx]) + sizeof("___val");
+ arg = kmalloc(len, GFP_KERNEL);
+ if (WARN_ON_ONCE(!arg)) {
+ meta->user_mask = 0;
+ return -ENOMEM;
+ }
- snprintf(arg, len, "__%s_val", meta->args[idx]);
+ snprintf(arg, len, "__%s_val", meta->args[idx]);
- ret = trace_define_field(call, "__data_loc char[]",
- arg, offset, sizeof(int), 0,
- FILTER_OTHER);
- if (ret)
- kfree(arg);
+ ret = trace_define_field(call, "__data_loc char[]",
+ arg, offset, sizeof(int), 0,
+ FILTER_OTHER);
+ if (ret) {
+ kfree(arg);
+ break;
+ }
+ offset += 4;
+ }
return ret;
}
@@ -387,8 +396,25 @@ struct syscall_buf_info {
struct syscall_buf __percpu *sbuf;
};
-/* Create a per CPU temporary buffer to copy user space pointers into */
+/*
+ * Create a per CPU temporary buffer to copy user space pointers into.
+ *
+ * SYSCALL_FAULT_BUF_SZ holds the size of the per CPU buffer to use
+ * to copy memory from user space addresses into.
+ *
+ * SYSCALL_FAULT_ARG_SZ is the amount to copy from user space.
+ *
+ * SYSCALL_FAULT_USER_MAX is the amount to copy into the ring buffer.
+ * It's slightly smaller than SYSCALL_FAULT_ARG_SZ to know if it
+ * needs to append the EXTRA or not.
+ *
+ * This only allows up to 3 args from system calls.
+ */
#define SYSCALL_FAULT_BUF_SZ 512
+#define SYSCALL_FAULT_ARG_SZ 168
+#define SYSCALL_FAULT_USER_MAX 128
+#define SYSCALL_FAULT_MAX_CNT 3
+
static struct syscall_buf_info *syscall_buffer;
static int syscall_fault_buffer_cnt;
@@ -473,23 +499,58 @@ static void syscall_fault_buffer_disable(void)
call_rcu_tasks_trace(&sinfo->rcu, rcu_free_syscall_buffer);
}
-static char *sys_fault_user(struct syscall_metadata *sys_data, struct syscall_buf_info *sinfo,
- unsigned long *args, unsigned int *data_size)
+static char *sys_fault_user(struct syscall_metadata *sys_data,
+ struct syscall_buf_info *sinfo,
+ unsigned long *args,
+ unsigned int data_size[SYSCALL_FAULT_MAX_CNT])
{
int cpu = smp_processor_id();
- char *buf = per_cpu_ptr(sinfo->sbuf, cpu)->buf;
- unsigned long size = SYSCALL_FAULT_BUF_SZ - 1;
+ char *buffer = per_cpu_ptr(sinfo->sbuf, cpu)->buf;
unsigned long mask = sys_data->user_mask;
+ unsigned long size = SYSCALL_FAULT_ARG_SZ - 1;
unsigned int cnt;
- int idx = ffs(mask) - 1;
bool array = false;
- char *ptr;
+ char *ptr_array[SYSCALL_FAULT_MAX_CNT];
+ char *buf;
+ int read[SYSCALL_FAULT_MAX_CNT];
int trys = 0;
+ int uargs;
int ret;
+ int i = 0;
+
+ /* The extra is appended to the user data in the buffer */
+ BUILD_BUG_ON(SYSCALL_FAULT_USER_MAX + sizeof(EXTRA) >=
+ SYSCALL_FAULT_ARG_SZ);
+
+ /*
+ * If this system call event has a size argument, use
+ * it to define how much of user space memory to read,
+ * and read it as an array and not a string.
+ */
+ if (sys_data->user_arg_size >= 0) {
+ array = true;
+ size = args[sys_data->user_arg_size];
+ if (size > SYSCALL_FAULT_ARG_SZ - 1)
+ size = SYSCALL_FAULT_ARG_SZ - 1;
+ }
+
+ while (mask) {
+ int idx = ffs(mask) - 1;
+ mask &= ~BIT(idx);
+
+ if (WARN_ON_ONCE(i == SYSCALL_FAULT_MAX_CNT))
+ break;
+
+ /* Get the pointer to user space memory to read */
+ ptr_array[i++] = (char *)args[idx];
+ }
- /* Get the pointer to user space memory to read */
- ptr = (char *)args[idx];
- *data_size = 0;
+ uargs = i;
+
+ /* Clear the values that are not used */
+ for (; i < SYSCALL_FAULT_MAX_CNT; i++) {
+ data_size[i] = -1; /* Denotes no pointer */
+ }
/*
* This acts similar to a seqcount. The per CPU context switches are
@@ -519,24 +580,12 @@ static char *sys_fault_user(struct syscall_metadata *sys_data, struct syscall_bu
if (!once)
pr_warn("Error: Too many tries to read syscall %s\n", sys_data->name);
once = true;
- return buf;
+ return buffer;
}
/* Read the current CPU context switch counter */
cnt = nr_context_switches_cpu(cpu);
- /*
- * If this system call event has a size argument, use
- * it to define how much of user space memory to read,
- * and read it as an array and not a string.
- */
- if (sys_data->user_arg_size >= 0) {
- array = true;
- size = args[sys_data->user_arg_size];
- if (size > SYSCALL_FAULT_BUF_SZ - 1)
- size = SYSCALL_FAULT_BUF_SZ - 1;
- }
-
/*
* Preemption is going to be enabled, but this task must
* remain on this CPU.
@@ -549,20 +598,23 @@ static char *sys_fault_user(struct syscall_metadata *sys_data, struct syscall_bu
*/
preempt_enable_notrace();
- if (array) {
- ret = __copy_from_user(buf, ptr, size);
- ret = ret ? -1 : size;
- } else {
- ret = strncpy_from_user(buf, ptr, size);
+ buf = buffer;
+
+ for (i = 0; i < uargs; i++, buf += SYSCALL_FAULT_ARG_SZ) {
+ char __user *ptr = (char __user *)ptr_array[i];
+
+ if (array) {
+ ret = __copy_from_user(buf, ptr, size);
+ ret = ret ? -1 : size;
+ } else {
+ ret = strncpy_from_user(buf, ptr, size);
+ }
+ read[i] = ret;
}
preempt_disable_notrace();
migrate_enable();
- /* If it faulted, no use to try again */
- if (ret < 0)
- return buf;
-
/*
* Preemption is disabled again, now check the per CPU context
* switch counter. If it doesn't match, then another user space
@@ -572,28 +624,39 @@ static char *sys_fault_user(struct syscall_metadata *sys_data, struct syscall_bu
if (nr_context_switches_cpu(cpu) != cnt)
goto again;
- /* For strings, replace any non-printable characters with '.' */
- if (!array) {
- for (int i = 0; i < ret; i++) {
- if (!isprint(buf[i]))
- buf[i] = '.';
- }
+ buf = buffer;
+ for (i = 0; i < uargs; i++, buf += SYSCALL_FAULT_ARG_SZ) {
- /*
- * If the text was truncated due to our max limit, add "..." to
- * the string.
- */
- if (ret > SYSCALL_FAULT_BUF_SZ - sizeof(EXTRA)) {
- strscpy(buf + SYSCALL_FAULT_BUF_SZ - sizeof(EXTRA),
- EXTRA, sizeof(EXTRA));
- ret = SYSCALL_FAULT_BUF_SZ;
+ ret = read[i];
+ if (ret < 0)
+ continue;
+ buf[ret] = '\0';
+
+ /* For strings, replace any non-printable characters with '.' */
+ if (!array) {
+ for (int x = 0; x < ret; x++) {
+ if (!isprint(buf[x]))
+ buf[x] = '.';
+ }
+
+ /*
+ * If the text was truncated due to our max limit,
+ * add "..." to the string.
+ */
+ if (ret > SYSCALL_FAULT_USER_MAX) {
+ strscpy(buf + SYSCALL_FAULT_USER_MAX, EXTRA,
+ sizeof(EXTRA));
+ ret = SYSCALL_FAULT_USER_MAX + sizeof(EXTRA);
+ } else {
+ buf[ret++] = '\0';
+ }
} else {
- buf[ret++] = '\0';
+ ret = min(ret, SYSCALL_FAULT_USER_MAX);
}
+ data_size[i] = ret;
}
- *data_size = ret;
- return buf;
+ return buffer;
}
static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
@@ -605,9 +668,10 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
struct trace_event_buffer fbuffer;
unsigned long args[6];
char *user_ptr;
- int user_size = 0;
+ int user_sizes[SYSCALL_FAULT_MAX_CNT] = {};
int syscall_nr;
int size = 0;
+ int uargs = 0;
bool mayfault;
/*
@@ -640,20 +704,27 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
if (mayfault) {
struct syscall_buf_info *sinfo;
+ int i;
/* If the syscall_buffer is NULL, tracing is being shutdown */
sinfo = READ_ONCE(syscall_buffer);
if (!sinfo)
return;
- user_ptr = sys_fault_user(sys_data, sinfo, args, &user_size);
+ user_ptr = sys_fault_user(sys_data, sinfo, args, user_sizes);
/*
* user_size is the amount of data to append.
* Need to add 4 for the meta field that points to
* the user memory at the end of the event and also
* stores its size.
*/
- size = 4 + user_size;
+ for (i = 0; i < SYSCALL_FAULT_MAX_CNT; i++) {
+ if (user_sizes[i] < 0)
+ break;
+ size += user_sizes[i] + 4;
+ }
+ /* Save the number of user read arguments of this syscall */
+ uargs = i;
}
size += sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
@@ -668,6 +739,7 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
memcpy(entry->args, args, sizeof(unsigned long) * sys_data->nb_args);
if (mayfault) {
+ char *buf = user_ptr;
void *ptr;
int val;
@@ -679,21 +751,30 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
/*
* The meta data will store the offset of the user data from
- * the beginning of the event.
+ * the beginning of the event. That is after the static arguments
+ * and the meta data fields.
*/
- val = (ptr - (void *)entry) + 4;
+ val = (ptr - (void *)entry) + 4 * uargs;
+
+ for (int i = 0; i < uargs; i++) {
- /* Store the offset and the size into the meta data */
- *(int *)ptr = val | (user_size << 16);
+ if (i)
+ val += user_sizes[i - 1];
- if (WARN_ON_ONCE((ptr - (void *)entry + user_size) > size))
- user_size = 0;
+ /* Store the offset and the size into the meta data */
+ *(int *)ptr = val | (user_sizes[i] << 16);
- /* Nothing to do if the user space was empty or faulted */
- if (user_size) {
- /* Now store the user space data into the event */
+ /* Skip the meta data */
ptr += 4;
- memcpy(ptr, user_ptr, user_size);
+ }
+
+ for (int i = 0; i < uargs; i++, buf += SYSCALL_FAULT_ARG_SZ) {
+ /* Nothing to do if the user space was empty or faulted */
+ if (!user_sizes[i])
+ continue;
+
+ memcpy(ptr, buf, user_sizes[i]);
+ ptr += user_sizes[i];
}
}
@@ -837,6 +918,7 @@ static void unreg_event_syscall_exit(struct trace_event_file *file,
static void check_faultable_syscall(struct trace_event_call *call, int nr)
{
struct syscall_metadata *sys_data = call->data;
+ unsigned long mask;
/* Only work on entry */
if (sys_data->enter_event != call)
@@ -870,7 +952,6 @@ static void check_faultable_syscall(struct trace_event_call *call, int nr)
/* user arg at position 0 */
case __NR_access:
case __NR_acct:
- case __NR_add_key: /* Just _type. TODO add _description */
case __NR_chdir:
case __NR_chown:
case __NR_chmod:
@@ -879,28 +960,15 @@ static void check_faultable_syscall(struct trace_event_call *call, int nr)
case __NR_delete_module:
case __NR_execve:
case __NR_fsopen:
- case __NR_getxattr: /* Just pathname, TODO add name */
case __NR_lchown:
- case __NR_lgetxattr: /* Just pathname, TODO add name */
- case __NR_lremovexattr: /* Just pathname, TODO add name */
- case __NR_link: /* Just oldname. TODO add newname */
- case __NR_listxattr: /* Just pathname, TODO add list */
- case __NR_llistxattr: /* Just pathname, TODO add list */
- case __NR_lsetxattr: /* Just pathname, TODO add list */
case __NR_open:
case __NR_memfd_create:
- case __NR_mount: /* Just dev_name, TODO add dir_name and type */
case __NR_mkdir:
case __NR_mknod:
case __NR_mq_open:
case __NR_mq_unlink:
- case __NR_pivot_root: /* Just new_root, TODO add old_root */
case __NR_readlink:
- case __NR_removexattr: /* Just pathname, TODO add name */
- case __NR_rename: /* Just oldname. TODO add newname */
- case __NR_request_key: /* Just _type. TODO add _description */
case __NR_rmdir:
- case __NR_setxattr: /* Just pathname, TODO add list */
case __NR_shmdt:
case __NR_statfs:
case __NR_swapon:
@@ -927,14 +995,10 @@ static void check_faultable_syscall(struct trace_event_call *call, int nr)
case __NR_fspick:
case __NR_fremovexattr:
case __NR_futimesat:
- case __NR_getxattrat: /* Just pathname, TODO add name */
case __NR_inotify_add_watch:
- case __NR_linkat: /* Just oldname. TODO add newname */
- case __NR_listxattrat: /* Just pathname, TODO add list */
case __NR_mkdirat:
case __NR_mknodat:
case __NR_mount_setattr:
- case __NR_move_mount: /* Just from_pathname, TODO add to_pathname */
case __NR_name_to_handle_at:
#if defined(__ARCH_WANT_NEW_STAT) || defined(__ARCH_WANT_STAT64)
case __NR_newfstatat:
@@ -944,13 +1008,8 @@ static void check_faultable_syscall(struct trace_event_call *call, int nr)
case __NR_open_tree:
case __NR_open_tree_attr:
case __NR_readlinkat:
- case __NR_renameat: /* Just oldname. TODO add newname */
- case __NR_renameat2: /* Just oldname. TODO add newname */
- case __NR_removexattrat: /* Just pathname, TODO add name */
case __NR_quotactl:
- case __NR_setxattrat: /* Just pathname, TODO add list */
case __NR_syslog:
- case __NR_symlinkat: /* Just oldname. TODO add newname */
case __NR_statx:
case __NR_unlinkat:
case __NR_utimensat:
@@ -965,9 +1024,52 @@ static void check_faultable_syscall(struct trace_event_call *call, int nr)
case __NR_fanotify_mark:
sys_data->user_mask = BIT(4);
break;
+ /* 2 user args, 0 and 1 */
+ case __NR_add_key:
+ case __NR_getxattr:
+ case __NR_lgetxattr:
+ case __NR_lremovexattr:
+ case __NR_link:
+ case __NR_listxattr:
+ case __NR_llistxattr:
+ case __NR_lsetxattr:
+ case __NR_pivot_root:
+ case __NR_removexattr:
+ case __NR_rename:
+ case __NR_request_key:
+ case __NR_setxattr:
+ case __NR_symlinkat:
+ sys_data->user_mask = BIT(0) | BIT(1);
+ break;
+ /* 2 user args, 1 and 3 */
+ case __NR_getxattrat:
+ case __NR_linkat:
+ case __NR_listxattrat:
+ case __NR_move_mount:
+ case __NR_renameat:
+ case __NR_renameat2:
+ case __NR_removexattrat:
+ case __NR_setxattrat:
+ sys_data->user_mask = BIT(1) | BIT(3);
+ break;
+ case __NR_mount: /* Just dev_name and dir_name, TODO add type */
+ sys_data->user_mask = BIT(0) | BIT(1) | BIT(2);
+ break;
default:
sys_data->user_mask = 0;
+ return;
}
+
+ if (sys_data->user_arg_size < 0)
+ return;
+
+ /*
+ * The user_arg_size can only be used when the system call
+ * is reading only a single address from user space.
+ */
+ mask = sys_data->user_mask;
+ if (WARN_ON(mask & (mask - 1)))
+ sys_data->user_arg_size = -1;
}
static int __init init_syscall_trace(struct trace_event_call *call)
--
2.50.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v2 7/8] tracing: Add syscall_user_buf_size to limit amount written
2025-09-23 13:04 [PATCH v2 0/8] tracing: Show contents of syscall trace event user space fields Steven Rostedt
` (5 preceding siblings ...)
2025-09-23 13:05 ` [PATCH v2 6/8] tracing: Allow syscall trace events to read more than one user parameter Steven Rostedt
@ 2025-09-23 13:05 ` Steven Rostedt
2025-09-24 9:49 ` kernel test robot
2025-09-23 13:05 ` [PATCH v2 8/8] tracing: Show printable characters in syscall arrays Steven Rostedt
7 siblings, 1 reply; 13+ messages in thread
From: Steven Rostedt @ 2025-09-23 13:05 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Peter Zijlstra, Namhyung Kim, Takaya Saeki, Tom Zanussi,
Thomas Gleixner, Ian Rogers, Douglas Raillard
From: Steven Rostedt <rostedt@goodmis.org>
When a system call that reads user space addresses copy it to the ring
buffer, it can copy up to 511 bytes of data. This can waste precious ring
buffer space if the user isn't interested in the output. Add a new file
"syscall_user_buf_size" that gets initialized to a new config
CONFIG_SYSCALL_BUF_SIZE_DEFAULT that defaults to 63.
Also lower the max down to 165, as this isn't to record everything that a
system call may be passing through to the kernel. 165 is more than enough.
The reason for 165 is because adding one for the nul terminating byte, as
well as possibly needing to append the "..." string turns it into 170
bytes. As this needs to save up to 3 arguments and 3 * 170 is 510 which
fits nicely in 512 bytes (a power of 2).
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v1: https://lore.kernel.org/20250805193235.747004484@kernel.org
- Change default to 63 (127 seemed too much)
- Change the max to 165 to fill in the extra data.
- Use the size macros of the max size and max args to calculate the size
of the buffer to save the values in.
Documentation/trace/ftrace.rst | 8 ++++++
kernel/trace/Kconfig | 13 +++++++++
kernel/trace/trace.c | 52 ++++++++++++++++++++++++++++++++++
kernel/trace/trace.h | 3 ++
kernel/trace/trace_syscalls.c | 42 ++++++++++++++-------------
5 files changed, 98 insertions(+), 20 deletions(-)
diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
index af66a05e18cc..87fd3ed1301f 100644
--- a/Documentation/trace/ftrace.rst
+++ b/Documentation/trace/ftrace.rst
@@ -366,6 +366,14 @@ of ftrace. Here is a list of some of the key files:
for each function. The displayed address is the patch-site address
and can differ from /proc/kallsyms address.
+ syscall_user_buf_size:
+
+ Some system call trace events will record the data from a user
+ space address that one of the parameters point to. The amount of
+ data per event is limited. This file holds the max number of bytes
+ that will be recorded into the ring buffer to hold this data.
+ The max value is currently 165.
+
dyn_ftrace_total_info:
This file is for debugging purposes. The number of functions that
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index d2c79da81e4f..a055ca174da5 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -575,6 +575,19 @@ config FTRACE_SYSCALLS
help
Basic tracer to catch the syscall entry and exit events.
+config TRACE_SYSCALL_BUF_SIZE_DEFAULT
+ int "System call user read max size"
+ range 0 165
+ default 63
+ depends on FTRACE_SYSCALLS
+ help
+ Some system call trace events will record the data from a user
+ space address that one of the parameters point to. The amount of
+ data per event is limited. It may be further limited by this
+ config and later changed by writing an ASCII number into:
+
+ /sys/kernel/tracing/syscall_user_buf_size
+
config TRACER_SNAPSHOT
bool "Create a snapshot trace buffer"
select TRACER_MAX_TRACE
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 1b7db732c0b1..a3d2e7d1c664 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -6913,6 +6913,43 @@ static ssize_t tracing_splice_read_pipe(struct file *filp,
goto out;
}
+static ssize_t
+tracing_syscall_buf_read(struct file *filp, char __user *ubuf,
+ size_t cnt, loff_t *ppos)
+{
+ struct inode *inode = file_inode(filp);
+ struct trace_array *tr = inode->i_private;
+ char buf[64];
+ int r;
+
+ r = snprintf(buf, 64, "%d\n", tr->syscall_buf_sz);
+
+ return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
+}
+
+static ssize_t
+tracing_syscall_buf_write(struct file *filp, const char __user *ubuf,
+ size_t cnt, loff_t *ppos)
+{
+ struct inode *inode = file_inode(filp);
+ struct trace_array *tr = inode->i_private;
+ unsigned long val;
+ int ret;
+
+ ret = kstrtoul_from_user(ubuf, cnt, 10, &val);
+ if (ret)
+ return ret;
+
+ if (val > SYSCALL_FAULT_USER_MAX)
+ val = SYSCALL_FAULT_USER_MAX;
+
+ tr->syscall_buf_sz = val;
+
+ *ppos += cnt;
+
+ return cnt;
+}
+
static ssize_t
tracing_entries_read(struct file *filp, char __user *ubuf,
size_t cnt, loff_t *ppos)
@@ -7737,6 +7774,14 @@ static const struct file_operations tracing_entries_fops = {
.release = tracing_release_generic_tr,
};
+static const struct file_operations tracing_syscall_buf_fops = {
+ .open = tracing_open_generic_tr,
+ .read = tracing_syscall_buf_read,
+ .write = tracing_syscall_buf_write,
+ .llseek = generic_file_llseek,
+ .release = tracing_release_generic_tr,
+};
+
static const struct file_operations tracing_buffer_meta_fops = {
.open = tracing_buffer_meta_open,
.read = seq_read,
@@ -9839,6 +9884,8 @@ trace_array_create_systems(const char *name, const char *systems,
raw_spin_lock_init(&tr->start_lock);
+ tr->syscall_buf_sz = global_trace.syscall_buf_sz;
+
tr->max_lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;
#ifdef CONFIG_TRACER_MAX_TRACE
spin_lock_init(&tr->snapshot_trigger_lock);
@@ -10155,6 +10202,9 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer)
trace_create_file("buffer_subbuf_size_kb", TRACE_MODE_WRITE, d_tracer,
tr, &buffer_subbuf_size_fops);
+ trace_create_file("syscall_user_buf_size", TRACE_MODE_WRITE, d_tracer,
+ tr, &tracing_syscall_buf_fops);
+
create_trace_options_dir(tr);
#ifdef CONFIG_TRACER_MAX_TRACE
@@ -11081,6 +11131,8 @@ __init static int tracer_alloc_buffers(void)
global_trace.flags = TRACE_ARRAY_FL_GLOBAL;
+ global_trace.syscall_buf_sz = CONFIG_TRACE_SYSCALL_BUF_SIZE_DEFAULT;
+
INIT_LIST_HEAD(&global_trace.systems);
INIT_LIST_HEAD(&global_trace.events);
INIT_LIST_HEAD(&global_trace.hist_vars);
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 85eabb454bee..0499e6dd51fa 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -131,6 +131,8 @@ enum trace_type {
#define HIST_STACKTRACE_SIZE (HIST_STACKTRACE_DEPTH * sizeof(unsigned long))
#define HIST_STACKTRACE_SKIP 5
+#define SYSCALL_FAULT_USER_MAX 165
+
/*
* syscalls are special, and need special handling, this is why
* they are not included in trace_entries.h
@@ -430,6 +432,7 @@ struct trace_array {
int function_enabled;
#endif
int no_filter_buffering_ref;
+ unsigned int syscall_buf_sz;
struct list_head hist_vars;
#ifdef CONFIG_TRACER_SNAPSHOT
struct cond_snapshot *cond_snapshot;
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index b602c9a7dbd8..367e10096c6f 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -399,24 +399,21 @@ struct syscall_buf_info {
/*
* Create a per CPU temporary buffer to copy user space pointers into.
*
- * SYSCALL_FAULT_BUF_SZ holds the size of the per CPU buffer to use
- * to copy memory from user space addresses into.
- *
- * SYSCALL_FAULT_ARG_SZ is the amount to copy from user space.
- *
- * SYSCALL_FAULT_USER_MAX is the amount to copy into the ring buffer.
- * It's slightly smaller than SYSCALL_FAULT_ARG_SZ to know if it
- * needs to append the EXTRA or not.
+ * SYSCALL_FAULT_USER_MAX is the amount to copy from user space.
+ * (defined in kernel/trace/trace.h)
+
+ * SYSCALL_FAULT_ARG_SZ is the amount to copy from user space plus the
+ * nul terminating byte and possibly appended EXTRA (4 bytes).
*
- * This only allows up to 3 args from system calls.
+ * SYSCALL_FAULT_BUF_SZ holds the size of the per CPU buffer to use
+ * to copy memory from user space addresses into that will hold
+ * 3 args as only 3 args are allowed to be copied from system calls.
*/
-#define SYSCALL_FAULT_BUF_SZ 512
-#define SYSCALL_FAULT_ARG_SZ 168
-#define SYSCALL_FAULT_USER_MAX 128
+#define SYSCALL_FAULT_ARG_SZ (SYSCALL_FAULT_USER_MAX + 1 + 4)
#define SYSCALL_FAULT_MAX_CNT 3
+#define SYSCALL_FAULT_BUF_SZ (SYSCALL_FAULT_ARG_SZ * SYSCALL_FAULT_MAX_CNT)
static struct syscall_buf_info *syscall_buffer;
-
static int syscall_fault_buffer_cnt;
static void syscall_fault_buffer_free(struct syscall_buf_info *sinfo)
@@ -499,7 +496,7 @@ static void syscall_fault_buffer_disable(void)
call_rcu_tasks_trace(&sinfo->rcu, rcu_free_syscall_buffer);
}
-static char *sys_fault_user(struct syscall_metadata *sys_data,
+static char *sys_fault_user(struct trace_array *tr, struct syscall_metadata *sys_data,
struct syscall_buf_info *sinfo,
unsigned long *args,
unsigned int data_size[SYSCALL_FAULT_MAX_CNT])
@@ -552,6 +549,10 @@ static char *sys_fault_user(struct syscall_metadata *sys_data,
data_size[i] = -1; /* Denotes no pointer */
}
+ /* A zero size means do not even try */
+ if (!tr->syscall_buf_sz)
+ return buffer;
+
/*
* This acts similar to a seqcount. The per CPU context switches are
* recorded, migration is disabled and preemption is enabled. The
@@ -639,19 +640,20 @@ static char *sys_fault_user(struct syscall_metadata *sys_data,
buf[x] = '.';
}
+ size = min(tr->syscall_buf_sz, SYSCALL_FAULT_USER_MAX);
+
/*
* If the text was truncated due to our max limit,
* add "..." to the string.
*/
- if (ret > SYSCALL_FAULT_USER_MAX) {
- strscpy(buf + SYSCALL_FAULT_USER_MAX, EXTRA,
- sizeof(EXTRA));
- ret = SYSCALL_FAULT_USER_MAX + sizeof(EXTRA);
+ if (ret > size) {
+ strscpy(buf + size, EXTRA, sizeof(EXTRA));
+ ret = size + sizeof(EXTRA);
} else {
buf[ret++] = '\0';
}
} else {
- ret = min(ret, SYSCALL_FAULT_USER_MAX);
+ ret = min((unsigned int)ret, tr->syscall_buf_sz);
}
data_size[i] = ret;
}
@@ -711,7 +713,7 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
if (!sinfo)
return;
- user_ptr = sys_fault_user(sys_data, sinfo, args, user_sizes);
+ user_ptr = sys_fault_user(tr, sys_data, sinfo, args, user_sizes);
/*
* user_size is the amount of data to append.
* Need to add 4 for the meta field that points to
--
2.50.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v2 8/8] tracing: Show printable characters in syscall arrays
2025-09-23 13:04 [PATCH v2 0/8] tracing: Show contents of syscall trace event user space fields Steven Rostedt
` (6 preceding siblings ...)
2025-09-23 13:05 ` [PATCH v2 7/8] tracing: Add syscall_user_buf_size to limit amount written Steven Rostedt
@ 2025-09-23 13:05 ` Steven Rostedt
7 siblings, 0 replies; 13+ messages in thread
From: Steven Rostedt @ 2025-09-23 13:05 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Peter Zijlstra, Namhyung Kim, Takaya Saeki, Tom Zanussi,
Thomas Gleixner, Ian Rogers, Douglas Raillard
From: Steven Rostedt <rostedt@goodmis.org>
When displaying the contents of the user space data passed to the kernel,
instead of just showing the array values, also print any printable
content.
Instead of just:
bash-1113 [003] ..... 3433.290654: sys_write(fd: 2, buf: 0x555a8deeddb0 (72:6f:6f:74:40:64:65:62:69:61:6e:2d:78:38:36:2d:36:34:3a:7e:23:20), count: 0x16)
Display:
bash-1113 [003] ..... 3433.290654: sys_write(fd: 2, buf: 0x555a8deeddb0 (72:6f:6f:74:40:64:65:62:69:61:6e:2d:78:38:36:2d:36:34:3a:7e:23:20) "root@debian-x86-64:~# ", count: 0x16)
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/trace/trace_syscalls.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 367e10096c6f..0625a32f01dd 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -155,6 +155,8 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
trace_seq_printf(s, "%s(", entry->name);
for (i = 0; i < entry->nb_args; i++) {
+ bool printable = false;
+ char *str;
if (trace_seq_has_overflowed(s))
goto end;
@@ -193,8 +195,11 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
val = trace->args[entry->user_arg_size];
+ str = ptr;
trace_seq_puts(s, " (");
for (int x = 0; x < len; x++, ptr++) {
+ if (isascii(*ptr) && isprint(*ptr))
+ printable = true;
if (x)
trace_seq_putc(s, ':');
trace_seq_printf(s, "%02x", *ptr);
@@ -203,6 +208,22 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
trace_seq_printf(s, ", %s", EXTRA);
trace_seq_putc(s, ')');
+
+ /* If nothing is printable, don't bother printing anything */
+ if (!printable)
+ continue;
+
+ trace_seq_puts(s, " \"");
+ for (int x = 0; x < len; x++) {
+ if (isascii(str[x]) && isprint(str[x]))
+ trace_seq_putc(s, str[x]);
+ else
+ trace_seq_putc(s, '.');
+ }
+ if (len < val)
+ trace_seq_printf(s, "\"%s", EXTRA);
+ else
+ trace_seq_putc(s, '"');
}
trace_seq_putc(s, ')');
--
2.50.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v2 7/8] tracing: Add syscall_user_buf_size to limit amount written
2025-09-23 13:05 ` [PATCH v2 7/8] tracing: Add syscall_user_buf_size to limit amount written Steven Rostedt
@ 2025-09-24 9:49 ` kernel test robot
0 siblings, 0 replies; 13+ messages in thread
From: kernel test robot @ 2025-09-24 9:49 UTC (permalink / raw)
To: Steven Rostedt, linux-kernel, linux-trace-kernel
Cc: llvm, oe-kbuild-all, Masami Hiramatsu, Mark Rutland,
Mathieu Desnoyers, Andrew Morton, Linux Memory Management List,
Peter Zijlstra, Namhyung Kim, Takaya Saeki, Tom Zanussi,
Thomas Gleixner, Ian Rogers, Douglas Raillard
Hi Steven,
kernel test robot noticed the following build errors:
[auto build test ERROR on trace/for-next]
[also build test ERROR on linus/master v6.17-rc7 next-20250923]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Steven-Rostedt/tracing-Replace-syscall-RCU-pointer-assignment-with-READ-WRITE_ONCE/20250923-210948
base: https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace for-next
patch link: https://lore.kernel.org/r/20250923130714.603760198%40kernel.org
patch subject: [PATCH v2 7/8] tracing: Add syscall_user_buf_size to limit amount written
config: x86_64-kexec (https://download.01.org/0day-ci/archive/20250924/202509241709.5vLMGNLe-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250924/202509241709.5vLMGNLe-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202509241709.5vLMGNLe-lkp@intel.com/
All errors (new ones prefixed by >>):
>> kernel/trace/trace.c:11128:32: error: use of undeclared identifier 'CONFIG_TRACE_SYSCALL_BUF_SIZE_DEFAULT'
11128 | global_trace.syscall_buf_sz = CONFIG_TRACE_SYSCALL_BUF_SIZE_DEFAULT;
| ^
1 error generated.
vim +/CONFIG_TRACE_SYSCALL_BUF_SIZE_DEFAULT +11128 kernel/trace/trace.c
11110
11111 init_trace_flags_index(&global_trace);
11112
11113 register_tracer(&nop_trace);
11114
11115 /* Function tracing may start here (via kernel command line) */
11116 init_function_trace();
11117
11118 /* All seems OK, enable tracing */
11119 tracing_disabled = 0;
11120
11121 atomic_notifier_chain_register(&panic_notifier_list,
11122 &trace_panic_notifier);
11123
11124 register_die_notifier(&trace_die_notifier);
11125
11126 global_trace.flags = TRACE_ARRAY_FL_GLOBAL;
11127
11128 global_trace.syscall_buf_sz = CONFIG_TRACE_SYSCALL_BUF_SIZE_DEFAULT;
11129
11130 INIT_LIST_HEAD(&global_trace.systems);
11131 INIT_LIST_HEAD(&global_trace.events);
11132 INIT_LIST_HEAD(&global_trace.hist_vars);
11133 INIT_LIST_HEAD(&global_trace.err_log);
11134 list_add(&global_trace.marker_list, &marker_copies);
11135 list_add(&global_trace.list, &ftrace_trace_arrays);
11136
11137 apply_trace_boot_options();
11138
11139 register_snapshot_cmd();
11140
11141 return 0;
11142
11143 out_free_pipe_cpumask:
11144 free_cpumask_var(global_trace.pipe_cpumask);
11145 out_free_savedcmd:
11146 trace_free_saved_cmdlines_buffer();
11147 out_free_temp_buffer:
11148 ring_buffer_free(temp_buffer);
11149 out_rm_hp_state:
11150 cpuhp_remove_multi_state(CPUHP_TRACE_RB_PREPARE);
11151 out_free_cpumask:
11152 free_cpumask_var(global_trace.tracing_cpumask);
11153 out_free_buffer_mask:
11154 free_cpumask_var(tracing_buffer_mask);
11155 return ret;
11156 }
11157
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 3/8] tracing: Have syscall trace events read user space string
2025-09-23 13:05 ` [PATCH v2 3/8] tracing: Have syscall trace events read user space string Steven Rostedt
@ 2025-09-25 7:26 ` Peter Zijlstra
2025-09-25 11:15 ` Steven Rostedt
0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2025-09-25 7:26 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
Mathieu Desnoyers, Andrew Morton, Namhyung Kim, Takaya Saeki,
Tom Zanussi, Thomas Gleixner, Ian Rogers, Douglas Raillard
On Tue, Sep 23, 2025 at 09:05:00AM -0400, Steven Rostedt wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
>
> As of commit 654ced4a1377 ("tracing: Introduce tracepoint_is_faultable()")
> system call trace events allow faulting in user space memory. Have some of
> the system call trace events take advantage of this.
>
> Introduce a way to read strings that are nul terminated into the trace
> event. The way this is accomplished is by creating a per CPU temporary
> buffer that is used to read unsafe user memory.
>
> When a syscall trace event needs to read user memory, it reads the per CPU
> schedule switch counter. It then disables migration and enables
> preemption, copies the user space memory into this buffer, then disables
> preemption again. It reads the per CPU schedule switch counter again and
> if it matches it considers the buffer is valid. Otherwise it needs to try
> again. This is similar to how seqcount works, but uses the per CPU context
> switch counter as the sequence counter.
And you can't just allocate memory and not bother with the
migrate_disable() and retry stuff because?
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 3/8] tracing: Have syscall trace events read user space string
2025-09-25 7:26 ` Peter Zijlstra
@ 2025-09-25 11:15 ` Steven Rostedt
2025-09-27 14:43 ` Steven Rostedt
0 siblings, 1 reply; 13+ messages in thread
From: Steven Rostedt @ 2025-09-25 11:15 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
Mathieu Desnoyers, Andrew Morton, Namhyung Kim, Takaya Saeki,
Tom Zanussi, Thomas Gleixner, Ian Rogers, Douglas Raillard
On Thu, 25 Sep 2025 09:26:09 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
> And you can't just allocate memory and not bother with the
> migrate_disable() and retry stuff because?
Because tracing is supposed to be as non-intrusive as possible. I
rather not call into the allocation system from a trace point. I'm not
sure what side effects that may cause either.
I have yet to cause the retry path under stress tests. I had to insert
a msleep() for testing purposes to make sure it worked.
-- Steve
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 3/8] tracing: Have syscall trace events read user space string
2025-09-25 11:15 ` Steven Rostedt
@ 2025-09-27 14:43 ` Steven Rostedt
0 siblings, 0 replies; 13+ messages in thread
From: Steven Rostedt @ 2025-09-27 14:43 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
Mathieu Desnoyers, Andrew Morton, Namhyung Kim, Takaya Saeki,
Tom Zanussi, Thomas Gleixner, Ian Rogers, Douglas Raillard
On Thu, 25 Sep 2025 07:15:45 -0400
Steven Rostedt <rostedt@kernel.org> wrote:
> I have yet to cause the retry path under stress tests. I had to insert
> a msleep() for testing purposes to make sure it worked.
I wanted to update on this. My stress test wasn't stressing enough. So
I took a "cp-mmap" program I had that did this:
map = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, ifd, 0);
if (map == MAP_FAILED)
exit(-1);
ofd = open(argv[2], O_WRONLY | O_TRUNC | O_CREAT, 0644);
if (ofd < 0)
exit(-1);
do {
r = write(ofd, map + w, st.st_size - w);
if (r < 0)
exit(-1);
w += r;
} while (w < st.st_size);
Where it mmaps the source file and then writes that. This would most
definitely fault at every write. Then I ran this while tracing
sys_entry_write() and an added trace_printk() in the goto again block:
# mkdir /tmp/dump
# for f in `find /usr/bin/ ` ; do n=`basename $f`; ./cp-mmap $f /tmp/dump/$n & done
And this did cause triggering:
12771 writes where it triggered the again loop 1635 times.
Thus, it triggered 12% of the time under a very intensive stress.
That's still faster than allocation, not to mention if one is tracing
both system calls and allocations, it will start to dirty the trace.
-- Steve
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2025-09-27 14:43 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-23 13:04 [PATCH v2 0/8] tracing: Show contents of syscall trace event user space fields Steven Rostedt
2025-09-23 13:04 ` [PATCH v2 1/8] tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE() Steven Rostedt
2025-09-23 13:04 ` [PATCH v2 2/8] tracing: Have syscall trace events show "0x" for values greater than 10 Steven Rostedt
2025-09-23 13:05 ` [PATCH v2 3/8] tracing: Have syscall trace events read user space string Steven Rostedt
2025-09-25 7:26 ` Peter Zijlstra
2025-09-25 11:15 ` Steven Rostedt
2025-09-27 14:43 ` Steven Rostedt
2025-09-23 13:05 ` [PATCH v2 4/8] tracing: Have system call events record user array data Steven Rostedt
2025-09-23 13:05 ` [PATCH v2 5/8] tracing: Display some syscall arrays as strings Steven Rostedt
2025-09-23 13:05 ` [PATCH v2 6/8] tracing: Allow syscall trace events to read more than one user parameter Steven Rostedt
2025-09-23 13:05 ` [PATCH v2 7/8] tracing: Add syscall_user_buf_size to limit amount written Steven Rostedt
2025-09-24 9:49 ` kernel test robot
2025-09-23 13:05 ` [PATCH v2 8/8] tracing: Show printable characters in syscall arrays Steven Rostedt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).