* [PATCH 1/8] tracing: Declare system call tracepoints with TRACE_EVENT_SYSCALL
2024-09-09 20:16 [PATCH 0/8] tracing: Allow system call tracepoints to handle page faults Mathieu Desnoyers
@ 2024-09-09 20:16 ` Mathieu Desnoyers
2024-09-09 20:16 ` [PATCH 2/8] tracing/ftrace: guard syscall probe with preempt_notrace Mathieu Desnoyers
` (8 subsequent siblings)
9 siblings, 0 replies; 17+ messages in thread
From: Mathieu Desnoyers @ 2024-09-09 20:16 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: linux-kernel, Mathieu Desnoyers, Peter Zijlstra,
Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
linux-trace-kernel, Michael Jeanson
In preparation for allowing system call tracepoints to handle page
faults, introduce TRACE_EVENT_SYSCALL to declare the sys_enter/sys_exit
tracepoints.
Emit the static inlines register_trace_syscall_##name for events
declared with TRACE_EVENT_SYSCALL, allowing source-level validation
that only probes meant to handle system call entry/exit events are
registered to them.
Move the common code between __DECLARE_TRACE and __DECLARE_TRACE_SYSCALL
into __DECLARE_TRACE_COMMON.
This change is not meant to alter the generated code, and only prepares
the following modifications.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michael Jeanson <mjeanson@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org
Cc: Joel Fernandes <joel@joelfernandes.org>
---
include/linux/tracepoint.h | 66 +++++++++++++++++++++++++--------
include/trace/bpf_probe.h | 3 ++
include/trace/define_trace.h | 5 +++
include/trace/events/syscalls.h | 4 +-
include/trace/perf.h | 3 ++
include/trace/trace_events.h | 28 ++++++++++++++
kernel/entry/common.c | 4 +-
kernel/trace/trace_syscalls.c | 8 ++--
8 files changed, 98 insertions(+), 23 deletions(-)
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 689b6d71590e..b2cfe6a9097c 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -249,10 +249,28 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
* site if it is not watching, as it will need to be active when the
* tracepoint is enabled.
*/
-#define __DECLARE_TRACE(name, proto, args, cond, data_proto) \
+#define __DECLARE_TRACE_COMMON(name, proto, args, cond, data_proto) \
extern int __traceiter_##name(data_proto); \
DECLARE_STATIC_CALL(tp_func_##name, __traceiter_##name); \
extern struct tracepoint __tracepoint_##name; \
+ static inline int \
+ unregister_trace_##name(void (*probe)(data_proto), void *data) \
+ { \
+ return tracepoint_probe_unregister(&__tracepoint_##name,\
+ (void *)probe, data); \
+ } \
+ static inline void \
+ check_trace_callback_type_##name(void (*cb)(data_proto)) \
+ { \
+ } \
+ static inline bool \
+ trace_##name##_enabled(void) \
+ { \
+ return static_key_false(&__tracepoint_##name.key); \
+ }
+
+#define __DECLARE_TRACE(name, proto, args, cond, data_proto) \
+ __DECLARE_TRACE_COMMON(name, PARAMS(proto), PARAMS(args), cond, PARAMS(data_proto)) \
static inline void trace_##name(proto) \
{ \
if (static_key_false(&__tracepoint_##name.key)) \
@@ -264,8 +282,13 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
"RCU not watching for tracepoint"); \
} \
} \
- __DECLARE_TRACE_RCU(name, PARAMS(proto), PARAMS(args), \
- PARAMS(cond)) \
+ static inline void trace_##name##_rcuidle(proto) \
+ { \
+ if (static_key_false(&__tracepoint_##name.key)) \
+ __DO_TRACE(name, \
+ TP_ARGS(args), \
+ TP_CONDITION(cond), 1); \
+ } \
static inline int \
register_trace_##name(void (*probe)(data_proto), void *data) \
{ \
@@ -278,21 +301,26 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
{ \
return tracepoint_probe_register_prio(&__tracepoint_##name, \
(void *)probe, data, prio); \
- } \
- static inline int \
- unregister_trace_##name(void (*probe)(data_proto), void *data) \
- { \
- return tracepoint_probe_unregister(&__tracepoint_##name,\
- (void *)probe, data); \
- } \
- static inline void \
- check_trace_callback_type_##name(void (*cb)(data_proto)) \
+ }
+
+#define __DECLARE_TRACE_SYSCALL(name, proto, args, cond, data_proto) \
+ __DECLARE_TRACE_COMMON(name, PARAMS(proto), PARAMS(args), cond, PARAMS(data_proto)) \
+ static inline void trace_syscall_##name(proto) \
{ \
+ if (static_key_false(&__tracepoint_##name.key)) \
+ __DO_TRACE(name, \
+ TP_ARGS(args), \
+ TP_CONDITION(cond), 0); \
+ if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) { \
+ WARN_ONCE(!rcu_is_watching(), \
+ "RCU not watching for tracepoint"); \
+ } \
} \
- static inline bool \
- trace_##name##_enabled(void) \
+ static inline int \
+ register_trace_syscall_##name(void (*probe)(data_proto), void *data) \
{ \
- return static_key_false(&__tracepoint_##name.key); \
+ return tracepoint_probe_register(&__tracepoint_##name, \
+ (void *)probe, data); \
}
/*
@@ -440,6 +468,11 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
cpu_online(raw_smp_processor_id()) && (PARAMS(cond)), \
PARAMS(void *__data, proto))
+#define DECLARE_TRACE_SYSCALL(name, proto, args) \
+ __DECLARE_TRACE_SYSCALL(name, PARAMS(proto), PARAMS(args), \
+ cpu_online(raw_smp_processor_id()), \
+ PARAMS(void *__data, proto))
+
#define TRACE_EVENT_FLAGS(event, flag)
#define TRACE_EVENT_PERF_PERM(event, expr...)
@@ -577,6 +610,9 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
struct, assign, print) \
DECLARE_TRACE_CONDITION(name, PARAMS(proto), \
PARAMS(args), PARAMS(cond))
+#define TRACE_EVENT_SYSCALL(name, proto, args, struct, assign, \
+ print, reg, unreg) \
+ DECLARE_TRACE_SYSCALL(name, PARAMS(proto), PARAMS(args))
#define TRACE_EVENT_FLAGS(event, flag)
diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
index a2ea11cc912e..c85bbce5aaa5 100644
--- a/include/trace/bpf_probe.h
+++ b/include/trace/bpf_probe.h
@@ -53,6 +53,9 @@ __bpf_trace_##call(void *__data, proto) \
#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
__BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args))
+#undef DECLARE_EVENT_SYSCALL_CLASS
+#define DECLARE_EVENT_SYSCALL_CLASS DECLARE_EVENT_CLASS
+
/*
* This part is compiled out, it is only here as a build time check
* to make sure that if the tracepoint handling changes, the
diff --git a/include/trace/define_trace.h b/include/trace/define_trace.h
index 00723935dcc7..ff5fa17a6259 100644
--- a/include/trace/define_trace.h
+++ b/include/trace/define_trace.h
@@ -46,6 +46,10 @@
assign, print, reg, unreg) \
DEFINE_TRACE_FN(name, reg, unreg, PARAMS(proto), PARAMS(args))
+#undef TRACE_EVENT_SYSCALL
+#define TRACE_EVENT_SYSCALL(name, proto, args, struct, assign, print, reg, unreg) \
+ DEFINE_TRACE_FN(name, reg, unreg, PARAMS(proto), PARAMS(args))
+
#undef TRACE_EVENT_NOP
#define TRACE_EVENT_NOP(name, proto, args, struct, assign, print)
@@ -107,6 +111,7 @@
#undef TRACE_EVENT
#undef TRACE_EVENT_FN
#undef TRACE_EVENT_FN_COND
+#undef TRACE_EVENT_SYSCALL
#undef TRACE_EVENT_CONDITION
#undef TRACE_EVENT_NOP
#undef DEFINE_EVENT_NOP
diff --git a/include/trace/events/syscalls.h b/include/trace/events/syscalls.h
index b6e0cbc2c71f..f31ff446b468 100644
--- a/include/trace/events/syscalls.h
+++ b/include/trace/events/syscalls.h
@@ -15,7 +15,7 @@
#ifdef CONFIG_HAVE_SYSCALL_TRACEPOINTS
-TRACE_EVENT_FN(sys_enter,
+TRACE_EVENT_SYSCALL(sys_enter,
TP_PROTO(struct pt_regs *regs, long id),
@@ -41,7 +41,7 @@ TRACE_EVENT_FN(sys_enter,
TRACE_EVENT_FLAGS(sys_enter, TRACE_EVENT_FL_CAP_ANY)
-TRACE_EVENT_FN(sys_exit,
+TRACE_EVENT_SYSCALL(sys_exit,
TP_PROTO(struct pt_regs *regs, long ret),
diff --git a/include/trace/perf.h b/include/trace/perf.h
index 2c11181c82e0..ded997af481e 100644
--- a/include/trace/perf.h
+++ b/include/trace/perf.h
@@ -55,6 +55,9 @@ perf_trace_##call(void *__data, proto) \
head, __task); \
}
+#undef DECLARE_EVENT_SYSCALL_CLASS
+#define DECLARE_EVENT_SYSCALL_CLASS DECLARE_EVENT_CLASS
+
/*
* This part is compiled out, it is only here as a build time check
* to make sure that if the tracepoint handling changes, the
diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
index c2f9cabf154d..8bcbb9ee44de 100644
--- a/include/trace/trace_events.h
+++ b/include/trace/trace_events.h
@@ -45,6 +45,16 @@
PARAMS(print)); \
DEFINE_EVENT(name, name, PARAMS(proto), PARAMS(args));
+#undef TRACE_EVENT_SYSCALL
+#define TRACE_EVENT_SYSCALL(name, proto, args, tstruct, assign, print, reg, unreg) \
+ DECLARE_EVENT_SYSCALL_CLASS(name, \
+ PARAMS(proto), \
+ PARAMS(args), \
+ PARAMS(tstruct), \
+ PARAMS(assign), \
+ PARAMS(print)); \
+ DEFINE_EVENT(name, name, PARAMS(proto), PARAMS(args));
+
#include "stages/stage1_struct_define.h"
#undef DECLARE_EVENT_CLASS
@@ -57,6 +67,9 @@
\
static struct trace_event_class event_class_##name;
+#undef DECLARE_EVENT_SYSCALL_CLASS
+#define DECLARE_EVENT_SYSCALL_CLASS DECLARE_EVENT_CLASS
+
#undef DEFINE_EVENT
#define DEFINE_EVENT(template, name, proto, args) \
static struct trace_event_call __used \
@@ -117,6 +130,9 @@
tstruct; \
};
+#undef DECLARE_EVENT_SYSCALL_CLASS
+#define DECLARE_EVENT_SYSCALL_CLASS DECLARE_EVENT_CLASS
+
#undef DEFINE_EVENT
#define DEFINE_EVENT(template, name, proto, args)
@@ -208,6 +224,9 @@ static struct trace_event_functions trace_event_type_funcs_##call = { \
.trace = trace_raw_output_##call, \
};
+#undef DECLARE_EVENT_SYSCALL_CLASS
+#define DECLARE_EVENT_SYSCALL_CLASS DECLARE_EVENT_CLASS
+
#undef DEFINE_EVENT_PRINT
#define DEFINE_EVENT_PRINT(template, call, proto, args, print) \
static notrace enum print_line_t \
@@ -265,6 +284,9 @@ static inline notrace int trace_event_get_offsets_##call( \
return __data_size; \
}
+#undef DECLARE_EVENT_SYSCALL_CLASS
+#define DECLARE_EVENT_SYSCALL_CLASS DECLARE_EVENT_CLASS
+
#include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
/*
@@ -409,6 +431,9 @@ trace_event_raw_event_##call(void *__data, proto) \
* fail to compile unless it too is updated.
*/
+#undef DECLARE_EVENT_SYSCALL_CLASS
+#define DECLARE_EVENT_SYSCALL_CLASS DECLARE_EVENT_CLASS
+
#undef DEFINE_EVENT
#define DEFINE_EVENT(template, call, proto, args) \
static inline void ftrace_test_probe_##call(void) \
@@ -434,6 +459,9 @@ static struct trace_event_class __used __refdata event_class_##call = { \
_TRACE_PERF_INIT(call) \
};
+#undef DECLARE_EVENT_SYSCALL_CLASS
+#define DECLARE_EVENT_SYSCALL_CLASS DECLARE_EVENT_CLASS
+
#undef DEFINE_EVENT
#define DEFINE_EVENT(template, call, proto, args) \
\
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 90843cc38588..d08472421d0e 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -58,7 +58,7 @@ long syscall_trace_enter(struct pt_regs *regs, long syscall,
syscall = syscall_get_nr(current, regs);
if (unlikely(work & SYSCALL_WORK_SYSCALL_TRACEPOINT)) {
- trace_sys_enter(regs, syscall);
+ trace_syscall_sys_enter(regs, syscall);
/*
* Probes or BPF hooks in the tracepoint may have changed the
* system call number as well.
@@ -166,7 +166,7 @@ static void syscall_exit_work(struct pt_regs *regs, unsigned long work)
audit_syscall_exit(regs);
if (work & SYSCALL_WORK_SYSCALL_TRACEPOINT)
- trace_sys_exit(regs, syscall_get_return_value(current, regs));
+ trace_syscall_sys_exit(regs, syscall_get_return_value(current, regs));
step = report_single_step(work);
if (step || work & SYSCALL_WORK_SYSCALL_TRACE)
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 9c581d6da843..067f8e2b930f 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -377,7 +377,7 @@ static int reg_event_syscall_enter(struct trace_event_file *file,
return -ENOSYS;
mutex_lock(&syscall_trace_lock);
if (!tr->sys_refcount_enter)
- ret = register_trace_sys_enter(ftrace_syscall_enter, tr);
+ ret = register_trace_syscall_sys_enter(ftrace_syscall_enter, tr);
if (!ret) {
rcu_assign_pointer(tr->enter_syscall_files[num], file);
tr->sys_refcount_enter++;
@@ -415,7 +415,7 @@ static int reg_event_syscall_exit(struct trace_event_file *file,
return -ENOSYS;
mutex_lock(&syscall_trace_lock);
if (!tr->sys_refcount_exit)
- ret = register_trace_sys_exit(ftrace_syscall_exit, tr);
+ ret = register_trace_syscall_sys_exit(ftrace_syscall_exit, tr);
if (!ret) {
rcu_assign_pointer(tr->exit_syscall_files[num], file);
tr->sys_refcount_exit++;
@@ -631,7 +631,7 @@ static int perf_sysenter_enable(struct trace_event_call *call)
mutex_lock(&syscall_trace_lock);
if (!sys_perf_refcount_enter)
- ret = register_trace_sys_enter(perf_syscall_enter, NULL);
+ ret = register_trace_syscall_sys_enter(perf_syscall_enter, NULL);
if (ret) {
pr_info("event trace: Could not activate syscall entry trace point");
} else {
@@ -728,7 +728,7 @@ static int perf_sysexit_enable(struct trace_event_call *call)
mutex_lock(&syscall_trace_lock);
if (!sys_perf_refcount_exit)
- ret = register_trace_sys_exit(perf_syscall_exit, NULL);
+ ret = register_trace_syscall_sys_exit(perf_syscall_exit, NULL);
if (ret) {
pr_info("event trace: Could not activate syscall exit trace point");
} else {
--
2.39.2
^ permalink raw reply related [flat|nested] 17+ messages in thread* [PATCH 2/8] tracing/ftrace: guard syscall probe with preempt_notrace
2024-09-09 20:16 [PATCH 0/8] tracing: Allow system call tracepoints to handle page faults Mathieu Desnoyers
2024-09-09 20:16 ` [PATCH 1/8] tracing: Declare system call tracepoints with TRACE_EVENT_SYSCALL Mathieu Desnoyers
@ 2024-09-09 20:16 ` Mathieu Desnoyers
2024-09-16 18:47 ` Masami Hiramatsu
2024-09-09 20:16 ` [PATCH 3/8] tracing/perf: " Mathieu Desnoyers
` (7 subsequent siblings)
9 siblings, 1 reply; 17+ messages in thread
From: Mathieu Desnoyers @ 2024-09-09 20:16 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: linux-kernel, Mathieu Desnoyers, Peter Zijlstra,
Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
linux-trace-kernel, Michael Jeanson
In preparation for allowing system call enter/exit instrumentation to
handle page faults, make sure that ftrace can handle this change by
explicitly disabling preemption within the ftrace system call tracepoint
probes to respect the current expectations within ftrace ring buffer
code.
This change does not yet allow ftrace to take page faults per se within
its probe, but allows its existing probes to adapt to the upcoming
change.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michael Jeanson <mjeanson@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org
Cc: Joel Fernandes <joel@joelfernandes.org>
---
include/trace/trace_events.h | 38 ++++++++++++++++++++++++++++-------
kernel/trace/trace_syscalls.c | 12 +++++++++++
2 files changed, 43 insertions(+), 7 deletions(-)
diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
index 8bcbb9ee44de..0228d9ed94a3 100644
--- a/include/trace/trace_events.h
+++ b/include/trace/trace_events.h
@@ -263,6 +263,9 @@ static struct trace_event_fields trace_event_fields_##call[] = { \
tstruct \
{} };
+#undef DECLARE_EVENT_SYSCALL_CLASS
+#define DECLARE_EVENT_SYSCALL_CLASS DECLARE_EVENT_CLASS
+
#undef DEFINE_EVENT_PRINT
#define DEFINE_EVENT_PRINT(template, name, proto, args, print)
@@ -396,11 +399,11 @@ static inline notrace int trace_event_get_offsets_##call( \
#include "stages/stage6_event_callback.h"
-#undef DECLARE_EVENT_CLASS
-#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
- \
+
+#undef __DECLARE_EVENT_CLASS
+#define __DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
static notrace void \
-trace_event_raw_event_##call(void *__data, proto) \
+do_trace_event_raw_event_##call(void *__data, proto) \
{ \
struct trace_event_file *trace_file = __data; \
struct trace_event_data_offsets_##call __maybe_unused __data_offsets;\
@@ -425,15 +428,34 @@ trace_event_raw_event_##call(void *__data, proto) \
\
trace_event_buffer_commit(&fbuffer); \
}
+
+#undef DECLARE_EVENT_CLASS
+#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
+__DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), PARAMS(tstruct), \
+ PARAMS(assign), PARAMS(print)) \
+static notrace void \
+trace_event_raw_event_##call(void *__data, proto) \
+{ \
+ do_trace_event_raw_event_##call(__data, args); \
+}
+
+#undef DECLARE_EVENT_SYSCALL_CLASS
+#define DECLARE_EVENT_SYSCALL_CLASS(call, proto, args, tstruct, assign, print) \
+__DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), PARAMS(tstruct), \
+ PARAMS(assign), PARAMS(print)) \
+static notrace void \
+trace_event_raw_event_##call(void *__data, proto) \
+{ \
+ guard(preempt_notrace)(); \
+ do_trace_event_raw_event_##call(__data, args); \
+}
+
/*
* The ftrace_test_probe is compiled out, it is only here as a build time check
* to make sure that if the tracepoint handling changes, the ftrace probe will
* fail to compile unless it too is updated.
*/
-#undef DECLARE_EVENT_SYSCALL_CLASS
-#define DECLARE_EVENT_SYSCALL_CLASS DECLARE_EVENT_CLASS
-
#undef DEFINE_EVENT
#define DEFINE_EVENT(template, call, proto, args) \
static inline void ftrace_test_probe_##call(void) \
@@ -443,6 +465,8 @@ static inline void ftrace_test_probe_##call(void) \
#include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
+#undef __DECLARE_EVENT_CLASS
+
#include "stages/stage7_class_define.h"
#undef DECLARE_EVENT_CLASS
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 067f8e2b930f..abf0e0b7cd0b 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -299,6 +299,12 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
int syscall_nr;
int size;
+ /*
+ * Syscall probe called with preemption enabled, but the ring
+ * buffer and per-cpu data require preemption to be disabled.
+ */
+ guard(preempt_notrace)();
+
syscall_nr = trace_get_syscall_nr(current, regs);
if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
return;
@@ -338,6 +344,12 @@ static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
struct trace_event_buffer fbuffer;
int syscall_nr;
+ /*
+ * Syscall probe called with preemption enabled, but the ring
+ * buffer and per-cpu data require preemption to be disabled.
+ */
+ guard(preempt_notrace)();
+
syscall_nr = trace_get_syscall_nr(current, regs);
if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
return;
--
2.39.2
^ permalink raw reply related [flat|nested] 17+ messages in thread* Re: [PATCH 2/8] tracing/ftrace: guard syscall probe with preempt_notrace
2024-09-09 20:16 ` [PATCH 2/8] tracing/ftrace: guard syscall probe with preempt_notrace Mathieu Desnoyers
@ 2024-09-16 18:47 ` Masami Hiramatsu
0 siblings, 0 replies; 17+ messages in thread
From: Masami Hiramatsu @ 2024-09-16 18:47 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Steven Rostedt, linux-kernel, Peter Zijlstra, Alexei Starovoitov,
Yonghong Song, Paul E . McKenney, Ingo Molnar,
Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
linux-trace-kernel, Michael Jeanson
On Mon, 9 Sep 2024 16:16:46 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> In preparation for allowing system call enter/exit instrumentation to
> handle page faults, make sure that ftrace can handle this change by
> explicitly disabling preemption within the ftrace system call tracepoint
> probes to respect the current expectations within ftrace ring buffer
> code.
>
> This change does not yet allow ftrace to take page faults per se within
> its probe, but allows its existing probes to adapt to the upcoming
> change.
OK, this looks good to me.
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Thank you,
>
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Michael Jeanson <mjeanson@efficios.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Yonghong Song <yhs@fb.com>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> Cc: bpf@vger.kernel.org
> Cc: Joel Fernandes <joel@joelfernandes.org>
> ---
> include/trace/trace_events.h | 38 ++++++++++++++++++++++++++++-------
> kernel/trace/trace_syscalls.c | 12 +++++++++++
> 2 files changed, 43 insertions(+), 7 deletions(-)
>
> diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
> index 8bcbb9ee44de..0228d9ed94a3 100644
> --- a/include/trace/trace_events.h
> +++ b/include/trace/trace_events.h
> @@ -263,6 +263,9 @@ static struct trace_event_fields trace_event_fields_##call[] = { \
> tstruct \
> {} };
>
> +#undef DECLARE_EVENT_SYSCALL_CLASS
> +#define DECLARE_EVENT_SYSCALL_CLASS DECLARE_EVENT_CLASS
> +
> #undef DEFINE_EVENT_PRINT
> #define DEFINE_EVENT_PRINT(template, name, proto, args, print)
>
> @@ -396,11 +399,11 @@ static inline notrace int trace_event_get_offsets_##call( \
>
> #include "stages/stage6_event_callback.h"
>
> -#undef DECLARE_EVENT_CLASS
> -#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
> - \
> +
> +#undef __DECLARE_EVENT_CLASS
> +#define __DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
> static notrace void \
> -trace_event_raw_event_##call(void *__data, proto) \
> +do_trace_event_raw_event_##call(void *__data, proto) \
> { \
> struct trace_event_file *trace_file = __data; \
> struct trace_event_data_offsets_##call __maybe_unused __data_offsets;\
> @@ -425,15 +428,34 @@ trace_event_raw_event_##call(void *__data, proto) \
> \
> trace_event_buffer_commit(&fbuffer); \
> }
> +
> +#undef DECLARE_EVENT_CLASS
> +#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
> +__DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), PARAMS(tstruct), \
> + PARAMS(assign), PARAMS(print)) \
> +static notrace void \
> +trace_event_raw_event_##call(void *__data, proto) \
> +{ \
> + do_trace_event_raw_event_##call(__data, args); \
> +}
> +
> +#undef DECLARE_EVENT_SYSCALL_CLASS
> +#define DECLARE_EVENT_SYSCALL_CLASS(call, proto, args, tstruct, assign, print) \
> +__DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), PARAMS(tstruct), \
> + PARAMS(assign), PARAMS(print)) \
> +static notrace void \
> +trace_event_raw_event_##call(void *__data, proto) \
> +{ \
> + guard(preempt_notrace)(); \
> + do_trace_event_raw_event_##call(__data, args); \
> +}
> +
> /*
> * The ftrace_test_probe is compiled out, it is only here as a build time check
> * to make sure that if the tracepoint handling changes, the ftrace probe will
> * fail to compile unless it too is updated.
> */
>
> -#undef DECLARE_EVENT_SYSCALL_CLASS
> -#define DECLARE_EVENT_SYSCALL_CLASS DECLARE_EVENT_CLASS
> -
> #undef DEFINE_EVENT
> #define DEFINE_EVENT(template, call, proto, args) \
> static inline void ftrace_test_probe_##call(void) \
> @@ -443,6 +465,8 @@ static inline void ftrace_test_probe_##call(void) \
>
> #include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
>
> +#undef __DECLARE_EVENT_CLASS
> +
> #include "stages/stage7_class_define.h"
>
> #undef DECLARE_EVENT_CLASS
> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
> index 067f8e2b930f..abf0e0b7cd0b 100644
> --- a/kernel/trace/trace_syscalls.c
> +++ b/kernel/trace/trace_syscalls.c
> @@ -299,6 +299,12 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
> int syscall_nr;
> int size;
>
> + /*
> + * Syscall probe called with preemption enabled, but the ring
> + * buffer and per-cpu data require preemption to be disabled.
> + */
> + guard(preempt_notrace)();
> +
> syscall_nr = trace_get_syscall_nr(current, regs);
> if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
> return;
> @@ -338,6 +344,12 @@ static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
> struct trace_event_buffer fbuffer;
> int syscall_nr;
>
> + /*
> + * Syscall probe called with preemption enabled, but the ring
> + * buffer and per-cpu data require preemption to be disabled.
> + */
> + guard(preempt_notrace)();
> +
> syscall_nr = trace_get_syscall_nr(current, regs);
> if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
> return;
> --
> 2.39.2
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 3/8] tracing/perf: guard syscall probe with preempt_notrace
2024-09-09 20:16 [PATCH 0/8] tracing: Allow system call tracepoints to handle page faults Mathieu Desnoyers
2024-09-09 20:16 ` [PATCH 1/8] tracing: Declare system call tracepoints with TRACE_EVENT_SYSCALL Mathieu Desnoyers
2024-09-09 20:16 ` [PATCH 2/8] tracing/ftrace: guard syscall probe with preempt_notrace Mathieu Desnoyers
@ 2024-09-09 20:16 ` Mathieu Desnoyers
2024-09-09 20:16 ` [PATCH 4/8] tracing/bpf: " Mathieu Desnoyers
` (6 subsequent siblings)
9 siblings, 0 replies; 17+ messages in thread
From: Mathieu Desnoyers @ 2024-09-09 20:16 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: linux-kernel, Mathieu Desnoyers, Peter Zijlstra,
Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
linux-trace-kernel, Michael Jeanson
In preparation for allowing system call enter/exit instrumentation to
handle page faults, make sure that perf can handle this change by
explicitly disabling preemption within the perf system call tracepoint
probes to respect the current expectations within perf ring buffer code.
This change does not yet allow perf to take page faults per se within
its probe, but allows its existing probes to adapt to the upcoming
change.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michael Jeanson <mjeanson@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org
Cc: Joel Fernandes <joel@joelfernandes.org>
---
include/trace/perf.h | 41 +++++++++++++++++++++++++++++++----
kernel/trace/trace_syscalls.c | 12 ++++++++++
2 files changed, 49 insertions(+), 4 deletions(-)
diff --git a/include/trace/perf.h b/include/trace/perf.h
index ded997af481e..5650c1bad088 100644
--- a/include/trace/perf.h
+++ b/include/trace/perf.h
@@ -12,10 +12,10 @@
#undef __perf_task
#define __perf_task(t) (__task = (t))
-#undef DECLARE_EVENT_CLASS
-#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
+#undef __DECLARE_EVENT_CLASS
+#define __DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
static notrace void \
-perf_trace_##call(void *__data, proto) \
+do_perf_trace_##call(void *__data, proto) \
{ \
struct trace_event_call *event_call = __data; \
struct trace_event_data_offsets_##call __maybe_unused __data_offsets;\
@@ -55,8 +55,38 @@ perf_trace_##call(void *__data, proto) \
head, __task); \
}
+/*
+ * Define unused __count and __task variables to use @args to pass
+ * arguments to do_perf_trace_##call. This is needed because the
+ * macros __perf_count and __perf_task introduce the side-effect to
+ * store copies into those local variables.
+ */
+#undef DECLARE_EVENT_CLASS
+#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
+__DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), PARAMS(tstruct), \
+ PARAMS(assign), PARAMS(print)) \
+static notrace void \
+perf_trace_##call(void *__data, proto) \
+{ \
+ u64 __count __attribute__((unused)); \
+ struct task_struct *__task __attribute__((unused)); \
+ \
+ do_perf_trace_##call(__data, args); \
+}
+
#undef DECLARE_EVENT_SYSCALL_CLASS
-#define DECLARE_EVENT_SYSCALL_CLASS DECLARE_EVENT_CLASS
+#define DECLARE_EVENT_SYSCALL_CLASS(call, proto, args, tstruct, assign, print) \
+__DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), PARAMS(tstruct), \
+ PARAMS(assign), PARAMS(print)) \
+static notrace void \
+perf_trace_##call(void *__data, proto) \
+{ \
+ u64 __count __attribute__((unused)); \
+ struct task_struct *__task __attribute__((unused)); \
+ \
+ guard(preempt_notrace)(); \
+ do_perf_trace_##call(__data, args); \
+}
/*
* This part is compiled out, it is only here as a build time check
@@ -76,4 +106,7 @@ static inline void perf_test_probe_##call(void) \
DEFINE_EVENT(template, name, PARAMS(proto), PARAMS(args))
#include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
+
+#undef __DECLARE_EVENT_CLASS
+
#endif /* CONFIG_PERF_EVENTS */
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index abf0e0b7cd0b..a3d8ac00793e 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -594,6 +594,12 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
int rctx;
int size;
+ /*
+ * Syscall probe called with preemption enabled, but the ring
+ * buffer and per-cpu data require preemption to be disabled.
+ */
+ guard(preempt_notrace)();
+
syscall_nr = trace_get_syscall_nr(current, regs);
if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
return;
@@ -694,6 +700,12 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
int rctx;
int size;
+ /*
+ * Syscall probe called with preemption enabled, but the ring
+ * buffer and per-cpu data require preemption to be disabled.
+ */
+ guard(preempt_notrace)();
+
syscall_nr = trace_get_syscall_nr(current, regs);
if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
return;
--
2.39.2
^ permalink raw reply related [flat|nested] 17+ messages in thread* [PATCH 4/8] tracing/bpf: guard syscall probe with preempt_notrace
2024-09-09 20:16 [PATCH 0/8] tracing: Allow system call tracepoints to handle page faults Mathieu Desnoyers
` (2 preceding siblings ...)
2024-09-09 20:16 ` [PATCH 3/8] tracing/perf: " Mathieu Desnoyers
@ 2024-09-09 20:16 ` Mathieu Desnoyers
2024-09-10 0:03 ` Andrii Nakryiko
2024-09-09 20:16 ` [PATCH 5/8] tracing: Allow system call tracepoints to handle page faults Mathieu Desnoyers
` (5 subsequent siblings)
9 siblings, 1 reply; 17+ messages in thread
From: Mathieu Desnoyers @ 2024-09-09 20:16 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: linux-kernel, Mathieu Desnoyers, Peter Zijlstra,
Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
linux-trace-kernel, Michael Jeanson
In preparation for allowing system call enter/exit instrumentation to
handle page faults, make sure that bpf can handle this change by
explicitly disabling preemption within the bpf system call tracepoint
probes to respect the current expectations within bpf tracing code.
This change does not yet allow bpf to take page faults per se within its
probe, but allows its existing probes to adapt to the upcoming change.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michael Jeanson <mjeanson@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org
Cc: Joel Fernandes <joel@joelfernandes.org>
---
include/trace/bpf_probe.h | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
index c85bbce5aaa5..211b98d45fc6 100644
--- a/include/trace/bpf_probe.h
+++ b/include/trace/bpf_probe.h
@@ -53,8 +53,17 @@ __bpf_trace_##call(void *__data, proto) \
#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
__BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args))
+#define __BPF_DECLARE_TRACE_SYSCALL(call, proto, args) \
+static notrace void \
+__bpf_trace_##call(void *__data, proto) \
+{ \
+ guard(preempt_notrace)(); \
+ CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(__data, CAST_TO_U64(args)); \
+}
+
#undef DECLARE_EVENT_SYSCALL_CLASS
-#define DECLARE_EVENT_SYSCALL_CLASS DECLARE_EVENT_CLASS
+#define DECLARE_EVENT_SYSCALL_CLASS(call, proto, args, tstruct, assign, print) \
+ __BPF_DECLARE_TRACE_SYSCALL(call, PARAMS(proto), PARAMS(args))
/*
* This part is compiled out, it is only here as a build time check
--
2.39.2
^ permalink raw reply related [flat|nested] 17+ messages in thread* Re: [PATCH 4/8] tracing/bpf: guard syscall probe with preempt_notrace
2024-09-09 20:16 ` [PATCH 4/8] tracing/bpf: " Mathieu Desnoyers
@ 2024-09-10 0:03 ` Andrii Nakryiko
0 siblings, 0 replies; 17+ messages in thread
From: Andrii Nakryiko @ 2024-09-10 0:03 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Steven Rostedt, Masami Hiramatsu, linux-kernel, Peter Zijlstra,
Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
Namhyung Kim, bpf, Joel Fernandes, linux-trace-kernel,
Michael Jeanson
On Mon, Sep 9, 2024 at 1:17 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> In preparation for allowing system call enter/exit instrumentation to
> handle page faults, make sure that bpf can handle this change by
> explicitly disabling preemption within the bpf system call tracepoint
> probes to respect the current expectations within bpf tracing code.
>
> This change does not yet allow bpf to take page faults per se within its
> probe, but allows its existing probes to adapt to the upcoming change.
>
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Michael Jeanson <mjeanson@efficios.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Yonghong Song <yhs@fb.com>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> Cc: bpf@vger.kernel.org
> Cc: Joel Fernandes <joel@joelfernandes.org>
> ---
> include/trace/bpf_probe.h | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
LGTM.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
> diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
> index c85bbce5aaa5..211b98d45fc6 100644
> --- a/include/trace/bpf_probe.h
> +++ b/include/trace/bpf_probe.h
> @@ -53,8 +53,17 @@ __bpf_trace_##call(void *__data, proto) \
> #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
> __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args))
>
> +#define __BPF_DECLARE_TRACE_SYSCALL(call, proto, args) \
> +static notrace void \
> +__bpf_trace_##call(void *__data, proto) \
> +{ \
> + guard(preempt_notrace)(); \
> + CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(__data, CAST_TO_U64(args)); \
> +}
> +
> #undef DECLARE_EVENT_SYSCALL_CLASS
> -#define DECLARE_EVENT_SYSCALL_CLASS DECLARE_EVENT_CLASS
> +#define DECLARE_EVENT_SYSCALL_CLASS(call, proto, args, tstruct, assign, print) \
> + __BPF_DECLARE_TRACE_SYSCALL(call, PARAMS(proto), PARAMS(args))
>
> /*
> * This part is compiled out, it is only here as a build time check
> --
> 2.39.2
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 5/8] tracing: Allow system call tracepoints to handle page faults
2024-09-09 20:16 [PATCH 0/8] tracing: Allow system call tracepoints to handle page faults Mathieu Desnoyers
` (3 preceding siblings ...)
2024-09-09 20:16 ` [PATCH 4/8] tracing/bpf: " Mathieu Desnoyers
@ 2024-09-09 20:16 ` Mathieu Desnoyers
2024-09-09 20:16 ` [PATCH 6/8] tracing/ftrace: Add might_fault check to syscall probes Mathieu Desnoyers
` (4 subsequent siblings)
9 siblings, 0 replies; 17+ messages in thread
From: Mathieu Desnoyers @ 2024-09-09 20:16 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: linux-kernel, Mathieu Desnoyers, Peter Zijlstra,
Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
linux-trace-kernel, Michael Jeanson
Use Tasks Trace RCU to protect iteration of system call enter/exit
tracepoint probes to allow those probes to handle page faults.
In preparation for this change, all tracers registering to system call
enter/exit tracepoints should expect those to be called with preemption
enabled.
This allows tracers to fault-in userspace system call arguments such as
path strings within their probe callbacks.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michael Jeanson <mjeanson@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org
Cc: Joel Fernandes <joel@joelfernandes.org>
---
include/linux/tracepoint.h | 25 +++++++++++++++++--------
init/Kconfig | 1 +
2 files changed, 18 insertions(+), 8 deletions(-)
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index b2cfe6a9097c..005ede3b1afa 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -18,6 +18,7 @@
#include <linux/types.h>
#include <linux/cpumask.h>
#include <linux/rcupdate.h>
+#include <linux/rcupdate_trace.h>
#include <linux/tracepoint-defs.h>
#include <linux/static_call.h>
@@ -90,6 +91,7 @@ int unregister_tracepoint_module_notifier(struct notifier_block *nb)
#ifdef CONFIG_TRACEPOINTS
static inline void tracepoint_synchronize_unregister(void)
{
+ synchronize_rcu_tasks_trace();
synchronize_srcu(&tracepoint_srcu);
synchronize_rcu();
}
@@ -192,7 +194,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
* it_func[0] is never NULL because there is at least one element in the array
* when the array itself is non NULL.
*/
-#define __DO_TRACE(name, args, cond, rcuidle) \
+#define __DO_TRACE(name, args, cond, rcuidle, syscall) \
do { \
int __maybe_unused __idx = 0; \
\
@@ -203,8 +205,12 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
"Bad RCU usage for tracepoint")) \
return; \
\
- /* keep srcu and sched-rcu usage consistent */ \
- preempt_disable_notrace(); \
+ if (syscall) { \
+ rcu_read_lock_trace(); \
+ } else { \
+ /* keep srcu and sched-rcu usage consistent */ \
+ preempt_disable_notrace(); \
+ } \
\
/* \
* For rcuidle callers, use srcu since sched-rcu \
@@ -222,7 +228,10 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
srcu_read_unlock_notrace(&tracepoint_srcu, __idx);\
} \
\
- preempt_enable_notrace(); \
+ if (syscall) \
+ rcu_read_unlock_trace(); \
+ else \
+ preempt_enable_notrace(); \
} while (0)
#ifndef MODULE
@@ -232,7 +241,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
if (static_key_false(&__tracepoint_##name.key)) \
__DO_TRACE(name, \
TP_ARGS(args), \
- TP_CONDITION(cond), 1); \
+ TP_CONDITION(cond), 1, 0); \
}
#else
#define __DECLARE_TRACE_RCU(name, proto, args, cond)
@@ -276,7 +285,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
if (static_key_false(&__tracepoint_##name.key)) \
__DO_TRACE(name, \
TP_ARGS(args), \
- TP_CONDITION(cond), 0); \
+ TP_CONDITION(cond), 0, 0); \
if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) { \
WARN_ONCE(!rcu_is_watching(), \
"RCU not watching for tracepoint"); \
@@ -287,7 +296,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
if (static_key_false(&__tracepoint_##name.key)) \
__DO_TRACE(name, \
TP_ARGS(args), \
- TP_CONDITION(cond), 1); \
+ TP_CONDITION(cond), 1, 0); \
} \
static inline int \
register_trace_##name(void (*probe)(data_proto), void *data) \
@@ -310,7 +319,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
if (static_key_false(&__tracepoint_##name.key)) \
__DO_TRACE(name, \
TP_ARGS(args), \
- TP_CONDITION(cond), 0); \
+ TP_CONDITION(cond), 0, 1); \
if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) { \
WARN_ONCE(!rcu_is_watching(), \
"RCU not watching for tracepoint"); \
diff --git a/init/Kconfig b/init/Kconfig
index d8a971b804d3..c854b2887e7f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1937,6 +1937,7 @@ config BINDGEN_VERSION_TEXT
#
config TRACEPOINTS
bool
+ select TASKS_TRACE_RCU
source "kernel/Kconfig.kexec"
--
2.39.2
^ permalink raw reply related [flat|nested] 17+ messages in thread* [PATCH 6/8] tracing/ftrace: Add might_fault check to syscall probes
2024-09-09 20:16 [PATCH 0/8] tracing: Allow system call tracepoints to handle page faults Mathieu Desnoyers
` (4 preceding siblings ...)
2024-09-09 20:16 ` [PATCH 5/8] tracing: Allow system call tracepoints to handle page faults Mathieu Desnoyers
@ 2024-09-09 20:16 ` Mathieu Desnoyers
2024-09-09 20:16 ` [PATCH 7/8] tracing/perf: " Mathieu Desnoyers
` (3 subsequent siblings)
9 siblings, 0 replies; 17+ messages in thread
From: Mathieu Desnoyers @ 2024-09-09 20:16 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: linux-kernel, Mathieu Desnoyers, Peter Zijlstra,
Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
linux-trace-kernel, Michael Jeanson
Add a might_fault() check to validate that the ftrace sys_enter/sys_exit
probe callbacks are indeed called from a context where page faults can
be handled.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michael Jeanson <mjeanson@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org
Cc: Joel Fernandes <joel@joelfernandes.org>
---
include/trace/trace_events.h | 1 +
kernel/trace/trace_syscalls.c | 2 ++
2 files changed, 3 insertions(+)
diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
index 0228d9ed94a3..e0d4850b0d77 100644
--- a/include/trace/trace_events.h
+++ b/include/trace/trace_events.h
@@ -446,6 +446,7 @@ __DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), PARAMS(tstruct), \
static notrace void \
trace_event_raw_event_##call(void *__data, proto) \
{ \
+ might_fault(); \
guard(preempt_notrace)(); \
do_trace_event_raw_event_##call(__data, args); \
}
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index a3d8ac00793e..0430890cbb42 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -303,6 +303,7 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
* Syscall probe called with preemption enabled, but the ring
* buffer and per-cpu data require preemption to be disabled.
*/
+ might_fault();
guard(preempt_notrace)();
syscall_nr = trace_get_syscall_nr(current, regs);
@@ -348,6 +349,7 @@ static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
* Syscall probe called with preemption enabled, but the ring
* buffer and per-cpu data require preemption to be disabled.
*/
+ might_fault();
guard(preempt_notrace)();
syscall_nr = trace_get_syscall_nr(current, regs);
--
2.39.2
^ permalink raw reply related [flat|nested] 17+ messages in thread* [PATCH 7/8] tracing/perf: Add might_fault check to syscall probes
2024-09-09 20:16 [PATCH 0/8] tracing: Allow system call tracepoints to handle page faults Mathieu Desnoyers
` (5 preceding siblings ...)
2024-09-09 20:16 ` [PATCH 6/8] tracing/ftrace: Add might_fault check to syscall probes Mathieu Desnoyers
@ 2024-09-09 20:16 ` Mathieu Desnoyers
2024-09-09 20:16 ` [PATCH 8/8] tracing/bpf: " Mathieu Desnoyers
` (2 subsequent siblings)
9 siblings, 0 replies; 17+ messages in thread
From: Mathieu Desnoyers @ 2024-09-09 20:16 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: linux-kernel, Mathieu Desnoyers, Peter Zijlstra,
Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
linux-trace-kernel, Michael Jeanson
Add a might_fault() check to validate that the perf sys_enter/sys_exit
probe callbacks are indeed called from a context where page faults can
be handled.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michael Jeanson <mjeanson@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org
Cc: Joel Fernandes <joel@joelfernandes.org>
---
include/trace/perf.h | 1 +
kernel/trace/trace_syscalls.c | 2 ++
2 files changed, 3 insertions(+)
diff --git a/include/trace/perf.h b/include/trace/perf.h
index 5650c1bad088..321bfd7919f6 100644
--- a/include/trace/perf.h
+++ b/include/trace/perf.h
@@ -84,6 +84,7 @@ perf_trace_##call(void *__data, proto) \
u64 __count __attribute__((unused)); \
struct task_struct *__task __attribute__((unused)); \
\
+ might_fault(); \
guard(preempt_notrace)(); \
do_perf_trace_##call(__data, args); \
}
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 0430890cbb42..53faa791c735 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -600,6 +600,7 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
* Syscall probe called with preemption enabled, but the ring
* buffer and per-cpu data require preemption to be disabled.
*/
+ might_fault();
guard(preempt_notrace)();
syscall_nr = trace_get_syscall_nr(current, regs);
@@ -706,6 +707,7 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
* Syscall probe called with preemption enabled, but the ring
* buffer and per-cpu data require preemption to be disabled.
*/
+ might_fault();
guard(preempt_notrace)();
syscall_nr = trace_get_syscall_nr(current, regs);
--
2.39.2
^ permalink raw reply related [flat|nested] 17+ messages in thread* [PATCH 8/8] tracing/bpf: Add might_fault check to syscall probes
2024-09-09 20:16 [PATCH 0/8] tracing: Allow system call tracepoints to handle page faults Mathieu Desnoyers
` (6 preceding siblings ...)
2024-09-09 20:16 ` [PATCH 7/8] tracing/perf: " Mathieu Desnoyers
@ 2024-09-09 20:16 ` Mathieu Desnoyers
2024-09-10 0:02 ` Andrii Nakryiko
2024-09-09 23:53 ` [PATCH 0/8] tracing: Allow system call tracepoints to handle page faults Andrii Nakryiko
2024-09-16 19:49 ` Masami Hiramatsu
9 siblings, 1 reply; 17+ messages in thread
From: Mathieu Desnoyers @ 2024-09-09 20:16 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: linux-kernel, Mathieu Desnoyers, Peter Zijlstra,
Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
linux-trace-kernel, Michael Jeanson
Add a might_fault() check to validate that the bpf sys_enter/sys_exit
probe callbacks are indeed called from a context where page faults can
be handled.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michael Jeanson <mjeanson@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org
Cc: Joel Fernandes <joel@joelfernandes.org>
---
include/trace/bpf_probe.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
index 211b98d45fc6..099df5c3e38a 100644
--- a/include/trace/bpf_probe.h
+++ b/include/trace/bpf_probe.h
@@ -57,6 +57,7 @@ __bpf_trace_##call(void *__data, proto) \
static notrace void \
__bpf_trace_##call(void *__data, proto) \
{ \
+ might_fault(); \
guard(preempt_notrace)(); \
CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(__data, CAST_TO_U64(args)); \
}
--
2.39.2
^ permalink raw reply related [flat|nested] 17+ messages in thread* Re: [PATCH 8/8] tracing/bpf: Add might_fault check to syscall probes
2024-09-09 20:16 ` [PATCH 8/8] tracing/bpf: " Mathieu Desnoyers
@ 2024-09-10 0:02 ` Andrii Nakryiko
0 siblings, 0 replies; 17+ messages in thread
From: Andrii Nakryiko @ 2024-09-10 0:02 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Steven Rostedt, Masami Hiramatsu, linux-kernel, Peter Zijlstra,
Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
Namhyung Kim, bpf, Joel Fernandes, linux-trace-kernel,
Michael Jeanson
On Mon, Sep 9, 2024 at 1:17 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> Add a might_fault() check to validate that the bpf sys_enter/sys_exit
> probe callbacks are indeed called from a context where page faults can
> be handled.
>
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Michael Jeanson <mjeanson@efficios.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Yonghong Song <yhs@fb.com>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> Cc: bpf@vger.kernel.org
> Cc: Joel Fernandes <joel@joelfernandes.org>
> ---
> include/trace/bpf_probe.h | 1 +
> 1 file changed, 1 insertion(+)
>
LGTM.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
> diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
> index 211b98d45fc6..099df5c3e38a 100644
> --- a/include/trace/bpf_probe.h
> +++ b/include/trace/bpf_probe.h
> @@ -57,6 +57,7 @@ __bpf_trace_##call(void *__data, proto) \
> static notrace void \
> __bpf_trace_##call(void *__data, proto) \
> { \
> + might_fault(); \
> guard(preempt_notrace)(); \
> CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(__data, CAST_TO_U64(args)); \
> }
> --
> 2.39.2
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/8] tracing: Allow system call tracepoints to handle page faults
2024-09-09 20:16 [PATCH 0/8] tracing: Allow system call tracepoints to handle page faults Mathieu Desnoyers
` (7 preceding siblings ...)
2024-09-09 20:16 ` [PATCH 8/8] tracing/bpf: " Mathieu Desnoyers
@ 2024-09-09 23:53 ` Andrii Nakryiko
2024-09-10 0:36 ` Mathieu Desnoyers
2024-09-16 19:49 ` Masami Hiramatsu
9 siblings, 1 reply; 17+ messages in thread
From: Andrii Nakryiko @ 2024-09-09 23:53 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Steven Rostedt, Masami Hiramatsu, linux-kernel, Peter Zijlstra,
Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
Namhyung Kim, bpf, Joel Fernandes, linux-trace-kernel
On Mon, Sep 9, 2024 at 1:17 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> Wire up the system call tracepoints with Tasks Trace RCU to allow
> the ftrace, perf, and eBPF tracers to handle page faults.
>
> This series does the initial wire-up allowing tracers to handle page
> faults, but leaves out the actual handling of said page faults as future
> work.
>
> This series was compile and runtime tested with ftrace and perf syscall
> tracing and raw syscall tracing, adding a WARN_ON_ONCE() in the
> generated code to validate that the intended probes are used for raw
> syscall tracing. The might_fault() added within those probes validate
> that they are called from a context where handling a page fault is OK.
>
> For ebpf, this series is compile-tested only.
What tree/branch was this based on? I can't apply it cleanly anywhere I tried...
>
> This series replaces the "Faultable Tracepoints v6" series found at [1].
>
> Thanks,
>
> Mathieu
>
> Link: https://lore.kernel.org/lkml/20240828144153.829582-1-mathieu.desnoyers@efficios.com/ # [1]
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Yonghong Song <yhs@fb.com>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> Cc: bpf@vger.kernel.org
> Cc: Joel Fernandes <joel@joelfernandes.org>
> Cc: linux-trace-kernel@vger.kernel.org
>
> Mathieu Desnoyers (8):
> tracing: Declare system call tracepoints with TRACE_EVENT_SYSCALL
> tracing/ftrace: guard syscall probe with preempt_notrace
> tracing/perf: guard syscall probe with preempt_notrace
> tracing/bpf: guard syscall probe with preempt_notrace
> tracing: Allow system call tracepoints to handle page faults
> tracing/ftrace: Add might_fault check to syscall probes
> tracing/perf: Add might_fault check to syscall probes
> tracing/bpf: Add might_fault check to syscall probes
>
> include/linux/tracepoint.h | 87 +++++++++++++++++++++++++--------
> include/trace/bpf_probe.h | 13 +++++
> include/trace/define_trace.h | 5 ++
> include/trace/events/syscalls.h | 4 +-
> include/trace/perf.h | 43 ++++++++++++++--
> include/trace/trace_events.h | 61 +++++++++++++++++++++--
> init/Kconfig | 1 +
> kernel/entry/common.c | 4 +-
> kernel/trace/trace_syscalls.c | 36 ++++++++++++--
> 9 files changed, 218 insertions(+), 36 deletions(-)
>
> --
> 2.39.2
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH 0/8] tracing: Allow system call tracepoints to handle page faults
2024-09-09 23:53 ` [PATCH 0/8] tracing: Allow system call tracepoints to handle page faults Andrii Nakryiko
@ 2024-09-10 0:36 ` Mathieu Desnoyers
2024-09-11 23:08 ` Andrii Nakryiko
0 siblings, 1 reply; 17+ messages in thread
From: Mathieu Desnoyers @ 2024-09-10 0:36 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Steven Rostedt, Masami Hiramatsu, linux-kernel, Peter Zijlstra,
Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
Namhyung Kim, bpf, Joel Fernandes, linux-trace-kernel
On 2024-09-09 19:53, Andrii Nakryiko wrote:
> On Mon, Sep 9, 2024 at 1:17 PM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> Wire up the system call tracepoints with Tasks Trace RCU to allow
>> the ftrace, perf, and eBPF tracers to handle page faults.
>>
>> This series does the initial wire-up allowing tracers to handle page
>> faults, but leaves out the actual handling of said page faults as future
>> work.
>>
>> This series was compile and runtime tested with ftrace and perf syscall
>> tracing and raw syscall tracing, adding a WARN_ON_ONCE() in the
>> generated code to validate that the intended probes are used for raw
>> syscall tracing. The might_fault() added within those probes validate
>> that they are called from a context where handling a page fault is OK.
>>
>> For ebpf, this series is compile-tested only.
>
> What tree/branch was this based on? I can't apply it cleanly anywhere I tried...
This series was based on tag v6.10.6
Sorry I should have included this information in patch 0.
Thanks,
Mathieu
>
>>
>> This series replaces the "Faultable Tracepoints v6" series found at [1].
>>
>> Thanks,
>>
>> Mathieu
>>
>> Link: https://lore.kernel.org/lkml/20240828144153.829582-1-mathieu.desnoyers@efficios.com/ # [1]
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Alexei Starovoitov <ast@kernel.org>
>> Cc: Yonghong Song <yhs@fb.com>
>> Cc: Paul E. McKenney <paulmck@kernel.org>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
>> Cc: Mark Rutland <mark.rutland@arm.com>
>> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
>> Cc: Namhyung Kim <namhyung@kernel.org>
>> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
>> Cc: bpf@vger.kernel.org
>> Cc: Joel Fernandes <joel@joelfernandes.org>
>> Cc: linux-trace-kernel@vger.kernel.org
>>
>> Mathieu Desnoyers (8):
>> tracing: Declare system call tracepoints with TRACE_EVENT_SYSCALL
>> tracing/ftrace: guard syscall probe with preempt_notrace
>> tracing/perf: guard syscall probe with preempt_notrace
>> tracing/bpf: guard syscall probe with preempt_notrace
>> tracing: Allow system call tracepoints to handle page faults
>> tracing/ftrace: Add might_fault check to syscall probes
>> tracing/perf: Add might_fault check to syscall probes
>> tracing/bpf: Add might_fault check to syscall probes
>>
>> include/linux/tracepoint.h | 87 +++++++++++++++++++++++++--------
>> include/trace/bpf_probe.h | 13 +++++
>> include/trace/define_trace.h | 5 ++
>> include/trace/events/syscalls.h | 4 +-
>> include/trace/perf.h | 43 ++++++++++++++--
>> include/trace/trace_events.h | 61 +++++++++++++++++++++--
>> init/Kconfig | 1 +
>> kernel/entry/common.c | 4 +-
>> kernel/trace/trace_syscalls.c | 36 ++++++++++++--
>> 9 files changed, 218 insertions(+), 36 deletions(-)
>>
>> --
>> 2.39.2
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/8] tracing: Allow system call tracepoints to handle page faults
2024-09-10 0:36 ` Mathieu Desnoyers
@ 2024-09-11 23:08 ` Andrii Nakryiko
0 siblings, 0 replies; 17+ messages in thread
From: Andrii Nakryiko @ 2024-09-11 23:08 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Steven Rostedt, Masami Hiramatsu, linux-kernel, Peter Zijlstra,
Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
Namhyung Kim, bpf, Joel Fernandes, linux-trace-kernel
On Mon, Sep 9, 2024 at 5:36 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> On 2024-09-09 19:53, Andrii Nakryiko wrote:
> > On Mon, Sep 9, 2024 at 1:17 PM Mathieu Desnoyers
> > <mathieu.desnoyers@efficios.com> wrote:
> >>
> >> Wire up the system call tracepoints with Tasks Trace RCU to allow
> >> the ftrace, perf, and eBPF tracers to handle page faults.
> >>
> >> This series does the initial wire-up allowing tracers to handle page
> >> faults, but leaves out the actual handling of said page faults as future
> >> work.
> >>
> >> This series was compile and runtime tested with ftrace and perf syscall
> >> tracing and raw syscall tracing, adding a WARN_ON_ONCE() in the
> >> generated code to validate that the intended probes are used for raw
> >> syscall tracing. The might_fault() added within those probes validate
> >> that they are called from a context where handling a page fault is OK.
> >>
> >> For ebpf, this series is compile-tested only.
> >
> > What tree/branch was this based on? I can't apply it cleanly anywhere I tried...
>
> This series was based on tag v6.10.6
>
Didn't find 6.10.6, but it applied cleanly to 6.10.3. I tested that
BPF parts work:
Tested-by: Andrii Nakryiko <andrii@kernel.org> # BPF parts
> Sorry I should have included this information in patch 0.
>
> Thanks,
>
> Mathieu
>
> >
> >>
> >> This series replaces the "Faultable Tracepoints v6" series found at [1].
> >>
> >> Thanks,
> >>
> >> Mathieu
> >>
> >> Link: https://lore.kernel.org/lkml/20240828144153.829582-1-mathieu.desnoyers@efficios.com/ # [1]
> >> Cc: Peter Zijlstra <peterz@infradead.org>
> >> Cc: Alexei Starovoitov <ast@kernel.org>
> >> Cc: Yonghong Song <yhs@fb.com>
> >> Cc: Paul E. McKenney <paulmck@kernel.org>
> >> Cc: Ingo Molnar <mingo@redhat.com>
> >> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> >> Cc: Mark Rutland <mark.rutland@arm.com>
> >> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> >> Cc: Namhyung Kim <namhyung@kernel.org>
> >> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> >> Cc: bpf@vger.kernel.org
> >> Cc: Joel Fernandes <joel@joelfernandes.org>
> >> Cc: linux-trace-kernel@vger.kernel.org
> >>
> >> Mathieu Desnoyers (8):
> >> tracing: Declare system call tracepoints with TRACE_EVENT_SYSCALL
> >> tracing/ftrace: guard syscall probe with preempt_notrace
> >> tracing/perf: guard syscall probe with preempt_notrace
> >> tracing/bpf: guard syscall probe with preempt_notrace
> >> tracing: Allow system call tracepoints to handle page faults
> >> tracing/ftrace: Add might_fault check to syscall probes
> >> tracing/perf: Add might_fault check to syscall probes
> >> tracing/bpf: Add might_fault check to syscall probes
> >>
> >> include/linux/tracepoint.h | 87 +++++++++++++++++++++++++--------
> >> include/trace/bpf_probe.h | 13 +++++
> >> include/trace/define_trace.h | 5 ++
> >> include/trace/events/syscalls.h | 4 +-
> >> include/trace/perf.h | 43 ++++++++++++++--
> >> include/trace/trace_events.h | 61 +++++++++++++++++++++--
> >> init/Kconfig | 1 +
> >> kernel/entry/common.c | 4 +-
> >> kernel/trace/trace_syscalls.c | 36 ++++++++++++--
> >> 9 files changed, 218 insertions(+), 36 deletions(-)
> >>
> >> --
> >> 2.39.2
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/8] tracing: Allow system call tracepoints to handle page faults
2024-09-09 20:16 [PATCH 0/8] tracing: Allow system call tracepoints to handle page faults Mathieu Desnoyers
` (8 preceding siblings ...)
2024-09-09 23:53 ` [PATCH 0/8] tracing: Allow system call tracepoints to handle page faults Andrii Nakryiko
@ 2024-09-16 19:49 ` Masami Hiramatsu
2024-09-17 9:54 ` Mathieu Desnoyers
9 siblings, 1 reply; 17+ messages in thread
From: Masami Hiramatsu @ 2024-09-16 19:49 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Steven Rostedt, linux-kernel, Peter Zijlstra, Alexei Starovoitov,
Yonghong Song, Paul E . McKenney, Ingo Molnar,
Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
linux-trace-kernel
On Mon, 9 Sep 2024 16:16:44 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> Wire up the system call tracepoints with Tasks Trace RCU to allow
> the ftrace, perf, and eBPF tracers to handle page faults.
>
> This series does the initial wire-up allowing tracers to handle page
> faults, but leaves out the actual handling of said page faults as future
> work.
>
> This series was compile and runtime tested with ftrace and perf syscall
> tracing and raw syscall tracing, adding a WARN_ON_ONCE() in the
> generated code to validate that the intended probes are used for raw
> syscall tracing. The might_fault() added within those probes validate
> that they are called from a context where handling a page fault is OK.
I think this series itself is valuable.
However, I'm still not sure that why ftrace needs to handle page faults.
This allows syscall trace-event itself to handle page faults, but the
raw-syscall/syscall events only accesses registers, right?
I think that the page faults happen only when dereference those registers
as a pointer to the data structure, and currently that is done by probes
like eprobe and fprobe. In order to handle faults in those probes, we
need to change how those writes data in per-cpu ring buffer.
Currently, those probes reserves an entry on ring buffer and writes the
dereferenced data on the entry, and commits it. So during this reserve-
write-commit operation, this still disables preemption. So we need a
another buffer for dereference on the stack and copy it.
Thank you,
>
> For ebpf, this series is compile-tested only.
>
> This series replaces the "Faultable Tracepoints v6" series found at [1].
>
> Thanks,
>
> Mathieu
>
> Link: https://lore.kernel.org/lkml/20240828144153.829582-1-mathieu.desnoyers@efficios.com/ # [1]
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Yonghong Song <yhs@fb.com>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> Cc: bpf@vger.kernel.org
> Cc: Joel Fernandes <joel@joelfernandes.org>
> Cc: linux-trace-kernel@vger.kernel.org
>
> Mathieu Desnoyers (8):
> tracing: Declare system call tracepoints with TRACE_EVENT_SYSCALL
> tracing/ftrace: guard syscall probe with preempt_notrace
> tracing/perf: guard syscall probe with preempt_notrace
> tracing/bpf: guard syscall probe with preempt_notrace
> tracing: Allow system call tracepoints to handle page faults
> tracing/ftrace: Add might_fault check to syscall probes
> tracing/perf: Add might_fault check to syscall probes
> tracing/bpf: Add might_fault check to syscall probes
>
> include/linux/tracepoint.h | 87 +++++++++++++++++++++++++--------
> include/trace/bpf_probe.h | 13 +++++
> include/trace/define_trace.h | 5 ++
> include/trace/events/syscalls.h | 4 +-
> include/trace/perf.h | 43 ++++++++++++++--
> include/trace/trace_events.h | 61 +++++++++++++++++++++--
> init/Kconfig | 1 +
> kernel/entry/common.c | 4 +-
> kernel/trace/trace_syscalls.c | 36 ++++++++++++--
> 9 files changed, 218 insertions(+), 36 deletions(-)
>
> --
> 2.39.2
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH 0/8] tracing: Allow system call tracepoints to handle page faults
2024-09-16 19:49 ` Masami Hiramatsu
@ 2024-09-17 9:54 ` Mathieu Desnoyers
0 siblings, 0 replies; 17+ messages in thread
From: Mathieu Desnoyers @ 2024-09-17 9:54 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: Steven Rostedt, linux-kernel, Peter Zijlstra, Alexei Starovoitov,
Yonghong Song, Paul E . McKenney, Ingo Molnar,
Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
linux-trace-kernel, Josh Poimboeuf, Kees Cook, Andy Lutomirski,
Will Drewry
On 2024-09-16 21:49, Masami Hiramatsu (Google) wrote:
> On Mon, 9 Sep 2024 16:16:44 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>
>> Wire up the system call tracepoints with Tasks Trace RCU to allow
>> the ftrace, perf, and eBPF tracers to handle page faults.
>>
>> This series does the initial wire-up allowing tracers to handle page
>> faults, but leaves out the actual handling of said page faults as future
>> work.
>>
>> This series was compile and runtime tested with ftrace and perf syscall
>> tracing and raw syscall tracing, adding a WARN_ON_ONCE() in the
>> generated code to validate that the intended probes are used for raw
>> syscall tracing. The might_fault() added within those probes validate
>> that they are called from a context where handling a page fault is OK.
>
> I think this series itself is valuable.
> However, I'm still not sure that why ftrace needs to handle page faults.
> This allows syscall trace-event itself to handle page faults, but the
> raw-syscall/syscall events only accesses registers, right?
You are correct that ftrace currently only accesses registers as of
today. And maybe it will stay the focus for ftrace, as the ftrace
focus appears to be more about what happens inside the kernel than
the causality from user-space. But different tracers have different
focus and use-cases.
It's a different story for eBPF and LTTng though: LTTng grabs filename strings from user-space for the openat system call for instance, so
we can reconstruct which system calls were done on which files at
post-processing. This is convenient if the end user wishes to focus
on the activity for given file/set of files.
eBPF also allows grabbing userspace data AFAIR, but none of those
tracers can handle page faults because tracepoints disables preemption,
which leads to missing data in specific cases, e.g. immediately after an
execve syscall when pages are not faulted in yet.
Also having syscall entry called from a context that can handle
preemption would allow LTTng (or eBPF) to do an immediate stackwalk
(see the sframe work from Josh) directly at system call entry. This
can be useful for filtering based on the user callstack before writing
to a ring buffer.
>
> I think that the page faults happen only when dereference those registers
> as a pointer to the data structure, and currently that is done by probes
> like eprobe and fprobe. In order to handle faults in those probes, we
> need to change how those writes data in per-cpu ring buffer.
>
> Currently, those probes reserves an entry on ring buffer and writes the
> dereferenced data on the entry, and commits it. So during this reserve-
> write-commit operation, this still disables preemption. So we need a
> another buffer for dereference on the stack and copy it.
There are quite a few approaches we can take there, with different
tradeoffs.
A) Issue dummy loads of user-space data just to trigger page faults
before disabling preemption. Unless the system has extreme memory
pressure, it should be enough to page in the data and it should stay
available for copy into the ring buffer immediately after with preemption
disabled. This should be fine for practical purposes. This is simple to
implement and is the route I intend to take initially for LTTng.
B) Do a copy in a local buffer and take page faults at that point. This
bring the question of where to allocate the buffer. This also requires an
extra copy from userspace, to the local buffer, then to the per-cpu ring
buffer, so it may come with a certain overhead. One advantage of that
approach is that it opens the door to fix TOCTOU races that syscall audit
systems (e.g. seccomp) have if we change the system call implementation
to use data from this argument copy rather than re-read them from
userspace within the system call. But this is a much larger endeavor that
should be done in collaboration between the tracing & seccomp developers.
C) Modify the ring buffer to make it usable without disabling preemption.
It's straightforward in LTTng because its ring buffer has been designed
to be usable in preemptible userspace context as well (LTTng-UST).
This may not be as easy for ftrace since disabling preemption is
rooted deep in its ring buffer design.
Besides those basic tradeoffs, we should of course consider the overhead
associated with each approach.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 17+ messages in thread