[PATCH v1 0/8] tracing: Allow system call tracepoints to handle page faults

linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v1 0/8] tracing: Allow system call tracepoints to handle page faults
@ 2024-10-03 15:16 Mathieu Desnoyers
  2024-10-03 15:16 ` [PATCH v1 1/8] tracing: Declare system call tracepoints with TRACE_EVENT_SYSCALL Mathieu Desnoyers
                   ` (7 more replies)
  0 siblings, 8 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2024-10-03 15:16 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: linux-kernel, Mathieu Desnoyers, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel

Wire up the system call tracepoints with Tasks Trace RCU to allow
the ftrace, perf, and eBPF tracers to handle page faults.

This series does the initial wire-up allowing tracers to handle page
faults, but leaves out the actual handling of said page faults as future
work.

This series was compile and runtime tested with ftrace and perf syscall
tracing and raw syscall tracing, adding a WARN_ON_ONCE() in the
generated code to validate that the intended probes are used for raw
syscall tracing. The might_fault() added within those probes validate
that they are called from a context where handling a page fault is OK.

This series replaces the "Faultable Tracepoints v6" series found at [1].

This has been rebased on v6.12-rc1, without any conflicts. I have fixed
the build bot warnings, those were caused by build error with
CONFIG_TRACING=n. I've added the Acked-by and Tested-by tags to relevant
commits.

Steven, can you merge it through the tracing tree ?

Thanks,

Mathieu

Link: https://lore.kernel.org/lkml/20240828144153.829582-1-mathieu.desnoyers@efficios.com/ # [1]
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: linux-trace-kernel@vger.kernel.org

Mathieu Desnoyers (8):
  tracing: Declare system call tracepoints with TRACE_EVENT_SYSCALL
  tracing/ftrace: guard syscall probe with preempt_notrace
  tracing/perf: guard syscall probe with preempt_notrace
  tracing/bpf: guard syscall probe with preempt_notrace
  tracing: Allow system call tracepoints to handle page faults
  tracing/ftrace: Add might_fault check to syscall probes
  tracing/perf: Add might_fault check to syscall probes
  tracing/bpf: Add might_fault check to syscall probes

 include/linux/tracepoint.h      | 104 ++++++++++++++++++++++++++++----
 include/trace/bpf_probe.h       |  13 ++++
 include/trace/define_trace.h    |   5 ++
 include/trace/events/syscalls.h |   4 +-
 include/trace/perf.h            |  43 ++++++++++++-
 include/trace/trace_events.h    |  61 +++++++++++++++++--
 init/Kconfig                    |   1 +
 kernel/entry/common.c           |   4 +-
 kernel/trace/trace_syscalls.c   |  44 +++++++++++---
 9 files changed, 247 insertions(+), 32 deletions(-)

-- 
2.39.2

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v1 1/8] tracing: Declare system call tracepoints with TRACE_EVENT_SYSCALL
  2024-10-03 15:16 [PATCH v1 0/8] tracing: Allow system call tracepoints to handle page faults Mathieu Desnoyers
@ 2024-10-03 15:16 ` Mathieu Desnoyers
  2024-10-03 21:32   ` Steven Rostedt
  2024-10-04 10:34   ` kernel test robot
  2024-10-03 15:16 ` [PATCH v1 2/8] tracing/ftrace: guard syscall probe with preempt_notrace Mathieu Desnoyers
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2024-10-03 15:16 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: linux-kernel, Mathieu Desnoyers, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Michael Jeanson

In preparation for allowing system call tracepoints to handle page
faults, introduce TRACE_EVENT_SYSCALL to declare the sys_enter/sys_exit
tracepoints.

Emit the static inlines register_trace_syscall_##name for events
declared with TRACE_EVENT_SYSCALL, allowing source-level validation
that only probes meant to handle system call entry/exit events are
registered to them.

Move the common code between __DECLARE_TRACE and __DECLARE_TRACE_SYSCALL
into __DECLARE_TRACE_COMMON.

This change is not meant to alter the generated code, and only prepares
the following modifications.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michael Jeanson <mjeanson@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org
Cc: Joel Fernandes <joel@joelfernandes.org>
---
Changes since v0:
- Fix allnoconfig build by adding __DECLARE_TRACE_SYSCALL define in
  CONFIG_TRACEPOINTS=n case.
- Rename unregister_trace_sys_{enter,exit} to
  unregister_trace_syscall_sys_{enter,exit} for symmetry with
  register.
- Add emit trace_syscall_##name##_enabled for syscall tracepoints
  rather than trace_##name##_enabled, so it is in sync with the
  rest of the naming.
---
 include/linux/tracepoint.h      | 83 ++++++++++++++++++++++++++++++---
 include/trace/bpf_probe.h       |  3 ++
 include/trace/define_trace.h    |  5 ++
 include/trace/events/syscalls.h |  4 +-
 include/trace/perf.h            |  3 ++
 include/trace/trace_events.h    | 28 +++++++++++
 kernel/entry/common.c           |  4 +-
 kernel/trace/trace_syscalls.c   | 16 +++----
 8 files changed, 127 insertions(+), 19 deletions(-)

diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 93a9f3070b48..666499b9f3be 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -268,10 +268,17 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
  * site if it is not watching, as it will need to be active when the
  * tracepoint is enabled.
  */
-#define __DECLARE_TRACE(name, proto, args, cond, data_proto)		\
+#define __DECLARE_TRACE_COMMON(name, proto, args, cond, data_proto)	\
 	extern int __traceiter_##name(data_proto);			\
 	DECLARE_STATIC_CALL(tp_func_##name, __traceiter_##name);	\
 	extern struct tracepoint __tracepoint_##name;			\
+	static inline void						\
+	check_trace_callback_type_##name(void (*cb)(data_proto))	\
+	{								\
+	}								\
+
+#define __DECLARE_TRACE(name, proto, args, cond, data_proto)		\
+	__DECLARE_TRACE_COMMON(name, PARAMS(proto), PARAMS(args), cond, PARAMS(data_proto)) \
 	static inline void trace_##name(proto)				\
 	{								\
 		if (static_key_false(&__tracepoint_##name.key))		\
@@ -283,8 +290,13 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 				  "RCU not watching for tracepoint");	\
 		}							\
 	}								\
-	__DECLARE_TRACE_RCU(name, PARAMS(proto), PARAMS(args),		\
-			    PARAMS(cond))				\
+	static inline void trace_##name##_rcuidle(proto)		\
+	{								\
+		if (static_key_false(&__tracepoint_##name.key))		\
+			__DO_TRACE(name,				\
+				TP_ARGS(args),				\
+				TP_CONDITION(cond), 1);			\
+	}								\
 	static inline int						\
 	register_trace_##name(void (*probe)(data_proto), void *data)	\
 	{								\
@@ -302,14 +314,42 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 	unregister_trace_##name(void (*probe)(data_proto), void *data)	\
 	{								\
 		return tracepoint_probe_unregister(&__tracepoint_##name,\
-						(void *)probe, data);	\
+						   (void *)probe, data);\
 	}								\
-	static inline void						\
-	check_trace_callback_type_##name(void (*cb)(data_proto))	\
+	static inline bool						\
+	trace_##name##_enabled(void)					\
+	{								\
+		return static_key_false(&__tracepoint_##name.key);	\
+	}
+
+
+#define __DECLARE_TRACE_SYSCALL(name, proto, args, cond, data_proto)	\
+	__DECLARE_TRACE_COMMON(name, PARAMS(proto), PARAMS(args), cond, PARAMS(data_proto)) \
+	static inline void trace_syscall_##name(proto)			\
+	{								\
+		if (static_key_false(&__tracepoint_##name.key))		\
+			__DO_TRACE(name,				\
+				TP_ARGS(args),				\
+				TP_CONDITION(cond), 0);			\
+		if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) {		\
+			WARN_ONCE(!rcu_is_watching(),			\
+				  "RCU not watching for tracepoint");	\
+		}							\
+	}								\
+	static inline int						\
+	register_trace_syscall_##name(void (*probe)(data_proto), void *data) \
 	{								\
+		return tracepoint_probe_register(&__tracepoint_##name,	\
+						 (void *)probe, data);	\
+	}								\
+	static inline int						\
+	unregister_trace_syscall_##name(void (*probe)(data_proto), void *data) \
+	{								\
+		return tracepoint_probe_unregister(&__tracepoint_##name,\
+						   (void *)probe, data);\
 	}								\
 	static inline bool						\
-	trace_##name##_enabled(void)					\
+	trace_syscall_##name##_enabled(void)				\
 	{								\
 		return static_key_false(&__tracepoint_##name.key);	\
 	}
@@ -398,6 +438,27 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 		return false;						\
 	}
 
+#define __DECLARE_TRACE_SYSCALL(name, proto, args, cond, data_proto)	\
+	static inline void trace_syscall_##name(proto)			\
+	{ }								\
+	static inline int						\
+	register_trace_syscall_##name(void (*probe)(data_proto),	\
+				      void *data)			\
+	{								\
+		return -ENOSYS;						\
+	}								\
+	static inline int						\
+	unregister_trace_syscall_##name(void (*probe)(data_proto),	\
+					void *data)			\
+	{								\
+		return -ENOSYS;						\
+	}								\
+	static inline bool						\
+	trace_syscall_##name##_enabled(void)				\
+	{								\
+		return false;						\
+	}
+
 #define DEFINE_TRACE_FN(name, reg, unreg, proto, args)
 #define DEFINE_TRACE(name, proto, args)
 #define EXPORT_TRACEPOINT_SYMBOL_GPL(name)
@@ -459,6 +520,11 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 			cpu_online(raw_smp_processor_id()) && (PARAMS(cond)), \
 			PARAMS(void *__data, proto))
 
+#define DECLARE_TRACE_SYSCALL(name, proto, args)			\
+	__DECLARE_TRACE_SYSCALL(name, PARAMS(proto), PARAMS(args),	\
+				cpu_online(raw_smp_processor_id()),	\
+				PARAMS(void *__data, proto))
+
 #define TRACE_EVENT_FLAGS(event, flag)
 
 #define TRACE_EVENT_PERF_PERM(event, expr...)
@@ -596,6 +662,9 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 			      struct, assign, print)		\
 	DECLARE_TRACE_CONDITION(name, PARAMS(proto),		\
 				PARAMS(args), PARAMS(cond))
+#define TRACE_EVENT_SYSCALL(name, proto, args, struct, assign,	\
+			    print, reg, unreg)			\
+	DECLARE_TRACE_SYSCALL(name, PARAMS(proto), PARAMS(args))
 
 #define TRACE_EVENT_FLAGS(event, flag)
 
diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
index a2ea11cc912e..c85bbce5aaa5 100644
--- a/include/trace/bpf_probe.h
+++ b/include/trace/bpf_probe.h
@@ -53,6 +53,9 @@ __bpf_trace_##call(void *__data, proto)					\
 #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
 	__BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args))
 
+#undef DECLARE_EVENT_SYSCALL_CLASS
+#define DECLARE_EVENT_SYSCALL_CLASS DECLARE_EVENT_CLASS
+
 /*
  * This part is compiled out, it is only here as a build time check
  * to make sure that if the tracepoint handling changes, the
diff --git a/include/trace/define_trace.h b/include/trace/define_trace.h
index 00723935dcc7..ff5fa17a6259 100644
--- a/include/trace/define_trace.h
+++ b/include/trace/define_trace.h
@@ -46,6 +46,10 @@
 		assign, print, reg, unreg)			\
 	DEFINE_TRACE_FN(name, reg, unreg, PARAMS(proto), PARAMS(args))
 
+#undef TRACE_EVENT_SYSCALL
+#define TRACE_EVENT_SYSCALL(name, proto, args, struct, assign, print, reg, unreg) \
+	DEFINE_TRACE_FN(name, reg, unreg, PARAMS(proto), PARAMS(args))
+
 #undef TRACE_EVENT_NOP
 #define TRACE_EVENT_NOP(name, proto, args, struct, assign, print)
 
@@ -107,6 +111,7 @@
 #undef TRACE_EVENT
 #undef TRACE_EVENT_FN
 #undef TRACE_EVENT_FN_COND
+#undef TRACE_EVENT_SYSCALL
 #undef TRACE_EVENT_CONDITION
 #undef TRACE_EVENT_NOP
 #undef DEFINE_EVENT_NOP
diff --git a/include/trace/events/syscalls.h b/include/trace/events/syscalls.h
index b6e0cbc2c71f..f31ff446b468 100644
--- a/include/trace/events/syscalls.h
+++ b/include/trace/events/syscalls.h
@@ -15,7 +15,7 @@
 
 #ifdef CONFIG_HAVE_SYSCALL_TRACEPOINTS
 
-TRACE_EVENT_FN(sys_enter,
+TRACE_EVENT_SYSCALL(sys_enter,
 
 	TP_PROTO(struct pt_regs *regs, long id),
 
@@ -41,7 +41,7 @@ TRACE_EVENT_FN(sys_enter,
 
 TRACE_EVENT_FLAGS(sys_enter, TRACE_EVENT_FL_CAP_ANY)
 
-TRACE_EVENT_FN(sys_exit,
+TRACE_EVENT_SYSCALL(sys_exit,
 
 	TP_PROTO(struct pt_regs *regs, long ret),
 
diff --git a/include/trace/perf.h b/include/trace/perf.h
index 2c11181c82e0..ded997af481e 100644
--- a/include/trace/perf.h
+++ b/include/trace/perf.h
@@ -55,6 +55,9 @@ perf_trace_##call(void *__data, proto)					\
 				  head, __task);			\
 }
 
+#undef DECLARE_EVENT_SYSCALL_CLASS
+#define DECLARE_EVENT_SYSCALL_CLASS DECLARE_EVENT_CLASS
+
 /*
  * This part is compiled out, it is only here as a build time check
  * to make sure that if the tracepoint handling changes, the
diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
index c2f9cabf154d..8bcbb9ee44de 100644
--- a/include/trace/trace_events.h
+++ b/include/trace/trace_events.h
@@ -45,6 +45,16 @@
 			     PARAMS(print));		       \
 	DEFINE_EVENT(name, name, PARAMS(proto), PARAMS(args));
 
+#undef TRACE_EVENT_SYSCALL
+#define TRACE_EVENT_SYSCALL(name, proto, args, tstruct, assign, print, reg, unreg) \
+	DECLARE_EVENT_SYSCALL_CLASS(name,		       \
+			     PARAMS(proto),		       \
+			     PARAMS(args),		       \
+			     PARAMS(tstruct),		       \
+			     PARAMS(assign),		       \
+			     PARAMS(print));		       \
+	DEFINE_EVENT(name, name, PARAMS(proto), PARAMS(args));
+
 #include "stages/stage1_struct_define.h"
 
 #undef DECLARE_EVENT_CLASS
@@ -57,6 +67,9 @@
 									\
 	static struct trace_event_class event_class_##name;
 
+#undef DECLARE_EVENT_SYSCALL_CLASS
+#define DECLARE_EVENT_SYSCALL_CLASS DECLARE_EVENT_CLASS
+
 #undef DEFINE_EVENT
 #define DEFINE_EVENT(template, name, proto, args)	\
 	static struct trace_event_call	__used		\
@@ -117,6 +130,9 @@
 		tstruct;						\
 	};
 
+#undef DECLARE_EVENT_SYSCALL_CLASS
+#define DECLARE_EVENT_SYSCALL_CLASS DECLARE_EVENT_CLASS
+
 #undef DEFINE_EVENT
 #define DEFINE_EVENT(template, name, proto, args)
 
@@ -208,6 +224,9 @@ static struct trace_event_functions trace_event_type_funcs_##call = {	\
 	.trace			= trace_raw_output_##call,		\
 };
 
+#undef DECLARE_EVENT_SYSCALL_CLASS
+#define DECLARE_EVENT_SYSCALL_CLASS DECLARE_EVENT_CLASS
+
 #undef DEFINE_EVENT_PRINT
 #define DEFINE_EVENT_PRINT(template, call, proto, args, print)		\
 static notrace enum print_line_t					\
@@ -265,6 +284,9 @@ static inline notrace int trace_event_get_offsets_##call(		\
 	return __data_size;						\
 }
 
+#undef DECLARE_EVENT_SYSCALL_CLASS
+#define DECLARE_EVENT_SYSCALL_CLASS DECLARE_EVENT_CLASS
+
 #include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
 
 /*
@@ -409,6 +431,9 @@ trace_event_raw_event_##call(void *__data, proto)			\
  * fail to compile unless it too is updated.
  */
 
+#undef DECLARE_EVENT_SYSCALL_CLASS
+#define DECLARE_EVENT_SYSCALL_CLASS DECLARE_EVENT_CLASS
+
 #undef DEFINE_EVENT
 #define DEFINE_EVENT(template, call, proto, args)			\
 static inline void ftrace_test_probe_##call(void)			\
@@ -434,6 +459,9 @@ static struct trace_event_class __used __refdata event_class_##call = { \
 	_TRACE_PERF_INIT(call)						\
 };
 
+#undef DECLARE_EVENT_SYSCALL_CLASS
+#define DECLARE_EVENT_SYSCALL_CLASS DECLARE_EVENT_CLASS
+
 #undef DEFINE_EVENT
 #define DEFINE_EVENT(template, call, proto, args)			\
 									\
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 5b6934e23c21..c9ac1c605d8b 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -58,7 +58,7 @@ long syscall_trace_enter(struct pt_regs *regs, long syscall,
 	syscall = syscall_get_nr(current, regs);
 
 	if (unlikely(work & SYSCALL_WORK_SYSCALL_TRACEPOINT)) {
-		trace_sys_enter(regs, syscall);
+		trace_syscall_sys_enter(regs, syscall);
 		/*
 		 * Probes or BPF hooks in the tracepoint may have changed the
 		 * system call number as well.
@@ -166,7 +166,7 @@ static void syscall_exit_work(struct pt_regs *regs, unsigned long work)
 	audit_syscall_exit(regs);
 
 	if (work & SYSCALL_WORK_SYSCALL_TRACEPOINT)
-		trace_sys_exit(regs, syscall_get_return_value(current, regs));
+		trace_syscall_sys_exit(regs, syscall_get_return_value(current, regs));
 
 	step = report_single_step(work);
 	if (step || work & SYSCALL_WORK_SYSCALL_TRACE)
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 785733245ead..67ac5366f724 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -377,7 +377,7 @@ static int reg_event_syscall_enter(struct trace_event_file *file,
 		return -ENOSYS;
 	mutex_lock(&syscall_trace_lock);
 	if (!tr->sys_refcount_enter)
-		ret = register_trace_sys_enter(ftrace_syscall_enter, tr);
+		ret = register_trace_syscall_sys_enter(ftrace_syscall_enter, tr);
 	if (!ret) {
 		rcu_assign_pointer(tr->enter_syscall_files[num], file);
 		tr->sys_refcount_enter++;
@@ -399,7 +399,7 @@ static void unreg_event_syscall_enter(struct trace_event_file *file,
 	tr->sys_refcount_enter--;
 	RCU_INIT_POINTER(tr->enter_syscall_files[num], NULL);
 	if (!tr->sys_refcount_enter)
-		unregister_trace_sys_enter(ftrace_syscall_enter, tr);
+		unregister_trace_syscall_sys_enter(ftrace_syscall_enter, tr);
 	mutex_unlock(&syscall_trace_lock);
 }
 
@@ -415,7 +415,7 @@ static int reg_event_syscall_exit(struct trace_event_file *file,
 		return -ENOSYS;
 	mutex_lock(&syscall_trace_lock);
 	if (!tr->sys_refcount_exit)
-		ret = register_trace_sys_exit(ftrace_syscall_exit, tr);
+		ret = register_trace_syscall_sys_exit(ftrace_syscall_exit, tr);
 	if (!ret) {
 		rcu_assign_pointer(tr->exit_syscall_files[num], file);
 		tr->sys_refcount_exit++;
@@ -437,7 +437,7 @@ static void unreg_event_syscall_exit(struct trace_event_file *file,
 	tr->sys_refcount_exit--;
 	RCU_INIT_POINTER(tr->exit_syscall_files[num], NULL);
 	if (!tr->sys_refcount_exit)
-		unregister_trace_sys_exit(ftrace_syscall_exit, tr);
+		unregister_trace_syscall_sys_exit(ftrace_syscall_exit, tr);
 	mutex_unlock(&syscall_trace_lock);
 }
 
@@ -633,7 +633,7 @@ static int perf_sysenter_enable(struct trace_event_call *call)
 
 	mutex_lock(&syscall_trace_lock);
 	if (!sys_perf_refcount_enter)
-		ret = register_trace_sys_enter(perf_syscall_enter, NULL);
+		ret = register_trace_syscall_sys_enter(perf_syscall_enter, NULL);
 	if (ret) {
 		pr_info("event trace: Could not activate syscall entry trace point");
 	} else {
@@ -654,7 +654,7 @@ static void perf_sysenter_disable(struct trace_event_call *call)
 	sys_perf_refcount_enter--;
 	clear_bit(num, enabled_perf_enter_syscalls);
 	if (!sys_perf_refcount_enter)
-		unregister_trace_sys_enter(perf_syscall_enter, NULL);
+		unregister_trace_syscall_sys_enter(perf_syscall_enter, NULL);
 	mutex_unlock(&syscall_trace_lock);
 }
 
@@ -732,7 +732,7 @@ static int perf_sysexit_enable(struct trace_event_call *call)
 
 	mutex_lock(&syscall_trace_lock);
 	if (!sys_perf_refcount_exit)
-		ret = register_trace_sys_exit(perf_syscall_exit, NULL);
+		ret = register_trace_syscall_sys_exit(perf_syscall_exit, NULL);
 	if (ret) {
 		pr_info("event trace: Could not activate syscall exit trace point");
 	} else {
@@ -753,7 +753,7 @@ static void perf_sysexit_disable(struct trace_event_call *call)
 	sys_perf_refcount_exit--;
 	clear_bit(num, enabled_perf_exit_syscalls);
 	if (!sys_perf_refcount_exit)
-		unregister_trace_sys_exit(perf_syscall_exit, NULL);
+		unregister_trace_syscall_sys_exit(perf_syscall_exit, NULL);
 	mutex_unlock(&syscall_trace_lock);
 }
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v1 2/8] tracing/ftrace: guard syscall probe with preempt_notrace
  2024-10-03 15:16 [PATCH v1 0/8] tracing: Allow system call tracepoints to handle page faults Mathieu Desnoyers
  2024-10-03 15:16 ` [PATCH v1 1/8] tracing: Declare system call tracepoints with TRACE_EVENT_SYSCALL Mathieu Desnoyers
@ 2024-10-03 15:16 ` Mathieu Desnoyers
  2024-10-03 22:23   ` Steven Rostedt
  2024-10-03 15:16 ` [PATCH v1 3/8] tracing/perf: " Mathieu Desnoyers
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2024-10-03 15:16 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: linux-kernel, Mathieu Desnoyers, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Michael Jeanson

In preparation for allowing system call enter/exit instrumentation to
handle page faults, make sure that ftrace can handle this change by
explicitly disabling preemption within the ftrace system call tracepoint
probes to respect the current expectations within ftrace ring buffer
code.

This change does not yet allow ftrace to take page faults per se within
its probe, but allows its existing probes to adapt to the upcoming
change.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Michael Jeanson <mjeanson@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org
Cc: Joel Fernandes <joel@joelfernandes.org>
---
 include/trace/trace_events.h  | 38 ++++++++++++++++++++++++++++-------
 kernel/trace/trace_syscalls.c | 12 +++++++++++
 2 files changed, 43 insertions(+), 7 deletions(-)

diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
index 8bcbb9ee44de..0228d9ed94a3 100644
--- a/include/trace/trace_events.h
+++ b/include/trace/trace_events.h
@@ -263,6 +263,9 @@ static struct trace_event_fields trace_event_fields_##call[] = {	\
 	tstruct								\
 	{} };
 
+#undef DECLARE_EVENT_SYSCALL_CLASS
+#define DECLARE_EVENT_SYSCALL_CLASS DECLARE_EVENT_CLASS
+
 #undef DEFINE_EVENT_PRINT
 #define DEFINE_EVENT_PRINT(template, name, proto, args, print)
 
@@ -396,11 +399,11 @@ static inline notrace int trace_event_get_offsets_##call(		\
 
 #include "stages/stage6_event_callback.h"
 
-#undef DECLARE_EVENT_CLASS
-#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
-									\
+
+#undef __DECLARE_EVENT_CLASS
+#define __DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
 static notrace void							\
-trace_event_raw_event_##call(void *__data, proto)			\
+do_trace_event_raw_event_##call(void *__data, proto)			\
 {									\
 	struct trace_event_file *trace_file = __data;			\
 	struct trace_event_data_offsets_##call __maybe_unused __data_offsets;\
@@ -425,15 +428,34 @@ trace_event_raw_event_##call(void *__data, proto)			\
 									\
 	trace_event_buffer_commit(&fbuffer);				\
 }
+
+#undef DECLARE_EVENT_CLASS
+#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
+__DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), PARAMS(tstruct), \
+		      PARAMS(assign), PARAMS(print))			\
+static notrace void							\
+trace_event_raw_event_##call(void *__data, proto)			\
+{									\
+	do_trace_event_raw_event_##call(__data, args);			\
+}
+
+#undef DECLARE_EVENT_SYSCALL_CLASS
+#define DECLARE_EVENT_SYSCALL_CLASS(call, proto, args, tstruct, assign, print) \
+__DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), PARAMS(tstruct), \
+		      PARAMS(assign), PARAMS(print))			\
+static notrace void							\
+trace_event_raw_event_##call(void *__data, proto)			\
+{									\
+	guard(preempt_notrace)();					\
+	do_trace_event_raw_event_##call(__data, args);			\
+}
+
 /*
  * The ftrace_test_probe is compiled out, it is only here as a build time check
  * to make sure that if the tracepoint handling changes, the ftrace probe will
  * fail to compile unless it too is updated.
  */
 
-#undef DECLARE_EVENT_SYSCALL_CLASS
-#define DECLARE_EVENT_SYSCALL_CLASS DECLARE_EVENT_CLASS
-
 #undef DEFINE_EVENT
 #define DEFINE_EVENT(template, call, proto, args)			\
 static inline void ftrace_test_probe_##call(void)			\
@@ -443,6 +465,8 @@ static inline void ftrace_test_probe_##call(void)			\
 
 #include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
 
+#undef __DECLARE_EVENT_CLASS
+
 #include "stages/stage7_class_define.h"
 
 #undef DECLARE_EVENT_CLASS
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 67ac5366f724..ab4db8c23f36 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -299,6 +299,12 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
 	int syscall_nr;
 	int size;
 
+	/*
+	 * Syscall probe called with preemption enabled, but the ring
+	 * buffer and per-cpu data require preemption to be disabled.
+	 */
+	guard(preempt_notrace)();
+
 	syscall_nr = trace_get_syscall_nr(current, regs);
 	if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
 		return;
@@ -338,6 +344,12 @@ static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
 	struct trace_event_buffer fbuffer;
 	int syscall_nr;
 
+	/*
+	 * Syscall probe called with preemption enabled, but the ring
+	 * buffer and per-cpu data require preemption to be disabled.
+	 */
+	guard(preempt_notrace)();
+
 	syscall_nr = trace_get_syscall_nr(current, regs);
 	if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
 		return;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v1 3/8] tracing/perf: guard syscall probe with preempt_notrace
  2024-10-03 15:16 [PATCH v1 0/8] tracing: Allow system call tracepoints to handle page faults Mathieu Desnoyers
  2024-10-03 15:16 ` [PATCH v1 1/8] tracing: Declare system call tracepoints with TRACE_EVENT_SYSCALL Mathieu Desnoyers
  2024-10-03 15:16 ` [PATCH v1 2/8] tracing/ftrace: guard syscall probe with preempt_notrace Mathieu Desnoyers
@ 2024-10-03 15:16 ` Mathieu Desnoyers
  2024-10-03 22:25   ` Steven Rostedt
  2024-10-03 15:16 ` [PATCH v1 4/8] tracing/bpf: " Mathieu Desnoyers
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2024-10-03 15:16 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: linux-kernel, Mathieu Desnoyers, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Michael Jeanson

In preparation for allowing system call enter/exit instrumentation to
handle page faults, make sure that perf can handle this change by
explicitly disabling preemption within the perf system call tracepoint
probes to respect the current expectations within perf ring buffer code.

This change does not yet allow perf to take page faults per se within
its probe, but allows its existing probes to adapt to the upcoming
change.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michael Jeanson <mjeanson@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org
Cc: Joel Fernandes <joel@joelfernandes.org>
---
 include/trace/perf.h          | 41 +++++++++++++++++++++++++++++++----
 kernel/trace/trace_syscalls.c | 12 ++++++++++
 2 files changed, 49 insertions(+), 4 deletions(-)

diff --git a/include/trace/perf.h b/include/trace/perf.h
index ded997af481e..5650c1bad088 100644
--- a/include/trace/perf.h
+++ b/include/trace/perf.h
@@ -12,10 +12,10 @@
 #undef __perf_task
 #define __perf_task(t)	(__task = (t))
 
-#undef DECLARE_EVENT_CLASS
-#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
+#undef __DECLARE_EVENT_CLASS
+#define __DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
 static notrace void							\
-perf_trace_##call(void *__data, proto)					\
+do_perf_trace_##call(void *__data, proto)				\
 {									\
 	struct trace_event_call *event_call = __data;			\
 	struct trace_event_data_offsets_##call __maybe_unused __data_offsets;\
@@ -55,8 +55,38 @@ perf_trace_##call(void *__data, proto)					\
 				  head, __task);			\
 }
 
+/*
+ * Define unused __count and __task variables to use @args to pass
+ * arguments to do_perf_trace_##call. This is needed because the
+ * macros __perf_count and __perf_task introduce the side-effect to
+ * store copies into those local variables.
+ */
+#undef DECLARE_EVENT_CLASS
+#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
+__DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), PARAMS(tstruct), \
+		      PARAMS(assign), PARAMS(print))			\
+static notrace void							\
+perf_trace_##call(void *__data, proto)					\
+{									\
+	u64 __count __attribute__((unused));				\
+	struct task_struct *__task __attribute__((unused));		\
+									\
+	do_perf_trace_##call(__data, args);				\
+}
+
 #undef DECLARE_EVENT_SYSCALL_CLASS
-#define DECLARE_EVENT_SYSCALL_CLASS DECLARE_EVENT_CLASS
+#define DECLARE_EVENT_SYSCALL_CLASS(call, proto, args, tstruct, assign, print) \
+__DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), PARAMS(tstruct), \
+		      PARAMS(assign), PARAMS(print))			\
+static notrace void							\
+perf_trace_##call(void *__data, proto)					\
+{									\
+	u64 __count __attribute__((unused));				\
+	struct task_struct *__task __attribute__((unused));		\
+									\
+	guard(preempt_notrace)();					\
+	do_perf_trace_##call(__data, args);				\
+}
 
 /*
  * This part is compiled out, it is only here as a build time check
@@ -76,4 +106,7 @@ static inline void perf_test_probe_##call(void)				\
 	DEFINE_EVENT(template, name, PARAMS(proto), PARAMS(args))
 
 #include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
+
+#undef __DECLARE_EVENT_CLASS
+
 #endif /* CONFIG_PERF_EVENTS */
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index ab4db8c23f36..edcfa47446c7 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -596,6 +596,12 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
 	int rctx;
 	int size;
 
+	/*
+	 * Syscall probe called with preemption enabled, but the ring
+	 * buffer and per-cpu data require preemption to be disabled.
+	 */
+	guard(preempt_notrace)();
+
 	syscall_nr = trace_get_syscall_nr(current, regs);
 	if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
 		return;
@@ -698,6 +704,12 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
 	int rctx;
 	int size;
 
+	/*
+	 * Syscall probe called with preemption enabled, but the ring
+	 * buffer and per-cpu data require preemption to be disabled.
+	 */
+	guard(preempt_notrace)();
+
 	syscall_nr = trace_get_syscall_nr(current, regs);
 	if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
 		return;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v1 4/8] tracing/bpf: guard syscall probe with preempt_notrace
  2024-10-03 15:16 [PATCH v1 0/8] tracing: Allow system call tracepoints to handle page faults Mathieu Desnoyers
                   ` (2 preceding siblings ...)
  2024-10-03 15:16 ` [PATCH v1 3/8] tracing/perf: " Mathieu Desnoyers
@ 2024-10-03 15:16 ` Mathieu Desnoyers
  2024-10-03 22:26   ` Steven Rostedt
  2024-10-03 15:16 ` [PATCH v1 5/8] tracing: Allow system call tracepoints to handle page faults Mathieu Desnoyers
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2024-10-03 15:16 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: linux-kernel, Mathieu Desnoyers, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Andrii Nakryiko, Michael Jeanson

In preparation for allowing system call enter/exit instrumentation to
handle page faults, make sure that bpf can handle this change by
explicitly disabling preemption within the bpf system call tracepoint
probes to respect the current expectations within bpf tracing code.

This change does not yet allow bpf to take page faults per se within its
probe, but allows its existing probes to adapt to the upcoming change.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Andrii Nakryiko <andrii@kernel.org> # BPF parts
Cc: Michael Jeanson <mjeanson@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org
Cc: Joel Fernandes <joel@joelfernandes.org>
---
 include/trace/bpf_probe.h | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
index c85bbce5aaa5..211b98d45fc6 100644
--- a/include/trace/bpf_probe.h
+++ b/include/trace/bpf_probe.h
@@ -53,8 +53,17 @@ __bpf_trace_##call(void *__data, proto)					\
 #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
 	__BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args))
 
+#define __BPF_DECLARE_TRACE_SYSCALL(call, proto, args)			\
+static notrace void							\
+__bpf_trace_##call(void *__data, proto)					\
+{									\
+	guard(preempt_notrace)();					\
+	CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(__data, CAST_TO_U64(args));	\
+}
+
 #undef DECLARE_EVENT_SYSCALL_CLASS
-#define DECLARE_EVENT_SYSCALL_CLASS DECLARE_EVENT_CLASS
+#define DECLARE_EVENT_SYSCALL_CLASS(call, proto, args, tstruct, assign, print)	\
+	__BPF_DECLARE_TRACE_SYSCALL(call, PARAMS(proto), PARAMS(args))
 
 /*
  * This part is compiled out, it is only here as a build time check
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v1 5/8] tracing: Allow system call tracepoints to handle page faults
  2024-10-03 15:16 [PATCH v1 0/8] tracing: Allow system call tracepoints to handle page faults Mathieu Desnoyers
                   ` (3 preceding siblings ...)
  2024-10-03 15:16 ` [PATCH v1 4/8] tracing/bpf: " Mathieu Desnoyers
@ 2024-10-03 15:16 ` Mathieu Desnoyers
  2024-10-03 22:29   ` Steven Rostedt
  2024-10-03 15:16 ` [PATCH v1 6/8] tracing/ftrace: Add might_fault check to syscall probes Mathieu Desnoyers
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2024-10-03 15:16 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: linux-kernel, Mathieu Desnoyers, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Michael Jeanson

Use Tasks Trace RCU to protect iteration of system call enter/exit
tracepoint probes to allow those probes to handle page faults.

In preparation for this change, all tracers registering to system call
enter/exit tracepoints should expect those to be called with preemption
enabled.

This allows tracers to fault-in userspace system call arguments such as
path strings within their probe callbacks.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michael Jeanson <mjeanson@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org
Cc: Joel Fernandes <joel@joelfernandes.org>
---
 include/linux/tracepoint.h | 25 +++++++++++++++++--------
 init/Kconfig               |  1 +
 2 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 666499b9f3be..6faf34e5efc9 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -17,6 +17,7 @@
 #include <linux/errno.h>
 #include <linux/types.h>
 #include <linux/rcupdate.h>
+#include <linux/rcupdate_trace.h>
 #include <linux/tracepoint-defs.h>
 #include <linux/static_call.h>
 
@@ -109,6 +110,7 @@ void for_each_tracepoint_in_module(struct module *mod,
 #ifdef CONFIG_TRACEPOINTS
 static inline void tracepoint_synchronize_unregister(void)
 {
+	synchronize_rcu_tasks_trace();
 	synchronize_srcu(&tracepoint_srcu);
 	synchronize_rcu();
 }
@@ -211,7 +213,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
  * it_func[0] is never NULL because there is at least one element in the array
  * when the array itself is non NULL.
  */
-#define __DO_TRACE(name, args, cond, rcuidle)				\
+#define __DO_TRACE(name, args, cond, rcuidle, syscall)			\
 	do {								\
 		int __maybe_unused __idx = 0;				\
 									\
@@ -222,8 +224,12 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 			      "Bad RCU usage for tracepoint"))		\
 			return;						\
 									\
-		/* keep srcu and sched-rcu usage consistent */		\
-		preempt_disable_notrace();				\
+		if (syscall) {						\
+			rcu_read_lock_trace();				\
+		} else {						\
+			/* keep srcu and sched-rcu usage consistent */	\
+			preempt_disable_notrace();			\
+		}							\
 									\
 		/*							\
 		 * For rcuidle callers, use srcu since sched-rcu	\
@@ -241,7 +247,10 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 			srcu_read_unlock_notrace(&tracepoint_srcu, __idx);\
 		}							\
 									\
-		preempt_enable_notrace();				\
+		if (syscall)						\
+			rcu_read_unlock_trace();			\
+		else							\
+			preempt_enable_notrace();			\
 	} while (0)
 
 #ifndef MODULE
@@ -251,7 +260,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 		if (static_key_false(&__tracepoint_##name.key))		\
 			__DO_TRACE(name,				\
 				TP_ARGS(args),				\
-				TP_CONDITION(cond), 1);			\
+				TP_CONDITION(cond), 1, 0);		\
 	}
 #else
 #define __DECLARE_TRACE_RCU(name, proto, args, cond)
@@ -284,7 +293,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 		if (static_key_false(&__tracepoint_##name.key))		\
 			__DO_TRACE(name,				\
 				TP_ARGS(args),				\
-				TP_CONDITION(cond), 0);			\
+				TP_CONDITION(cond), 0, 0);		\
 		if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) {		\
 			WARN_ONCE(!rcu_is_watching(),			\
 				  "RCU not watching for tracepoint");	\
@@ -295,7 +304,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 		if (static_key_false(&__tracepoint_##name.key))		\
 			__DO_TRACE(name,				\
 				TP_ARGS(args),				\
-				TP_CONDITION(cond), 1);			\
+				TP_CONDITION(cond), 1, 0);		\
 	}								\
 	static inline int						\
 	register_trace_##name(void (*probe)(data_proto), void *data)	\
@@ -330,7 +339,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 		if (static_key_false(&__tracepoint_##name.key))		\
 			__DO_TRACE(name,				\
 				TP_ARGS(args),				\
-				TP_CONDITION(cond), 0);			\
+				TP_CONDITION(cond), 0, 1);		\
 		if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) {		\
 			WARN_ONCE(!rcu_is_watching(),			\
 				  "RCU not watching for tracepoint");	\
diff --git a/init/Kconfig b/init/Kconfig
index fbd0cb06a50a..eedd0064fb36 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1984,6 +1984,7 @@ config BINDGEN_VERSION_TEXT
 #
 config TRACEPOINTS
 	bool
+	select TASKS_TRACE_RCU
 
 source "kernel/Kconfig.kexec"
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v1 6/8] tracing/ftrace: Add might_fault check to syscall probes
  2024-10-03 15:16 [PATCH v1 0/8] tracing: Allow system call tracepoints to handle page faults Mathieu Desnoyers
                   ` (4 preceding siblings ...)
  2024-10-03 15:16 ` [PATCH v1 5/8] tracing: Allow system call tracepoints to handle page faults Mathieu Desnoyers
@ 2024-10-03 15:16 ` Mathieu Desnoyers
  2024-10-03 22:36   ` Steven Rostedt
  2024-10-03 15:16 ` [PATCH v1 7/8] tracing/perf: " Mathieu Desnoyers
  2024-10-03 15:16 ` [PATCH v1 8/8] tracing/bpf: " Mathieu Desnoyers
  7 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2024-10-03 15:16 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: linux-kernel, Mathieu Desnoyers, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Michael Jeanson

Add a might_fault() check to validate that the ftrace sys_enter/sys_exit
probe callbacks are indeed called from a context where page faults can
be handled.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Michael Jeanson <mjeanson@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org
Cc: Joel Fernandes <joel@joelfernandes.org>
---
 include/trace/trace_events.h  | 1 +
 kernel/trace/trace_syscalls.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
index 0228d9ed94a3..e0d4850b0d77 100644
--- a/include/trace/trace_events.h
+++ b/include/trace/trace_events.h
@@ -446,6 +446,7 @@ __DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), PARAMS(tstruct), \
 static notrace void							\
 trace_event_raw_event_##call(void *__data, proto)			\
 {									\
+	might_fault();							\
 	guard(preempt_notrace)();					\
 	do_trace_event_raw_event_##call(__data, args);			\
 }
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index edcfa47446c7..89d7e4c57b5b 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -303,6 +303,7 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
 	 * Syscall probe called with preemption enabled, but the ring
 	 * buffer and per-cpu data require preemption to be disabled.
 	 */
+	might_fault();
 	guard(preempt_notrace)();
 
 	syscall_nr = trace_get_syscall_nr(current, regs);
@@ -348,6 +349,7 @@ static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
 	 * Syscall probe called with preemption enabled, but the ring
 	 * buffer and per-cpu data require preemption to be disabled.
 	 */
+	might_fault();
 	guard(preempt_notrace)();
 
 	syscall_nr = trace_get_syscall_nr(current, regs);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v1 7/8] tracing/perf: Add might_fault check to syscall probes
  2024-10-03 15:16 [PATCH v1 0/8] tracing: Allow system call tracepoints to handle page faults Mathieu Desnoyers
                   ` (5 preceding siblings ...)
  2024-10-03 15:16 ` [PATCH v1 6/8] tracing/ftrace: Add might_fault check to syscall probes Mathieu Desnoyers
@ 2024-10-03 15:16 ` Mathieu Desnoyers
  2024-10-03 22:37   ` Steven Rostedt
  2024-10-03 15:16 ` [PATCH v1 8/8] tracing/bpf: " Mathieu Desnoyers
  7 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2024-10-03 15:16 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: linux-kernel, Mathieu Desnoyers, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Michael Jeanson

Add a might_fault() check to validate that the perf sys_enter/sys_exit
probe callbacks are indeed called from a context where page faults can
be handled.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michael Jeanson <mjeanson@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org
Cc: Joel Fernandes <joel@joelfernandes.org>
---
 include/trace/perf.h          | 1 +
 kernel/trace/trace_syscalls.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/include/trace/perf.h b/include/trace/perf.h
index 5650c1bad088..321bfd7919f6 100644
--- a/include/trace/perf.h
+++ b/include/trace/perf.h
@@ -84,6 +84,7 @@ perf_trace_##call(void *__data, proto)					\
 	u64 __count __attribute__((unused));				\
 	struct task_struct *__task __attribute__((unused));		\
 									\
+	might_fault();							\
 	guard(preempt_notrace)();					\
 	do_perf_trace_##call(__data, args);				\
 }
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 89d7e4c57b5b..0d42d6f293d6 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -602,6 +602,7 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
 	 * Syscall probe called with preemption enabled, but the ring
 	 * buffer and per-cpu data require preemption to be disabled.
 	 */
+	might_fault();
 	guard(preempt_notrace)();
 
 	syscall_nr = trace_get_syscall_nr(current, regs);
@@ -710,6 +711,7 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
 	 * Syscall probe called with preemption enabled, but the ring
 	 * buffer and per-cpu data require preemption to be disabled.
 	 */
+	might_fault();
 	guard(preempt_notrace)();
 
 	syscall_nr = trace_get_syscall_nr(current, regs);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v1 8/8] tracing/bpf: Add might_fault check to syscall probes
  2024-10-03 15:16 [PATCH v1 0/8] tracing: Allow system call tracepoints to handle page faults Mathieu Desnoyers
                   ` (6 preceding siblings ...)
  2024-10-03 15:16 ` [PATCH v1 7/8] tracing/perf: " Mathieu Desnoyers
@ 2024-10-03 15:16 ` Mathieu Desnoyers
  2024-10-03 22:38   ` Steven Rostedt
  7 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2024-10-03 15:16 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: linux-kernel, Mathieu Desnoyers, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Andrii Nakryiko, Michael Jeanson

Add a might_fault() check to validate that the bpf sys_enter/sys_exit
probe callbacks are indeed called from a context where page faults can
be handled.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Andrii Nakryiko <andrii@kernel.org> # BPF parts
Cc: Michael Jeanson <mjeanson@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org
Cc: Joel Fernandes <joel@joelfernandes.org>
---
 include/trace/bpf_probe.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
index 211b98d45fc6..099df5c3e38a 100644
--- a/include/trace/bpf_probe.h
+++ b/include/trace/bpf_probe.h
@@ -57,6 +57,7 @@ __bpf_trace_##call(void *__data, proto)					\
 static notrace void							\
 __bpf_trace_##call(void *__data, proto)					\
 {									\
+	might_fault();							\
 	guard(preempt_notrace)();					\
 	CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(__data, CAST_TO_U64(args));	\
 }
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 1/8] tracing: Declare system call tracepoints with TRACE_EVENT_SYSCALL
  2024-10-03 15:16 ` [PATCH v1 1/8] tracing: Declare system call tracepoints with TRACE_EVENT_SYSCALL Mathieu Desnoyers
@ 2024-10-03 21:32   ` Steven Rostedt
  2024-10-04  0:15     ` Mathieu Desnoyers
  2024-10-04 10:34   ` kernel test robot
  1 sibling, 1 reply; 40+ messages in thread
From: Steven Rostedt @ 2024-10-03 21:32 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Masami Hiramatsu, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Michael Jeanson

On Thu,  3 Oct 2024 11:16:31 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> @@ -283,8 +290,13 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
>  				  "RCU not watching for tracepoint");	\
>  		}							\
>  	}								\
> -	__DECLARE_TRACE_RCU(name, PARAMS(proto), PARAMS(args),		\
> -			    PARAMS(cond))				\
> +	static inline void trace_##name##_rcuidle(proto)		\
> +	{								\
> +		if (static_key_false(&__tracepoint_##name.key))		\
> +			__DO_TRACE(name,				\
> +				TP_ARGS(args),				\
> +				TP_CONDITION(cond), 1);			\
> +	}								\
>  	static inline int						\
>  	register_trace_##name(void (*probe)(data_proto), void *data)	\
>  	{								\

Looking at this part of your change, I realized it's time to nuke the
rcuidle() variant.

Feel free to rebase on top of this patch:

  https://lore.kernel.org/all/20241003173051.6b178bb3@gandalf.local.home/

-- Steve

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 2/8] tracing/ftrace: guard syscall probe with preempt_notrace
  2024-10-03 15:16 ` [PATCH v1 2/8] tracing/ftrace: guard syscall probe with preempt_notrace Mathieu Desnoyers
@ 2024-10-03 22:23   ` Steven Rostedt
  2024-10-04  0:26     ` Mathieu Desnoyers
  0 siblings, 1 reply; 40+ messages in thread
From: Steven Rostedt @ 2024-10-03 22:23 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Masami Hiramatsu, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Michael Jeanson

On Thu,  3 Oct 2024 11:16:32 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> In preparation for allowing system call enter/exit instrumentation to
> handle page faults, make sure that ftrace can handle this change by
> explicitly disabling preemption within the ftrace system call tracepoint
> probes to respect the current expectations within ftrace ring buffer
> code.

The ftrace ring buffer doesn't expect preemption being disabled before use.
It will explicitly disable preemption.

I don't think this patch is needed.

-- Steve


> 
> This change does not yet allow ftrace to take page faults per se within
> its probe, but allows its existing probes to adapt to the upcoming
> change.
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> Cc: Michael Jeanson <mjeanson@efficios.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Yonghong Song <yhs@fb.com>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> Cc: bpf@vger.kernel.org
> Cc: Joel Fernandes <joel@joelfernandes.org>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 3/8] tracing/perf: guard syscall probe with preempt_notrace
  2024-10-03 15:16 ` [PATCH v1 3/8] tracing/perf: " Mathieu Desnoyers
@ 2024-10-03 22:25   ` Steven Rostedt
  2024-10-04  0:17     ` Frederic Weisbecker
  0 siblings, 1 reply; 40+ messages in thread
From: Steven Rostedt @ 2024-10-03 22:25 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Masami Hiramatsu, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Michael Jeanson, Frederic Weisbecker

On Thu,  3 Oct 2024 11:16:33 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> In preparation for allowing system call enter/exit instrumentation to
> handle page faults, make sure that perf can handle this change by
> explicitly disabling preemption within the perf system call tracepoint
> probes to respect the current expectations within perf ring buffer code.
> 
> This change does not yet allow perf to take page faults per se within
> its probe, but allows its existing probes to adapt to the upcoming
> change.

Frederic,

Does the perf ring buffer expect preemption to be disabled when used?

In other words, is this patch needed?

-- Steve


> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Michael Jeanson <mjeanson@efficios.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Yonghong Song <yhs@fb.com>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> Cc: bpf@vger.kernel.org
> Cc: Joel Fernandes <joel@joelfernandes.org>
> ---
>  include/trace/perf.h          | 41 +++++++++++++++++++++++++++++++----
>  kernel/trace/trace_syscalls.c | 12 ++++++++++
>  2 files changed, 49 insertions(+), 4 deletions(-)
> 
> diff --git a/include/trace/perf.h b/include/trace/perf.h
> index ded997af481e..5650c1bad088 100644
> --- a/include/trace/perf.h
> +++ b/include/trace/perf.h
> @@ -12,10 +12,10 @@
>  #undef __perf_task
>  #define __perf_task(t)	(__task = (t))
>  
> -#undef DECLARE_EVENT_CLASS
> -#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
> +#undef __DECLARE_EVENT_CLASS
> +#define __DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
>  static notrace void							\
> -perf_trace_##call(void *__data, proto)					\
> +do_perf_trace_##call(void *__data, proto)				\
>  {									\
>  	struct trace_event_call *event_call = __data;			\
>  	struct trace_event_data_offsets_##call __maybe_unused __data_offsets;\
> @@ -55,8 +55,38 @@ perf_trace_##call(void *__data, proto)					\
>  				  head, __task);			\
>  }
>  
> +/*
> + * Define unused __count and __task variables to use @args to pass
> + * arguments to do_perf_trace_##call. This is needed because the
> + * macros __perf_count and __perf_task introduce the side-effect to
> + * store copies into those local variables.
> + */
> +#undef DECLARE_EVENT_CLASS
> +#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
> +__DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), PARAMS(tstruct), \
> +		      PARAMS(assign), PARAMS(print))			\
> +static notrace void							\
> +perf_trace_##call(void *__data, proto)					\
> +{									\
> +	u64 __count __attribute__((unused));				\
> +	struct task_struct *__task __attribute__((unused));		\
> +									\
> +	do_perf_trace_##call(__data, args);				\
> +}
> +
>  #undef DECLARE_EVENT_SYSCALL_CLASS
> -#define DECLARE_EVENT_SYSCALL_CLASS DECLARE_EVENT_CLASS
> +#define DECLARE_EVENT_SYSCALL_CLASS(call, proto, args, tstruct, assign, print) \
> +__DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), PARAMS(tstruct), \
> +		      PARAMS(assign), PARAMS(print))			\
> +static notrace void							\
> +perf_trace_##call(void *__data, proto)					\
> +{									\
> +	u64 __count __attribute__((unused));				\
> +	struct task_struct *__task __attribute__((unused));		\
> +									\
> +	guard(preempt_notrace)();					\
> +	do_perf_trace_##call(__data, args);				\
> +}
>  
>  /*
>   * This part is compiled out, it is only here as a build time check
> @@ -76,4 +106,7 @@ static inline void perf_test_probe_##call(void)				\
>  	DEFINE_EVENT(template, name, PARAMS(proto), PARAMS(args))
>  
>  #include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
> +
> +#undef __DECLARE_EVENT_CLASS
> +
>  #endif /* CONFIG_PERF_EVENTS */
> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
> index ab4db8c23f36..edcfa47446c7 100644
> --- a/kernel/trace/trace_syscalls.c
> +++ b/kernel/trace/trace_syscalls.c
> @@ -596,6 +596,12 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
>  	int rctx;
>  	int size;
>  
> +	/*
> +	 * Syscall probe called with preemption enabled, but the ring
> +	 * buffer and per-cpu data require preemption to be disabled.
> +	 */
> +	guard(preempt_notrace)();
> +
>  	syscall_nr = trace_get_syscall_nr(current, regs);
>  	if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
>  		return;
> @@ -698,6 +704,12 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
>  	int rctx;
>  	int size;
>  
> +	/*
> +	 * Syscall probe called with preemption enabled, but the ring
> +	 * buffer and per-cpu data require preemption to be disabled.
> +	 */
> +	guard(preempt_notrace)();
> +
>  	syscall_nr = trace_get_syscall_nr(current, regs);
>  	if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
>  		return;


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 4/8] tracing/bpf: guard syscall probe with preempt_notrace
  2024-10-03 15:16 ` [PATCH v1 4/8] tracing/bpf: " Mathieu Desnoyers
@ 2024-10-03 22:26   ` Steven Rostedt
  2024-10-03 23:05     ` Alexei Starovoitov
  0 siblings, 1 reply; 40+ messages in thread
From: Steven Rostedt @ 2024-10-03 22:26 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Masami Hiramatsu, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Andrii Nakryiko, Michael Jeanson

On Thu,  3 Oct 2024 11:16:34 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> In preparation for allowing system call enter/exit instrumentation to
> handle page faults, make sure that bpf can handle this change by
> explicitly disabling preemption within the bpf system call tracepoint
> probes to respect the current expectations within bpf tracing code.
> 
> This change does not yet allow bpf to take page faults per se within its
> probe, but allows its existing probes to adapt to the upcoming change.
> 

I guess the BPF folks should state if this is needed or not?

Does the BPF hooks into the tracepoints expect preemption to be disabled
when called?

-- Steve


> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Acked-by: Andrii Nakryiko <andrii@kernel.org>
> Tested-by: Andrii Nakryiko <andrii@kernel.org> # BPF parts
> Cc: Michael Jeanson <mjeanson@efficios.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Yonghong Song <yhs@fb.com>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> Cc: bpf@vger.kernel.org
> Cc: Joel Fernandes <joel@joelfernandes.org>
> ---
>  include/trace/bpf_probe.h | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
> index c85bbce5aaa5..211b98d45fc6 100644
> --- a/include/trace/bpf_probe.h
> +++ b/include/trace/bpf_probe.h
> @@ -53,8 +53,17 @@ __bpf_trace_##call(void *__data, proto)					\
>  #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
>  	__BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args))
>  
> +#define __BPF_DECLARE_TRACE_SYSCALL(call, proto, args)			\
> +static notrace void							\
> +__bpf_trace_##call(void *__data, proto)					\
> +{									\
> +	guard(preempt_notrace)();					\
> +	CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(__data, CAST_TO_U64(args));	\
> +}
> +
>  #undef DECLARE_EVENT_SYSCALL_CLASS
> -#define DECLARE_EVENT_SYSCALL_CLASS DECLARE_EVENT_CLASS
> +#define DECLARE_EVENT_SYSCALL_CLASS(call, proto, args, tstruct, assign, print)	\
> +	__BPF_DECLARE_TRACE_SYSCALL(call, PARAMS(proto), PARAMS(args))
>  
>  /*
>   * This part is compiled out, it is only here as a build time check


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 5/8] tracing: Allow system call tracepoints to handle page faults
  2024-10-03 15:16 ` [PATCH v1 5/8] tracing: Allow system call tracepoints to handle page faults Mathieu Desnoyers
@ 2024-10-03 22:29   ` Steven Rostedt
  2024-10-04  0:35     ` Mathieu Desnoyers
  0 siblings, 1 reply; 40+ messages in thread
From: Steven Rostedt @ 2024-10-03 22:29 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Masami Hiramatsu, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Michael Jeanson

On Thu,  3 Oct 2024 11:16:35 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> Use Tasks Trace RCU to protect iteration of system call enter/exit
> tracepoint probes to allow those probes to handle page faults.
> 
> In preparation for this change, all tracers registering to system call
> enter/exit tracepoints should expect those to be called with preemption
> enabled.
> 
> This allows tracers to fault-in userspace system call arguments such as
> path strings within their probe callbacks.
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Michael Jeanson <mjeanson@efficios.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Yonghong Song <yhs@fb.com>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> Cc: bpf@vger.kernel.org
> Cc: Joel Fernandes <joel@joelfernandes.org>
> ---
>  include/linux/tracepoint.h | 25 +++++++++++++++++--------
>  init/Kconfig               |  1 +
>  2 files changed, 18 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> index 666499b9f3be..6faf34e5efc9 100644
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -17,6 +17,7 @@
>  #include <linux/errno.h>
>  #include <linux/types.h>
>  #include <linux/rcupdate.h>
> +#include <linux/rcupdate_trace.h>
>  #include <linux/tracepoint-defs.h>
>  #include <linux/static_call.h>
>  
> @@ -109,6 +110,7 @@ void for_each_tracepoint_in_module(struct module *mod,
>  #ifdef CONFIG_TRACEPOINTS
>  static inline void tracepoint_synchronize_unregister(void)
>  {
> +	synchronize_rcu_tasks_trace();
>  	synchronize_srcu(&tracepoint_srcu);
>  	synchronize_rcu();
>  }
> @@ -211,7 +213,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
>   * it_func[0] is never NULL because there is at least one element in the array
>   * when the array itself is non NULL.
>   */
> -#define __DO_TRACE(name, args, cond, rcuidle)				\
> +#define __DO_TRACE(name, args, cond, rcuidle, syscall)			\
>  	do {								\
>  		int __maybe_unused __idx = 0;				\
>  									\
> @@ -222,8 +224,12 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
>  			      "Bad RCU usage for tracepoint"))		\
>  			return;						\
>  									\
> -		/* keep srcu and sched-rcu usage consistent */		\
> -		preempt_disable_notrace();				\
> +		if (syscall) {						\
> +			rcu_read_lock_trace();				\
> +		} else {						\
> +			/* keep srcu and sched-rcu usage consistent */	\
> +			preempt_disable_notrace();			\
> +		}							\
>  									\

I'm thinking we just use rcu_read_lock_trace() and get rid of the
preempt_disable and srcu locks for all tracepoints. Oh crap! I should get
rid of srcu locking too, as it was only needed for the rcuidle code :-p

-- Steve


>  		/*							\
>  		 * For rcuidle callers, use srcu since sched-rcu	\
> @@ -241,7 +247,10 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
>  			srcu_read_unlock_notrace(&tracepoint_srcu, __idx);\
>  		}							\
>  									\
> -		preempt_enable_notrace();				\
> +		if (syscall)						\
> +			rcu_read_unlock_trace();			\
> +		else							\
> +			preempt_enable_notrace();			\
>  	} while (0)
>  
>  #ifndef MODULE
> @@ -251,7 +260,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
>  		if (static_key_false(&__tracepoint_##name.key))		\
>  			__DO_TRACE(name,				\
>  				TP_ARGS(args),				\
> -				TP_CONDITION(cond), 1);			\
> +				TP_CONDITION(cond), 1, 0);		\
>  	}
>  #else
>  #define __DECLARE_TRACE_RCU(name, proto, args, cond)
> @@ -284,7 +293,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
>  		if (static_key_false(&__tracepoint_##name.key))		\
>  			__DO_TRACE(name,				\
>  				TP_ARGS(args),				\
> -				TP_CONDITION(cond), 0);			\
> +				TP_CONDITION(cond), 0, 0);		\
>  		if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) {		\
>  			WARN_ONCE(!rcu_is_watching(),			\
>  				  "RCU not watching for tracepoint");	\
> @@ -295,7 +304,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
>  		if (static_key_false(&__tracepoint_##name.key))		\
>  			__DO_TRACE(name,				\
>  				TP_ARGS(args),				\
> -				TP_CONDITION(cond), 1);			\
> +				TP_CONDITION(cond), 1, 0);		\
>  	}								\
>  	static inline int						\
>  	register_trace_##name(void (*probe)(data_proto), void *data)	\
> @@ -330,7 +339,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
>  		if (static_key_false(&__tracepoint_##name.key))		\
>  			__DO_TRACE(name,				\
>  				TP_ARGS(args),				\
> -				TP_CONDITION(cond), 0);			\
> +				TP_CONDITION(cond), 0, 1);		\
>  		if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) {		\
>  			WARN_ONCE(!rcu_is_watching(),			\
>  				  "RCU not watching for tracepoint");	\
> diff --git a/init/Kconfig b/init/Kconfig
> index fbd0cb06a50a..eedd0064fb36 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1984,6 +1984,7 @@ config BINDGEN_VERSION_TEXT
>  #
>  config TRACEPOINTS
>  	bool
> +	select TASKS_TRACE_RCU
>  
>  source "kernel/Kconfig.kexec"
>  


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 6/8] tracing/ftrace: Add might_fault check to syscall probes
  2024-10-03 15:16 ` [PATCH v1 6/8] tracing/ftrace: Add might_fault check to syscall probes Mathieu Desnoyers
@ 2024-10-03 22:36   ` Steven Rostedt
  2024-10-04  0:11     ` Mathieu Desnoyers
  0 siblings, 1 reply; 40+ messages in thread
From: Steven Rostedt @ 2024-10-03 22:36 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Masami Hiramatsu, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Michael Jeanson

On Thu,  3 Oct 2024 11:16:36 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
> index 0228d9ed94a3..e0d4850b0d77 100644
> --- a/include/trace/trace_events.h
> +++ b/include/trace/trace_events.h
> @@ -446,6 +446,7 @@ __DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), PARAMS(tstruct), \
>  static notrace void							\
>  trace_event_raw_event_##call(void *__data, proto)			\
>  {									\
> +	might_fault();							\

I don't think we want "might_fault()" here, as this is called for every
tracepoint that is created by the TRACE_EVENT() macro. That means, there's
going to be plenty of locations this gets called at that do not allow faults.

-- Steve


>  	guard(preempt_notrace)();					\
>  	do_trace_event_raw_event_##call(__data, args);			\
>  }

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 7/8] tracing/perf: Add might_fault check to syscall probes
  2024-10-03 15:16 ` [PATCH v1 7/8] tracing/perf: " Mathieu Desnoyers
@ 2024-10-03 22:37   ` Steven Rostedt
  2024-10-04  0:12     ` Mathieu Desnoyers
  0 siblings, 1 reply; 40+ messages in thread
From: Steven Rostedt @ 2024-10-03 22:37 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Masami Hiramatsu, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Michael Jeanson

On Thu,  3 Oct 2024 11:16:37 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> Add a might_fault() check to validate that the perf sys_enter/sys_exit
> probe callbacks are indeed called from a context where page faults can
> be handled.
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Michael Jeanson <mjeanson@efficios.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Yonghong Song <yhs@fb.com>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> Cc: bpf@vger.kernel.org
> Cc: Joel Fernandes <joel@joelfernandes.org>
> ---
>  include/trace/perf.h          | 1 +
>  kernel/trace/trace_syscalls.c | 2 ++
>  2 files changed, 3 insertions(+)
> 
> diff --git a/include/trace/perf.h b/include/trace/perf.h
> index 5650c1bad088..321bfd7919f6 100644
> --- a/include/trace/perf.h
> +++ b/include/trace/perf.h
> @@ -84,6 +84,7 @@ perf_trace_##call(void *__data, proto)					\
>  	u64 __count __attribute__((unused));				\
>  	struct task_struct *__task __attribute__((unused));		\
>  									\
> +	might_fault();							\
>  	guard(preempt_notrace)();					\
>  	do_perf_trace_##call(__data, args);				\

Same for this. This is used for all tracepoints that perf hooks to.

-- Steve

>  }
> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
> index 89d7e4c57b5b..0d42d6f293d6 100644
> --- a/kernel/trace/trace_syscalls.c
> +++ b/kernel/trace/trace_syscalls.c
> @@ -602,6 +602,7 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
>  	 * Syscall probe called with preemption enabled, but the ring
>  	 * buffer and per-cpu data require preemption to be disabled.
>  	 */
> +	might_fault();
>  	guard(preempt_notrace)();
>  
>  	syscall_nr = trace_get_syscall_nr(current, regs);
> @@ -710,6 +711,7 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
>  	 * Syscall probe called with preemption enabled, but the ring
>  	 * buffer and per-cpu data require preemption to be disabled.
>  	 */
> +	might_fault();
>  	guard(preempt_notrace)();
>  
>  	syscall_nr = trace_get_syscall_nr(current, regs);


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 8/8] tracing/bpf: Add might_fault check to syscall probes
  2024-10-03 15:16 ` [PATCH v1 8/8] tracing/bpf: " Mathieu Desnoyers
@ 2024-10-03 22:38   ` Steven Rostedt
  2024-10-04  0:13     ` Mathieu Desnoyers
  0 siblings, 1 reply; 40+ messages in thread
From: Steven Rostedt @ 2024-10-03 22:38 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Masami Hiramatsu, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Andrii Nakryiko, Michael Jeanson

On Thu,  3 Oct 2024 11:16:38 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> Add a might_fault() check to validate that the bpf sys_enter/sys_exit
> probe callbacks are indeed called from a context where page faults can
> be handled.
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Acked-by: Andrii Nakryiko <andrii@kernel.org>
> Tested-by: Andrii Nakryiko <andrii@kernel.org> # BPF parts
> Cc: Michael Jeanson <mjeanson@efficios.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Yonghong Song <yhs@fb.com>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> Cc: bpf@vger.kernel.org
> Cc: Joel Fernandes <joel@joelfernandes.org>
> ---
>  include/trace/bpf_probe.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
> index 211b98d45fc6..099df5c3e38a 100644
> --- a/include/trace/bpf_probe.h
> +++ b/include/trace/bpf_probe.h
> @@ -57,6 +57,7 @@ __bpf_trace_##call(void *__data, proto)					\
>  static notrace void							\
>  __bpf_trace_##call(void *__data, proto)					\
>  {									\
> +	might_fault();							\

And I think this gets called at places that do not allow faults.

-- Steve

>  	guard(preempt_notrace)();					\
>  	CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(__data, CAST_TO_U64(args));	\
>  }


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 4/8] tracing/bpf: guard syscall probe with preempt_notrace
  2024-10-03 22:26   ` Steven Rostedt
@ 2024-10-03 23:05     ` Alexei Starovoitov
  2024-10-04  0:30       ` Mathieu Desnoyers
  0 siblings, 1 reply; 40+ messages in thread
From: Alexei Starovoitov @ 2024-10-03 23:05 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, Masami Hiramatsu, LKML, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Andrii Nakryiko, Michael Jeanson

On Thu, Oct 3, 2024 at 3:25 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Thu,  3 Oct 2024 11:16:34 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>
> > In preparation for allowing system call enter/exit instrumentation to
> > handle page faults, make sure that bpf can handle this change by
> > explicitly disabling preemption within the bpf system call tracepoint
> > probes to respect the current expectations within bpf tracing code.
> >
> > This change does not yet allow bpf to take page faults per se within its
> > probe, but allows its existing probes to adapt to the upcoming change.
> >
>
> I guess the BPF folks should state if this is needed or not?
>
> Does the BPF hooks into the tracepoints expect preemption to be disabled
> when called?

Andrii pointed it out already.
bpf doesn't need preemption to be disabled.
Only migration needs to be disabled.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 6/8] tracing/ftrace: Add might_fault check to syscall probes
  2024-10-03 22:36   ` Steven Rostedt
@ 2024-10-04  0:11     ` Mathieu Desnoyers
  0 siblings, 0 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2024-10-04  0:11 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Michael Jeanson

On 2024-10-04 00:36, Steven Rostedt wrote:
> On Thu,  3 Oct 2024 11:16:36 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>> diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
>> index 0228d9ed94a3..e0d4850b0d77 100644
>> --- a/include/trace/trace_events.h
>> +++ b/include/trace/trace_events.h
>> @@ -446,6 +446,7 @@ __DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), PARAMS(tstruct), \
>>   static notrace void							\
>>   trace_event_raw_event_##call(void *__data, proto)			\
>>   {									\
>> +	might_fault();							\
> 
> I don't think we want "might_fault()" here, as this is called for every
> tracepoint that is created by the TRACE_EVENT() macro. That means, there's
> going to be plenty of locations this gets called at that do not allow faults.

Here is the full context where this line applies:

#undef DECLARE_EVENT_SYSCALL_CLASS
#define DECLARE_EVENT_SYSCALL_CLASS(call, proto, args, tstruct, assign, print) \
__DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), PARAMS(tstruct), \
                       PARAMS(assign), PARAMS(print))                    \
static notrace void                                                     \
trace_event_raw_event_##call(void *__data, proto)                       \
{                                                                       \
         might_fault();                                                  \
         guard(preempt_notrace)();                                       \
         do_trace_event_raw_event_##call(__data, args);                  \
}

Not an issue, since it's only for syscall tracepoints.

Thanks,

Mathieu

> 
> -- Steve
> 
> 
>>   	guard(preempt_notrace)();					\
>>   	do_trace_event_raw_event_##call(__data, args);			\
>>   }

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 7/8] tracing/perf: Add might_fault check to syscall probes
  2024-10-03 22:37   ` Steven Rostedt
@ 2024-10-04  0:12     ` Mathieu Desnoyers
  0 siblings, 0 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2024-10-04  0:12 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Michael Jeanson

On 2024-10-04 00:37, Steven Rostedt wrote:
> On Thu,  3 Oct 2024 11:16:37 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>> Add a might_fault() check to validate that the perf sys_enter/sys_exit
>> probe callbacks are indeed called from a context where page faults can
>> be handled.
>>
>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> Cc: Michael Jeanson <mjeanson@efficios.com>
>> Cc: Steven Rostedt <rostedt@goodmis.org>
>> Cc: Masami Hiramatsu <mhiramat@kernel.org>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Alexei Starovoitov <ast@kernel.org>
>> Cc: Yonghong Song <yhs@fb.com>
>> Cc: Paul E. McKenney <paulmck@kernel.org>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
>> Cc: Mark Rutland <mark.rutland@arm.com>
>> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
>> Cc: Namhyung Kim <namhyung@kernel.org>
>> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
>> Cc: bpf@vger.kernel.org
>> Cc: Joel Fernandes <joel@joelfernandes.org>
>> ---
>>   include/trace/perf.h          | 1 +
>>   kernel/trace/trace_syscalls.c | 2 ++
>>   2 files changed, 3 insertions(+)
>>
>> diff --git a/include/trace/perf.h b/include/trace/perf.h
>> index 5650c1bad088..321bfd7919f6 100644
>> --- a/include/trace/perf.h
>> +++ b/include/trace/perf.h
>> @@ -84,6 +84,7 @@ perf_trace_##call(void *__data, proto)					\
>>   	u64 __count __attribute__((unused));				\
>>   	struct task_struct *__task __attribute__((unused));		\
>>   									\
>> +	might_fault();							\
>>   	guard(preempt_notrace)();					\
>>   	do_perf_trace_##call(__data, args);				\
> 
> Same for this. This is used for all tracepoints that perf hooks to.

You're also missing the context:

#undef DECLARE_EVENT_SYSCALL_CLASS
#define DECLARE_EVENT_SYSCALL_CLASS(call, proto, args, tstruct, assign, print) \
__DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), PARAMS(tstruct), \
                       PARAMS(assign), PARAMS(print))                    \
static notrace void                                                     \
perf_trace_##call(void *__data, proto)                                  \
{                                                                       \
         u64 __count __attribute__((unused));                            \
         struct task_struct *__task __attribute__((unused));             \
                                                                         \
         might_fault();                                                  \
         guard(preempt_notrace)();                                       \
         do_perf_trace_##call(__data, args);                             \
}

Not an issue.

Thanks,

Mathieu


> 
> -- Steve
> 
>>   }
>> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
>> index 89d7e4c57b5b..0d42d6f293d6 100644
>> --- a/kernel/trace/trace_syscalls.c
>> +++ b/kernel/trace/trace_syscalls.c
>> @@ -602,6 +602,7 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
>>   	 * Syscall probe called with preemption enabled, but the ring
>>   	 * buffer and per-cpu data require preemption to be disabled.
>>   	 */
>> +	might_fault();
>>   	guard(preempt_notrace)();
>>   
>>   	syscall_nr = trace_get_syscall_nr(current, regs);
>> @@ -710,6 +711,7 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
>>   	 * Syscall probe called with preemption enabled, but the ring
>>   	 * buffer and per-cpu data require preemption to be disabled.
>>   	 */
>> +	might_fault();
>>   	guard(preempt_notrace)();
>>   
>>   	syscall_nr = trace_get_syscall_nr(current, regs);
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 8/8] tracing/bpf: Add might_fault check to syscall probes
  2024-10-03 22:38   ` Steven Rostedt
@ 2024-10-04  0:13     ` Mathieu Desnoyers
  0 siblings, 0 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2024-10-04  0:13 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Andrii Nakryiko, Michael Jeanson

On 2024-10-04 00:38, Steven Rostedt wrote:
> On Thu,  3 Oct 2024 11:16:38 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>> Add a might_fault() check to validate that the bpf sys_enter/sys_exit
>> probe callbacks are indeed called from a context where page faults can
>> be handled.
>>
>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> Acked-by: Andrii Nakryiko <andrii@kernel.org>
>> Tested-by: Andrii Nakryiko <andrii@kernel.org> # BPF parts
>> Cc: Michael Jeanson <mjeanson@efficios.com>
>> Cc: Steven Rostedt <rostedt@goodmis.org>
>> Cc: Masami Hiramatsu <mhiramat@kernel.org>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Alexei Starovoitov <ast@kernel.org>
>> Cc: Yonghong Song <yhs@fb.com>
>> Cc: Paul E. McKenney <paulmck@kernel.org>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
>> Cc: Mark Rutland <mark.rutland@arm.com>
>> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
>> Cc: Namhyung Kim <namhyung@kernel.org>
>> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
>> Cc: bpf@vger.kernel.org
>> Cc: Joel Fernandes <joel@joelfernandes.org>
>> ---
>>   include/trace/bpf_probe.h | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
>> index 211b98d45fc6..099df5c3e38a 100644
>> --- a/include/trace/bpf_probe.h
>> +++ b/include/trace/bpf_probe.h
>> @@ -57,6 +57,7 @@ __bpf_trace_##call(void *__data, proto)					\
>>   static notrace void							\
>>   __bpf_trace_##call(void *__data, proto)					\
>>   {									\
>> +	might_fault();							\
> 
> And I think this gets called at places that do not allow faults.

Context matters:

#undef DECLARE_EVENT_SYSCALL_CLASS
#define DECLARE_EVENT_SYSCALL_CLASS(call, proto, args, tstruct, assign, print) \
__DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), PARAMS(tstruct), \
                       PARAMS(assign), PARAMS(print))                    \
static notrace void                                                     \
perf_trace_##call(void *__data, proto)                                  \
{                                                                       \
         u64 __count __attribute__((unused));                            \
         struct task_struct *__task __attribute__((unused));             \
                                                                         \
         might_fault();                                                  \
         guard(preempt_notrace)();                                       \
         do_perf_trace_##call(__data, args);                             \
}

Not an issue.

Thanks,

Mathieu

> 
> -- Steve
> 
>>   	guard(preempt_notrace)();					\
>>   	CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(__data, CAST_TO_U64(args));	\
>>   }
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 1/8] tracing: Declare system call tracepoints with TRACE_EVENT_SYSCALL
  2024-10-03 21:32   ` Steven Rostedt
@ 2024-10-04  0:15     ` Mathieu Desnoyers
  2024-10-04  1:06       ` Steven Rostedt
  0 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2024-10-04  0:15 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Michael Jeanson

On 2024-10-03 23:32, Steven Rostedt wrote:
> On Thu,  3 Oct 2024 11:16:31 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>> @@ -283,8 +290,13 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
>>   				  "RCU not watching for tracepoint");	\
>>   		}							\
>>   	}								\
>> -	__DECLARE_TRACE_RCU(name, PARAMS(proto), PARAMS(args),		\
>> -			    PARAMS(cond))				\
>> +	static inline void trace_##name##_rcuidle(proto)		\
>> +	{								\
>> +		if (static_key_false(&__tracepoint_##name.key))		\
>> +			__DO_TRACE(name,				\
>> +				TP_ARGS(args),				\
>> +				TP_CONDITION(cond), 1);			\
>> +	}								\
>>   	static inline int						\
>>   	register_trace_##name(void (*probe)(data_proto), void *data)	\
>>   	{								\
> 
> Looking at this part of your change, I realized it's time to nuke the
> rcuidle() variant.
> 
> Feel free to rebase on top of this patch:
> 
>    https://lore.kernel.org/all/20241003173051.6b178bb3@gandalf.local.home/
> 

I will. But you realize that you could have done all this SRCU and
rcuidle nuking on top of my own series rather than pull the rug
under my feet and require me to re-do this series again ?

Grumpily,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 3/8] tracing/perf: guard syscall probe with preempt_notrace
  2024-10-03 22:25   ` Steven Rostedt
@ 2024-10-04  0:17     ` Frederic Weisbecker
  0 siblings, 0 replies; 40+ messages in thread
From: Frederic Weisbecker @ 2024-10-04  0:17 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, Masami Hiramatsu, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Michael Jeanson, Frederic Weisbecker

Le Thu, Oct 03, 2024 at 06:25:08PM -0400, Steven Rostedt a écrit :
> On Thu,  3 Oct 2024 11:16:33 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> > In preparation for allowing system call enter/exit instrumentation to
> > handle page faults, make sure that perf can handle this change by
> > explicitly disabling preemption within the perf system call tracepoint
> > probes to respect the current expectations within perf ring buffer code.
> > 
> > This change does not yet allow perf to take page faults per se within
> > its probe, but allows its existing probes to adapt to the upcoming
> > change.
> 
> Frederic,
> 
> Does the perf ring buffer expect preemption to be disabled when used?
> 
> In other words, is this patch needed?

At least the trace events perf callback requires that because it uses
a per cpu buffer (see perf_trace_buf_alloc()).

Thanks.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 2/8] tracing/ftrace: guard syscall probe with preempt_notrace
  2024-10-03 22:23   ` Steven Rostedt
@ 2024-10-04  0:26     ` Mathieu Desnoyers
  2024-10-04  1:04       ` Steven Rostedt
  0 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2024-10-04  0:26 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Michael Jeanson

On 2024-10-04 00:23, Steven Rostedt wrote:
> On Thu,  3 Oct 2024 11:16:32 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>> In preparation for allowing system call enter/exit instrumentation to
>> handle page faults, make sure that ftrace can handle this change by
>> explicitly disabling preemption within the ftrace system call tracepoint
>> probes to respect the current expectations within ftrace ring buffer
>> code.
> 
> The ftrace ring buffer doesn't expect preemption being disabled before use.
> It will explicitly disable preemption.
> 
> I don't think this patch is needed.

Steve,

Look here:

static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
{
         struct trace_array *tr = data;
         struct trace_event_file *trace_file;
         struct syscall_trace_enter *entry;
         struct syscall_metadata *sys_data;
         struct trace_event_buffer fbuffer;
         unsigned long args[6];
         int syscall_nr;
         int size;

         syscall_nr = trace_get_syscall_nr(current, regs);
         if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
                 return;

         /* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE) */
         trace_file = rcu_dereference_sched(tr->enter_syscall_files[syscall_nr]);

^^^^ this function explicitly states that preempt needs to be disabled by
tracepoints.

         if (!trace_file)
                 return;

         if (trace_trigger_soft_disabled(trace_file))
                 return;

         sys_data = syscall_nr_to_meta(syscall_nr);
         if (!sys_data)
                 return;

         size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;

         entry = trace_event_buffer_reserve(&fbuffer, trace_file, size);
^^^^ it reserves space in the ring buffer without disabling preemption explicitly.

And also:

void *trace_event_buffer_reserve(struct trace_event_buffer *fbuffer,
                                  struct trace_event_file *trace_file,
                                  unsigned long len)
{
         struct trace_event_call *event_call = trace_file->event_call;

         if ((trace_file->flags & EVENT_FILE_FL_PID_FILTER) &&
             trace_event_ignore_this_pid(trace_file))
                 return NULL;

         /*
          * If CONFIG_PREEMPTION is enabled, then the tracepoint itself disables
          * preemption (adding one to the preempt_count). Since we are
          * interested in the preempt_count at the time the tracepoint was
          * hit, we need to subtract one to offset the increment.
          */
^^^ This function also explicitly expects preemption to be disabled.

So I rest my case. The change I'm introducing for tracepoints
don't make any assumptions about whether or not each tracer require
preempt off or not: it keeps the behavior the _same_ as it was before.

Then it's up to each tracer's developer to change the behavior of their
own callbacks as they see fit. But I'm not introducing regressions in
tracers with the "big switch" change of making syscall tracepoints
faultable. This will belong to changes that are specific to each tracer.

Thanks,

Mathieu

> 
> -- Steve
> 
> 
>>
>> This change does not yet allow ftrace to take page faults per se within
>> its probe, but allows its existing probes to adapt to the upcoming
>> change.
>>
>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
>> Cc: Michael Jeanson <mjeanson@efficios.com>
>> Cc: Steven Rostedt <rostedt@goodmis.org>
>> Cc: Masami Hiramatsu <mhiramat@kernel.org>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Alexei Starovoitov <ast@kernel.org>
>> Cc: Yonghong Song <yhs@fb.com>
>> Cc: Paul E. McKenney <paulmck@kernel.org>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
>> Cc: Mark Rutland <mark.rutland@arm.com>
>> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
>> Cc: Namhyung Kim <namhyung@kernel.org>
>> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
>> Cc: bpf@vger.kernel.org
>> Cc: Joel Fernandes <joel@joelfernandes.org>

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 4/8] tracing/bpf: guard syscall probe with preempt_notrace
  2024-10-03 23:05     ` Alexei Starovoitov
@ 2024-10-04  0:30       ` Mathieu Desnoyers
  2024-10-04  1:28         ` Mathieu Desnoyers
  0 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2024-10-04  0:30 UTC (permalink / raw)
  To: Alexei Starovoitov, Steven Rostedt
  Cc: Masami Hiramatsu, LKML, Peter Zijlstra, Alexei Starovoitov,
	Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Andrii Nakryiko, Michael Jeanson

On 2024-10-04 01:05, Alexei Starovoitov wrote:
> On Thu, Oct 3, 2024 at 3:25 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>>
>> On Thu,  3 Oct 2024 11:16:34 -0400
>> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>
>>> In preparation for allowing system call enter/exit instrumentation to
>>> handle page faults, make sure that bpf can handle this change by
>>> explicitly disabling preemption within the bpf system call tracepoint
>>> probes to respect the current expectations within bpf tracing code.
>>>
>>> This change does not yet allow bpf to take page faults per se within its
>>> probe, but allows its existing probes to adapt to the upcoming change.
>>>
>>
>> I guess the BPF folks should state if this is needed or not?
>>
>> Does the BPF hooks into the tracepoints expect preemption to be disabled
>> when called?
> 
> Andrii pointed it out already.
> bpf doesn't need preemption to be disabled.
> Only migration needs to be disabled.

I'm well aware of this. Feel free to relax those constraints in
follow up patches in your own tracers. I'm simply not introducing
any behavior change in the "big switch" patch introducing faultable
syscall tracepoints. It's just too easy to overlook a dependency on
preempt off deep inside some tracer code for me to make assumptions
at the tracepoint level.

If a regression happens, it will be caused by the tracer-specific
patch that relaxes the constraints, not by the tracepoint change
that affects multiple tracers at once.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 5/8] tracing: Allow system call tracepoints to handle page faults
  2024-10-03 22:29   ` Steven Rostedt
@ 2024-10-04  0:35     ` Mathieu Desnoyers
  0 siblings, 0 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2024-10-04  0:35 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Michael Jeanson

On 2024-10-04 00:29, Steven Rostedt wrote:
> On Thu,  3 Oct 2024 11:16:35 -0400
[...[
>> -#define __DO_TRACE(name, args, cond, rcuidle)				\
>> +#define __DO_TRACE(name, args, cond, rcuidle, syscall)			\
>>   	do {								\
>>   		int __maybe_unused __idx = 0;				\
>>   									\
>> @@ -222,8 +224,12 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
>>   			      "Bad RCU usage for tracepoint"))		\
>>   			return;						\
>>   									\
>> -		/* keep srcu and sched-rcu usage consistent */		\
>> -		preempt_disable_notrace();				\
>> +		if (syscall) {						\
>> +			rcu_read_lock_trace();				\
>> +		} else {						\
>> +			/* keep srcu and sched-rcu usage consistent */	\
>> +			preempt_disable_notrace();			\
>> +		}							\
>>   									\
> 
> I'm thinking we just use rcu_read_lock_trace() and get rid of the
> preempt_disable and srcu locks for all tracepoints. Oh crap! I should get
> rid of srcu locking too, as it was only needed for the rcuidle code :-p

How about we do it one step at a time ? First introduce use of the
(lightly tested) rcu_read_lock_trace() (at least in comparison with
preempt disable RCU) only for syscalls, and if this works well,
then eventually consider moving the preempt off users to
rcu_read_lock_trace as well ?

Of course it should all work well, in theory. But considering the
vast number of tracepoints we have in the kernel, I am reluctant
to change too many things at once in that area. We may very well
be bitten by unforeseen corner-cases.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 2/8] tracing/ftrace: guard syscall probe with preempt_notrace
  2024-10-04  0:26     ` Mathieu Desnoyers
@ 2024-10-04  1:04       ` Steven Rostedt
  2024-10-04  1:33         ` Mathieu Desnoyers
  0 siblings, 1 reply; 40+ messages in thread
From: Steven Rostedt @ 2024-10-04  1:04 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Masami Hiramatsu, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Michael Jeanson

On Thu, 3 Oct 2024 20:26:29 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:


> static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
> {
>          struct trace_array *tr = data;
>          struct trace_event_file *trace_file;
>          struct syscall_trace_enter *entry;
>          struct syscall_metadata *sys_data;
>          struct trace_event_buffer fbuffer;
>          unsigned long args[6];
>          int syscall_nr;
>          int size;
> 
>          syscall_nr = trace_get_syscall_nr(current, regs);
>          if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
>                  return;
> 
>          /* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE) */
>          trace_file = rcu_dereference_sched(tr->enter_syscall_files[syscall_nr]);
> 
> ^^^^ this function explicitly states that preempt needs to be disabled by
> tracepoints.

Ah, I should have known it was the syscall portion. I don't care for this
hidden dependency. I rather add a preempt disable here and not expect it to
be disabled when called.

> 
>          if (!trace_file)
>                  return;
> 
>          if (trace_trigger_soft_disabled(trace_file))
>                  return;
> 
>          sys_data = syscall_nr_to_meta(syscall_nr);
>          if (!sys_data)
>                  return;
> 
>          size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
> 
>          entry = trace_event_buffer_reserve(&fbuffer, trace_file, size);
> ^^^^ it reserves space in the ring buffer without disabling preemption explicitly.
> 
> And also:
> 
> void *trace_event_buffer_reserve(struct trace_event_buffer *fbuffer,
>                                   struct trace_event_file *trace_file,
>                                   unsigned long len)
> {
>          struct trace_event_call *event_call = trace_file->event_call;
> 
>          if ((trace_file->flags & EVENT_FILE_FL_PID_FILTER) &&
>              trace_event_ignore_this_pid(trace_file))
>                  return NULL;
> 
>          /*
>           * If CONFIG_PREEMPTION is enabled, then the tracepoint itself disables
>           * preemption (adding one to the preempt_count). Since we are
>           * interested in the preempt_count at the time the tracepoint was
>           * hit, we need to subtract one to offset the increment.
>           */
> ^^^ This function also explicitly expects preemption to be disabled.
> 
> So I rest my case. The change I'm introducing for tracepoints
> don't make any assumptions about whether or not each tracer require
> preempt off or not: it keeps the behavior the _same_ as it was before.
> 
> Then it's up to each tracer's developer to change the behavior of their
> own callbacks as they see fit. But I'm not introducing regressions in
> tracers with the "big switch" change of making syscall tracepoints
> faultable. This will belong to changes that are specific to each tracer.


I rather remove these dependencies at the source. So, IMHO, these places
should be "fixed" first.

At least for the ftrace users. But I think the same can be done for the
other users as well. BPF already stated it just needs "migrate_disable()".
Let's see what perf has.

We can then audit all the tracepoint users to make sure they do not need
preemption disabled.

-- Steve

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 1/8] tracing: Declare system call tracepoints with TRACE_EVENT_SYSCALL
  2024-10-04  0:15     ` Mathieu Desnoyers
@ 2024-10-04  1:06       ` Steven Rostedt
  2024-10-04  1:34         ` Mathieu Desnoyers
  0 siblings, 1 reply; 40+ messages in thread
From: Steven Rostedt @ 2024-10-04  1:06 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Masami Hiramatsu, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Michael Jeanson

On Thu, 3 Oct 2024 20:15:25 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> > Feel free to rebase on top of this patch:
> > 
> >    https://lore.kernel.org/all/20241003173051.6b178bb3@gandalf.local.home/
> >   
> 
> I will. But you realize that you could have done all this SRCU and
> rcuidle nuking on top of my own series rather than pull the rug
> under my feet and require me to re-do this series again ?

I thought I was doing you a favor! It's removing a lot of code and would
make your code simpler. ;-)

-- Steve

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 4/8] tracing/bpf: guard syscall probe with preempt_notrace
  2024-10-04  0:30       ` Mathieu Desnoyers
@ 2024-10-04  1:28         ` Mathieu Desnoyers
  0 siblings, 0 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2024-10-04  1:28 UTC (permalink / raw)
  To: Alexei Starovoitov, Steven Rostedt
  Cc: Masami Hiramatsu, LKML, Peter Zijlstra, Alexei Starovoitov,
	Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Andrii Nakryiko, Michael Jeanson

On 2024-10-04 02:30, Mathieu Desnoyers wrote:
> On 2024-10-04 01:05, Alexei Starovoitov wrote:
>> On Thu, Oct 3, 2024 at 3:25 PM Steven Rostedt <rostedt@goodmis.org> 
>> wrote:
>>>
>>> On Thu,  3 Oct 2024 11:16:34 -0400
>>> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>>
>>>> In preparation for allowing system call enter/exit instrumentation to
>>>> handle page faults, make sure that bpf can handle this change by
>>>> explicitly disabling preemption within the bpf system call tracepoint
>>>> probes to respect the current expectations within bpf tracing code.
>>>>
>>>> This change does not yet allow bpf to take page faults per se within 
>>>> its
>>>> probe, but allows its existing probes to adapt to the upcoming change.
>>>>
>>>
>>> I guess the BPF folks should state if this is needed or not?
>>>
>>> Does the BPF hooks into the tracepoints expect preemption to be disabled
>>> when called?
>>
>> Andrii pointed it out already.
>> bpf doesn't need preemption to be disabled.
>> Only migration needs to be disabled.
> 
> I'm well aware of this. Feel free to relax those constraints in
> follow up patches in your own tracers. I'm simply not introducing
> any behavior change in the "big switch" patch introducing faultable
> syscall tracepoints. It's just too easy to overlook a dependency on
> preempt off deep inside some tracer code for me to make assumptions
> at the tracepoint level.
> 
> If a regression happens, it will be caused by the tracer-specific
> patch that relaxes the constraints, not by the tracepoint change
> that affects multiple tracers at once.

I also notice that the bpf verifier checks a "active_preempt_lock"
state to make sure sleepable functions are not called while within
preempt off region. So I would expect that the verifier has some
knowledge about the fact that tracepoint probes are called with
preempt off already.

Likewise in reverse for functions which deal with per-cpu data: those
would expect to be used with preempt off if multiple functions need to
touch the same cpu's data.

So if we make the syscall tracepoint constraints more relax (migrate
off rather than preempt off), I suspect we may have to update the
verifier.

This contributes to my uneasiness towards introducing this kind of
side-effect in a tracepoint change that affects all tracers.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 2/8] tracing/ftrace: guard syscall probe with preempt_notrace
  2024-10-04  1:04       ` Steven Rostedt
@ 2024-10-04  1:33         ` Mathieu Desnoyers
  2024-10-04 13:26           ` Steven Rostedt
  0 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2024-10-04  1:33 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Michael Jeanson

On 2024-10-04 03:04, Steven Rostedt wrote:
> On Thu, 3 Oct 2024 20:26:29 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> 
>> static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
>> {
>>           struct trace_array *tr = data;
>>           struct trace_event_file *trace_file;
>>           struct syscall_trace_enter *entry;
>>           struct syscall_metadata *sys_data;
>>           struct trace_event_buffer fbuffer;
>>           unsigned long args[6];
>>           int syscall_nr;
>>           int size;
>>
>>           syscall_nr = trace_get_syscall_nr(current, regs);
>>           if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
>>                   return;
>>
>>           /* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE) */
>>           trace_file = rcu_dereference_sched(tr->enter_syscall_files[syscall_nr]);
>>
>> ^^^^ this function explicitly states that preempt needs to be disabled by
>> tracepoints.
> 
> Ah, I should have known it was the syscall portion. I don't care for this
> hidden dependency. I rather add a preempt disable here and not expect it to
> be disabled when called.

Which is exactly what this patch is doing.

> 
>>
>>           if (!trace_file)
>>                   return;
>>
>>           if (trace_trigger_soft_disabled(trace_file))
>>                   return;
>>
>>           sys_data = syscall_nr_to_meta(syscall_nr);
>>           if (!sys_data)
>>                   return;
>>
>>           size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
>>
>>           entry = trace_event_buffer_reserve(&fbuffer, trace_file, size);
>> ^^^^ it reserves space in the ring buffer without disabling preemption explicitly.
>>
>> And also:
>>
>> void *trace_event_buffer_reserve(struct trace_event_buffer *fbuffer,
>>                                    struct trace_event_file *trace_file,
>>                                    unsigned long len)
>> {
>>           struct trace_event_call *event_call = trace_file->event_call;
>>
>>           if ((trace_file->flags & EVENT_FILE_FL_PID_FILTER) &&
>>               trace_event_ignore_this_pid(trace_file))
>>                   return NULL;
>>
>>           /*
>>            * If CONFIG_PREEMPTION is enabled, then the tracepoint itself disables
>>            * preemption (adding one to the preempt_count). Since we are
>>            * interested in the preempt_count at the time the tracepoint was
>>            * hit, we need to subtract one to offset the increment.
>>            */
>> ^^^ This function also explicitly expects preemption to be disabled.
>>
>> So I rest my case. The change I'm introducing for tracepoints
>> don't make any assumptions about whether or not each tracer require
>> preempt off or not: it keeps the behavior the _same_ as it was before.
>>
>> Then it's up to each tracer's developer to change the behavior of their
>> own callbacks as they see fit. But I'm not introducing regressions in
>> tracers with the "big switch" change of making syscall tracepoints
>> faultable. This will belong to changes that are specific to each tracer.
> 
> 
> I rather remove these dependencies at the source. So, IMHO, these places
> should be "fixed" first.
> 
> At least for the ftrace users. But I think the same can be done for the
> other users as well. BPF already stated it just needs "migrate_disable()".
> Let's see what perf has.
> 
> We can then audit all the tracepoint users to make sure they do not need
> preemption disabled.

Why does it need to be a broad refactoring of the entire world ? What is
wrong with the simple approach of introducing this tracepoint faultable
syscall support as a no-op from the tracer's perspective ?

Then we can build on top and figure out if we want to relax things
on a tracer-per-tracer basis.

Thanks,

Mathieu



-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 1/8] tracing: Declare system call tracepoints with TRACE_EVENT_SYSCALL
  2024-10-04  1:06       ` Steven Rostedt
@ 2024-10-04  1:34         ` Mathieu Desnoyers
  0 siblings, 0 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2024-10-04  1:34 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Michael Jeanson

On 2024-10-04 03:06, Steven Rostedt wrote:
> On Thu, 3 Oct 2024 20:15:25 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>>> Feel free to rebase on top of this patch:
>>>
>>>     https://lore.kernel.org/all/20241003173051.6b178bb3@gandalf.local.home/
>>>    
>>
>> I will. But you realize that you could have done all this SRCU and
>> rcuidle nuking on top of my own series rather than pull the rug
>> under my feet and require me to re-do this series again ?
> 
> I thought I was doing you a favor! It's removing a lot of code and would
> make your code simpler. ;-)

The rebase was indeed not so bad.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 1/8] tracing: Declare system call tracepoints with TRACE_EVENT_SYSCALL
  2024-10-03 15:16 ` [PATCH v1 1/8] tracing: Declare system call tracepoints with TRACE_EVENT_SYSCALL Mathieu Desnoyers
  2024-10-03 21:32   ` Steven Rostedt
@ 2024-10-04 10:34   ` kernel test robot
  1 sibling, 0 replies; 40+ messages in thread
From: kernel test robot @ 2024-10-04 10:34 UTC (permalink / raw)
  To: Mathieu Desnoyers, Steven Rostedt, Masami Hiramatsu
  Cc: oe-kbuild-all, linux-kernel, Mathieu Desnoyers, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Michael Jeanson

Hi Mathieu,

kernel test robot noticed the following build errors:

[auto build test ERROR on peterz-queue/sched/core]
[also build test ERROR on linus/master tip/core/entry v6.12-rc1 next-20241004]
[cannot apply to rostedt-trace/for-next rostedt-trace/for-next-urgent]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Mathieu-Desnoyers/tracing-Declare-system-call-tracepoints-with-TRACE_EVENT_SYSCALL/20241003-232114
base:   https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/core
patch link:    https://lore.kernel.org/r/20241003151638.1608537-2-mathieu.desnoyers%40efficios.com
patch subject: [PATCH v1 1/8] tracing: Declare system call tracepoints with TRACE_EVENT_SYSCALL
config: powerpc-randconfig-r071-20241004 (https://download.01.org/0day-ci/archive/20241004/202410041838.pOZuOGTX-lkp@intel.com/config)
compiler: powerpc-linux-gcc (GCC) 14.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241004/202410041838.pOZuOGTX-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202410041838.pOZuOGTX-lkp@intel.com/

All errors (new ones prefixed by >>):

   arch/powerpc/kernel/ptrace/ptrace.c: In function 'do_syscall_trace_enter':
>> arch/powerpc/kernel/ptrace/ptrace.c:298:17: error: implicit declaration of function 'trace_sys_enter'; did you mean 'ftrace_nmi_enter'? [-Wimplicit-function-declaration]
     298 |                 trace_sys_enter(regs, regs->gpr[0]);
         |                 ^~~~~~~~~~~~~~~
         |                 ftrace_nmi_enter
   arch/powerpc/kernel/ptrace/ptrace.c: In function 'do_syscall_trace_leave':
>> arch/powerpc/kernel/ptrace/ptrace.c:329:17: error: implicit declaration of function 'trace_sys_exit'; did you mean 'ftrace_nmi_exit'? [-Wimplicit-function-declaration]
     329 |                 trace_sys_exit(regs, regs->result);
         |                 ^~~~~~~~~~~~~~
         |                 ftrace_nmi_exit

Kconfig warnings: (for reference only)
   WARNING: unmet direct dependencies detected for GET_FREE_REGION
   Depends on [n]: SPARSEMEM [=n]
   Selected by [y]:
   - RESOURCE_KUNIT_TEST [=y] && RUNTIME_TESTING_MENU [=y] && KUNIT [=y]


vim +298 arch/powerpc/kernel/ptrace/ptrace.c

2449acc5348b94 arch/powerpc/kernel/ptrace.c        Michael Ellerman  2015-07-23  235  
d38374142b2560 arch/powerpc/kernel/ptrace.c        Michael Ellerman  2015-07-23  236  /**
d38374142b2560 arch/powerpc/kernel/ptrace.c        Michael Ellerman  2015-07-23  237   * do_syscall_trace_enter() - Do syscall tracing on kernel entry.
d38374142b2560 arch/powerpc/kernel/ptrace.c        Michael Ellerman  2015-07-23  238   * @regs: the pt_regs of the task to trace (current)
d38374142b2560 arch/powerpc/kernel/ptrace.c        Michael Ellerman  2015-07-23  239   *
d38374142b2560 arch/powerpc/kernel/ptrace.c        Michael Ellerman  2015-07-23  240   * Performs various types of tracing on syscall entry. This includes seccomp,
d38374142b2560 arch/powerpc/kernel/ptrace.c        Michael Ellerman  2015-07-23  241   * ptrace, syscall tracepoints and audit.
d38374142b2560 arch/powerpc/kernel/ptrace.c        Michael Ellerman  2015-07-23  242   *
d38374142b2560 arch/powerpc/kernel/ptrace.c        Michael Ellerman  2015-07-23  243   * The pt_regs are potentially visible to userspace via ptrace, so their
d38374142b2560 arch/powerpc/kernel/ptrace.c        Michael Ellerman  2015-07-23  244   * contents is ABI.
d38374142b2560 arch/powerpc/kernel/ptrace.c        Michael Ellerman  2015-07-23  245   *
d38374142b2560 arch/powerpc/kernel/ptrace.c        Michael Ellerman  2015-07-23  246   * One or more of the tracers may modify the contents of pt_regs, in particular
d38374142b2560 arch/powerpc/kernel/ptrace.c        Michael Ellerman  2015-07-23  247   * to modify arguments or even the syscall number itself.
d38374142b2560 arch/powerpc/kernel/ptrace.c        Michael Ellerman  2015-07-23  248   *
d38374142b2560 arch/powerpc/kernel/ptrace.c        Michael Ellerman  2015-07-23  249   * It's also possible that a tracer can choose to reject the system call. In
d38374142b2560 arch/powerpc/kernel/ptrace.c        Michael Ellerman  2015-07-23  250   * that case this function will return an illegal syscall number, and will put
d38374142b2560 arch/powerpc/kernel/ptrace.c        Michael Ellerman  2015-07-23  251   * an appropriate return value in regs->r3.
d38374142b2560 arch/powerpc/kernel/ptrace.c        Michael Ellerman  2015-07-23  252   *
d38374142b2560 arch/powerpc/kernel/ptrace.c        Michael Ellerman  2015-07-23  253   * Return: the (possibly changed) syscall number.
^1da177e4c3f41 arch/ppc/kernel/ptrace.c            Linus Torvalds    2005-04-16  254   */
4f72c4279eab1e arch/powerpc/kernel/ptrace.c        Roland McGrath    2008-07-27  255  long do_syscall_trace_enter(struct pt_regs *regs)
ea9c102cb0a796 arch/ppc/kernel/ptrace.c            David Woodhouse   2005-05-08  256  {
8dbdec0bcb416d arch/powerpc/kernel/ptrace.c        Dmitry V. Levin   2018-12-16  257  	u32 flags;
8dbdec0bcb416d arch/powerpc/kernel/ptrace.c        Dmitry V. Levin   2018-12-16  258  
985faa78687de6 arch/powerpc/kernel/ptrace/ptrace.c Mark Rutland      2021-11-29  259  	flags = read_thread_flags() & (_TIF_SYSCALL_EMU | _TIF_SYSCALL_TRACE);
8dbdec0bcb416d arch/powerpc/kernel/ptrace.c        Dmitry V. Levin   2018-12-16  260  
8dbdec0bcb416d arch/powerpc/kernel/ptrace.c        Dmitry V. Levin   2018-12-16  261  	if (flags) {
153474ba1a4aed arch/powerpc/kernel/ptrace/ptrace.c Eric W. Biederman 2022-01-27  262  		int rc = ptrace_report_syscall_entry(regs);
8dbdec0bcb416d arch/powerpc/kernel/ptrace.c        Dmitry V. Levin   2018-12-16  263  
8dbdec0bcb416d arch/powerpc/kernel/ptrace.c        Dmitry V. Levin   2018-12-16  264  		if (unlikely(flags & _TIF_SYSCALL_EMU)) {
5521eb4bca2db7 arch/powerpc/kernel/ptrace.c        Breno Leitao      2018-09-20  265  			/*
8dbdec0bcb416d arch/powerpc/kernel/ptrace.c        Dmitry V. Levin   2018-12-16  266  			 * A nonzero return code from
153474ba1a4aed arch/powerpc/kernel/ptrace/ptrace.c Eric W. Biederman 2022-01-27  267  			 * ptrace_report_syscall_entry() tells us to prevent
8dbdec0bcb416d arch/powerpc/kernel/ptrace.c        Dmitry V. Levin   2018-12-16  268  			 * the syscall execution, but we are not going to
8dbdec0bcb416d arch/powerpc/kernel/ptrace.c        Dmitry V. Levin   2018-12-16  269  			 * execute it anyway.
a225f156740555 arch/powerpc/kernel/ptrace.c        Elvira Khabirova  2018-12-07  270  			 *
8dbdec0bcb416d arch/powerpc/kernel/ptrace.c        Dmitry V. Levin   2018-12-16  271  			 * Returning -1 will skip the syscall execution. We want
8dbdec0bcb416d arch/powerpc/kernel/ptrace.c        Dmitry V. Levin   2018-12-16  272  			 * to avoid clobbering any registers, so we don't goto
8dbdec0bcb416d arch/powerpc/kernel/ptrace.c        Dmitry V. Levin   2018-12-16  273  			 * the skip label below.
5521eb4bca2db7 arch/powerpc/kernel/ptrace.c        Breno Leitao      2018-09-20  274  			 */
5521eb4bca2db7 arch/powerpc/kernel/ptrace.c        Breno Leitao      2018-09-20  275  			return -1;
5521eb4bca2db7 arch/powerpc/kernel/ptrace.c        Breno Leitao      2018-09-20  276  		}
5521eb4bca2db7 arch/powerpc/kernel/ptrace.c        Breno Leitao      2018-09-20  277  
8dbdec0bcb416d arch/powerpc/kernel/ptrace.c        Dmitry V. Levin   2018-12-16  278  		if (rc) {
4f72c4279eab1e arch/powerpc/kernel/ptrace.c        Roland McGrath    2008-07-27  279  			/*
8dbdec0bcb416d arch/powerpc/kernel/ptrace.c        Dmitry V. Levin   2018-12-16  280  			 * The tracer decided to abort the syscall. Note that
8dbdec0bcb416d arch/powerpc/kernel/ptrace.c        Dmitry V. Levin   2018-12-16  281  			 * the tracer may also just change regs->gpr[0] to an
8dbdec0bcb416d arch/powerpc/kernel/ptrace.c        Dmitry V. Levin   2018-12-16  282  			 * invalid syscall number, that is handled below on the
8dbdec0bcb416d arch/powerpc/kernel/ptrace.c        Dmitry V. Levin   2018-12-16  283  			 * exit path.
4f72c4279eab1e arch/powerpc/kernel/ptrace.c        Roland McGrath    2008-07-27  284  			 */
1addc57e111b92 arch/powerpc/kernel/ptrace.c        Kees Cook         2016-06-02  285  			goto skip;
8dbdec0bcb416d arch/powerpc/kernel/ptrace.c        Dmitry V. Levin   2018-12-16  286  		}
8dbdec0bcb416d arch/powerpc/kernel/ptrace.c        Dmitry V. Levin   2018-12-16  287  	}
1addc57e111b92 arch/powerpc/kernel/ptrace.c        Kees Cook         2016-06-02  288  
1addc57e111b92 arch/powerpc/kernel/ptrace.c        Kees Cook         2016-06-02  289  	/* Run seccomp after ptrace; allow it to set gpr[3]. */
1addc57e111b92 arch/powerpc/kernel/ptrace.c        Kees Cook         2016-06-02  290  	if (do_seccomp(regs))
1addc57e111b92 arch/powerpc/kernel/ptrace.c        Kees Cook         2016-06-02  291  		return -1;
1addc57e111b92 arch/powerpc/kernel/ptrace.c        Kees Cook         2016-06-02  292  
1addc57e111b92 arch/powerpc/kernel/ptrace.c        Kees Cook         2016-06-02  293  	/* Avoid trace and audit when syscall is invalid. */
1addc57e111b92 arch/powerpc/kernel/ptrace.c        Kees Cook         2016-06-02  294  	if (regs->gpr[0] >= NR_syscalls)
1addc57e111b92 arch/powerpc/kernel/ptrace.c        Kees Cook         2016-06-02  295  		goto skip;
ea9c102cb0a796 arch/ppc/kernel/ptrace.c            David Woodhouse   2005-05-08  296  
02424d8966d803 arch/powerpc/kernel/ptrace.c        Ian Munsie        2011-02-02  297  	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
02424d8966d803 arch/powerpc/kernel/ptrace.c        Ian Munsie        2011-02-02 @298  		trace_sys_enter(regs, regs->gpr[0]);
02424d8966d803 arch/powerpc/kernel/ptrace.c        Ian Munsie        2011-02-02  299  
cab175f9fa2973 arch/powerpc/kernel/ptrace.c        Denis Kirjanov    2010-08-27  300  	if (!is_32bit_task())
91397401bb5072 arch/powerpc/kernel/ptrace.c        Eric Paris        2014-03-11  301  		audit_syscall_entry(regs->gpr[0], regs->gpr[3], regs->gpr[4],
ea9c102cb0a796 arch/ppc/kernel/ptrace.c            David Woodhouse   2005-05-08  302  				    regs->gpr[5], regs->gpr[6]);
cfcd1705b61ecc arch/powerpc/kernel/ptrace.c        David Woodhouse   2007-01-14  303  	else
91397401bb5072 arch/powerpc/kernel/ptrace.c        Eric Paris        2014-03-11  304  		audit_syscall_entry(regs->gpr[0],
cfcd1705b61ecc arch/powerpc/kernel/ptrace.c        David Woodhouse   2007-01-14  305  				    regs->gpr[3] & 0xffffffff,
cfcd1705b61ecc arch/powerpc/kernel/ptrace.c        David Woodhouse   2007-01-14  306  				    regs->gpr[4] & 0xffffffff,
cfcd1705b61ecc arch/powerpc/kernel/ptrace.c        David Woodhouse   2007-01-14  307  				    regs->gpr[5] & 0xffffffff,
cfcd1705b61ecc arch/powerpc/kernel/ptrace.c        David Woodhouse   2007-01-14  308  				    regs->gpr[6] & 0xffffffff);
4f72c4279eab1e arch/powerpc/kernel/ptrace.c        Roland McGrath    2008-07-27  309  
1addc57e111b92 arch/powerpc/kernel/ptrace.c        Kees Cook         2016-06-02  310  	/* Return the possibly modified but valid syscall number */
1addc57e111b92 arch/powerpc/kernel/ptrace.c        Kees Cook         2016-06-02  311  	return regs->gpr[0];
1addc57e111b92 arch/powerpc/kernel/ptrace.c        Kees Cook         2016-06-02  312  
1addc57e111b92 arch/powerpc/kernel/ptrace.c        Kees Cook         2016-06-02  313  skip:
d38374142b2560 arch/powerpc/kernel/ptrace.c        Michael Ellerman  2015-07-23  314  	/*
d38374142b2560 arch/powerpc/kernel/ptrace.c        Michael Ellerman  2015-07-23  315  	 * If we are aborting explicitly, or if the syscall number is
d38374142b2560 arch/powerpc/kernel/ptrace.c        Michael Ellerman  2015-07-23  316  	 * now invalid, set the return value to -ENOSYS.
d38374142b2560 arch/powerpc/kernel/ptrace.c        Michael Ellerman  2015-07-23  317  	 */
d38374142b2560 arch/powerpc/kernel/ptrace.c        Michael Ellerman  2015-07-23  318  	regs->gpr[3] = -ENOSYS;
d38374142b2560 arch/powerpc/kernel/ptrace.c        Michael Ellerman  2015-07-23  319  	return -1;
d38374142b2560 arch/powerpc/kernel/ptrace.c        Michael Ellerman  2015-07-23  320  }
d38374142b2560 arch/powerpc/kernel/ptrace.c        Michael Ellerman  2015-07-23  321  
ea9c102cb0a796 arch/ppc/kernel/ptrace.c            David Woodhouse   2005-05-08  322  void do_syscall_trace_leave(struct pt_regs *regs)
ea9c102cb0a796 arch/ppc/kernel/ptrace.c            David Woodhouse   2005-05-08  323  {
4f72c4279eab1e arch/powerpc/kernel/ptrace.c        Roland McGrath    2008-07-27  324  	int step;
4f72c4279eab1e arch/powerpc/kernel/ptrace.c        Roland McGrath    2008-07-27  325  
d7e7528bcd456f arch/powerpc/kernel/ptrace.c        Eric Paris        2012-01-03  326  	audit_syscall_exit(regs);
ea9c102cb0a796 arch/ppc/kernel/ptrace.c            David Woodhouse   2005-05-08  327  
02424d8966d803 arch/powerpc/kernel/ptrace.c        Ian Munsie        2011-02-02  328  	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
02424d8966d803 arch/powerpc/kernel/ptrace.c        Ian Munsie        2011-02-02 @329  		trace_sys_exit(regs, regs->result);
02424d8966d803 arch/powerpc/kernel/ptrace.c        Ian Munsie        2011-02-02  330  
4f72c4279eab1e arch/powerpc/kernel/ptrace.c        Roland McGrath    2008-07-27  331  	step = test_thread_flag(TIF_SINGLESTEP);
4f72c4279eab1e arch/powerpc/kernel/ptrace.c        Roland McGrath    2008-07-27  332  	if (step || test_thread_flag(TIF_SYSCALL_TRACE))
153474ba1a4aed arch/powerpc/kernel/ptrace/ptrace.c Eric W. Biederman 2022-01-27  333  		ptrace_report_syscall_exit(regs, step);
ea9c102cb0a796 arch/ppc/kernel/ptrace.c            David Woodhouse   2005-05-08  334  }
002af9391bfbe8 arch/powerpc/kernel/ptrace.c        Michael Ellerman  2018-10-12  335  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 2/8] tracing/ftrace: guard syscall probe with preempt_notrace
  2024-10-04  1:33         ` Mathieu Desnoyers
@ 2024-10-04 13:26           ` Steven Rostedt
  2024-10-04 14:18             ` Mathieu Desnoyers
  2024-10-04 14:19             ` Mathieu Desnoyers
  0 siblings, 2 replies; 40+ messages in thread
From: Steven Rostedt @ 2024-10-04 13:26 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Masami Hiramatsu, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Michael Jeanson

On Thu, 3 Oct 2024 21:33:16 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> On 2024-10-04 03:04, Steven Rostedt wrote:
> > On Thu, 3 Oct 2024 20:26:29 -0400
> > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> > 
> >   
> >> static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
> >> {
> >>           struct trace_array *tr = data;
> >>           struct trace_event_file *trace_file;
> >>           struct syscall_trace_enter *entry;
> >>           struct syscall_metadata *sys_data;
> >>           struct trace_event_buffer fbuffer;
> >>           unsigned long args[6];
> >>           int syscall_nr;
> >>           int size;
> >>
> >>           syscall_nr = trace_get_syscall_nr(current, regs);
> >>           if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
> >>                   return;
> >>
> >>           /* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE) */
> >>           trace_file = rcu_dereference_sched(tr->enter_syscall_files[syscall_nr]);
> >>
> >> ^^^^ this function explicitly states that preempt needs to be disabled by
> >> tracepoints.  
> > 
> > Ah, I should have known it was the syscall portion. I don't care for this
> > hidden dependency. I rather add a preempt disable here and not expect it to
> > be disabled when called.  
> 
> Which is exactly what this patch is doing.

I was thinking of putting the protection in the function and not the macro.

> 
> >   
> >>
> >>           if (!trace_file)
> >>                   return;
> >>
> >>           if (trace_trigger_soft_disabled(trace_file))
> >>                   return;
> >>
> >>           sys_data = syscall_nr_to_meta(syscall_nr);
> >>           if (!sys_data)
> >>                   return;
> >>
> >>           size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
> >>
> >>           entry = trace_event_buffer_reserve(&fbuffer, trace_file, size);
> >> ^^^^ it reserves space in the ring buffer without disabling preemption explicitly.
> >>
> >> And also:
> >>
> >> void *trace_event_buffer_reserve(struct trace_event_buffer *fbuffer,
> >>                                    struct trace_event_file *trace_file,
> >>                                    unsigned long len)
> >> {
> >>           struct trace_event_call *event_call = trace_file->event_call;
> >>
> >>           if ((trace_file->flags & EVENT_FILE_FL_PID_FILTER) &&
> >>               trace_event_ignore_this_pid(trace_file))
> >>                   return NULL;
> >>
> >>           /*
> >>            * If CONFIG_PREEMPTION is enabled, then the tracepoint itself disables
> >>            * preemption (adding one to the preempt_count). Since we are
> >>            * interested in the preempt_count at the time the tracepoint was
> >>            * hit, we need to subtract one to offset the increment.
> >>            */
> >> ^^^ This function also explicitly expects preemption to be disabled.
> >>
> >> So I rest my case. The change I'm introducing for tracepoints
> >> don't make any assumptions about whether or not each tracer require
> >> preempt off or not: it keeps the behavior the _same_ as it was before.
> >>
> >> Then it's up to each tracer's developer to change the behavior of their
> >> own callbacks as they see fit. But I'm not introducing regressions in
> >> tracers with the "big switch" change of making syscall tracepoints
> >> faultable. This will belong to changes that are specific to each tracer.  
> > 
> > 
> > I rather remove these dependencies at the source. So, IMHO, these places
> > should be "fixed" first.
> > 
> > At least for the ftrace users. But I think the same can be done for the
> > other users as well. BPF already stated it just needs "migrate_disable()".
> > Let's see what perf has.
> > 
> > We can then audit all the tracepoint users to make sure they do not need
> > preemption disabled.  
> 
> Why does it need to be a broad refactoring of the entire world ? What is
> wrong with the simple approach of introducing this tracepoint faultable
> syscall support as a no-op from the tracer's perspective ?

Because we want in-tree users too ;-)

> 
> Then we can build on top and figure out if we want to relax things
> on a tracer-per-tracer basis.

Looking deeper into how ftrace can implement this, it may require some more
work. Doing it your way may be fine for now, but we need this working for
something in-tree instead of having it only work for LTTng.

Note, it doesn't have to be ftrace either. It could be perf or BPF. Or
simply the sframe code (doing stack traces at the entry of system calls).

-- Steve

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 2/8] tracing/ftrace: guard syscall probe with preempt_notrace
  2024-10-04 13:26           ` Steven Rostedt
@ 2024-10-04 14:18             ` Mathieu Desnoyers
  2024-10-04 14:45               ` Steven Rostedt
  2024-10-04 14:19             ` Mathieu Desnoyers
  1 sibling, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2024-10-04 14:18 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Michael Jeanson

On 2024-10-04 15:26, Steven Rostedt wrote:
> On Thu, 3 Oct 2024 21:33:16 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>> On 2024-10-04 03:04, Steven Rostedt wrote:
>>> On Thu, 3 Oct 2024 20:26:29 -0400
>>> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>>
>>>    
>>>> static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
>>>> {
>>>>            struct trace_array *tr = data;
>>>>            struct trace_event_file *trace_file;
>>>>            struct syscall_trace_enter *entry;
>>>>            struct syscall_metadata *sys_data;
>>>>            struct trace_event_buffer fbuffer;
>>>>            unsigned long args[6];
>>>>            int syscall_nr;
>>>>            int size;
>>>>
>>>>            syscall_nr = trace_get_syscall_nr(current, regs);
>>>>            if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
>>>>                    return;
>>>>
>>>>            /* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE) */
>>>>            trace_file = rcu_dereference_sched(tr->enter_syscall_files[syscall_nr]);
>>>>
>>>> ^^^^ this function explicitly states that preempt needs to be disabled by
>>>> tracepoints.
>>>
>>> Ah, I should have known it was the syscall portion. I don't care for this
>>> hidden dependency. I rather add a preempt disable here and not expect it to
>>> be disabled when called.
>>
>> Which is exactly what this patch is doing.
> 
> I was thinking of putting the protection in the function and not the macro.

I'm confused by your comment. The protection is added to the function here:

diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 67ac5366f724..ab4db8c23f36 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -299,6 +299,12 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
         int syscall_nr;
         int size;
  
+       /*
+        * Syscall probe called with preemption enabled, but the ring
+        * buffer and per-cpu data require preemption to be disabled.
+        */
+       guard(preempt_notrace)();
+
         syscall_nr = trace_get_syscall_nr(current, regs);
         if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
                 return;
@@ -338,6 +344,12 @@ static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
         struct trace_event_buffer fbuffer;
         int syscall_nr;
  
+       /*
+        * Syscall probe called with preemption enabled, but the ring
+        * buffer and per-cpu data require preemption to be disabled.
+        */
+       guard(preempt_notrace)();
+
         syscall_nr = trace_get_syscall_nr(current, regs);
         if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
                 return;

(I'll answer to the rest of your message in a separate email)

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 2/8] tracing/ftrace: guard syscall probe with preempt_notrace
  2024-10-04 13:26           ` Steven Rostedt
  2024-10-04 14:18             ` Mathieu Desnoyers
@ 2024-10-04 14:19             ` Mathieu Desnoyers
  2024-10-04 14:52               ` Steven Rostedt
  1 sibling, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2024-10-04 14:19 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Michael Jeanson

On 2024-10-04 15:26, Steven Rostedt wrote:
> On Thu, 3 Oct 2024 21:33:16 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>> On 2024-10-04 03:04, Steven Rostedt wrote:
>>> On Thu, 3 Oct 2024 20:26:29 -0400
>>> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>>

[...]

>>>> So I rest my case. The change I'm introducing for tracepoints
>>>> don't make any assumptions about whether or not each tracer require
>>>> preempt off or not: it keeps the behavior the _same_ as it was before.
>>>>
>>>> Then it's up to each tracer's developer to change the behavior of their
>>>> own callbacks as they see fit. But I'm not introducing regressions in
>>>> tracers with the "big switch" change of making syscall tracepoints
>>>> faultable. This will belong to changes that are specific to each tracer.
>>>
>>>
>>> I rather remove these dependencies at the source. So, IMHO, these places
>>> should be "fixed" first.
>>>
>>> At least for the ftrace users. But I think the same can be done for the
>>> other users as well. BPF already stated it just needs "migrate_disable()".
>>> Let's see what perf has.
>>>
>>> We can then audit all the tracepoint users to make sure they do not need
>>> preemption disabled.
>>
>> Why does it need to be a broad refactoring of the entire world ? What is
>> wrong with the simple approach of introducing this tracepoint faultable
>> syscall support as a no-op from the tracer's perspective ?
> 
> Because we want in-tree users too ;-)

This series is infrastructure work that allows all in-tree tracers to
start handling page faults for sys enter/exit events. Can we simply
do the tracer-specific work on top of this infrastructure series rather
than do everything at once ?

Regarding test coverage, this series modifies each tracer syscall probe
code to add a might_fault(), which ensures a page fault can indeed be
serviced at this point.

>>
>> Then we can build on top and figure out if we want to relax things
>> on a tracer-per-tracer basis.
> 
> Looking deeper into how ftrace can implement this, it may require some more
> work. Doing it your way may be fine for now, but we need this working for
> something in-tree instead of having it only work for LTTng.

There is nothing LTTng-specific here. LTTng only has feature branches
to test it out for now. Once we have the infrastructure in place we
can discuss how each tracer can use this. I've recently enumerated
various approaches that can be taken by tracers to handle page faults:

https://lore.kernel.org/lkml/c2a2db4b-4409-4f3c-9959-53622fd8dfa7@efficios.com/

> Note, it doesn't have to be ftrace either. It could be perf or BPF. Or
> simply the sframe code (doing stack traces at the entry of system calls).

Steven, we've been working on faultable tracepoints for four years now:

https://lore.kernel.org/lkml/20201023195352.26269-1-mjeanson@efficios.com/ [2020]
https://lore.kernel.org/lkml/20210218222125.46565-1-mjeanson@efficios.com/ [2021]
https://lore.kernel.org/lkml/20231002202531.3160-1-mathieu.desnoyers@efficios.com/ [2023]
https://lore.kernel.org/lkml/20231120205418.334172-1-mathieu.desnoyers@efficios.com/ [2023]
https://lore.kernel.org/lkml/20240626185941.68420-1-mathieu.desnoyers@efficios.com/ [2024]
https://lore.kernel.org/lkml/20240828144153.829582-1-mathieu.desnoyers@efficios.com/ [2024]
https://lore.kernel.org/lkml/20240909201652.319406-1-mathieu.desnoyers@efficios.com/ [2024]
https://lore.kernel.org/lkml/20241003151638.1608537-1-mathieu.desnoyers@efficios.com/ [2024] (current)

The eBPF people want to leverage this. When I last discussed this with
eBPF maintainers, they were open to adapt eBPF after this infrastructure
series is merged. Based on this eBPF attempt from 2022:

https://lore.kernel.org/lkml/c323bce9-a04e-b1c3-580a-783fde259d60@fb.com/

The sframe code is just getting in shape (2024), but is far from being ready.

Everyone appears to be waiting for this infrastructure work to go in
before they can build on top. Once this infrastructure is available,
multiple groups can start working on introducing use of this into their
own code in parallel.

Four years into this effort, and this is the first time we're told we need
to adapt in-tree tracers to handle the page faults before this can go in.

Could you please stop moving the goal posts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 2/8] tracing/ftrace: guard syscall probe with preempt_notrace
  2024-10-04 14:18             ` Mathieu Desnoyers
@ 2024-10-04 14:45               ` Steven Rostedt
  0 siblings, 0 replies; 40+ messages in thread
From: Steven Rostedt @ 2024-10-04 14:45 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Masami Hiramatsu, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Michael Jeanson

On Fri, 4 Oct 2024 10:18:59 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> On 2024-10-04 15:26, Steven Rostedt wrote:
> > On Thu, 3 Oct 2024 21:33:16 -0400
> > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> >   
> >> On 2024-10-04 03:04, Steven Rostedt wrote:  
> >>> On Thu, 3 Oct 2024 20:26:29 -0400
> >>> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> >>>
> >>>      
> >>>> static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
> >>>> {
> >>>>            struct trace_array *tr = data;
> >>>>            struct trace_event_file *trace_file;
> >>>>            struct syscall_trace_enter *entry;
> >>>>            struct syscall_metadata *sys_data;
> >>>>            struct trace_event_buffer fbuffer;
> >>>>            unsigned long args[6];
> >>>>            int syscall_nr;
> >>>>            int size;
> >>>>
> >>>>            syscall_nr = trace_get_syscall_nr(current, regs);
> >>>>            if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
> >>>>                    return;
> >>>>
> >>>>            /* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE) */
> >>>>            trace_file = rcu_dereference_sched(tr->enter_syscall_files[syscall_nr]);
> >>>>
> >>>> ^^^^ this function explicitly states that preempt needs to be disabled by
> >>>> tracepoints.  
> >>>
> >>> Ah, I should have known it was the syscall portion. I don't care for this
> >>> hidden dependency. I rather add a preempt disable here and not expect it to
> >>> be disabled when called.  
> >>
> >> Which is exactly what this patch is doing.  
> > 
> > I was thinking of putting the protection in the function and not the macro.  
> 
> I'm confused by your comment. The protection is added to the function here:

Ah, sorry. I'm the one confused. I was talking about this part:

> +#undef DECLARE_EVENT_SYSCALL_CLASS
> +#define DECLARE_EVENT_SYSCALL_CLASS(call, proto, args, tstruct, assign, print) \
> +__DECLARE_EVENT_CLASS(call, PARAMS(proto), PARAMS(args), PARAMS(tstruct), \
> +		      PARAMS(assign), PARAMS(print))			\
> +static notrace void							\
> +trace_event_raw_event_##call(void *__data, proto)			\
> +{									\
> +	guard(preempt_notrace)();					\
> +	do_trace_event_raw_event_##call(__data, args);			\
> +}
> +

But that's for the non-syscall case.

This is why I shouldn't review patches just before going to bed :-p

-- Steve

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 2/8] tracing/ftrace: guard syscall probe with preempt_notrace
  2024-10-04 14:52               ` Steven Rostedt
@ 2024-10-04 14:51                 ` Mathieu Desnoyers
  2024-10-04 20:04                   ` Andrii Nakryiko
  0 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2024-10-04 14:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Michael Jeanson

On 2024-10-04 16:52, Steven Rostedt wrote:
> On Fri, 4 Oct 2024 10:19:36 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>> The eBPF people want to leverage this. When I last discussed this with
>> eBPF maintainers, they were open to adapt eBPF after this infrastructure
>> series is merged. Based on this eBPF attempt from 2022:
>>
>> https://lore.kernel.org/lkml/c323bce9-a04e-b1c3-580a-783fde259d60@fb.com/
> 
> Sorry, I wasn't part of that discussion.
> 
>>
>> The sframe code is just getting in shape (2024), but is far from being ready.
>>
>> Everyone appears to be waiting for this infrastructure work to go in
>> before they can build on top. Once this infrastructure is available,
>> multiple groups can start working on introducing use of this into their
>> own code in parallel.
>>
>> Four years into this effort, and this is the first time we're told we need
>> to adapt in-tree tracers to handle the page faults before this can go in.
>>
>> Could you please stop moving the goal posts ?
> 
> I don't think I'm moving the goal posts. I was mentioning to show an
> in-tree user. If BPF wants this, I'm all for it. The only thing I saw was a
> generalization in the cover letter about perf, bpf and ftrace using
> faultible tracepoints. I just wanted to see a path for that happening.

AFAIU eBPF folks are very eager to start making use of this, so we won't
have to wait long.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 2/8] tracing/ftrace: guard syscall probe with preempt_notrace
  2024-10-04 14:19             ` Mathieu Desnoyers
@ 2024-10-04 14:52               ` Steven Rostedt
  2024-10-04 14:51                 ` Mathieu Desnoyers
  0 siblings, 1 reply; 40+ messages in thread
From: Steven Rostedt @ 2024-10-04 14:52 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Masami Hiramatsu, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes,
	linux-trace-kernel, Michael Jeanson

On Fri, 4 Oct 2024 10:19:36 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> The eBPF people want to leverage this. When I last discussed this with
> eBPF maintainers, they were open to adapt eBPF after this infrastructure
> series is merged. Based on this eBPF attempt from 2022:
> 
> https://lore.kernel.org/lkml/c323bce9-a04e-b1c3-580a-783fde259d60@fb.com/

Sorry, I wasn't part of that discussion.

> 
> The sframe code is just getting in shape (2024), but is far from being ready.
> 
> Everyone appears to be waiting for this infrastructure work to go in
> before they can build on top. Once this infrastructure is available,
> multiple groups can start working on introducing use of this into their
> own code in parallel.
> 
> Four years into this effort, and this is the first time we're told we need
> to adapt in-tree tracers to handle the page faults before this can go in.
> 
> Could you please stop moving the goal posts ?

I don't think I'm moving the goal posts. I was mentioning to show an
in-tree user. If BPF wants this, I'm all for it. The only thing I saw was a
generalization in the cover letter about perf, bpf and ftrace using
faultible tracepoints. I just wanted to see a path for that happening.

-- Steve

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 2/8] tracing/ftrace: guard syscall probe with preempt_notrace
  2024-10-04 14:51                 ` Mathieu Desnoyers
@ 2024-10-04 20:04                   ` Andrii Nakryiko
  2024-10-04 20:59                     ` Steven Rostedt
  0 siblings, 1 reply; 40+ messages in thread
From: Andrii Nakryiko @ 2024-10-04 20:04 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Masami Hiramatsu, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, bpf, Joel Fernandes, linux-trace-kernel,
	Michael Jeanson

On Fri, Oct 4, 2024 at 7:53 AM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> On 2024-10-04 16:52, Steven Rostedt wrote:
> > On Fri, 4 Oct 2024 10:19:36 -0400
> > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> >
> >> The eBPF people want to leverage this. When I last discussed this with
> >> eBPF maintainers, they were open to adapt eBPF after this infrastructure
> >> series is merged. Based on this eBPF attempt from 2022:
> >>
> >> https://lore.kernel.org/lkml/c323bce9-a04e-b1c3-580a-783fde259d60@fb.com/
> >
> > Sorry, I wasn't part of that discussion.
> >
> >>
> >> The sframe code is just getting in shape (2024), but is far from being ready.
> >>
> >> Everyone appears to be waiting for this infrastructure work to go in
> >> before they can build on top. Once this infrastructure is available,
> >> multiple groups can start working on introducing use of this into their
> >> own code in parallel.
> >>
> >> Four years into this effort, and this is the first time we're told we need
> >> to adapt in-tree tracers to handle the page faults before this can go in.
> >>
> >> Could you please stop moving the goal posts ?
> >
> > I don't think I'm moving the goal posts. I was mentioning to show an
> > in-tree user. If BPF wants this, I'm all for it. The only thing I saw was a
> > generalization in the cover letter about perf, bpf and ftrace using
> > faultible tracepoints. I just wanted to see a path for that happening.
>
> AFAIU eBPF folks are very eager to start making use of this, so we won't
> have to wait long.

I already gave my ack on BPF parts of this patch set, but I'll
elaborate a bit more here for the record. There seems to be two things
that's been discussed.

First, preempt_disable() vs migrate_disable(). We only need the
latter, but the former just preserves current behavior and I think
it's fine, we can follow up with BPF-specific bits later to optimize
and clean this up further. No big deal.

Second, whether BPF can utilize sleepable (faultable) tracepoints
right now with these changes. No, we need a bit more work (again, in
BPF specific parts) to allow faultable tracepoint attachment for BPF
programs. But it's a bit nuanced piece of code to get everything
right, and it's best done by someone more familiar with BPF internals.
So I wouldn't expect Mathieu to do this either.

So, tl;dr, I think patches are fine as-is (from BPF perspective), and
we'd like to see them applied and get to bpf-next for further
development on top of that.

>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 2/8] tracing/ftrace: guard syscall probe with preempt_notrace
  2024-10-04 20:04                   ` Andrii Nakryiko
@ 2024-10-04 20:59                     ` Steven Rostedt
  0 siblings, 0 replies; 40+ messages in thread
From: Steven Rostedt @ 2024-10-04 20:59 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Mathieu Desnoyers, Masami Hiramatsu, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Namhyung Kim, bpf, Joel Fernandes, linux-trace-kernel,
	Michael Jeanson

On Fri, 4 Oct 2024 13:04:21 -0700
Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:

> > AFAIU eBPF folks are very eager to start making use of this, so we won't
> > have to wait long.  
> 
> I already gave my ack on BPF parts of this patch set, but I'll
> elaborate a bit more here for the record. There seems to be two things
> that's been discussed.
> 
> First, preempt_disable() vs migrate_disable(). We only need the
> latter, but the former just preserves current behavior and I think
> it's fine, we can follow up with BPF-specific bits later to optimize
> and clean this up further. No big deal.
> 
> Second, whether BPF can utilize sleepable (faultable) tracepoints
> right now with these changes. No, we need a bit more work (again, in
> BPF specific parts) to allow faultable tracepoint attachment for BPF
> programs. But it's a bit nuanced piece of code to get everything
> right, and it's best done by someone more familiar with BPF internals.
> So I wouldn't expect Mathieu to do this either.
> 
> So, tl;dr, I think patches are fine as-is (from BPF perspective), and
> we'd like to see them applied and get to bpf-next for further
> development on top of that.

Thanks Andrii for elaborating.

-- Steve

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2024-10-04 20:58 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-03 15:16 [PATCH v1 0/8] tracing: Allow system call tracepoints to handle page faults Mathieu Desnoyers
2024-10-03 15:16 ` [PATCH v1 1/8] tracing: Declare system call tracepoints with TRACE_EVENT_SYSCALL Mathieu Desnoyers
2024-10-03 21:32   ` Steven Rostedt
2024-10-04  0:15     ` Mathieu Desnoyers
2024-10-04  1:06       ` Steven Rostedt
2024-10-04  1:34         ` Mathieu Desnoyers
2024-10-04 10:34   ` kernel test robot
2024-10-03 15:16 ` [PATCH v1 2/8] tracing/ftrace: guard syscall probe with preempt_notrace Mathieu Desnoyers
2024-10-03 22:23   ` Steven Rostedt
2024-10-04  0:26     ` Mathieu Desnoyers
2024-10-04  1:04       ` Steven Rostedt
2024-10-04  1:33         ` Mathieu Desnoyers
2024-10-04 13:26           ` Steven Rostedt
2024-10-04 14:18             ` Mathieu Desnoyers
2024-10-04 14:45               ` Steven Rostedt
2024-10-04 14:19             ` Mathieu Desnoyers
2024-10-04 14:52               ` Steven Rostedt
2024-10-04 14:51                 ` Mathieu Desnoyers
2024-10-04 20:04                   ` Andrii Nakryiko
2024-10-04 20:59                     ` Steven Rostedt
2024-10-03 15:16 ` [PATCH v1 3/8] tracing/perf: " Mathieu Desnoyers
2024-10-03 22:25   ` Steven Rostedt
2024-10-04  0:17     ` Frederic Weisbecker
2024-10-03 15:16 ` [PATCH v1 4/8] tracing/bpf: " Mathieu Desnoyers
2024-10-03 22:26   ` Steven Rostedt
2024-10-03 23:05     ` Alexei Starovoitov
2024-10-04  0:30       ` Mathieu Desnoyers
2024-10-04  1:28         ` Mathieu Desnoyers
2024-10-03 15:16 ` [PATCH v1 5/8] tracing: Allow system call tracepoints to handle page faults Mathieu Desnoyers
2024-10-03 22:29   ` Steven Rostedt
2024-10-04  0:35     ` Mathieu Desnoyers
2024-10-03 15:16 ` [PATCH v1 6/8] tracing/ftrace: Add might_fault check to syscall probes Mathieu Desnoyers
2024-10-03 22:36   ` Steven Rostedt
2024-10-04  0:11     ` Mathieu Desnoyers
2024-10-03 15:16 ` [PATCH v1 7/8] tracing/perf: " Mathieu Desnoyers
2024-10-03 22:37   ` Steven Rostedt
2024-10-04  0:12     ` Mathieu Desnoyers
2024-10-03 15:16 ` [PATCH v1 8/8] tracing/bpf: " Mathieu Desnoyers
2024-10-03 22:38   ` Steven Rostedt
2024-10-04  0:13     ` Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).