* [RFC PATCH v3 2/3] tracing: Introduce tracepoint_is_syscall()
2024-10-26 15:46 [RFC PATCH v3 1/3] tracing: Introduce tracepoint extended structure Mathieu Desnoyers
@ 2024-10-26 15:46 ` Mathieu Desnoyers
2024-10-27 0:08 ` Steven Rostedt
2024-10-26 15:46 ` [RFC PATCH v3 3/3] tracing: Fix syscall tracepoint use-after-free Mathieu Desnoyers
2024-10-28 1:22 ` [RFC PATCH v3 1/3] tracing: Introduce tracepoint extended structure Andrii Nakryiko
2 siblings, 1 reply; 16+ messages in thread
From: Mathieu Desnoyers @ 2024-10-26 15:46 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, Mathieu Desnoyers, Michael Jeanson,
Masami Hiramatsu, Peter Zijlstra, Alexei Starovoitov,
Yonghong Song, Paul E . McKenney, Ingo Molnar,
Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes, Jordan Rife
Introduce a "syscall" flag within the extended structure to know whether
a tracepoint needs rcu tasks trace grace period before reclaim.
This can be queried using tracepoint_is_syscall().
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michael Jeanson <mjeanson@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Jordan Rife <jrife@google.com>
---
include/linux/tracepoint-defs.h | 2 ++
include/linux/tracepoint.h | 24 ++++++++++++++++++++++++
include/trace/define_trace.h | 2 +-
3 files changed, 27 insertions(+), 1 deletion(-)
diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
index 967c08d9da84..53119e074c87 100644
--- a/include/linux/tracepoint-defs.h
+++ b/include/linux/tracepoint-defs.h
@@ -32,6 +32,8 @@ struct tracepoint_func {
struct tracepoint_ext {
int (*regfunc)(void);
void (*unregfunc)(void);
+ /* Flags. */
+ unsigned int syscall:1;
};
struct tracepoint {
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 83dc24ee8b13..93e70bc64533 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -104,6 +104,12 @@ void for_each_tracepoint_in_module(struct module *mod,
* tracepoint_synchronize_unregister must be called between the last tracepoint
* probe unregistration and the end of module exit to make sure there is no
* caller executing a probe when it is freed.
+ *
+ * An alternative is to use the following for batch reclaim associated
+ * with a given tracepoint:
+ *
+ * - tracepoint_is_syscall() == false: call_rcu()
+ * - tracepoint_is_syscall() == true: call_rcu_tasks_trace()
*/
#ifdef CONFIG_TRACEPOINTS
static inline void tracepoint_synchronize_unregister(void)
@@ -111,9 +117,17 @@ static inline void tracepoint_synchronize_unregister(void)
synchronize_rcu_tasks_trace();
synchronize_rcu();
}
+static inline bool tracepoint_is_syscall(struct tracepoint *tp)
+{
+ return tp->ext && tp->ext->syscall;
+}
#else
static inline void tracepoint_synchronize_unregister(void)
{ }
+static inline bool tracepoint_is_syscall(struct tracepoint *tp)
+{
+ return false;
+}
#endif
#ifdef CONFIG_HAVE_SYSCALL_TRACEPOINTS
@@ -345,6 +359,15 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
struct tracepoint_ext __tracepoint_ext_##_name = { \
.regfunc = _reg, \
.unregfunc = _unreg, \
+ .syscall = false, \
+ }; \
+ __DEFINE_TRACE_EXT(_name, &__tracepoint_ext_##_name, PARAMS(_proto), PARAMS(_args));
+
+#define DEFINE_TRACE_SYSCALL(_name, _reg, _unreg, _proto, _args) \
+ struct tracepoint_ext __tracepoint_ext_##_name = { \
+ .regfunc = _reg, \
+ .unregfunc = _unreg, \
+ .syscall = true, \
}; \
__DEFINE_TRACE_EXT(_name, &__tracepoint_ext_##_name, PARAMS(_proto), PARAMS(_args));
@@ -389,6 +412,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
#define __DECLARE_TRACE_SYSCALL __DECLARE_TRACE
#define DEFINE_TRACE_FN(name, reg, unreg, proto, args)
+#define DEFINE_TRACE_SYSCALL(name, reg, unreg, proto, args)
#define DEFINE_TRACE(name, proto, args)
#define EXPORT_TRACEPOINT_SYMBOL_GPL(name)
#define EXPORT_TRACEPOINT_SYMBOL(name)
diff --git a/include/trace/define_trace.h b/include/trace/define_trace.h
index ff5fa17a6259..63fea2218afa 100644
--- a/include/trace/define_trace.h
+++ b/include/trace/define_trace.h
@@ -48,7 +48,7 @@
#undef TRACE_EVENT_SYSCALL
#define TRACE_EVENT_SYSCALL(name, proto, args, struct, assign, print, reg, unreg) \
- DEFINE_TRACE_FN(name, reg, unreg, PARAMS(proto), PARAMS(args))
+ DEFINE_TRACE_SYSCALL(name, reg, unreg, PARAMS(proto), PARAMS(args))
#undef TRACE_EVENT_NOP
#define TRACE_EVENT_NOP(name, proto, args, struct, assign, print)
--
2.39.5
^ permalink raw reply related [flat|nested] 16+ messages in thread* Re: [RFC PATCH v3 2/3] tracing: Introduce tracepoint_is_syscall()
2024-10-26 15:46 ` [RFC PATCH v3 2/3] tracing: Introduce tracepoint_is_syscall() Mathieu Desnoyers
@ 2024-10-27 0:08 ` Steven Rostedt
2024-10-27 12:30 ` Mathieu Desnoyers
2024-10-27 14:19 ` Masami Hiramatsu
0 siblings, 2 replies; 16+ messages in thread
From: Steven Rostedt @ 2024-10-27 0:08 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: linux-kernel, Michael Jeanson, Masami Hiramatsu, Peter Zijlstra,
Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes, Jordan Rife
On Sat, 26 Oct 2024 11:46:28 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> Introduce a "syscall" flag within the extended structure to know whether
> a tracepoint needs rcu tasks trace grace period before reclaim.
> This can be queried using tracepoint_is_syscall().
>
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Michael Jeanson <mjeanson@efficios.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Yonghong Song <yhs@fb.com>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> Cc: bpf@vger.kernel.org
> Cc: Joel Fernandes <joel@joelfernandes.org>
> Cc: Jordan Rife <jrife@google.com>
> ---
> include/linux/tracepoint-defs.h | 2 ++
> include/linux/tracepoint.h | 24 ++++++++++++++++++++++++
> include/trace/define_trace.h | 2 +-
> 3 files changed, 27 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
> index 967c08d9da84..53119e074c87 100644
> --- a/include/linux/tracepoint-defs.h
> +++ b/include/linux/tracepoint-defs.h
> @@ -32,6 +32,8 @@ struct tracepoint_func {
> struct tracepoint_ext {
> int (*regfunc)(void);
> void (*unregfunc)(void);
> + /* Flags. */
> + unsigned int syscall:1;
I wonder if we should call it "sleepable" instead? For this patch set
do we really care if it's a system call or not? It's really if the
tracepoint is sleepable or not that's the issue. System calls are just
one user of it, there may be more in the future, and the changes to BPF
will still be needed.
Other than that, I think this could work.
-- Steve
> };
>
> struct tracepoint {
> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> index 83dc24ee8b13..93e70bc64533 100644
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -104,6 +104,12 @@ void for_each_tracepoint_in_module(struct module *mod,
> * tracepoint_synchronize_unregister must be called between the last tracepoint
> * probe unregistration and the end of module exit to make sure there is no
> * caller executing a probe when it is freed.
> + *
> + * An alternative is to use the following for batch reclaim associated
> + * with a given tracepoint:
> + *
> + * - tracepoint_is_syscall() == false: call_rcu()
> + * - tracepoint_is_syscall() == true: call_rcu_tasks_trace()
> */
> #ifdef CONFIG_TRACEPOINTS
> static inline void tracepoint_synchronize_unregister(void)
> @@ -111,9 +117,17 @@ static inline void tracepoint_synchronize_unregister(void)
> synchronize_rcu_tasks_trace();
> synchronize_rcu();
> }
> +static inline bool tracepoint_is_syscall(struct tracepoint *tp)
> +{
> + return tp->ext && tp->ext->syscall;
> +}
> #else
> static inline void tracepoint_synchronize_unregister(void)
> { }
> +static inline bool tracepoint_is_syscall(struct tracepoint *tp)
> +{
> + return false;
> +}
> #endif
>
> #ifdef CONFIG_HAVE_SYSCALL_TRACEPOINTS
> @@ -345,6 +359,15 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> struct tracepoint_ext __tracepoint_ext_##_name = { \
> .regfunc = _reg, \
> .unregfunc = _unreg, \
> + .syscall = false, \
> + }; \
> + __DEFINE_TRACE_EXT(_name, &__tracepoint_ext_##_name, PARAMS(_proto), PARAMS(_args));
> +
> +#define DEFINE_TRACE_SYSCALL(_name, _reg, _unreg, _proto, _args) \
> + struct tracepoint_ext __tracepoint_ext_##_name = { \
> + .regfunc = _reg, \
> + .unregfunc = _unreg, \
> + .syscall = true, \
> }; \
> __DEFINE_TRACE_EXT(_name, &__tracepoint_ext_##_name, PARAMS(_proto), PARAMS(_args));
>
> @@ -389,6 +412,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> #define __DECLARE_TRACE_SYSCALL __DECLARE_TRACE
>
> #define DEFINE_TRACE_FN(name, reg, unreg, proto, args)
> +#define DEFINE_TRACE_SYSCALL(name, reg, unreg, proto, args)
> #define DEFINE_TRACE(name, proto, args)
> #define EXPORT_TRACEPOINT_SYMBOL_GPL(name)
> #define EXPORT_TRACEPOINT_SYMBOL(name)
> diff --git a/include/trace/define_trace.h b/include/trace/define_trace.h
> index ff5fa17a6259..63fea2218afa 100644
> --- a/include/trace/define_trace.h
> +++ b/include/trace/define_trace.h
> @@ -48,7 +48,7 @@
>
> #undef TRACE_EVENT_SYSCALL
> #define TRACE_EVENT_SYSCALL(name, proto, args, struct, assign, print, reg, unreg) \
> - DEFINE_TRACE_FN(name, reg, unreg, PARAMS(proto), PARAMS(args))
> + DEFINE_TRACE_SYSCALL(name, reg, unreg, PARAMS(proto), PARAMS(args))
>
> #undef TRACE_EVENT_NOP
> #define TRACE_EVENT_NOP(name, proto, args, struct, assign, print)
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [RFC PATCH v3 2/3] tracing: Introduce tracepoint_is_syscall()
2024-10-27 0:08 ` Steven Rostedt
@ 2024-10-27 12:30 ` Mathieu Desnoyers
2024-10-28 5:06 ` Steven Rostedt
2024-10-27 14:19 ` Masami Hiramatsu
1 sibling, 1 reply; 16+ messages in thread
From: Mathieu Desnoyers @ 2024-10-27 12:30 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, Michael Jeanson, Masami Hiramatsu, Peter Zijlstra,
Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes, Jordan Rife,
Linus Torvalds
On 2024-10-26 20:08, Steven Rostedt wrote:
> On Sat, 26 Oct 2024 11:46:28 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>
>> Introduce a "syscall" flag within the extended structure to know whether
>> a tracepoint needs rcu tasks trace grace period before reclaim.
>> This can be queried using tracepoint_is_syscall().
>>
>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> Cc: Michael Jeanson <mjeanson@efficios.com>
>> Cc: Steven Rostedt <rostedt@goodmis.org>
>> Cc: Masami Hiramatsu <mhiramat@kernel.org>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Alexei Starovoitov <ast@kernel.org>
>> Cc: Yonghong Song <yhs@fb.com>
>> Cc: Paul E. McKenney <paulmck@kernel.org>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
>> Cc: Mark Rutland <mark.rutland@arm.com>
>> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
>> Cc: Namhyung Kim <namhyung@kernel.org>
>> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
>> Cc: bpf@vger.kernel.org
>> Cc: Joel Fernandes <joel@joelfernandes.org>
>> Cc: Jordan Rife <jrife@google.com>
>> ---
>> include/linux/tracepoint-defs.h | 2 ++
>> include/linux/tracepoint.h | 24 ++++++++++++++++++++++++
>> include/trace/define_trace.h | 2 +-
>> 3 files changed, 27 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
>> index 967c08d9da84..53119e074c87 100644
>> --- a/include/linux/tracepoint-defs.h
>> +++ b/include/linux/tracepoint-defs.h
>> @@ -32,6 +32,8 @@ struct tracepoint_func {
>> struct tracepoint_ext {
>> int (*regfunc)(void);
>> void (*unregfunc)(void);
>> + /* Flags. */
>> + unsigned int syscall:1;
>
> I wonder if we should call it "sleepable" instead? For this patch set
> do we really care if it's a system call or not? It's really if the
> tracepoint is sleepable or not that's the issue. System calls are just
> one user of it, there may be more in the future, and the changes to BPF
> will still be needed.
Remember that syscall tracepoint probes are allowed to handle page
faults, but should not generally block, otherwise it would postpone the
grace periods of all RCU tasks trace users.
So naming this "sleepable" would be misleading, because probes are
not allowed general blocking, just to handle page faults.
If we look at the history of this tracepoint feature, we went with
the following naming over the various versions of the patch series:
1) Sleepable tracepoints: until we understood that we just want to
allow page fault, not general sleeping, so we needed to change
the name,
2) Faultable tracepoints: until Linus requested that we aim for
something that is specific to system calls, rather than a generic
thing.
https://lore.kernel.org/lkml/CAHk-=wggDLDeTKbhb5hh--x=-DQd69v41137M72m6NOTmbD-cw@mail.gmail.com/
3) Syscall tracepoints: This is what we currently have.
> Other than that, I think this could work.
Calling this field "sleepable" would be misleading. Calling it "faultable"
would be a better fit, but based on Linus' request, I'm tempted to stick
with "syscall" for now.
Your concern is to name this in a way that is general and future-proof.
Linus' point was to make it syscall-specific rather than general. My
position is that we should wait until we face other use-cases (if we
even do) before consider changing the naming from "syscall" to something
more generic.
Thanks,
Mathieu
>
> -- Steve
>
>
>> };
>>
>> struct tracepoint {
>> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
>> index 83dc24ee8b13..93e70bc64533 100644
>> --- a/include/linux/tracepoint.h
>> +++ b/include/linux/tracepoint.h
>> @@ -104,6 +104,12 @@ void for_each_tracepoint_in_module(struct module *mod,
>> * tracepoint_synchronize_unregister must be called between the last tracepoint
>> * probe unregistration and the end of module exit to make sure there is no
>> * caller executing a probe when it is freed.
>> + *
>> + * An alternative is to use the following for batch reclaim associated
>> + * with a given tracepoint:
>> + *
>> + * - tracepoint_is_syscall() == false: call_rcu()
>> + * - tracepoint_is_syscall() == true: call_rcu_tasks_trace()
>> */
>> #ifdef CONFIG_TRACEPOINTS
>> static inline void tracepoint_synchronize_unregister(void)
>> @@ -111,9 +117,17 @@ static inline void tracepoint_synchronize_unregister(void)
>> synchronize_rcu_tasks_trace();
>> synchronize_rcu();
>> }
>> +static inline bool tracepoint_is_syscall(struct tracepoint *tp)
>> +{
>> + return tp->ext && tp->ext->syscall;
>> +}
>> #else
>> static inline void tracepoint_synchronize_unregister(void)
>> { }
>> +static inline bool tracepoint_is_syscall(struct tracepoint *tp)
>> +{
>> + return false;
>> +}
>> #endif
>>
>> #ifdef CONFIG_HAVE_SYSCALL_TRACEPOINTS
>> @@ -345,6 +359,15 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
>> struct tracepoint_ext __tracepoint_ext_##_name = { \
>> .regfunc = _reg, \
>> .unregfunc = _unreg, \
>> + .syscall = false, \
>> + }; \
>> + __DEFINE_TRACE_EXT(_name, &__tracepoint_ext_##_name, PARAMS(_proto), PARAMS(_args));
>> +
>> +#define DEFINE_TRACE_SYSCALL(_name, _reg, _unreg, _proto, _args) \
>> + struct tracepoint_ext __tracepoint_ext_##_name = { \
>> + .regfunc = _reg, \
>> + .unregfunc = _unreg, \
>> + .syscall = true, \
>> }; \
>> __DEFINE_TRACE_EXT(_name, &__tracepoint_ext_##_name, PARAMS(_proto), PARAMS(_args));
>>
>> @@ -389,6 +412,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
>> #define __DECLARE_TRACE_SYSCALL __DECLARE_TRACE
>>
>> #define DEFINE_TRACE_FN(name, reg, unreg, proto, args)
>> +#define DEFINE_TRACE_SYSCALL(name, reg, unreg, proto, args)
>> #define DEFINE_TRACE(name, proto, args)
>> #define EXPORT_TRACEPOINT_SYMBOL_GPL(name)
>> #define EXPORT_TRACEPOINT_SYMBOL(name)
>> diff --git a/include/trace/define_trace.h b/include/trace/define_trace.h
>> index ff5fa17a6259..63fea2218afa 100644
>> --- a/include/trace/define_trace.h
>> +++ b/include/trace/define_trace.h
>> @@ -48,7 +48,7 @@
>>
>> #undef TRACE_EVENT_SYSCALL
>> #define TRACE_EVENT_SYSCALL(name, proto, args, struct, assign, print, reg, unreg) \
>> - DEFINE_TRACE_FN(name, reg, unreg, PARAMS(proto), PARAMS(args))
>> + DEFINE_TRACE_SYSCALL(name, reg, unreg, PARAMS(proto), PARAMS(args))
>>
>> #undef TRACE_EVENT_NOP
>> #define TRACE_EVENT_NOP(name, proto, args, struct, assign, print)
>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [RFC PATCH v3 2/3] tracing: Introduce tracepoint_is_syscall()
2024-10-27 12:30 ` Mathieu Desnoyers
@ 2024-10-28 5:06 ` Steven Rostedt
2024-10-28 13:35 ` Mathieu Desnoyers
0 siblings, 1 reply; 16+ messages in thread
From: Steven Rostedt @ 2024-10-28 5:06 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: linux-kernel, Michael Jeanson, Masami Hiramatsu, Peter Zijlstra,
Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes, Jordan Rife,
Linus Torvalds
On Sun, 27 Oct 2024 08:30:54 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> >
> > I wonder if we should call it "sleepable" instead? For this patch set
> > do we really care if it's a system call or not? It's really if the
> > tracepoint is sleepable or not that's the issue. System calls are just
> > one user of it, there may be more in the future, and the changes to BPF
> > will still be needed.
>
> Remember that syscall tracepoint probes are allowed to handle page
> faults, but should not generally block, otherwise it would postpone the
> grace periods of all RCU tasks trace users.
>
> So naming this "sleepable" would be misleading, because probes are
> not allowed general blocking, just to handle page faults.
I'm fine with "faultable" too.
>
> If we look at the history of this tracepoint feature, we went with
> the following naming over the various versions of the patch series:
>
> 1) Sleepable tracepoints: until we understood that we just want to
> allow page fault, not general sleeping, so we needed to change
> the name,
>
> 2) Faultable tracepoints: until Linus requested that we aim for
> something that is specific to system calls, rather than a generic
> thing.
>
> https://lore.kernel.org/lkml/CAHk-=wggDLDeTKbhb5hh--x=-DQd69v41137M72m6NOTmbD-cw@mail.gmail.com/
Reading that thread again, I believe that Linus was talking more about
all the infrastructure going around to make a special "faultable"
tracepoint (I could be wrong, and Linus may correct me here). When in
fact, the only user is system calls. But from the BPF POV, it doesn't
care if it's a system call, it cares that it is faultable, and the
check should be on that. Having BPF check if it's a system call is
requiring that BPF knows the implementation details of system call
tracepoints. But if it knows it is faultable, then it needs to do
something special.
>
> 3) Syscall tracepoints: This is what we currently have.
>
> > Other than that, I think this could work.
>
> Calling this field "sleepable" would be misleading. Calling it "faultable"
> would be a better fit, but based on Linus' request, I'm tempted to stick
> with "syscall" for now.
>
> Your concern is to name this in a way that is general and future-proof.
> Linus' point was to make it syscall-specific rather than general. My
> position is that we should wait until we face other use-cases (if we
> even do) before consider changing the naming from "syscall" to something
> more generic.
Yes, but that was for the infrastructure itself. It really doesnt' make
sense that BPF needs to know which type of tracepoint can fault. That's
telling BPF, you need to know the implementation of this type of
tracepoint.
-- Steve
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH v3 2/3] tracing: Introduce tracepoint_is_syscall()
2024-10-28 5:06 ` Steven Rostedt
@ 2024-10-28 13:35 ` Mathieu Desnoyers
0 siblings, 0 replies; 16+ messages in thread
From: Mathieu Desnoyers @ 2024-10-28 13:35 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, Michael Jeanson, Masami Hiramatsu, Peter Zijlstra,
Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes, Jordan Rife,
Linus Torvalds
On 2024-10-28 01:06, Steven Rostedt wrote:
> On Sun, 27 Oct 2024 08:30:54 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>>
>>> I wonder if we should call it "sleepable" instead? For this patch set
>>> do we really care if it's a system call or not? It's really if the
>>> tracepoint is sleepable or not that's the issue. System calls are just
>>> one user of it, there may be more in the future, and the changes to BPF
>>> will still be needed.
>>
>> Remember that syscall tracepoint probes are allowed to handle page
>> faults, but should not generally block, otherwise it would postpone the
>> grace periods of all RCU tasks trace users.
>>
>> So naming this "sleepable" would be misleading, because probes are
>> not allowed general blocking, just to handle page faults.
>
> I'm fine with "faultable" too.
>
>>
>> If we look at the history of this tracepoint feature, we went with
>> the following naming over the various versions of the patch series:
>>
>> 1) Sleepable tracepoints: until we understood that we just want to
>> allow page fault, not general sleeping, so we needed to change
>> the name,
>>
>> 2) Faultable tracepoints: until Linus requested that we aim for
>> something that is specific to system calls, rather than a generic
>> thing.
>>
>> https://lore.kernel.org/lkml/CAHk-=wggDLDeTKbhb5hh--x=-DQd69v41137M72m6NOTmbD-cw@mail.gmail.com/
>
> Reading that thread again, I believe that Linus was talking more about
> all the infrastructure going around to make a special "faultable"
> tracepoint (I could be wrong, and Linus may correct me here). When in
> fact, the only user is system calls. But from the BPF POV, it doesn't
> care if it's a system call, it cares that it is faultable, and the
> check should be on that. Having BPF check if it's a system call is
> requiring that BPF knows the implementation details of system call
> tracepoints. But if it knows it is faultable, then it needs to do
> something special.
It might just be that, indeed. Considering the overwhelming preference
for something a little more general (sleepable/faultable vs syscall),
I am tempted to go for "tracepoint_is_faultable()".
>
>>
>> 3) Syscall tracepoints: This is what we currently have.
>>
>>> Other than that, I think this could work.
>>
>> Calling this field "sleepable" would be misleading. Calling it "faultable"
>> would be a better fit, but based on Linus' request, I'm tempted to stick
>> with "syscall" for now.
>>
>> Your concern is to name this in a way that is general and future-proof.
>> Linus' point was to make it syscall-specific rather than general. My
>> position is that we should wait until we face other use-cases (if we
>> even do) before consider changing the naming from "syscall" to something
>> more generic.
>
> Yes, but that was for the infrastructure itself. It really doesnt' make
> sense that BPF needs to know which type of tracepoint can fault. That's
> telling BPF, you need to know the implementation of this type of
> tracepoint.
OK, I'll use tracepoint_is_faultable() and a "faultable" field name, and
see how people react. I really prefer "faultable" to "sleepable" here,
because I envision that in the future we may introduce tracepoints
which are really able to sleep (general blocking), for instance though
use of hazard pointers to protect a list iteration. (if there is ever
a need for it)
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH v3 2/3] tracing: Introduce tracepoint_is_syscall()
2024-10-27 0:08 ` Steven Rostedt
2024-10-27 12:30 ` Mathieu Desnoyers
@ 2024-10-27 14:19 ` Masami Hiramatsu
2024-10-28 1:23 ` Andrii Nakryiko
1 sibling, 1 reply; 16+ messages in thread
From: Masami Hiramatsu @ 2024-10-27 14:19 UTC (permalink / raw)
To: Steven Rostedt
Cc: Mathieu Desnoyers, linux-kernel, Michael Jeanson,
Masami Hiramatsu, Peter Zijlstra, Alexei Starovoitov,
Yonghong Song, Paul E . McKenney, Ingo Molnar,
Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes, Jordan Rife
On Sat, 26 Oct 2024 20:08:40 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> On Sat, 26 Oct 2024 11:46:28 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>
> > Introduce a "syscall" flag within the extended structure to know whether
> > a tracepoint needs rcu tasks trace grace period before reclaim.
> > This can be queried using tracepoint_is_syscall().
> >
> > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> > Cc: Michael Jeanson <mjeanson@efficios.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Masami Hiramatsu <mhiramat@kernel.org>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Alexei Starovoitov <ast@kernel.org>
> > Cc: Yonghong Song <yhs@fb.com>
> > Cc: Paul E. McKenney <paulmck@kernel.org>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> > Cc: Mark Rutland <mark.rutland@arm.com>
> > Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> > Cc: Namhyung Kim <namhyung@kernel.org>
> > Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> > Cc: bpf@vger.kernel.org
> > Cc: Joel Fernandes <joel@joelfernandes.org>
> > Cc: Jordan Rife <jrife@google.com>
> > ---
> > include/linux/tracepoint-defs.h | 2 ++
> > include/linux/tracepoint.h | 24 ++++++++++++++++++++++++
> > include/trace/define_trace.h | 2 +-
> > 3 files changed, 27 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
> > index 967c08d9da84..53119e074c87 100644
> > --- a/include/linux/tracepoint-defs.h
> > +++ b/include/linux/tracepoint-defs.h
> > @@ -32,6 +32,8 @@ struct tracepoint_func {
> > struct tracepoint_ext {
> > int (*regfunc)(void);
> > void (*unregfunc)(void);
> > + /* Flags. */
> > + unsigned int syscall:1;
>
> I wonder if we should call it "sleepable" instead? For this patch set
> do we really care if it's a system call or not? It's really if the
> tracepoint is sleepable or not that's the issue. System calls are just
> one user of it, there may be more in the future, and the changes to BPF
> will still be needed.
I agree with this. Even if currently we restrict only syscall events
can be sleep, "tracepoint_is_syscall()" requires to add comment to
explain why on all call sites e.g.
/*
* The syscall event is only sleepable event, so we ensure it is
* syscall event for checking sleepable or not.
*/
If it called tracepoint_is_sleepable(), we don't need such comment.
Thank you,
>
> Other than that, I think this could work.
>
> -- Steve
>
>
> > };
> >
> > struct tracepoint {
> > diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> > index 83dc24ee8b13..93e70bc64533 100644
> > --- a/include/linux/tracepoint.h
> > +++ b/include/linux/tracepoint.h
> > @@ -104,6 +104,12 @@ void for_each_tracepoint_in_module(struct module *mod,
> > * tracepoint_synchronize_unregister must be called between the last tracepoint
> > * probe unregistration and the end of module exit to make sure there is no
> > * caller executing a probe when it is freed.
> > + *
> > + * An alternative is to use the following for batch reclaim associated
> > + * with a given tracepoint:
> > + *
> > + * - tracepoint_is_syscall() == false: call_rcu()
> > + * - tracepoint_is_syscall() == true: call_rcu_tasks_trace()
> > */
> > #ifdef CONFIG_TRACEPOINTS
> > static inline void tracepoint_synchronize_unregister(void)
> > @@ -111,9 +117,17 @@ static inline void tracepoint_synchronize_unregister(void)
> > synchronize_rcu_tasks_trace();
> > synchronize_rcu();
> > }
> > +static inline bool tracepoint_is_syscall(struct tracepoint *tp)
> > +{
> > + return tp->ext && tp->ext->syscall;
> > +}
> > #else
> > static inline void tracepoint_synchronize_unregister(void)
> > { }
> > +static inline bool tracepoint_is_syscall(struct tracepoint *tp)
> > +{
> > + return false;
> > +}
> > #endif
> >
> > #ifdef CONFIG_HAVE_SYSCALL_TRACEPOINTS
> > @@ -345,6 +359,15 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> > struct tracepoint_ext __tracepoint_ext_##_name = { \
> > .regfunc = _reg, \
> > .unregfunc = _unreg, \
> > + .syscall = false, \
> > + }; \
> > + __DEFINE_TRACE_EXT(_name, &__tracepoint_ext_##_name, PARAMS(_proto), PARAMS(_args));
> > +
> > +#define DEFINE_TRACE_SYSCALL(_name, _reg, _unreg, _proto, _args) \
> > + struct tracepoint_ext __tracepoint_ext_##_name = { \
> > + .regfunc = _reg, \
> > + .unregfunc = _unreg, \
> > + .syscall = true, \
> > }; \
> > __DEFINE_TRACE_EXT(_name, &__tracepoint_ext_##_name, PARAMS(_proto), PARAMS(_args));
> >
> > @@ -389,6 +412,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> > #define __DECLARE_TRACE_SYSCALL __DECLARE_TRACE
> >
> > #define DEFINE_TRACE_FN(name, reg, unreg, proto, args)
> > +#define DEFINE_TRACE_SYSCALL(name, reg, unreg, proto, args)
> > #define DEFINE_TRACE(name, proto, args)
> > #define EXPORT_TRACEPOINT_SYMBOL_GPL(name)
> > #define EXPORT_TRACEPOINT_SYMBOL(name)
> > diff --git a/include/trace/define_trace.h b/include/trace/define_trace.h
> > index ff5fa17a6259..63fea2218afa 100644
> > --- a/include/trace/define_trace.h
> > +++ b/include/trace/define_trace.h
> > @@ -48,7 +48,7 @@
> >
> > #undef TRACE_EVENT_SYSCALL
> > #define TRACE_EVENT_SYSCALL(name, proto, args, struct, assign, print, reg, unreg) \
> > - DEFINE_TRACE_FN(name, reg, unreg, PARAMS(proto), PARAMS(args))
> > + DEFINE_TRACE_SYSCALL(name, reg, unreg, PARAMS(proto), PARAMS(args))
> >
> > #undef TRACE_EVENT_NOP
> > #define TRACE_EVENT_NOP(name, proto, args, struct, assign, print)
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [RFC PATCH v3 2/3] tracing: Introduce tracepoint_is_syscall()
2024-10-27 14:19 ` Masami Hiramatsu
@ 2024-10-28 1:23 ` Andrii Nakryiko
2024-10-28 13:36 ` Mathieu Desnoyers
0 siblings, 1 reply; 16+ messages in thread
From: Andrii Nakryiko @ 2024-10-28 1:23 UTC (permalink / raw)
To: Masami Hiramatsu
Cc: Steven Rostedt, Mathieu Desnoyers, linux-kernel, Michael Jeanson,
Peter Zijlstra, Alexei Starovoitov, Yonghong Song,
Paul E . McKenney, Ingo Molnar, Arnaldo Carvalho de Melo,
Mark Rutland, Alexander Shishkin, Namhyung Kim, bpf,
Joel Fernandes, Jordan Rife
On Sun, Oct 27, 2024 at 7:19 AM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> On Sat, 26 Oct 2024 20:08:40 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > On Sat, 26 Oct 2024 11:46:28 -0400
> > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> >
> > > Introduce a "syscall" flag within the extended structure to know whether
> > > a tracepoint needs rcu tasks trace grace period before reclaim.
> > > This can be queried using tracepoint_is_syscall().
> > >
> > > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> > > Cc: Michael Jeanson <mjeanson@efficios.com>
> > > Cc: Steven Rostedt <rostedt@goodmis.org>
> > > Cc: Masami Hiramatsu <mhiramat@kernel.org>
> > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > Cc: Alexei Starovoitov <ast@kernel.org>
> > > Cc: Yonghong Song <yhs@fb.com>
> > > Cc: Paul E. McKenney <paulmck@kernel.org>
> > > Cc: Ingo Molnar <mingo@redhat.com>
> > > Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> > > Cc: Mark Rutland <mark.rutland@arm.com>
> > > Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> > > Cc: Namhyung Kim <namhyung@kernel.org>
> > > Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> > > Cc: bpf@vger.kernel.org
> > > Cc: Joel Fernandes <joel@joelfernandes.org>
> > > Cc: Jordan Rife <jrife@google.com>
> > > ---
> > > include/linux/tracepoint-defs.h | 2 ++
> > > include/linux/tracepoint.h | 24 ++++++++++++++++++++++++
> > > include/trace/define_trace.h | 2 +-
> > > 3 files changed, 27 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
> > > index 967c08d9da84..53119e074c87 100644
> > > --- a/include/linux/tracepoint-defs.h
> > > +++ b/include/linux/tracepoint-defs.h
> > > @@ -32,6 +32,8 @@ struct tracepoint_func {
> > > struct tracepoint_ext {
> > > int (*regfunc)(void);
> > > void (*unregfunc)(void);
> > > + /* Flags. */
> > > + unsigned int syscall:1;
> >
> > I wonder if we should call it "sleepable" instead? For this patch set
> > do we really care if it's a system call or not? It's really if the
> > tracepoint is sleepable or not that's the issue. System calls are just
> > one user of it, there may be more in the future, and the changes to BPF
> > will still be needed.
>
> I agree with this. Even if currently we restrict only syscall events
> can be sleep, "tracepoint_is_syscall()" requires to add comment to
> explain why on all call sites e.g.
>
+1 to naming this "sleepable" (or at least "faultable"). BPF world
uses "sleepable BPF" terminology for BPF programs and attachment hooks
that can take page fault (and wait/sleep waiting for those to be
handled), so this would be consistent with that. Also, from BPF
standpoint this will be advertised as attaching to sleepable
tracepoints regardless, so "syscall" terminology is too specific and
misleading, because while current set of tracepoints are
syscall-specific, the important part is taking page fault, no tracing
syscalls.
> /*
> * The syscall event is only sleepable event, so we ensure it is
> * syscall event for checking sleepable or not.
> */
>
> If it called tracepoint_is_sleepable(), we don't need such comment.
>
> Thank you,
>
> >
> > Other than that, I think this could work.
> >
> > -- Steve
> >
> >
> > > };
> > >
> > > struct tracepoint {
> > > diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> > > index 83dc24ee8b13..93e70bc64533 100644
> > > --- a/include/linux/tracepoint.h
> > > +++ b/include/linux/tracepoint.h
> > > @@ -104,6 +104,12 @@ void for_each_tracepoint_in_module(struct module *mod,
> > > * tracepoint_synchronize_unregister must be called between the last tracepoint
> > > * probe unregistration and the end of module exit to make sure there is no
> > > * caller executing a probe when it is freed.
> > > + *
> > > + * An alternative is to use the following for batch reclaim associated
> > > + * with a given tracepoint:
> > > + *
> > > + * - tracepoint_is_syscall() == false: call_rcu()
> > > + * - tracepoint_is_syscall() == true: call_rcu_tasks_trace()
> > > */
> > > #ifdef CONFIG_TRACEPOINTS
> > > static inline void tracepoint_synchronize_unregister(void)
> > > @@ -111,9 +117,17 @@ static inline void tracepoint_synchronize_unregister(void)
> > > synchronize_rcu_tasks_trace();
> > > synchronize_rcu();
> > > }
> > > +static inline bool tracepoint_is_syscall(struct tracepoint *tp)
> > > +{
> > > + return tp->ext && tp->ext->syscall;
> > > +}
> > > #else
> > > static inline void tracepoint_synchronize_unregister(void)
> > > { }
> > > +static inline bool tracepoint_is_syscall(struct tracepoint *tp)
> > > +{
> > > + return false;
> > > +}
> > > #endif
> > >
> > > #ifdef CONFIG_HAVE_SYSCALL_TRACEPOINTS
> > > @@ -345,6 +359,15 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> > > struct tracepoint_ext __tracepoint_ext_##_name = { \
> > > .regfunc = _reg, \
> > > .unregfunc = _unreg, \
> > > + .syscall = false, \
> > > + }; \
> > > + __DEFINE_TRACE_EXT(_name, &__tracepoint_ext_##_name, PARAMS(_proto), PARAMS(_args));
> > > +
> > > +#define DEFINE_TRACE_SYSCALL(_name, _reg, _unreg, _proto, _args) \
> > > + struct tracepoint_ext __tracepoint_ext_##_name = { \
> > > + .regfunc = _reg, \
> > > + .unregfunc = _unreg, \
> > > + .syscall = true, \
> > > }; \
> > > __DEFINE_TRACE_EXT(_name, &__tracepoint_ext_##_name, PARAMS(_proto), PARAMS(_args));
> > >
> > > @@ -389,6 +412,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> > > #define __DECLARE_TRACE_SYSCALL __DECLARE_TRACE
> > >
> > > #define DEFINE_TRACE_FN(name, reg, unreg, proto, args)
> > > +#define DEFINE_TRACE_SYSCALL(name, reg, unreg, proto, args)
> > > #define DEFINE_TRACE(name, proto, args)
> > > #define EXPORT_TRACEPOINT_SYMBOL_GPL(name)
> > > #define EXPORT_TRACEPOINT_SYMBOL(name)
> > > diff --git a/include/trace/define_trace.h b/include/trace/define_trace.h
> > > index ff5fa17a6259..63fea2218afa 100644
> > > --- a/include/trace/define_trace.h
> > > +++ b/include/trace/define_trace.h
> > > @@ -48,7 +48,7 @@
> > >
> > > #undef TRACE_EVENT_SYSCALL
> > > #define TRACE_EVENT_SYSCALL(name, proto, args, struct, assign, print, reg, unreg) \
> > > - DEFINE_TRACE_FN(name, reg, unreg, PARAMS(proto), PARAMS(args))
> > > + DEFINE_TRACE_SYSCALL(name, reg, unreg, PARAMS(proto), PARAMS(args))
> > >
> > > #undef TRACE_EVENT_NOP
> > > #define TRACE_EVENT_NOP(name, proto, args, struct, assign, print)
> >
>
>
> --
> Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [RFC PATCH v3 2/3] tracing: Introduce tracepoint_is_syscall()
2024-10-28 1:23 ` Andrii Nakryiko
@ 2024-10-28 13:36 ` Mathieu Desnoyers
0 siblings, 0 replies; 16+ messages in thread
From: Mathieu Desnoyers @ 2024-10-28 13:36 UTC (permalink / raw)
To: Andrii Nakryiko, Masami Hiramatsu
Cc: Steven Rostedt, linux-kernel, Michael Jeanson, Peter Zijlstra,
Alexei Starovoitov, Yonghong Song, Paul E . McKenney, Ingo Molnar,
Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
Namhyung Kim, bpf, Joel Fernandes, Jordan Rife
On 2024-10-27 21:23, Andrii Nakryiko wrote:
> On Sun, Oct 27, 2024 at 7:19 AM Masami Hiramatsu <mhiramat@kernel.org> wrote:
[...]
>>>> include/linux/tracepoint-defs.h | 2 ++
>>>> include/linux/tracepoint.h | 24 ++++++++++++++++++++++++
>>>> include/trace/define_trace.h | 2 +-
>>>> 3 files changed, 27 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
>>>> index 967c08d9da84..53119e074c87 100644
>>>> --- a/include/linux/tracepoint-defs.h
>>>> +++ b/include/linux/tracepoint-defs.h
>>>> @@ -32,6 +32,8 @@ struct tracepoint_func {
>>>> struct tracepoint_ext {
>>>> int (*regfunc)(void);
>>>> void (*unregfunc)(void);
>>>> + /* Flags. */
>>>> + unsigned int syscall:1;
>>>
>>> I wonder if we should call it "sleepable" instead? For this patch set
>>> do we really care if it's a system call or not? It's really if the
>>> tracepoint is sleepable or not that's the issue. System calls are just
>>> one user of it, there may be more in the future, and the changes to BPF
>>> will still be needed.
>>
>> I agree with this. Even if currently we restrict only syscall events
>> can be sleep, "tracepoint_is_syscall()" requires to add comment to
>> explain why on all call sites e.g.
>>
>
> +1 to naming this "sleepable" (or at least "faultable"). BPF world
> uses "sleepable BPF" terminology for BPF programs and attachment hooks
> that can take page fault (and wait/sleep waiting for those to be
> handled), so this would be consistent with that. Also, from BPF
> standpoint this will be advertised as attaching to sleepable
> tracepoints regardless, so "syscall" terminology is too specific and
> misleading, because while current set of tracepoints are
> syscall-specific, the important part is taking page fault, no tracing
> syscalls.
+1 for "faultable".
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 16+ messages in thread
* [RFC PATCH v3 3/3] tracing: Fix syscall tracepoint use-after-free
2024-10-26 15:46 [RFC PATCH v3 1/3] tracing: Introduce tracepoint extended structure Mathieu Desnoyers
2024-10-26 15:46 ` [RFC PATCH v3 2/3] tracing: Introduce tracepoint_is_syscall() Mathieu Desnoyers
@ 2024-10-26 15:46 ` Mathieu Desnoyers
2024-10-28 1:22 ` Andrii Nakryiko
2024-10-28 1:22 ` [RFC PATCH v3 1/3] tracing: Introduce tracepoint extended structure Andrii Nakryiko
2 siblings, 1 reply; 16+ messages in thread
From: Mathieu Desnoyers @ 2024-10-26 15:46 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, Mathieu Desnoyers, Michael Jeanson,
Masami Hiramatsu, Peter Zijlstra, Alexei Starovoitov,
Yonghong Song, Paul E . McKenney, Ingo Molnar,
Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
Namhyung Kim, Andrii Nakryiko, bpf, Joel Fernandes, Jordan Rife,
syzbot+b390c8062d8387b6272a
The grace period used internally within tracepoint.c:release_probes()
uses call_rcu() to batch waiting for quiescence of old probe arrays,
rather than using the tracepoint_synchronize_unregister() which blocks
while waiting for quiescence.
With the introduction of faultable syscall tracepoints, this causes
use-after-free issues reproduced with syzkaller.
Fix this by using the appropriate call_rcu() or call_rcu_tasks_trace()
before invoking the rcu_free_old_probes callback. This can be chosen
using the tracepoint_is_syscall() API.
A similar issue exists in bpf use of call_rcu(). Fixing this is left to
a separate change.
Reported-by: syzbot+b390c8062d8387b6272a@syzkaller.appspotmail.com
Fixes: a363d27cdbc2 ("tracing: Allow system call tracepoints to handle page faults")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michael Jeanson <mjeanson@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Jordan Rife <jrife@google.com>
---
Changes since v0:
- Introduce tracepoint_call_rcu(),
- Fix bpf_link_free() use of call_rcu as well.
Changes since v1:
- Use tracepoint_call_rcu() for bpf_prog_put as well.
Changes since v2:
- Do not cover bpf changes in the same commit, let bpf developers
implement it.
---
kernel/tracepoint.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 5658dc92f5b5..47569fb06596 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -106,13 +106,16 @@ static void rcu_free_old_probes(struct rcu_head *head)
kfree(container_of(head, struct tp_probes, rcu));
}
-static inline void release_probes(struct tracepoint_func *old)
+static inline void release_probes(struct tracepoint *tp, struct tracepoint_func *old)
{
if (old) {
struct tp_probes *tp_probes = container_of(old,
struct tp_probes, probes[0]);
- call_rcu(&tp_probes->rcu, rcu_free_old_probes);
+ if (tracepoint_is_syscall(tp))
+ call_rcu_tasks_trace(&tp_probes->rcu, rcu_free_old_probes);
+ else
+ call_rcu(&tp_probes->rcu, rcu_free_old_probes);
}
}
@@ -334,7 +337,7 @@ static int tracepoint_add_func(struct tracepoint *tp,
break;
}
- release_probes(old);
+ release_probes(tp, old);
return 0;
}
@@ -405,7 +408,7 @@ static int tracepoint_remove_func(struct tracepoint *tp,
WARN_ON_ONCE(1);
break;
}
- release_probes(old);
+ release_probes(tp, old);
return 0;
}
--
2.39.5
^ permalink raw reply related [flat|nested] 16+ messages in thread* Re: [RFC PATCH v3 3/3] tracing: Fix syscall tracepoint use-after-free
2024-10-26 15:46 ` [RFC PATCH v3 3/3] tracing: Fix syscall tracepoint use-after-free Mathieu Desnoyers
@ 2024-10-28 1:22 ` Andrii Nakryiko
2024-10-28 19:19 ` Mathieu Desnoyers
0 siblings, 1 reply; 16+ messages in thread
From: Andrii Nakryiko @ 2024-10-28 1:22 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Steven Rostedt, linux-kernel, Michael Jeanson, Masami Hiramatsu,
Peter Zijlstra, Alexei Starovoitov, Yonghong Song,
Paul E . McKenney, Ingo Molnar, Arnaldo Carvalho de Melo,
Mark Rutland, Alexander Shishkin, Namhyung Kim, bpf,
Joel Fernandes, Jordan Rife, syzbot+b390c8062d8387b6272a
On Sat, Oct 26, 2024 at 8:48 AM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> The grace period used internally within tracepoint.c:release_probes()
> uses call_rcu() to batch waiting for quiescence of old probe arrays,
> rather than using the tracepoint_synchronize_unregister() which blocks
> while waiting for quiescence.
>
> With the introduction of faultable syscall tracepoints, this causes
> use-after-free issues reproduced with syzkaller.
>
> Fix this by using the appropriate call_rcu() or call_rcu_tasks_trace()
> before invoking the rcu_free_old_probes callback. This can be chosen
> using the tracepoint_is_syscall() API.
>
> A similar issue exists in bpf use of call_rcu(). Fixing this is left to
> a separate change.
>
> Reported-by: syzbot+b390c8062d8387b6272a@syzkaller.appspotmail.com
> Fixes: a363d27cdbc2 ("tracing: Allow system call tracepoints to handle page faults")
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Michael Jeanson <mjeanson@efficios.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Yonghong Song <yhs@fb.com>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> Cc: bpf@vger.kernel.org
> Cc: Joel Fernandes <joel@joelfernandes.org>
> Cc: Jordan Rife <jrife@google.com>
> ---
> Changes since v0:
> - Introduce tracepoint_call_rcu(),
> - Fix bpf_link_free() use of call_rcu as well.
>
> Changes since v1:
> - Use tracepoint_call_rcu() for bpf_prog_put as well.
>
> Changes since v2:
> - Do not cover bpf changes in the same commit, let bpf developers
> implement it.
> ---
> kernel/tracepoint.c | 11 +++++++----
> 1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
> index 5658dc92f5b5..47569fb06596 100644
> --- a/kernel/tracepoint.c
> +++ b/kernel/tracepoint.c
> @@ -106,13 +106,16 @@ static void rcu_free_old_probes(struct rcu_head *head)
> kfree(container_of(head, struct tp_probes, rcu));
> }
>
> -static inline void release_probes(struct tracepoint_func *old)
> +static inline void release_probes(struct tracepoint *tp, struct tracepoint_func *old)
> {
> if (old) {
> struct tp_probes *tp_probes = container_of(old,
> struct tp_probes, probes[0]);
>
> - call_rcu(&tp_probes->rcu, rcu_free_old_probes);
> + if (tracepoint_is_syscall(tp))
> + call_rcu_tasks_trace(&tp_probes->rcu, rcu_free_old_probes);
should this be call_rcu_tasks_trace() -> call_rcu() chain instead of
just call_rcu_tasks_trace()? While currently call_rcu_tasks_trace()
implies RCU GP (as evidenced by rcu_trace_implies_rcu_gp() being
hardcoded right now to returning true), this might not always be the
case in the future, so it's best to have a guarantee that regardless
of sleepable or not, we'll always have have RCU GP, and for sleepable
tracepoint *also* RCU Tasks Trace GP.
> + else
> + call_rcu(&tp_probes->rcu, rcu_free_old_probes);
> }
> }
>
> @@ -334,7 +337,7 @@ static int tracepoint_add_func(struct tracepoint *tp,
> break;
> }
>
> - release_probes(old);
> + release_probes(tp, old);
> return 0;
> }
>
> @@ -405,7 +408,7 @@ static int tracepoint_remove_func(struct tracepoint *tp,
> WARN_ON_ONCE(1);
> break;
> }
> - release_probes(old);
> + release_probes(tp, old);
> return 0;
> }
>
> --
> 2.39.5
>
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [RFC PATCH v3 3/3] tracing: Fix syscall tracepoint use-after-free
2024-10-28 1:22 ` Andrii Nakryiko
@ 2024-10-28 19:19 ` Mathieu Desnoyers
2024-10-31 15:43 ` Mathieu Desnoyers
0 siblings, 1 reply; 16+ messages in thread
From: Mathieu Desnoyers @ 2024-10-28 19:19 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Steven Rostedt, linux-kernel, Michael Jeanson, Masami Hiramatsu,
Peter Zijlstra, Alexei Starovoitov, Yonghong Song,
Paul E . McKenney, Ingo Molnar, Arnaldo Carvalho de Melo,
Mark Rutland, Alexander Shishkin, Namhyung Kim, bpf,
Joel Fernandes, Jordan Rife, syzbot+b390c8062d8387b6272a
On 2024-10-27 21:22, Andrii Nakryiko wrote:
> On Sat, Oct 26, 2024 at 8:48 AM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> The grace period used internally within tracepoint.c:release_probes()
>> uses call_rcu() to batch waiting for quiescence of old probe arrays,
>> rather than using the tracepoint_synchronize_unregister() which blocks
>> while waiting for quiescence.
>>
>> With the introduction of faultable syscall tracepoints, this causes
>> use-after-free issues reproduced with syzkaller.
>>
>> Fix this by using the appropriate call_rcu() or call_rcu_tasks_trace()
>> before invoking the rcu_free_old_probes callback. This can be chosen
>> using the tracepoint_is_syscall() API.
>>
>> A similar issue exists in bpf use of call_rcu(). Fixing this is left to
>> a separate change.
>>
>> Reported-by: syzbot+b390c8062d8387b6272a@syzkaller.appspotmail.com
>> Fixes: a363d27cdbc2 ("tracing: Allow system call tracepoints to handle page faults")
>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> Cc: Michael Jeanson <mjeanson@efficios.com>
>> Cc: Steven Rostedt <rostedt@goodmis.org>
>> Cc: Masami Hiramatsu <mhiramat@kernel.org>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Alexei Starovoitov <ast@kernel.org>
>> Cc: Yonghong Song <yhs@fb.com>
>> Cc: Paul E. McKenney <paulmck@kernel.org>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
>> Cc: Mark Rutland <mark.rutland@arm.com>
>> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
>> Cc: Namhyung Kim <namhyung@kernel.org>
>> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
>> Cc: bpf@vger.kernel.org
>> Cc: Joel Fernandes <joel@joelfernandes.org>
>> Cc: Jordan Rife <jrife@google.com>
>> ---
>> Changes since v0:
>> - Introduce tracepoint_call_rcu(),
>> - Fix bpf_link_free() use of call_rcu as well.
>>
>> Changes since v1:
>> - Use tracepoint_call_rcu() for bpf_prog_put as well.
>>
>> Changes since v2:
>> - Do not cover bpf changes in the same commit, let bpf developers
>> implement it.
>> ---
>> kernel/tracepoint.c | 11 +++++++----
>> 1 file changed, 7 insertions(+), 4 deletions(-)
>>
>> diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
>> index 5658dc92f5b5..47569fb06596 100644
>> --- a/kernel/tracepoint.c
>> +++ b/kernel/tracepoint.c
>> @@ -106,13 +106,16 @@ static void rcu_free_old_probes(struct rcu_head *head)
>> kfree(container_of(head, struct tp_probes, rcu));
>> }
>>
>> -static inline void release_probes(struct tracepoint_func *old)
>> +static inline void release_probes(struct tracepoint *tp, struct tracepoint_func *old)
>> {
>> if (old) {
>> struct tp_probes *tp_probes = container_of(old,
>> struct tp_probes, probes[0]);
>>
>> - call_rcu(&tp_probes->rcu, rcu_free_old_probes);
>> + if (tracepoint_is_syscall(tp))
>> + call_rcu_tasks_trace(&tp_probes->rcu, rcu_free_old_probes);
>
> should this be call_rcu_tasks_trace() -> call_rcu() chain instead of
> just call_rcu_tasks_trace()? While currently call_rcu_tasks_trace()
> implies RCU GP (as evidenced by rcu_trace_implies_rcu_gp() being
> hardcoded right now to returning true), this might not always be the
> case in the future, so it's best to have a guarantee that regardless
> of sleepable or not, we'll always have have RCU GP, and for sleepable
> tracepoint *also* RCU Tasks Trace GP.
Given that faultable tracepoints only use RCU tasks trace for the
read-side and do not rely on preempt disable, I don't see why we would
need to chain both grace periods there ?
Thanks,
Mathieu
>
>> + else
>> + call_rcu(&tp_probes->rcu, rcu_free_old_probes);
>> }
>> }
>>
>> @@ -334,7 +337,7 @@ static int tracepoint_add_func(struct tracepoint *tp,
>> break;
>> }
>>
>> - release_probes(old);
>> + release_probes(tp, old);
>> return 0;
>> }
>>
>> @@ -405,7 +408,7 @@ static int tracepoint_remove_func(struct tracepoint *tp,
>> WARN_ON_ONCE(1);
>> break;
>> }
>> - release_probes(old);
>> + release_probes(tp, old);
>> return 0;
>> }
>>
>> --
>> 2.39.5
>>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [RFC PATCH v3 3/3] tracing: Fix syscall tracepoint use-after-free
2024-10-28 19:19 ` Mathieu Desnoyers
@ 2024-10-31 15:43 ` Mathieu Desnoyers
2024-10-31 16:35 ` Andrii Nakryiko
0 siblings, 1 reply; 16+ messages in thread
From: Mathieu Desnoyers @ 2024-10-31 15:43 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Steven Rostedt, linux-kernel, Michael Jeanson, Masami Hiramatsu,
Peter Zijlstra, Alexei Starovoitov, Yonghong Song,
Paul E . McKenney, Ingo Molnar, Arnaldo Carvalho de Melo,
Mark Rutland, Alexander Shishkin, Namhyung Kim, bpf,
Joel Fernandes, Jordan Rife, syzbot+b390c8062d8387b6272a
On 2024-10-28 15:19, Mathieu Desnoyers wrote:
> On 2024-10-27 21:22, Andrii Nakryiko wrote:
>> On Sat, Oct 26, 2024 at 8:48 AM Mathieu Desnoyers
>> <mathieu.desnoyers@efficios.com> wrote:
>>>
>>> The grace period used internally within tracepoint.c:release_probes()
>>> uses call_rcu() to batch waiting for quiescence of old probe arrays,
>>> rather than using the tracepoint_synchronize_unregister() which blocks
>>> while waiting for quiescence.
>>>
>>> With the introduction of faultable syscall tracepoints, this causes
>>> use-after-free issues reproduced with syzkaller.
>>>
>>> Fix this by using the appropriate call_rcu() or call_rcu_tasks_trace()
>>> before invoking the rcu_free_old_probes callback. This can be chosen
>>> using the tracepoint_is_syscall() API.
>>>
>>> A similar issue exists in bpf use of call_rcu(). Fixing this is left to
>>> a separate change.
>>>
>>> Reported-by: syzbot+b390c8062d8387b6272a@syzkaller.appspotmail.com
>>> Fixes: a363d27cdbc2 ("tracing: Allow system call tracepoints to
>>> handle page faults")
>>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>>> Cc: Michael Jeanson <mjeanson@efficios.com>
>>> Cc: Steven Rostedt <rostedt@goodmis.org>
>>> Cc: Masami Hiramatsu <mhiramat@kernel.org>
>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>> Cc: Alexei Starovoitov <ast@kernel.org>
>>> Cc: Yonghong Song <yhs@fb.com>
>>> Cc: Paul E. McKenney <paulmck@kernel.org>
>>> Cc: Ingo Molnar <mingo@redhat.com>
>>> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
>>> Cc: Mark Rutland <mark.rutland@arm.com>
>>> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
>>> Cc: Namhyung Kim <namhyung@kernel.org>
>>> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
>>> Cc: bpf@vger.kernel.org
>>> Cc: Joel Fernandes <joel@joelfernandes.org>
>>> Cc: Jordan Rife <jrife@google.com>
>>> ---
>>> Changes since v0:
>>> - Introduce tracepoint_call_rcu(),
>>> - Fix bpf_link_free() use of call_rcu as well.
>>>
>>> Changes since v1:
>>> - Use tracepoint_call_rcu() for bpf_prog_put as well.
>>>
>>> Changes since v2:
>>> - Do not cover bpf changes in the same commit, let bpf developers
>>> implement it.
>>> ---
>>> kernel/tracepoint.c | 11 +++++++----
>>> 1 file changed, 7 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
>>> index 5658dc92f5b5..47569fb06596 100644
>>> --- a/kernel/tracepoint.c
>>> +++ b/kernel/tracepoint.c
>>> @@ -106,13 +106,16 @@ static void rcu_free_old_probes(struct rcu_head
>>> *head)
>>> kfree(container_of(head, struct tp_probes, rcu));
>>> }
>>>
>>> -static inline void release_probes(struct tracepoint_func *old)
>>> +static inline void release_probes(struct tracepoint *tp, struct
>>> tracepoint_func *old)
>>> {
>>> if (old) {
>>> struct tp_probes *tp_probes = container_of(old,
>>> struct tp_probes, probes[0]);
>>>
>>> - call_rcu(&tp_probes->rcu, rcu_free_old_probes);
>>> + if (tracepoint_is_syscall(tp))
>>> + call_rcu_tasks_trace(&tp_probes->rcu,
>>> rcu_free_old_probes);
>>
>> should this be call_rcu_tasks_trace() -> call_rcu() chain instead of
>> just call_rcu_tasks_trace()? While currently call_rcu_tasks_trace()
>> implies RCU GP (as evidenced by rcu_trace_implies_rcu_gp() being
>> hardcoded right now to returning true), this might not always be the
>> case in the future, so it's best to have a guarantee that regardless
>> of sleepable or not, we'll always have have RCU GP, and for sleepable
>> tracepoint *also* RCU Tasks Trace GP.
>
> Given that faultable tracepoints only use RCU tasks trace for the
> read-side and do not rely on preempt disable, I don't see why we would
> need to chain both grace periods there ?
Hi Andrii,
AFAIU, your question above is rooted in the way bpf does its sleepable
program grace periods (chaining RCU tasks trace + RCU GP), e.g.:
bpf_map_free_mult_rcu_gp
bpf_link_defer_dealloc_mult_rcu_gp
and
bpf_link_free:
/* schedule BPF link deallocation; if underlying BPF program
* is sleepable, we need to first wait for RCU tasks trace
* sync, then go through "classic" RCU grace period
*/
This is introduced in commit 1a80dbcb2db ("bpf: support deferring bpf_link dealloc to after RCU grace period")
which has a bit more information in the commit message, but what I'm not seeing
is an explanation of *why* chaining RCU tasks trace and RCU grace periods is
needed for sleepable bpf programs. What am I missing ?
As far as tracepoint.c release_probes() is concerned, just waiting for
RCU tasks trace before freeing memory of faultable tracepoints is
sufficient.
Thanks,
Mathieu
>
> Thanks,
>
> Mathieu
>
>>
>>> + else
>>> + call_rcu(&tp_probes->rcu, rcu_free_old_probes);
>>> }
>>> }
>>>
>>> @@ -334,7 +337,7 @@ static int tracepoint_add_func(struct tracepoint
>>> *tp,
>>> break;
>>> }
>>>
>>> - release_probes(old);
>>> + release_probes(tp, old);
>>> return 0;
>>> }
>>>
>>> @@ -405,7 +408,7 @@ static int tracepoint_remove_func(struct
>>> tracepoint *tp,
>>> WARN_ON_ONCE(1);
>>> break;
>>> }
>>> - release_probes(old);
>>> + release_probes(tp, old);
>>> return 0;
>>> }
>>>
>>> --
>>> 2.39.5
>>>
>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [RFC PATCH v3 3/3] tracing: Fix syscall tracepoint use-after-free
2024-10-31 15:43 ` Mathieu Desnoyers
@ 2024-10-31 16:35 ` Andrii Nakryiko
0 siblings, 0 replies; 16+ messages in thread
From: Andrii Nakryiko @ 2024-10-31 16:35 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Steven Rostedt, linux-kernel, Michael Jeanson, Masami Hiramatsu,
Peter Zijlstra, Alexei Starovoitov, Yonghong Song,
Paul E . McKenney, Ingo Molnar, Arnaldo Carvalho de Melo,
Mark Rutland, Alexander Shishkin, Namhyung Kim, bpf,
Joel Fernandes, Jordan Rife, syzbot+b390c8062d8387b6272a
On Thu, Oct 31, 2024 at 8:44 AM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> On 2024-10-28 15:19, Mathieu Desnoyers wrote:
> > On 2024-10-27 21:22, Andrii Nakryiko wrote:
> >> On Sat, Oct 26, 2024 at 8:48 AM Mathieu Desnoyers
> >> <mathieu.desnoyers@efficios.com> wrote:
> >>>
> >>> The grace period used internally within tracepoint.c:release_probes()
> >>> uses call_rcu() to batch waiting for quiescence of old probe arrays,
> >>> rather than using the tracepoint_synchronize_unregister() which blocks
> >>> while waiting for quiescence.
> >>>
> >>> With the introduction of faultable syscall tracepoints, this causes
> >>> use-after-free issues reproduced with syzkaller.
> >>>
> >>> Fix this by using the appropriate call_rcu() or call_rcu_tasks_trace()
> >>> before invoking the rcu_free_old_probes callback. This can be chosen
> >>> using the tracepoint_is_syscall() API.
> >>>
> >>> A similar issue exists in bpf use of call_rcu(). Fixing this is left to
> >>> a separate change.
> >>>
> >>> Reported-by: syzbot+b390c8062d8387b6272a@syzkaller.appspotmail.com
> >>> Fixes: a363d27cdbc2 ("tracing: Allow system call tracepoints to
> >>> handle page faults")
> >>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> >>> Cc: Michael Jeanson <mjeanson@efficios.com>
> >>> Cc: Steven Rostedt <rostedt@goodmis.org>
> >>> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> >>> Cc: Peter Zijlstra <peterz@infradead.org>
> >>> Cc: Alexei Starovoitov <ast@kernel.org>
> >>> Cc: Yonghong Song <yhs@fb.com>
> >>> Cc: Paul E. McKenney <paulmck@kernel.org>
> >>> Cc: Ingo Molnar <mingo@redhat.com>
> >>> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> >>> Cc: Mark Rutland <mark.rutland@arm.com>
> >>> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> >>> Cc: Namhyung Kim <namhyung@kernel.org>
> >>> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> >>> Cc: bpf@vger.kernel.org
> >>> Cc: Joel Fernandes <joel@joelfernandes.org>
> >>> Cc: Jordan Rife <jrife@google.com>
> >>> ---
> >>> Changes since v0:
> >>> - Introduce tracepoint_call_rcu(),
> >>> - Fix bpf_link_free() use of call_rcu as well.
> >>>
> >>> Changes since v1:
> >>> - Use tracepoint_call_rcu() for bpf_prog_put as well.
> >>>
> >>> Changes since v2:
> >>> - Do not cover bpf changes in the same commit, let bpf developers
> >>> implement it.
> >>> ---
> >>> kernel/tracepoint.c | 11 +++++++----
> >>> 1 file changed, 7 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
> >>> index 5658dc92f5b5..47569fb06596 100644
> >>> --- a/kernel/tracepoint.c
> >>> +++ b/kernel/tracepoint.c
> >>> @@ -106,13 +106,16 @@ static void rcu_free_old_probes(struct rcu_head
> >>> *head)
> >>> kfree(container_of(head, struct tp_probes, rcu));
> >>> }
> >>>
> >>> -static inline void release_probes(struct tracepoint_func *old)
> >>> +static inline void release_probes(struct tracepoint *tp, struct
> >>> tracepoint_func *old)
> >>> {
> >>> if (old) {
> >>> struct tp_probes *tp_probes = container_of(old,
> >>> struct tp_probes, probes[0]);
> >>>
> >>> - call_rcu(&tp_probes->rcu, rcu_free_old_probes);
> >>> + if (tracepoint_is_syscall(tp))
> >>> + call_rcu_tasks_trace(&tp_probes->rcu,
> >>> rcu_free_old_probes);
> >>
> >> should this be call_rcu_tasks_trace() -> call_rcu() chain instead of
> >> just call_rcu_tasks_trace()? While currently call_rcu_tasks_trace()
> >> implies RCU GP (as evidenced by rcu_trace_implies_rcu_gp() being
> >> hardcoded right now to returning true), this might not always be the
> >> case in the future, so it's best to have a guarantee that regardless
> >> of sleepable or not, we'll always have have RCU GP, and for sleepable
> >> tracepoint *also* RCU Tasks Trace GP.
> >
> > Given that faultable tracepoints only use RCU tasks trace for the
> > read-side and do not rely on preempt disable, I don't see why we would
> > need to chain both grace periods there ?
>
> Hi Andrii,
>
> AFAIU, your question above is rooted in the way bpf does its sleepable
> program grace periods (chaining RCU tasks trace + RCU GP), e.g.:
>
> bpf_map_free_mult_rcu_gp
> bpf_link_defer_dealloc_mult_rcu_gp
>
> and
>
> bpf_link_free:
> /* schedule BPF link deallocation; if underlying BPF program
> * is sleepable, we need to first wait for RCU tasks trace
> * sync, then go through "classic" RCU grace period
> */
>
> This is introduced in commit 1a80dbcb2db ("bpf: support deferring bpf_link dealloc to after RCU grace period")
> which has a bit more information in the commit message, but what I'm not seeing
> is an explanation of *why* chaining RCU tasks trace and RCU grace periods is
> needed for sleepable bpf programs. What am I missing ?
At least one of the reasons are BPF maps that can be used from both
sleepable and non-sleepable BPF programs *at the same time*. So in BPF
everything is *at least* protected with rcu_read_lock(), and then
sleepable-capable things are *additionally* supported by
rcu_read_tasks_trace(). So on destruction, we chain both RCU GP kinds
to make sure that all users can't see BPF map/prog/(and soon links).
It might not be strictly necessary in general, but you are right that
I asked because of how we do this in BPF. Also, in practice, tasks
trace RCU GP implies RCU GP, so there is no overhead for how we do
this for BPF maps (and progs? didn't check).
Anyways, this might be fine as is.
>
> As far as tracepoint.c release_probes() is concerned, just waiting for
> RCU tasks trace before freeing memory of faultable tracepoints is
> sufficient.
>
> Thanks,
>
> Mathieu
>
> >
> > Thanks,
> >
> > Mathieu
> >
> >>
> >>> + else
> >>> + call_rcu(&tp_probes->rcu, rcu_free_old_probes);
> >>> }
> >>> }
> >>>
> >>> @@ -334,7 +337,7 @@ static int tracepoint_add_func(struct tracepoint
> >>> *tp,
> >>> break;
> >>> }
> >>>
> >>> - release_probes(old);
> >>> + release_probes(tp, old);
> >>> return 0;
> >>> }
> >>>
> >>> @@ -405,7 +408,7 @@ static int tracepoint_remove_func(struct
> >>> tracepoint *tp,
> >>> WARN_ON_ONCE(1);
> >>> break;
> >>> }
> >>> - release_probes(old);
> >>> + release_probes(tp, old);
> >>> return 0;
> >>> }
> >>>
> >>> --
> >>> 2.39.5
> >>>
> >
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH v3 1/3] tracing: Introduce tracepoint extended structure
2024-10-26 15:46 [RFC PATCH v3 1/3] tracing: Introduce tracepoint extended structure Mathieu Desnoyers
2024-10-26 15:46 ` [RFC PATCH v3 2/3] tracing: Introduce tracepoint_is_syscall() Mathieu Desnoyers
2024-10-26 15:46 ` [RFC PATCH v3 3/3] tracing: Fix syscall tracepoint use-after-free Mathieu Desnoyers
@ 2024-10-28 1:22 ` Andrii Nakryiko
2024-10-28 19:13 ` Mathieu Desnoyers
2 siblings, 1 reply; 16+ messages in thread
From: Andrii Nakryiko @ 2024-10-28 1:22 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Steven Rostedt, linux-kernel, Michael Jeanson, Masami Hiramatsu,
Peter Zijlstra, Alexei Starovoitov, Yonghong Song,
Paul E . McKenney, Ingo Molnar, Arnaldo Carvalho de Melo,
Mark Rutland, Alexander Shishkin, Namhyung Kim, bpf,
Joel Fernandes, Jordan Rife
On Sat, Oct 26, 2024 at 8:48 AM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> Shrink the struct tracepoint size from 80 bytes to 72 bytes on x86-64 by
> moving the (typically NULL) regfunc/unregfunc pointers to an extended
> structure.
>
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Michael Jeanson <mjeanson@efficios.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Yonghong Song <yhs@fb.com>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> Cc: bpf@vger.kernel.org
> Cc: Joel Fernandes <joel@joelfernandes.org>
> Cc: Jordan Rife <jrife@google.com>
> ---
> include/linux/tracepoint-defs.h | 8 ++++++--
> include/linux/tracepoint.h | 19 +++++++++++++------
> kernel/tracepoint.c | 9 ++++-----
> 3 files changed, 23 insertions(+), 13 deletions(-)
>
> diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
> index 60a6e8314d4c..967c08d9da84 100644
> --- a/include/linux/tracepoint-defs.h
> +++ b/include/linux/tracepoint-defs.h
> @@ -29,6 +29,11 @@ struct tracepoint_func {
> int prio;
> };
>
> +struct tracepoint_ext {
> + int (*regfunc)(void);
> + void (*unregfunc)(void);
> +};
> +
> struct tracepoint {
> const char *name; /* Tracepoint name */
> struct static_key_false key;
> @@ -36,9 +41,8 @@ struct tracepoint {
> void *static_call_tramp;
> void *iterator;
> void *probestub;
> - int (*regfunc)(void);
> - void (*unregfunc)(void);
> struct tracepoint_func __rcu *funcs;
> + struct tracepoint_ext *ext;
> };
>
> #ifdef CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> index 0dc67fad706c..83dc24ee8b13 100644
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -302,7 +302,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> * structures, so we create an array of pointers that will be used for iteration
> * on the tracepoints.
> */
> -#define DEFINE_TRACE_FN(_name, _reg, _unreg, proto, args) \
> +#define __DEFINE_TRACE_EXT(_name, _ext, proto, args) \
> static const char __tpstrtab_##_name[] \
> __section("__tracepoints_strings") = #_name; \
> extern struct static_call_key STATIC_CALL_KEY(tp_func_##_name); \
> @@ -316,9 +316,9 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> .static_call_tramp = STATIC_CALL_TRAMP_ADDR(tp_func_##_name), \
> .iterator = &__traceiter_##_name, \
> .probestub = &__probestub_##_name, \
> - .regfunc = _reg, \
> - .unregfunc = _unreg, \
> - .funcs = NULL }; \
> + .funcs = NULL, \
> + .ext = _ext, \
> + }; \
> __TRACEPOINT_ENTRY(_name); \
> int __traceiter_##_name(void *__data, proto) \
> { \
> @@ -341,8 +341,15 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> } \
> DEFINE_STATIC_CALL(tp_func_##_name, __traceiter_##_name);
>
> -#define DEFINE_TRACE(name, proto, args) \
> - DEFINE_TRACE_FN(name, NULL, NULL, PARAMS(proto), PARAMS(args));
> +#define DEFINE_TRACE_FN(_name, _reg, _unreg, _proto, _args) \
> + struct tracepoint_ext __tracepoint_ext_##_name = { \
can be static, no?
> + .regfunc = _reg, \
> + .unregfunc = _unreg, \
> + }; \
> + __DEFINE_TRACE_EXT(_name, &__tracepoint_ext_##_name, PARAMS(_proto), PARAMS(_args));
> +
> +#define DEFINE_TRACE(_name, _proto, _args) \
> + __DEFINE_TRACE_EXT(_name, NULL, PARAMS(_proto), PARAMS(_args));
>
> #define EXPORT_TRACEPOINT_SYMBOL_GPL(name) \
> EXPORT_SYMBOL_GPL(__tracepoint_##name); \
> diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
> index 6474e2cf22c9..5658dc92f5b5 100644
> --- a/kernel/tracepoint.c
> +++ b/kernel/tracepoint.c
> @@ -278,8 +278,8 @@ static int tracepoint_add_func(struct tracepoint *tp,
> struct tracepoint_func *old, *tp_funcs;
> int ret;
>
> - if (tp->regfunc && !static_key_enabled(&tp->key)) {
> - ret = tp->regfunc();
> + if (tp->ext && tp->ext->regfunc && !static_key_enabled(&tp->key)) {
> + ret = tp->ext->regfunc();
> if (ret < 0)
> return ret;
> }
> @@ -362,9 +362,8 @@ static int tracepoint_remove_func(struct tracepoint *tp,
> switch (nr_func_state(tp_funcs)) {
> case TP_FUNC_0: /* 1->0 */
> /* Removed last function */
> - if (tp->unregfunc && static_key_enabled(&tp->key))
> - tp->unregfunc();
> -
> + if (tp->ext && tp->ext->unregfunc && static_key_enabled(&tp->key))
> + tp->ext->unregfunc();
> static_branch_disable(&tp->key);
> /* Set iterator static call */
> tracepoint_update_call(tp, tp_funcs);
> --
> 2.39.5
>
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [RFC PATCH v3 1/3] tracing: Introduce tracepoint extended structure
2024-10-28 1:22 ` [RFC PATCH v3 1/3] tracing: Introduce tracepoint extended structure Andrii Nakryiko
@ 2024-10-28 19:13 ` Mathieu Desnoyers
0 siblings, 0 replies; 16+ messages in thread
From: Mathieu Desnoyers @ 2024-10-28 19:13 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Steven Rostedt, linux-kernel, Michael Jeanson, Masami Hiramatsu,
Peter Zijlstra, Alexei Starovoitov, Yonghong Song,
Paul E . McKenney, Ingo Molnar, Arnaldo Carvalho de Melo,
Mark Rutland, Alexander Shishkin, Namhyung Kim, bpf,
Joel Fernandes, Jordan Rife
On 2024-10-27 21:22, Andrii Nakryiko wrote:
>> -#define DEFINE_TRACE(name, proto, args) \
>> - DEFINE_TRACE_FN(name, NULL, NULL, PARAMS(proto), PARAMS(args));
>> +#define DEFINE_TRACE_FN(_name, _reg, _unreg, _proto, _args) \
>> + struct tracepoint_ext __tracepoint_ext_##_name = { \
>
> can be static, no?
Yes, sorry, I'll update the series for a v5.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 16+ messages in thread