* [PATCH v2] exit: move and extend sched_process_exit() tracepoint
@ 2025-04-02 18:09 Andrii Nakryiko
2025-04-02 18:36 ` Steven Rostedt
` (2 more replies)
0 siblings, 3 replies; 7+ messages in thread
From: Andrii Nakryiko @ 2025-04-02 18:09 UTC (permalink / raw)
To: linux-trace-kernel, peterz, mingo
Cc: bpf, linux-kernel, kernel-team, mhocko, rostedt, oleg, brauner,
glider, mhiramat, mathieu.desnoyers, akpm, Andrii Nakryiko
It is useful to be able to access current->mm at task exit to, say,
record a bunch of VMA information right before the task exits (e.g., for
stack symbolization reasons when dealing with short-lived processes that
exit in the middle of profiling session). Currently,
trace_sched_process_exit() is triggered after exit_mm() which resets
current->mm to NULL making this tracepoint unsuitable for inspecting
and recording task's mm_struct-related data when tracing process
lifetimes.
There is a particularly suitable place, though, right after
taskstats_exit() is called, but before we do exit_mm() and other
exit_*() resource teardowns. taskstats performs a similar kind of
accounting that some applications do with BPF, and so co-locating them
seems like a good fit. So that's where trace_sched_process_exit() is
moved with this patch.
Also, existing trace_sched_process_exit() tracepoint is notoriously
missing `group_dead` flag that is certainly useful in practice and some
of our production applications have to work around this. So plumb
`group_dead` through while at it, to have a richer and more complete
tracepoint.
Note that we can't use sched_process_template anymore, and so we use
TRACE_EVENT()-based tracepoint definition. But all the field names and
order, as well as assign and output logic remain intact. We just add one
extra field at the end in backwards-compatible way.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
include/trace/events/sched.h | 28 +++++++++++++++++++++++++---
kernel/exit.c | 2 +-
2 files changed, 26 insertions(+), 4 deletions(-)
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 8994e97d86c1..05a14f2b35c3 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -328,9 +328,31 @@ DEFINE_EVENT(sched_process_template, sched_process_free,
/*
* Tracepoint for a task exiting:
*/
-DEFINE_EVENT(sched_process_template, sched_process_exit,
- TP_PROTO(struct task_struct *p),
- TP_ARGS(p));
+TRACE_EVENT(sched_process_exit,
+
+ TP_PROTO(struct task_struct *p, bool group_dead),
+
+ TP_ARGS(p, group_dead),
+
+ TP_STRUCT__entry(
+ __array( char, comm, TASK_COMM_LEN )
+ __field( pid_t, pid )
+ __field( int, prio )
+ __field( bool, group_dead )
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
+ __entry->pid = p->pid;
+ __entry->prio = p->prio; /* XXX SCHED_DEADLINE */
+ __entry->group_dead = group_dead;
+ ),
+
+ TP_printk("comm=%s pid=%d prio=%d group_dead=%s",
+ __entry->comm, __entry->pid, __entry->prio,
+ __entry->group_dead ? "true" : "false"
+ )
+);
/*
* Tracepoint for waiting on task to unschedule:
diff --git a/kernel/exit.c b/kernel/exit.c
index c2e6c7b7779f..4abd307b1586 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -937,12 +937,12 @@ void __noreturn do_exit(long code)
tsk->exit_code = code;
taskstats_exit(tsk, group_dead);
+ trace_sched_process_exit(tsk, group_dead);
exit_mm();
if (group_dead)
acct_process();
- trace_sched_process_exit(tsk);
exit_sem(tsk);
exit_shm(tsk);
--
2.47.1
^ permalink raw reply related [flat|nested] 7+ messages in thread* Re: [PATCH v2] exit: move and extend sched_process_exit() tracepoint
2025-04-02 18:09 [PATCH v2] exit: move and extend sched_process_exit() tracepoint Andrii Nakryiko
@ 2025-04-02 18:36 ` Steven Rostedt
2025-04-02 19:29 ` Oleg Nesterov
2025-04-03 13:54 ` Ingo Molnar
2 siblings, 0 replies; 7+ messages in thread
From: Steven Rostedt @ 2025-04-02 18:36 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: linux-trace-kernel, peterz, mingo, bpf, linux-kernel, kernel-team,
mhocko, oleg, brauner, glider, mhiramat, mathieu.desnoyers, akpm
On Wed, 2 Apr 2025 11:09:25 -0700
Andrii Nakryiko <andrii@kernel.org> wrote:
> It is useful to be able to access current->mm at task exit to, say,
> record a bunch of VMA information right before the task exits (e.g., for
> stack symbolization reasons when dealing with short-lived processes that
> exit in the middle of profiling session). Currently,
> trace_sched_process_exit() is triggered after exit_mm() which resets
> current->mm to NULL making this tracepoint unsuitable for inspecting
> and recording task's mm_struct-related data when tracing process
> lifetimes.
>
> There is a particularly suitable place, though, right after
> taskstats_exit() is called, but before we do exit_mm() and other
> exit_*() resource teardowns. taskstats performs a similar kind of
> accounting that some applications do with BPF, and so co-locating them
> seems like a good fit. So that's where trace_sched_process_exit() is
> moved with this patch.
>
> Also, existing trace_sched_process_exit() tracepoint is notoriously
> missing `group_dead` flag that is certainly useful in practice and some
> of our production applications have to work around this. So plumb
> `group_dead` through while at it, to have a richer and more complete
> tracepoint.
>
> Note that we can't use sched_process_template anymore, and so we use
> TRACE_EVENT()-based tracepoint definition. But all the field names and
> order, as well as assign and output logic remain intact. We just add one
> extra field at the end in backwards-compatible way.
>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
> include/trace/events/sched.h | 28 +++++++++++++++++++++++++---
> kernel/exit.c | 2 +-
> 2 files changed, 26 insertions(+), 4 deletions(-)
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
-- Steve
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2] exit: move and extend sched_process_exit() tracepoint
2025-04-02 18:09 [PATCH v2] exit: move and extend sched_process_exit() tracepoint Andrii Nakryiko
2025-04-02 18:36 ` Steven Rostedt
@ 2025-04-02 19:29 ` Oleg Nesterov
2025-04-03 13:54 ` Ingo Molnar
2 siblings, 0 replies; 7+ messages in thread
From: Oleg Nesterov @ 2025-04-02 19:29 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: linux-trace-kernel, peterz, mingo, bpf, linux-kernel, kernel-team,
mhocko, rostedt, brauner, glider, mhiramat, mathieu.desnoyers,
akpm
On 04/02, Andrii Nakryiko wrote:
>
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -937,12 +937,12 @@ void __noreturn do_exit(long code)
>
> tsk->exit_code = code;
> taskstats_exit(tsk, group_dead);
> + trace_sched_process_exit(tsk, group_dead);
>
> exit_mm();
>
> if (group_dead)
> acct_process();
> - trace_sched_process_exit(tsk);
I see nothing wrong in this change.
(and I think that the current placement of trace_sched_process_exit() was
more or less "random").
Acked-by: Oleg Nesterov <oleg@redhat.com>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2] exit: move and extend sched_process_exit() tracepoint
2025-04-02 18:09 [PATCH v2] exit: move and extend sched_process_exit() tracepoint Andrii Nakryiko
2025-04-02 18:36 ` Steven Rostedt
2025-04-02 19:29 ` Oleg Nesterov
@ 2025-04-03 13:54 ` Ingo Molnar
2025-04-03 15:06 ` Steven Rostedt
2 siblings, 1 reply; 7+ messages in thread
From: Ingo Molnar @ 2025-04-03 13:54 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: linux-trace-kernel, peterz, bpf, linux-kernel, kernel-team,
mhocko, rostedt, oleg, brauner, glider, mhiramat,
mathieu.desnoyers, akpm
* Andrii Nakryiko <andrii@kernel.org> wrote:
> It is useful to be able to access current->mm at task exit to, say,
> record a bunch of VMA information right before the task exits (e.g., for
> stack symbolization reasons when dealing with short-lived processes that
> exit in the middle of profiling session). Currently,
> trace_sched_process_exit() is triggered after exit_mm() which resets
> current->mm to NULL making this tracepoint unsuitable for inspecting
> and recording task's mm_struct-related data when tracing process
> lifetimes.
>
> There is a particularly suitable place, though, right after
> taskstats_exit() is called, but before we do exit_mm() and other
> exit_*() resource teardowns. taskstats performs a similar kind of
> accounting that some applications do with BPF, and so co-locating them
> seems like a good fit. So that's where trace_sched_process_exit() is
> moved with this patch.
>
> Also, existing trace_sched_process_exit() tracepoint is notoriously
> missing `group_dead` flag that is certainly useful in practice and some
> of our production applications have to work around this. So plumb
> `group_dead` through while at it, to have a richer and more complete
> tracepoint.
>
> Note that we can't use sched_process_template anymore, and so we use
> TRACE_EVENT()-based tracepoint definition.
But all the field names and
> order, as well as assign and output logic remain intact. We just add one
> extra field at the end in backwards-compatible way.
>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
> include/trace/events/sched.h | 28 +++++++++++++++++++++++++---
> kernel/exit.c | 2 +-
> 2 files changed, 26 insertions(+), 4 deletions(-)
>
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 8994e97d86c1..05a14f2b35c3 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -328,9 +328,31 @@ DEFINE_EVENT(sched_process_template, sched_process_free,
> /*
> * Tracepoint for a task exiting:
> */
> -DEFINE_EVENT(sched_process_template, sched_process_exit,
> - TP_PROTO(struct task_struct *p),
> - TP_ARGS(p));
> +TRACE_EVENT(sched_process_exit,
> +
> + TP_PROTO(struct task_struct *p, bool group_dead),
> +
> + TP_ARGS(p, group_dead),
> +
> + TP_STRUCT__entry(
> + __array( char, comm, TASK_COMM_LEN )
> + __field( pid_t, pid )
> + __field( int, prio )
> + __field( bool, group_dead )
> + ),
> +
> + TP_fast_assign(
> + memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
> + __entry->pid = p->pid;
> + __entry->prio = p->prio; /* XXX SCHED_DEADLINE */
> + __entry->group_dead = group_dead;
> + ),
> +
> + TP_printk("comm=%s pid=%d prio=%d group_dead=%s",
> + __entry->comm, __entry->pid, __entry->prio,
> + __entry->group_dead ? "true" : "false"
> + )
This feels really fragile, could you please at least add a comment that
points out that this is basically an extension of
sched_process_template, and that it should remain a subset of it, or
something to that end?
Thanks,
Ingo
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [PATCH v2] exit: move and extend sched_process_exit() tracepoint
2025-04-03 13:54 ` Ingo Molnar
@ 2025-04-03 15:06 ` Steven Rostedt
2025-04-03 15:11 ` Ingo Molnar
0 siblings, 1 reply; 7+ messages in thread
From: Steven Rostedt @ 2025-04-03 15:06 UTC (permalink / raw)
To: Ingo Molnar
Cc: Andrii Nakryiko, linux-trace-kernel, peterz, bpf, linux-kernel,
kernel-team, mhocko, oleg, brauner, glider, mhiramat,
mathieu.desnoyers, akpm
On Thu, 3 Apr 2025 15:54:22 +0200
Ingo Molnar <mingo@kernel.org> wrote:
> This feels really fragile, could you please at least add a comment that
> points out that this is basically an extension of
> sched_process_template, and that it should remain a subset of it, or
> something to that end?
Is there any dependency on this?
The reason to use the templates is because it saves memory. Each
TRACE_EVENT() can add ~5k (which a TRACE_EVENT() is really just a
DECLARE_EVENT_CLASS() + DEFINE_EVENT() for a single event).
Each DEFINE_EVENT() just adds around 250 bytes. Hence, if you have multiple
events that share the same fields and output, it's much more memory
efficient to use the CLASS and EVENT logic then making each their own
TRACE_EVENT().
I don't know of any other dependency to why this was a template other than
to save memory.
-- Steve
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2] exit: move and extend sched_process_exit() tracepoint
2025-04-03 15:06 ` Steven Rostedt
@ 2025-04-03 15:11 ` Ingo Molnar
2025-04-03 16:01 ` Andrii Nakryiko
0 siblings, 1 reply; 7+ messages in thread
From: Ingo Molnar @ 2025-04-03 15:11 UTC (permalink / raw)
To: Steven Rostedt
Cc: Andrii Nakryiko, linux-trace-kernel, peterz, bpf, linux-kernel,
kernel-team, mhocko, oleg, brauner, glider, mhiramat,
mathieu.desnoyers, akpm
* Steven Rostedt <rostedt@goodmis.org> wrote:
> On Thu, 3 Apr 2025 15:54:22 +0200
> Ingo Molnar <mingo@kernel.org> wrote:
>
> > This feels really fragile, could you please at least add a comment
> > that points out that this is basically an extension of
> > sched_process_template, and that it should remain a subset of it,
> > or something to that end?
>
> Is there any dependency on this?
>
> I don't know of any other dependency to why this was a template other than
> to save memory.
Uhm, to state the obvious: to not replicate the same definitions over
and over again three times times, for 3 scheduler tracepoints that
share the record format?
Removing just a single sched_process_template use bloats the source and
adds in potential fragility:
2 files changed, 26 insertions(+), 4 deletions(-)
So my request is to please at least add a comment that points the
reader to the shared record format between sched_process_exit and the
other two tracepoints.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2] exit: move and extend sched_process_exit() tracepoint
2025-04-03 15:11 ` Ingo Molnar
@ 2025-04-03 16:01 ` Andrii Nakryiko
0 siblings, 0 replies; 7+ messages in thread
From: Andrii Nakryiko @ 2025-04-03 16:01 UTC (permalink / raw)
To: Ingo Molnar
Cc: Steven Rostedt, Andrii Nakryiko, linux-trace-kernel, peterz, bpf,
linux-kernel, kernel-team, mhocko, oleg, brauner, glider,
mhiramat, mathieu.desnoyers, akpm
On Thu, Apr 3, 2025 at 8:12 AM Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > On Thu, 3 Apr 2025 15:54:22 +0200
> > Ingo Molnar <mingo@kernel.org> wrote:
> >
> > > This feels really fragile, could you please at least add a comment
> > > that points out that this is basically an extension of
> > > sched_process_template, and that it should remain a subset of it,
> > > or something to that end?
> >
> > Is there any dependency on this?
> >
> > I don't know of any other dependency to why this was a template other than
> > to save memory.
>
> Uhm, to state the obvious: to not replicate the same definitions over
> and over again three times times, for 3 scheduler tracepoints that
> share the record format?
>
> Removing just a single sched_process_template use bloats the source and
> adds in potential fragility:
>
> 2 files changed, 26 insertions(+), 4 deletions(-)
>
> So my request is to please at least add a comment that points the
> reader to the shared record format between sched_process_exit and the
> other two tracepoints.
Sounds good, no problem. I'll send a follow up patch which Andrew can
fold, if he prefers.
>
> Thanks,
>
> Ingo
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-04-03 16:02 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-02 18:09 [PATCH v2] exit: move and extend sched_process_exit() tracepoint Andrii Nakryiko
2025-04-02 18:36 ` Steven Rostedt
2025-04-02 19:29 ` Oleg Nesterov
2025-04-03 13:54 ` Ingo Molnar
2025-04-03 15:06 ` Steven Rostedt
2025-04-03 15:11 ` Ingo Molnar
2025-04-03 16:01 ` Andrii Nakryiko
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).