* [PATCH 0/2] fgraph: Free up function graph shadow stacks
@ 2024-10-24 9:27 Steven Rostedt
2024-10-24 9:27 ` [PATCH 1/2] fgraph: Free ret_stacks when graph tracing is done Steven Rostedt
2024-10-24 9:27 ` [PATCH 2/2] fgraph: Free ret_stack when task is done with it Steven Rostedt
0 siblings, 2 replies; 8+ messages in thread
From: Steven Rostedt @ 2024-10-24 9:27 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Thomas Gleixner, Peter Zijlstra
Since the start of function graph tracing, shadow stacks were created for
every task on the system when the first instance of function graph was used.
But they were never freed due to the shadow stacks holding the return
address that was hijacked for the function graph return trampoline. The
stacks were only freed when a task exits. That means once you use function
graph tracing, your system has PAGE_SIZE stack for every task in the system
that was running when function graph was happening. That's a lot of memory
being wasted that's not being used.
This series addresses this by now freeing shadow stacks that are no longer
being used. This can be found by checking the shadow stack pointer on the
task structure.
When function graph is finished, it will free all the shadow stacks that are
no longer being used. For those still being used, the freeing of them is
delayed until the funciton graph return is called by the task and it pops
off the last return address. That will trigger a irq work which triggers a
work queue to do shadow stack clean up again. A static_branch is used so
that this check doesn't happen during normal tracing as it's in a very hot
path.
Note this patch series is based on my ftrace/urgent branch merged with my
ftrace/for-next branch (and some patches that havent been pushed yet).
Steven Rostedt (2):
fgraph: Free ret_stacks when graph tracing is done
fgraph: Free ret_stack when task is done with it
----
kernel/trace/fgraph.c | 150 +++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 135 insertions(+), 15 deletions(-)
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 1/2] fgraph: Free ret_stacks when graph tracing is done
2024-10-24 9:27 [PATCH 0/2] fgraph: Free up function graph shadow stacks Steven Rostedt
@ 2024-10-24 9:27 ` Steven Rostedt
2024-10-24 15:00 ` Masami Hiramatsu
2024-10-24 9:27 ` [PATCH 2/2] fgraph: Free ret_stack when task is done with it Steven Rostedt
1 sibling, 1 reply; 8+ messages in thread
From: Steven Rostedt @ 2024-10-24 9:27 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Thomas Gleixner, Peter Zijlstra
From: Steven Rostedt <rostedt@goodmis.org>
Since function graph tracing was added to the kernel, it needed shadow
stacks for every process in order to be able to hijack the return address
and replace it with its own trampoline to trace when the function exits.
The first time function graph was used, it allocated PAGE_SIZE for each
task on the system (including idle tasks). But because these stacks may
still be in use long after tracing is done, they were never freed (except
when a task exits). That means any task that never exits (including kernel
tasks), would always have these shadow stacks allocated even when they
were no longer needed.
The race that needed to be avoided was tracing functions that sleep for
long periods of time (i.e. poll()). If it gets traced, its original return
address is saved on the shadow stack. That means the shadow stack can not
be freed until the task is no longer using it.
Luckily, it is easy to know if the task is done with its shadow stack.
After function graph is disabled, the shadow stack will never grow, and
once the last element is removed off of it, nothing will use it again.
When function graph is done and the last user unregisters, all the tasks
in the system can be examined and if the shadow stack pointer
(curr_ret_depth), is zero, then it can be freed. But since there's no
memory barriers on the CPUs doing the tracing, it has to be moved to a
link list first and then after a call to synchronize_rcu_tasks_trace() the
shadow stacks can be freed.
As the shadow stack is not going to grow anymore, the end of the shadow
stack can be used to store a structure that holds the list_head for the
link list as well as a pointer to the task. This can be used to delay the
freeing until all the shadow stacks to be freed are added to the link list
and the synchronize_rcu_tasks_trace() has finished.
Note, tasks that are still using their shadow stack will not have them
freed. They will stay until the task exits or if another instance of
function graph is registered and unregistered and the shadow stack is no
longer being used.
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/trace/fgraph.c | 113 ++++++++++++++++++++++++++++++++++++------
1 file changed, 99 insertions(+), 14 deletions(-)
diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index 0b7cf2507569..3c7f115217b4 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -1144,6 +1144,7 @@ void ftrace_graph_init_task(struct task_struct *t)
t->curr_ret_stack = 0;
t->curr_ret_depth = -1;
+ mutex_lock(&ftrace_lock);
if (ftrace_graph_active) {
unsigned long *ret_stack;
@@ -1155,6 +1156,7 @@ void ftrace_graph_init_task(struct task_struct *t)
return;
graph_init_task(t, ret_stack);
}
+ mutex_unlock(&ftrace_lock);
}
void ftrace_graph_exit_task(struct task_struct *t)
@@ -1292,19 +1294,106 @@ static void ftrace_graph_disable_direct(bool disable_branch)
fgraph_direct_gops = &fgraph_stub;
}
-/* The cpu_boot init_task->ret_stack will never be freed */
-static int fgraph_cpu_init(unsigned int cpu)
+static void __fgraph_cpu_init(unsigned int cpu)
{
if (!idle_task(cpu)->ret_stack)
ftrace_graph_init_idle_task(idle_task(cpu), cpu);
+}
+
+static int fgraph_cpu_init(unsigned int cpu)
+{
+ if (ftrace_graph_active)
+ __fgraph_cpu_init(cpu);
return 0;
}
+struct ret_stack_free_data {
+ struct list_head list;
+ struct task_struct *task;
+};
+
+static void remove_ret_stack(struct task_struct *t, struct list_head *head, int list_index)
+{
+ struct ret_stack_free_data *free_data;
+
+ /* If the ret_stack is still in use, skip this */
+ if (t->curr_ret_depth >= 0)
+ return;
+
+ free_data = (struct ret_stack_free_data*)(t->ret_stack + list_index);
+ list_add(&free_data->list, head);
+ free_data->task = t;
+}
+
+static void free_ret_stacks(void)
+{
+ struct ret_stack_free_data *free_data, *n;
+ struct task_struct *g, *t;
+ LIST_HEAD(stacks);
+ int list_index;
+ int list_sz;
+ int cpu;
+
+ /* Calculate the size in longs to hold ret_stack_free_data */
+ list_sz = DIV_ROUND_UP(sizeof(struct ret_stack_free_data), sizeof(long));
+
+ /*
+ * We do not want to race with __ftrace_return_to_handler() where this
+ * CPU can see the update to curr_ret_depth going to zero before it
+ * actually does. As tracing is disabled, the ret_stack is not going
+ * to be used anymore and there will be no more callbacks. Use
+ * the top of the stack as the link list pointer to attach this
+ * ret_stack to @head. Then at the end, run an RCU trace synthronization
+ * which will guarantee that there are no more uses of the ret_stacks
+ * and they can all be freed.
+ */
+ list_index = SHADOW_STACK_MAX_OFFSET - list_sz;
+
+ read_lock(&tasklist_lock);
+ for_each_process_thread(g, t) {
+ if (t->ret_stack)
+ remove_ret_stack(t, &stacks, list_index);
+ }
+ read_unlock(&tasklist_lock);
+
+ cpus_read_lock();
+ for_each_online_cpu(cpu) {
+ t = idle_task(cpu);
+ if (t->ret_stack)
+ remove_ret_stack(t, &stacks, list_index);
+ }
+ cpus_read_unlock();
+
+ /* Make sure nothing is using the ret_stacks anymore */
+ synchronize_rcu_tasks_trace();
+
+ list_for_each_entry_safe(free_data, n, &stacks, list) {
+ unsigned long *stack = free_data->task->ret_stack;
+
+ free_data->task->ret_stack = NULL;
+ kmem_cache_free(fgraph_stack_cachep, stack);
+ }
+}
+
+static __init int fgraph_init(void)
+{
+ int ret;
+
+ ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "fgraph_idle_init",
+ fgraph_cpu_init, NULL);
+ if (ret < 0) {
+ pr_warn("fgraph: Error to init cpu hotplug support\n");
+ return ret;
+ }
+ return 0;
+}
+core_initcall(fgraph_init)
+
int register_ftrace_graph(struct fgraph_ops *gops)
{
- static bool fgraph_initialized;
int command = 0;
int ret = 0;
+ int cpu;
int i = -1;
mutex_lock(&ftrace_lock);
@@ -1319,17 +1408,6 @@ int register_ftrace_graph(struct fgraph_ops *gops)
}
}
- if (!fgraph_initialized) {
- ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "fgraph_idle_init",
- fgraph_cpu_init, NULL);
- if (ret < 0) {
- pr_warn("fgraph: Error to init cpu hotplug support\n");
- return ret;
- }
- fgraph_initialized = true;
- ret = 0;
- }
-
if (!fgraph_array[0]) {
/* The array must always have real data on it */
for (i = 0; i < FGRAPH_ARRAY_SIZE; i++)
@@ -1346,6 +1424,12 @@ int register_ftrace_graph(struct fgraph_ops *gops)
ftrace_graph_active++;
+ cpus_read_lock();
+ for_each_online_cpu(cpu) {
+ __fgraph_cpu_init(cpu);
+ }
+ cpus_read_unlock();
+
if (ftrace_graph_active == 2)
ftrace_graph_disable_direct(true);
@@ -1418,6 +1502,7 @@ void unregister_ftrace_graph(struct fgraph_ops *gops)
ftrace_graph_entry = ftrace_graph_entry_stub;
unregister_pm_notifier(&ftrace_suspend_notifier);
unregister_trace_sched_switch(ftrace_graph_probe_sched_switch, NULL);
+ free_ret_stacks();
}
out:
gops->saved_func = NULL;
--
2.45.2
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 2/2] fgraph: Free ret_stack when task is done with it
2024-10-24 9:27 [PATCH 0/2] fgraph: Free up function graph shadow stacks Steven Rostedt
2024-10-24 9:27 ` [PATCH 1/2] fgraph: Free ret_stacks when graph tracing is done Steven Rostedt
@ 2024-10-24 9:27 ` Steven Rostedt
2024-10-24 15:21 ` Masami Hiramatsu
1 sibling, 1 reply; 8+ messages in thread
From: Steven Rostedt @ 2024-10-24 9:27 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Thomas Gleixner, Peter Zijlstra
From: Steven Rostedt <rostedt@goodmis.org>
The shadow stack used for function graph is only freed when function graph
is done for those tasks that are no longer using them. That's because a
function that does a long sleep (like poll) could be traced, and its
original return address on the stack has been replaced with a pointer to a
trampoline, but that return address is saved on the shadow stack. It can
not be freed until the function returns and there's no more return
addresses being stored on the shadow stack.
Add a static_branch test in the return part of the function graph code
that is called after the return address on the shadow stack is popped. If
the shadow stack is empty, call an irq_work that will call a work queue
that will run the shadow stack freeing code again. This will clean up all
the shadow stacks that were not removed when function graph ended but are
no longer being used.
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/trace/fgraph.c | 37 ++++++++++++++++++++++++++++++++++++-
1 file changed, 36 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index 3c7f115217b4..7520ceba7748 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -174,6 +174,11 @@ int ftrace_graph_active;
static struct kmem_cache *fgraph_stack_cachep;
+DEFINE_STATIC_KEY_FALSE(fgraph_ret_stack_cleanup);
+static struct workqueue_struct *fgraph_ret_stack_wq;
+static struct work_struct fgraph_ret_stack_work;
+static struct irq_work fgraph_ret_stack_irq_work;
+
static struct fgraph_ops *fgraph_array[FGRAPH_ARRAY_SIZE];
static unsigned long fgraph_array_bitmask;
@@ -849,8 +854,15 @@ static unsigned long __ftrace_return_to_handler(struct fgraph_ret_regs *ret_regs
*/
barrier();
current->curr_ret_stack = offset - FGRAPH_FRAME_OFFSET;
-
current->curr_ret_depth--;
+
+ /*
+ * If function graph is done and this task is no longer using ret_stack
+ * then start the work to free it.
+ */
+ if (static_branch_unlikely(&fgraph_ret_stack_cleanup) && current->curr_ret_depth < 0)
+ irq_work_queue(&fgraph_ret_stack_irq_work);
+
return ret;
}
@@ -1375,6 +1387,21 @@ static void free_ret_stacks(void)
}
}
+static void fgraph_ret_stack_work_func(struct work_struct *work)
+{
+ mutex_lock(&ftrace_lock);
+ if (!ftrace_graph_active)
+ free_ret_stacks();
+ mutex_unlock(&ftrace_lock);
+}
+
+static void fgraph_ret_stack_irq_func(struct irq_work *iwork)
+{
+ if (unlikely(!fgraph_ret_stack_wq))
+ return;
+ queue_work(fgraph_ret_stack_wq, &fgraph_ret_stack_work);
+}
+
static __init int fgraph_init(void)
{
int ret;
@@ -1385,6 +1412,12 @@ static __init int fgraph_init(void)
pr_warn("fgraph: Error to init cpu hotplug support\n");
return ret;
}
+ fgraph_ret_stack_wq = alloc_workqueue("fgraph_ret_stack_wq",
+ WQ_UNBOUND | WQ_MEM_RECLAIM, 0);
+ WARN_ON(!fgraph_ret_stack_wq);
+
+ INIT_WORK(&fgraph_ret_stack_work, fgraph_ret_stack_work_func);
+ init_irq_work(&fgraph_ret_stack_irq_work, fgraph_ret_stack_irq_func);
return 0;
}
core_initcall(fgraph_init)
@@ -1434,6 +1467,7 @@ int register_ftrace_graph(struct fgraph_ops *gops)
ftrace_graph_disable_direct(true);
if (ftrace_graph_active == 1) {
+ static_branch_disable(&fgraph_ret_stack_cleanup);
ftrace_graph_enable_direct(false, gops);
register_pm_notifier(&ftrace_suspend_notifier);
ret = start_graph_tracing();
@@ -1502,6 +1536,7 @@ void unregister_ftrace_graph(struct fgraph_ops *gops)
ftrace_graph_entry = ftrace_graph_entry_stub;
unregister_pm_notifier(&ftrace_suspend_notifier);
unregister_trace_sched_switch(ftrace_graph_probe_sched_switch, NULL);
+ static_branch_enable(&fgraph_ret_stack_cleanup);
free_ret_stacks();
}
out:
--
2.45.2
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 1/2] fgraph: Free ret_stacks when graph tracing is done
2024-10-24 9:27 ` [PATCH 1/2] fgraph: Free ret_stacks when graph tracing is done Steven Rostedt
@ 2024-10-24 15:00 ` Masami Hiramatsu
2024-10-25 1:00 ` Steven Rostedt
0 siblings, 1 reply; 8+ messages in thread
From: Masami Hiramatsu @ 2024-10-24 15:00 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
Mathieu Desnoyers, Andrew Morton, Thomas Gleixner, Peter Zijlstra
Hi,
On Thu, 24 Oct 2024 05:27:24 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
>
> Since function graph tracing was added to the kernel, it needed shadow
> stacks for every process in order to be able to hijack the return address
> and replace it with its own trampoline to trace when the function exits.
> The first time function graph was used, it allocated PAGE_SIZE for each
> task on the system (including idle tasks). But because these stacks may
> still be in use long after tracing is done, they were never freed (except
> when a task exits). That means any task that never exits (including kernel
> tasks), would always have these shadow stacks allocated even when they
> were no longer needed.
>
> The race that needed to be avoided was tracing functions that sleep for
> long periods of time (i.e. poll()). If it gets traced, its original return
> address is saved on the shadow stack. That means the shadow stack can not
> be freed until the task is no longer using it.
>
> Luckily, it is easy to know if the task is done with its shadow stack.
> After function graph is disabled, the shadow stack will never grow, and
> once the last element is removed off of it, nothing will use it again.
>
> When function graph is done and the last user unregisters, all the tasks
> in the system can be examined and if the shadow stack pointer
> (curr_ret_depth), is zero, then it can be freed. But since there's no
> memory barriers on the CPUs doing the tracing, it has to be moved to a
> link list first and then after a call to synchronize_rcu_tasks_trace() the
> shadow stacks can be freed.
>
> As the shadow stack is not going to grow anymore, the end of the shadow
> stack can be used to store a structure that holds the list_head for the
> link list as well as a pointer to the task. This can be used to delay the
> freeing until all the shadow stacks to be freed are added to the link list
> and the synchronize_rcu_tasks_trace() has finished.
>
> Note, tasks that are still using their shadow stack will not have them
> freed. They will stay until the task exits or if another instance of
> function graph is registered and unregistered and the shadow stack is no
> longer being used.
>
This needs one fix and some comments below. Except for those, it looks
good to me.
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
> kernel/trace/fgraph.c | 113 ++++++++++++++++++++++++++++++++++++------
> 1 file changed, 99 insertions(+), 14 deletions(-)
>
> diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
> index 0b7cf2507569..3c7f115217b4 100644
> --- a/kernel/trace/fgraph.c
> +++ b/kernel/trace/fgraph.c
> @@ -1144,6 +1144,7 @@ void ftrace_graph_init_task(struct task_struct *t)
> t->curr_ret_stack = 0;
> t->curr_ret_depth = -1;
>
> + mutex_lock(&ftrace_lock);
> if (ftrace_graph_active) {
> unsigned long *ret_stack;
>
> @@ -1155,6 +1156,7 @@ void ftrace_graph_init_task(struct task_struct *t)
> return;
The above `return;` shows that you miss unlocking ftrace_lock. B^)
> graph_init_task(t, ret_stack);
> }
> + mutex_unlock(&ftrace_lock);
> }
>
> void ftrace_graph_exit_task(struct task_struct *t)
> @@ -1292,19 +1294,106 @@ static void ftrace_graph_disable_direct(bool disable_branch)
> fgraph_direct_gops = &fgraph_stub;
> }
>
> -/* The cpu_boot init_task->ret_stack will never be freed */
> -static int fgraph_cpu_init(unsigned int cpu)
> +static void __fgraph_cpu_init(unsigned int cpu)
> {
> if (!idle_task(cpu)->ret_stack)
> ftrace_graph_init_idle_task(idle_task(cpu), cpu);
> +}
> +
> +static int fgraph_cpu_init(unsigned int cpu)
> +{
> + if (ftrace_graph_active)
> + __fgraph_cpu_init(cpu);
> return 0;
> }
>
> +struct ret_stack_free_data {
> + struct list_head list;
> + struct task_struct *task;
> +};
> +
> +static void remove_ret_stack(struct task_struct *t, struct list_head *head, int list_index)
> +{
> + struct ret_stack_free_data *free_data;
> +
> + /* If the ret_stack is still in use, skip this */
> + if (t->curr_ret_depth >= 0)
> + return;
> +
> + free_data = (struct ret_stack_free_data*)(t->ret_stack + list_index);
> + list_add(&free_data->list, head);
> + free_data->task = t;
> +}
> +
> +static void free_ret_stacks(void)
> +{
> + struct ret_stack_free_data *free_data, *n;
> + struct task_struct *g, *t;
> + LIST_HEAD(stacks);
> + int list_index;
> + int list_sz;
> + int cpu;
> +
> + /* Calculate the size in longs to hold ret_stack_free_data */
> + list_sz = DIV_ROUND_UP(sizeof(struct ret_stack_free_data), sizeof(long));
> +
> + /*
> + * We do not want to race with __ftrace_return_to_handler() where this
> + * CPU can see the update to curr_ret_depth going to zero before it
> + * actually does. As tracing is disabled, the ret_stack is not going
> + * to be used anymore and there will be no more callbacks. Use
> + * the top of the stack as the link list pointer to attach this
> + * ret_stack to @head. Then at the end, run an RCU trace synthronization
> + * which will guarantee that there are no more uses of the ret_stacks
> + * and they can all be freed.
Just a comment.
This part can mislead, the ret_stacks here are the ret_stacks which can be
used by currently running callbacks on other CPUs. Some other ret_stack are
still used and the owner tasks are in sleep.
> + */
> + list_index = SHADOW_STACK_MAX_OFFSET - list_sz;
> +
> + read_lock(&tasklist_lock);
> + for_each_process_thread(g, t) {
> + if (t->ret_stack)
> + remove_ret_stack(t, &stacks, list_index);
> + }
> + read_unlock(&tasklist_lock);
> +
> + cpus_read_lock();
> + for_each_online_cpu(cpu) {
> + t = idle_task(cpu);
> + if (t->ret_stack)
> + remove_ret_stack(t, &stacks, list_index);
> + }
> + cpus_read_unlock();
> +
> + /* Make sure nothing is using the ret_stacks anymore */
> + synchronize_rcu_tasks_trace();
> +
> + list_for_each_entry_safe(free_data, n, &stacks, list) {
> + unsigned long *stack = free_data->task->ret_stack;
> +
> + free_data->task->ret_stack = NULL;
> + kmem_cache_free(fgraph_stack_cachep, stack);
> + }
> +}
> +
> +static __init int fgraph_init(void)
> +{
> + int ret;
> +
> + ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "fgraph_idle_init",
nit: Shouldn't we update the name first?
Thank you,
> + fgraph_cpu_init, NULL);
> + if (ret < 0) {
> + pr_warn("fgraph: Error to init cpu hotplug support\n");
> + return ret;
> + }
> + return 0;
> +}
> +core_initcall(fgraph_init)
> +
> int register_ftrace_graph(struct fgraph_ops *gops)
> {
> - static bool fgraph_initialized;
> int command = 0;
> int ret = 0;
> + int cpu;
> int i = -1;
>
> mutex_lock(&ftrace_lock);
> @@ -1319,17 +1408,6 @@ int register_ftrace_graph(struct fgraph_ops *gops)
> }
> }
>
> - if (!fgraph_initialized) {
> - ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "fgraph_idle_init",
> - fgraph_cpu_init, NULL);
> - if (ret < 0) {
> - pr_warn("fgraph: Error to init cpu hotplug support\n");
> - return ret;
> - }
> - fgraph_initialized = true;
> - ret = 0;
> - }
> -
> if (!fgraph_array[0]) {
> /* The array must always have real data on it */
> for (i = 0; i < FGRAPH_ARRAY_SIZE; i++)
> @@ -1346,6 +1424,12 @@ int register_ftrace_graph(struct fgraph_ops *gops)
>
> ftrace_graph_active++;
>
> + cpus_read_lock();
> + for_each_online_cpu(cpu) {
> + __fgraph_cpu_init(cpu);
> + }
> + cpus_read_unlock();
> +
> if (ftrace_graph_active == 2)
> ftrace_graph_disable_direct(true);
>
> @@ -1418,6 +1502,7 @@ void unregister_ftrace_graph(struct fgraph_ops *gops)
> ftrace_graph_entry = ftrace_graph_entry_stub;
> unregister_pm_notifier(&ftrace_suspend_notifier);
> unregister_trace_sched_switch(ftrace_graph_probe_sched_switch, NULL);
> + free_ret_stacks();
> }
> out:
> gops->saved_func = NULL;
> --
> 2.45.2
>
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] fgraph: Free ret_stack when task is done with it
2024-10-24 9:27 ` [PATCH 2/2] fgraph: Free ret_stack when task is done with it Steven Rostedt
@ 2024-10-24 15:21 ` Masami Hiramatsu
2024-10-25 1:05 ` Steven Rostedt
0 siblings, 1 reply; 8+ messages in thread
From: Masami Hiramatsu @ 2024-10-24 15:21 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
Mathieu Desnoyers, Andrew Morton, Thomas Gleixner, Peter Zijlstra
On Thu, 24 Oct 2024 05:27:25 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
>
> The shadow stack used for function graph is only freed when function graph
> is done for those tasks that are no longer using them. That's because a
> function that does a long sleep (like poll) could be traced, and its
> original return address on the stack has been replaced with a pointer to a
> trampoline, but that return address is saved on the shadow stack. It can
> not be freed until the function returns and there's no more return
> addresses being stored on the shadow stack.
>
> Add a static_branch test in the return part of the function graph code
> that is called after the return address on the shadow stack is popped. If
> the shadow stack is empty, call an irq_work that will call a work queue
> that will run the shadow stack freeing code again. This will clean up all
> the shadow stacks that were not removed when function graph ended but are
> no longer being used.
>
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
> kernel/trace/fgraph.c | 37 ++++++++++++++++++++++++++++++++++++-
> 1 file changed, 36 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
> index 3c7f115217b4..7520ceba7748 100644
> --- a/kernel/trace/fgraph.c
> +++ b/kernel/trace/fgraph.c
> @@ -174,6 +174,11 @@ int ftrace_graph_active;
>
> static struct kmem_cache *fgraph_stack_cachep;
>
> +DEFINE_STATIC_KEY_FALSE(fgraph_ret_stack_cleanup);
> +static struct workqueue_struct *fgraph_ret_stack_wq;
> +static struct work_struct fgraph_ret_stack_work;
> +static struct irq_work fgraph_ret_stack_irq_work;
> +
> static struct fgraph_ops *fgraph_array[FGRAPH_ARRAY_SIZE];
> static unsigned long fgraph_array_bitmask;
>
> @@ -849,8 +854,15 @@ static unsigned long __ftrace_return_to_handler(struct fgraph_ret_regs *ret_regs
> */
> barrier();
> current->curr_ret_stack = offset - FGRAPH_FRAME_OFFSET;
> -
> current->curr_ret_depth--;
> +
> + /*
> + * If function graph is done and this task is no longer using ret_stack
> + * then start the work to free it.
> + */
> + if (static_branch_unlikely(&fgraph_ret_stack_cleanup) && current->curr_ret_depth < 0)
> + irq_work_queue(&fgraph_ret_stack_irq_work);
> +
> return ret;
> }
>
> @@ -1375,6 +1387,21 @@ static void free_ret_stacks(void)
> }
> }
>
> +static void fgraph_ret_stack_work_func(struct work_struct *work)
> +{
> + mutex_lock(&ftrace_lock);
> + if (!ftrace_graph_active)
> + free_ret_stacks();
> + mutex_unlock(&ftrace_lock);
> +}
Hmm, will you scan all tasks everytime? Shouldn't we have another global
list of skipped tasks in remove_ret_stack(), like below?
static void remove_ret_stack(struct task_struct *t, struct list_head *freelist, struct list_head *skiplist, int list_index)
{
struct ret_stack_free_data *free_data;
struct list_head *head;
/* If the ret_stack is still in use, skip this */
if (t->curr_ret_depth >= 0)
head = skiplist;
else
head = freelist;
free_data = (struct ret_stack_free_data*)(t->ret_stack + list_index);
list_add(&free_data->list, head);
free_data->task = t;
}
Then we can scan only skiplist in free_ret_stacks() in fgraph_ret_stack_work_func().
Of course this will need to decouple preparing freelist/skiplist and
actual free function.
Thank you,
> +
> +static void fgraph_ret_stack_irq_func(struct irq_work *iwork)
> +{
> + if (unlikely(!fgraph_ret_stack_wq))
> + return;
> + queue_work(fgraph_ret_stack_wq, &fgraph_ret_stack_work);
> +}
> +
> static __init int fgraph_init(void)
> {
> int ret;
> @@ -1385,6 +1412,12 @@ static __init int fgraph_init(void)
> pr_warn("fgraph: Error to init cpu hotplug support\n");
> return ret;
> }
> + fgraph_ret_stack_wq = alloc_workqueue("fgraph_ret_stack_wq",
> + WQ_UNBOUND | WQ_MEM_RECLAIM, 0);
> + WARN_ON(!fgraph_ret_stack_wq);
> +
> + INIT_WORK(&fgraph_ret_stack_work, fgraph_ret_stack_work_func);
> + init_irq_work(&fgraph_ret_stack_irq_work, fgraph_ret_stack_irq_func);
> return 0;
> }
> core_initcall(fgraph_init)
> @@ -1434,6 +1467,7 @@ int register_ftrace_graph(struct fgraph_ops *gops)
> ftrace_graph_disable_direct(true);
>
> if (ftrace_graph_active == 1) {
> + static_branch_disable(&fgraph_ret_stack_cleanup);
> ftrace_graph_enable_direct(false, gops);
> register_pm_notifier(&ftrace_suspend_notifier);
> ret = start_graph_tracing();
> @@ -1502,6 +1536,7 @@ void unregister_ftrace_graph(struct fgraph_ops *gops)
> ftrace_graph_entry = ftrace_graph_entry_stub;
> unregister_pm_notifier(&ftrace_suspend_notifier);
> unregister_trace_sched_switch(ftrace_graph_probe_sched_switch, NULL);
> + static_branch_enable(&fgraph_ret_stack_cleanup);
> free_ret_stacks();
> }
> out:
> --
> 2.45.2
>
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/2] fgraph: Free ret_stacks when graph tracing is done
2024-10-24 15:00 ` Masami Hiramatsu
@ 2024-10-25 1:00 ` Steven Rostedt
0 siblings, 0 replies; 8+ messages in thread
From: Steven Rostedt @ 2024-10-25 1:00 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: linux-kernel, linux-trace-kernel, Mark Rutland, Mathieu Desnoyers,
Andrew Morton, Thomas Gleixner, Peter Zijlstra
On Fri, 25 Oct 2024 00:00:44 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> > ---
> > kernel/trace/fgraph.c | 113 ++++++++++++++++++++++++++++++++++++------
> > 1 file changed, 99 insertions(+), 14 deletions(-)
> >
> > diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
> > index 0b7cf2507569..3c7f115217b4 100644
> > --- a/kernel/trace/fgraph.c
> > +++ b/kernel/trace/fgraph.c
> > @@ -1144,6 +1144,7 @@ void ftrace_graph_init_task(struct task_struct *t)
> > t->curr_ret_stack = 0;
> > t->curr_ret_depth = -1;
> >
> > + mutex_lock(&ftrace_lock);
> > if (ftrace_graph_active) {
> > unsigned long *ret_stack;
> >
> > @@ -1155,6 +1156,7 @@ void ftrace_graph_init_task(struct task_struct *t)
> > return;
>
> The above `return;` shows that you miss unlocking ftrace_lock. B^)
Bah, I added this locking after doing most my tests and then seeing
this needed protection. The return was here before the mutex, but I
missed it when I added the mutexes. I'll switch this with guard.
>
> > graph_init_task(t, ret_stack);
> > }
> > + mutex_unlock(&ftrace_lock);
> > }
> >
> > void ftrace_graph_exit_task(struct task_struct *t)
> > @@ -1292,19 +1294,106 @@ static void ftrace_graph_disable_direct(bool disable_branch)
> > fgraph_direct_gops = &fgraph_stub;
> > }
> >
> > -/* The cpu_boot init_task->ret_stack will never be freed */
> > -static int fgraph_cpu_init(unsigned int cpu)
> > +static void __fgraph_cpu_init(unsigned int cpu)
> > {
> > if (!idle_task(cpu)->ret_stack)
> > ftrace_graph_init_idle_task(idle_task(cpu), cpu);
> > +}
> > +
> > +static int fgraph_cpu_init(unsigned int cpu)
> > +{
> > + if (ftrace_graph_active)
> > + __fgraph_cpu_init(cpu);
> > return 0;
> > }
> >
> > +struct ret_stack_free_data {
> > + struct list_head list;
> > + struct task_struct *task;
> > +};
> > +
> > +static void remove_ret_stack(struct task_struct *t, struct list_head *head, int list_index)
> > +{
> > + struct ret_stack_free_data *free_data;
> > +
> > + /* If the ret_stack is still in use, skip this */
> > + if (t->curr_ret_depth >= 0)
> > + return;
> > +
> > + free_data = (struct ret_stack_free_data*)(t->ret_stack + list_index);
> > + list_add(&free_data->list, head);
> > + free_data->task = t;
> > +}
> > +
> > +static void free_ret_stacks(void)
> > +{
> > + struct ret_stack_free_data *free_data, *n;
> > + struct task_struct *g, *t;
> > + LIST_HEAD(stacks);
> > + int list_index;
> > + int list_sz;
> > + int cpu;
> > +
> > + /* Calculate the size in longs to hold ret_stack_free_data */
> > + list_sz = DIV_ROUND_UP(sizeof(struct ret_stack_free_data), sizeof(long));
> > +
> > + /*
> > + * We do not want to race with __ftrace_return_to_handler() where this
> > + * CPU can see the update to curr_ret_depth going to zero before it
> > + * actually does. As tracing is disabled, the ret_stack is not going
> > + * to be used anymore and there will be no more callbacks. Use
> > + * the top of the stack as the link list pointer to attach this
>
> > + * ret_stack to @head. Then at the end, run an RCU trace synthronization
> > + * which will guarantee that there are no more uses of the ret_stacks
> > + * and they can all be freed.
>
> Just a comment.
> This part can mislead, the ret_stacks here are the ret_stacks which can be
> used by currently running callbacks on other CPUs. Some other ret_stack are
> still used and the owner tasks are in sleep.
OK, I'll update the comment.
>
> > + */
> > + list_index = SHADOW_STACK_MAX_OFFSET - list_sz;
> > +
> > + read_lock(&tasklist_lock);
> > + for_each_process_thread(g, t) {
> > + if (t->ret_stack)
> > + remove_ret_stack(t, &stacks, list_index);
> > + }
> > + read_unlock(&tasklist_lock);
> > +
> > + cpus_read_lock();
> > + for_each_online_cpu(cpu) {
> > + t = idle_task(cpu);
> > + if (t->ret_stack)
> > + remove_ret_stack(t, &stacks, list_index);
> > + }
> > + cpus_read_unlock();
> > +
> > + /* Make sure nothing is using the ret_stacks anymore */
> > + synchronize_rcu_tasks_trace();
> > +
> > + list_for_each_entry_safe(free_data, n, &stacks, list) {
> > + unsigned long *stack = free_data->task->ret_stack;
> > +
> > + free_data->task->ret_stack = NULL;
> > + kmem_cache_free(fgraph_stack_cachep, stack);
> > + }
> > +}
> > +
> > +static __init int fgraph_init(void)
> > +{
> > + int ret;
> > +
> > + ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "fgraph_idle_init",
>
> nit: Shouldn't we update the name first?
Heh, I guess ;-)
Thank for the review!
-- Steve
>
>
> Thank you,
>
> > + fgraph_cpu_init, NULL);
> > + if (ret < 0) {
> > + pr_warn("fgraph: Error to init cpu hotplug support\n");
> > + return ret;
> > + }
> > + return 0;
> > +}
> > +core_initcall(fgraph_init)
> > +
> > int register_ftrace_graph(struct fgraph_ops *gops)
> > {
> > - static bool fgraph_initialized;
> > int command = 0;
> > int ret = 0;
> > + int cpu;
> > int i = -1;
> >
> > mutex_lock(&ftrace_lock);
> > @@ -1319,17 +1408,6 @@ int register_ftrace_graph(struct fgraph_ops *gops)
> > }
> > }
> >
> > - if (!fgraph_initialized) {
> > - ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "fgraph_idle_init",
> > - fgraph_cpu_init, NULL);
> > - if (ret < 0) {
> > - pr_warn("fgraph: Error to init cpu hotplug support\n");
> > - return ret;
> > - }
> > - fgraph_initialized = true;
> > - ret = 0;
> > - }
> > -
> > if (!fgraph_array[0]) {
> > /* The array must always have real data on it */
> > for (i = 0; i < FGRAPH_ARRAY_SIZE; i++)
> > @@ -1346,6 +1424,12 @@ int register_ftrace_graph(struct fgraph_ops *gops)
> >
> > ftrace_graph_active++;
> >
> > + cpus_read_lock();
> > + for_each_online_cpu(cpu) {
> > + __fgraph_cpu_init(cpu);
> > + }
> > + cpus_read_unlock();
> > +
> > if (ftrace_graph_active == 2)
> > ftrace_graph_disable_direct(true);
> >
> > @@ -1418,6 +1502,7 @@ void unregister_ftrace_graph(struct fgraph_ops *gops)
> > ftrace_graph_entry = ftrace_graph_entry_stub;
> > unregister_pm_notifier(&ftrace_suspend_notifier);
> > unregister_trace_sched_switch(ftrace_graph_probe_sched_switch, NULL);
> > + free_ret_stacks();
> > }
> > out:
> > gops->saved_func = NULL;
> > --
> > 2.45.2
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] fgraph: Free ret_stack when task is done with it
2024-10-24 15:21 ` Masami Hiramatsu
@ 2024-10-25 1:05 ` Steven Rostedt
2024-10-25 3:31 ` Masami Hiramatsu
0 siblings, 1 reply; 8+ messages in thread
From: Steven Rostedt @ 2024-10-25 1:05 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: linux-kernel, linux-trace-kernel, Mark Rutland, Mathieu Desnoyers,
Andrew Morton, Thomas Gleixner, Peter Zijlstra
On Fri, 25 Oct 2024 00:21:21 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> > +static void fgraph_ret_stack_work_func(struct work_struct *work)
> > +{
> > + mutex_lock(&ftrace_lock);
> > + if (!ftrace_graph_active)
> > + free_ret_stacks();
> > + mutex_unlock(&ftrace_lock);
> > +}
>
> Hmm, will you scan all tasks everytime? Shouldn't we have another global
> list of skipped tasks in remove_ret_stack(), like below?
>
> static void remove_ret_stack(struct task_struct *t, struct list_head *freelist, struct list_head *skiplist, int list_index)
> {
> struct ret_stack_free_data *free_data;
> struct list_head *head;
>
> /* If the ret_stack is still in use, skip this */
> if (t->curr_ret_depth >= 0)
> head = skiplist;
> else
> head = freelist;
>
> free_data = (struct ret_stack_free_data*)(t->ret_stack + list_index);
> list_add(&free_data->list, head);
> free_data->task = t;
> }
>
> Then we can scan only skiplist in free_ret_stacks() in fgraph_ret_stack_work_func().
>
> Of course this will need to decouple preparing freelist/skiplist and
> actual free function.
I thought about doing it this way, but I felt that it made the code
more complex with little benefit. Yeah, we scan all tasks, but it only
happens in a work queue that is grabbing the ftrace_lock mutex. If
anything, I rather keep it this way and if it ends up being an issue we
can change it later.
One thing Thomas always says is "correctness first, optimize later".
This is much easier to get correct. Adding a skip list will add
complexity. Like I said, nothing prevents us from adding that feature
later, and if it ends up buggy, we can know which change caused the bug.
-- Steve
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] fgraph: Free ret_stack when task is done with it
2024-10-25 1:05 ` Steven Rostedt
@ 2024-10-25 3:31 ` Masami Hiramatsu
0 siblings, 0 replies; 8+ messages in thread
From: Masami Hiramatsu @ 2024-10-25 3:31 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, Mark Rutland, Mathieu Desnoyers,
Andrew Morton, Thomas Gleixner, Peter Zijlstra
On Thu, 24 Oct 2024 21:05:15 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> On Fri, 25 Oct 2024 00:21:21 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
>
> > > +static void fgraph_ret_stack_work_func(struct work_struct *work)
> > > +{
> > > + mutex_lock(&ftrace_lock);
> > > + if (!ftrace_graph_active)
> > > + free_ret_stacks();
> > > + mutex_unlock(&ftrace_lock);
> > > +}
> >
> > Hmm, will you scan all tasks everytime? Shouldn't we have another global
> > list of skipped tasks in remove_ret_stack(), like below?
> >
> > static void remove_ret_stack(struct task_struct *t, struct list_head *freelist, struct list_head *skiplist, int list_index)
> > {
> > struct ret_stack_free_data *free_data;
> > struct list_head *head;
> >
> > /* If the ret_stack is still in use, skip this */
> > if (t->curr_ret_depth >= 0)
> > head = skiplist;
> > else
> > head = freelist;
> >
> > free_data = (struct ret_stack_free_data*)(t->ret_stack + list_index);
> > list_add(&free_data->list, head);
> > free_data->task = t;
> > }
> >
> > Then we can scan only skiplist in free_ret_stacks() in fgraph_ret_stack_work_func().
> >
> > Of course this will need to decouple preparing freelist/skiplist and
> > actual free function.
>
> I thought about doing it this way, but I felt that it made the code
> more complex with little benefit. Yeah, we scan all tasks, but it only
> happens in a work queue that is grabbing the ftrace_lock mutex. If
> anything, I rather keep it this way and if it ends up being an issue we
> can change it later.
OK, then let it goes with this in this version.
>
> One thing Thomas always says is "correctness first, optimize later".
> This is much easier to get correct. Adding a skip list will add
> complexity. Like I said, nothing prevents us from adding that feature
> later, and if it ends up buggy, we can know which change caused the bug.
It is not buggy as far as I reviewed, just concerned about the
performance overhead. So,
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Thank you,
>
> -- Steve
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2024-10-25 3:31 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-24 9:27 [PATCH 0/2] fgraph: Free up function graph shadow stacks Steven Rostedt
2024-10-24 9:27 ` [PATCH 1/2] fgraph: Free ret_stacks when graph tracing is done Steven Rostedt
2024-10-24 15:00 ` Masami Hiramatsu
2024-10-25 1:00 ` Steven Rostedt
2024-10-24 9:27 ` [PATCH 2/2] fgraph: Free ret_stack when task is done with it Steven Rostedt
2024-10-24 15:21 ` Masami Hiramatsu
2024-10-25 1:05 ` Steven Rostedt
2024-10-25 3:31 ` Masami Hiramatsu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox