* Re: [PATCH v3] rethook: Remove the running task check in rethook_find_ret_addr()
[not found] <20260609084953.901576-1-wutengda@huaweicloud.com>
@ 2026-06-09 9:43 ` Petr Mladek
2026-06-09 11:12 ` Tengda Wu
0 siblings, 1 reply; 3+ messages in thread
From: Petr Mladek @ 2026-06-09 9:43 UTC (permalink / raw)
To: Tengda Wu
Cc: Masami Hiramatsu, Peter Zijlstra, Steven Rostedt,
Mathieu Desnoyers, Alexei Starovoitov, linux-trace-kernel,
linux-kernel, live-patching
Added live-patching mailing list.
On Tue 2026-06-09 16:49:53, Tengda Wu wrote:
> The current check in rethook_find_ret_addr() prevents obtaining a return
> address when the target task is marked as running. However, this condition
> is both insufficient for correctness and unnecessary for its intended
> purpose.
>
> The check is inherently racy: a task can begin running on another CPU
> immediately after task_is_running() returns false, potentially leading to
> concurrent modification of rethook data structures while the iteration is
> in progress.
>
> Rather than trying to fix this unreliable check deep in the unwinding
> path, simply remove it. The iteration is already safe from crashes because
> unwind_next_frame() holds RCU and rethook_node structures are RCU-freed;
> even if the iteration goes off the rails and returns invalid information,
> it will not crash. Callers that require consistency must provide a safe
> context themselves.
>
> Fixes: 54ecbe6f1ed5 ("rethook: Add a generic return hook")
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Tengda Wu <wutengda@huaweicloud.com>
> ---
> v3: Improve commit message: clarify safety semantics and document that RCU guarantees no crash.
> v2: https://lore.kernel.org/all/20260609005728.458962-1-wutengda@huaweicloud.com/
> v1: https://lore.kernel.org/all/20260525132253.1889726-1-wutengda@huaweicloud.com/
>
> --- a/kernel/trace/rethook.c
> +++ b/kernel/trace/rethook.c
> @@ -250,9 +250,6 @@ unsigned long rethook_find_ret_addr(struct task_struct *tsk, unsigned long frame
> if (WARN_ON_ONCE(!cur))
> return 0;
>
> - if (tsk != current && task_is_running(tsk))
> - return 0;
> -
The description of the function should be updated as well. It still
mentions:
* The @tsk must be 'current' or a task which is not running.
Instead it should explain that it safe to call the function even
on another running tasks but the returned address is not reliable
then.
> do {
> ret = __rethook_find_ret_addr(tsk, cur);
> if (!ret)
I am still a bit concerned about the motivation.
Tengda mentioned at
https://lore.kernel.org/all/679a1c8f-1e4d-4ae5-83e1-d0068e6de1a6@huaweicloud.com/
that they tried to verify livepatching:
<paste>
Background: We are verifying the support of live patches for functions that
have a kretprobe. The specific verification method is as follows:
We construct a function foo() that calls bar():
void bar(void)
{
for (;;) {
schedule();
}
}
void foo(void)
{
bar();
}
A kretprobe is attached to bar():
echo 'r:rp1 bar' > /sys/kernel/tracing/kprobe_events
echo 1 > /sys/kernel/tracing/events/kprobes/rp1/enable
Then foo() is triggered. The expected behavior is that bar() will call
schedule() and yield the CPU.
After that, the live patch is activated to attempt replacing the implementation
of foo(). The expectation is that this should succeed.
However, in reality, because the task that called schedule() is still in the
RUNNING state, the condition task_is_running(tsk) inside rethook_find_ret_addr()
is not satisfied, causing the function to return early. This, in turn,
prevents stack_trace_save_tsk_reliable() from determining the stack as
reliable, leading to a failure in activating the live patch.
**Not sure if this is correct:**
We believe that after a task voluntarily calls schedule(), when the stack
is expected to be reliable, it is a safe time to activate a live patch.
Additionally, a similar tsk->on_cpu check can be found elsewhere in the
kernel (See task_on_another_cpu() in arch/x86/include/asm/unwind.h).
Therefore, we propose changing the task_is_running(tsk) condition to
tsk->on_cpu.
</paste>
More background:
----------------
The test is artificial because it keeps the RUNNING state before
calling schedule, see
https://lore.kernel.org/all/20260608093449.GH4149641@noisy.programming.kicks-ass.net/
My questions:
Does this patch allows to livepatch the above mentioned test code?
Is the livepatching safe?
Does it help in another scenarios?
My opinion:
The livepatching might be safe only when the process is migrating
itself. I mean that it might be safe even when it is RUNNING as long
at it is _current_.
I agree that we do not need to enforce this in rethook_find_ret_addr()
if the function is used also in other scenarios, for example, by
ftrace/BTF for taking snapshots of other processes.
But we need to make sure that the backtrace is reliable when
livepatching (migrating) the task.
Best Regards,
Petr
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH v3] rethook: Remove the running task check in rethook_find_ret_addr()
2026-06-09 9:43 ` [PATCH v3] rethook: Remove the running task check in rethook_find_ret_addr() Petr Mladek
@ 2026-06-09 11:12 ` Tengda Wu
2026-06-10 8:29 ` Masami Hiramatsu
0 siblings, 1 reply; 3+ messages in thread
From: Tengda Wu @ 2026-06-09 11:12 UTC (permalink / raw)
To: Petr Mladek
Cc: Masami Hiramatsu, Peter Zijlstra, Steven Rostedt,
Mathieu Desnoyers, Alexei Starovoitov, linux-trace-kernel,
linux-kernel, live-patching
On 2026/6/9 17:43, Petr Mladek wrote:
> Added live-patching mailing list.
>
> On Tue 2026-06-09 16:49:53, Tengda Wu wrote:
>> The current check in rethook_find_ret_addr() prevents obtaining a return
>> address when the target task is marked as running. However, this condition
>> is both insufficient for correctness and unnecessary for its intended
>> purpose.
>>
>> The check is inherently racy: a task can begin running on another CPU
>> immediately after task_is_running() returns false, potentially leading to
>> concurrent modification of rethook data structures while the iteration is
>> in progress.
>>
>> Rather than trying to fix this unreliable check deep in the unwinding
>> path, simply remove it. The iteration is already safe from crashes because
>> unwind_next_frame() holds RCU and rethook_node structures are RCU-freed;
>> even if the iteration goes off the rails and returns invalid information,
>> it will not crash. Callers that require consistency must provide a safe
>> context themselves.
>>
>> Fixes: 54ecbe6f1ed5 ("rethook: Add a generic return hook")
>> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Signed-off-by: Tengda Wu <wutengda@huaweicloud.com>
>> ---
>> v3: Improve commit message: clarify safety semantics and document that RCU guarantees no crash.
>> v2: https://lore.kernel.org/all/20260609005728.458962-1-wutengda@huaweicloud.com/
>> v1: https://lore.kernel.org/all/20260525132253.1889726-1-wutengda@huaweicloud.com/
>>
>> --- a/kernel/trace/rethook.c
>> +++ b/kernel/trace/rethook.c
>> @@ -250,9 +250,6 @@ unsigned long rethook_find_ret_addr(struct task_struct *tsk, unsigned long frame
>> if (WARN_ON_ONCE(!cur))
>> return 0;
>>
>> - if (tsk != current && task_is_running(tsk))
>> - return 0;
>> -
>
> The description of the function should be updated as well. It still
> mentions:
>
> * The @tsk must be 'current' or a task which is not running.
>
> Instead it should explain that it safe to call the function even
> on another running tasks but the returned address is not reliable
> then.
>
Oh, I forgot that. Thanks for pointing it out.
>> do {
>> ret = __rethook_find_ret_addr(tsk, cur);
>> if (!ret)
>
> I am still a bit concerned about the motivation.
>
> Tengda mentioned at
> https://lore.kernel.org/all/679a1c8f-1e4d-4ae5-83e1-d0068e6de1a6@huaweicloud.com/
> that they tried to verify livepatching:
>
> <paste>
> Background: We are verifying the support of live patches for functions that
> have a kretprobe. The specific verification method is as follows:
>
> We construct a function foo() that calls bar():
>
> void bar(void)
> {
> for (;;) {
> schedule();
> }
> }
>
> void foo(void)
> {
> bar();
> }
>
> A kretprobe is attached to bar():
>
> echo 'r:rp1 bar' > /sys/kernel/tracing/kprobe_events
> echo 1 > /sys/kernel/tracing/events/kprobes/rp1/enable
>
> Then foo() is triggered. The expected behavior is that bar() will call
> schedule() and yield the CPU.
>
> After that, the live patch is activated to attempt replacing the implementation
> of foo(). The expectation is that this should succeed.
>
> However, in reality, because the task that called schedule() is still in the
> RUNNING state, the condition task_is_running(tsk) inside rethook_find_ret_addr()
> is not satisfied, causing the function to return early. This, in turn,
> prevents stack_trace_save_tsk_reliable() from determining the stack as
> reliable, leading to a failure in activating the live patch.
>
> **Not sure if this is correct:**
>
> We believe that after a task voluntarily calls schedule(), when the stack
> is expected to be reliable, it is a safe time to activate a live patch.
> Additionally, a similar tsk->on_cpu check can be found elsewhere in the
> kernel (See task_on_another_cpu() in arch/x86/include/asm/unwind.h).
> Therefore, we propose changing the task_is_running(tsk) condition to
> tsk->on_cpu.
> </paste>
>
> More background:
> ----------------
>
> The test is artificial because it keeps the RUNNING state before
> calling schedule, see
> https://lore.kernel.org/all/20260608093449.GH4149641@noisy.programming.kicks-ass.net/
>
>
>
> My questions:
>
> Does this patch allows to livepatch the above mentioned test code?
At least it will no longer be blocked by the rethook check.
> Is the livepatching safe?
> Does it help in another scenarios?
>
>
> My opinion:
>
> The livepatching might be safe only when the process is migrating
> itself. I mean that it might be safe even when it is RUNNING as long
> at it is _current_.
>
> I agree that we do not need to enforce this in rethook_find_ret_addr()
> if the function is used also in other scenarios, for example, by
> ftrace/BTF for taking snapshots of other processes.
>
> But we need to make sure that the backtrace is reliable when
> livepatching (migrating) the task.
>
> Best Regards,
> Petr
Peter actually touched on this point earlier:
<paste>
Things like live-patch use task_call_func(), which ensures the callback
function is done while holding sufficient locks for the task to not
change state.
</paste>
From my understanding, removing this check should not introduce
stack reliability issues for livepatch.
Thanks,
Tengda
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH v3] rethook: Remove the running task check in rethook_find_ret_addr()
2026-06-09 11:12 ` Tengda Wu
@ 2026-06-10 8:29 ` Masami Hiramatsu
0 siblings, 0 replies; 3+ messages in thread
From: Masami Hiramatsu @ 2026-06-10 8:29 UTC (permalink / raw)
To: Tengda Wu
Cc: Petr Mladek, Masami Hiramatsu, Peter Zijlstra, Steven Rostedt,
Mathieu Desnoyers, Alexei Starovoitov, linux-trace-kernel,
linux-kernel, live-patching
On Tue, 9 Jun 2026 19:12:41 +0800
Tengda Wu <wutengda@huaweicloud.com> wrote:
>
>
> On 2026/6/9 17:43, Petr Mladek wrote:
> > Added live-patching mailing list.
> >
> > On Tue 2026-06-09 16:49:53, Tengda Wu wrote:
> >> The current check in rethook_find_ret_addr() prevents obtaining a return
> >> address when the target task is marked as running. However, this condition
> >> is both insufficient for correctness and unnecessary for its intended
> >> purpose.
> >>
> >> The check is inherently racy: a task can begin running on another CPU
> >> immediately after task_is_running() returns false, potentially leading to
> >> concurrent modification of rethook data structures while the iteration is
> >> in progress.
> >>
> >> Rather than trying to fix this unreliable check deep in the unwinding
> >> path, simply remove it. The iteration is already safe from crashes because
> >> unwind_next_frame() holds RCU and rethook_node structures are RCU-freed;
> >> even if the iteration goes off the rails and returns invalid information,
> >> it will not crash. Callers that require consistency must provide a safe
> >> context themselves.
> >>
> >> Fixes: 54ecbe6f1ed5 ("rethook: Add a generic return hook")
> >> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> >> Signed-off-by: Tengda Wu <wutengda@huaweicloud.com>
> >> ---
> >> v3: Improve commit message: clarify safety semantics and document that RCU guarantees no crash.
> >> v2: https://lore.kernel.org/all/20260609005728.458962-1-wutengda@huaweicloud.com/
> >> v1: https://lore.kernel.org/all/20260525132253.1889726-1-wutengda@huaweicloud.com/
> >>
> >> --- a/kernel/trace/rethook.c
> >> +++ b/kernel/trace/rethook.c
> >> @@ -250,9 +250,6 @@ unsigned long rethook_find_ret_addr(struct task_struct *tsk, unsigned long frame
> >> if (WARN_ON_ONCE(!cur))
> >> return 0;
> >>
> >> - if (tsk != current && task_is_running(tsk))
> >> - return 0;
> >> -
> >
> > The description of the function should be updated as well. It still
> > mentions:
> >
> > * The @tsk must be 'current' or a task which is not running.
> >
> > Instead it should explain that it safe to call the function even
> > on another running tasks but the returned address is not reliable
> > then.
> >
>
> Oh, I forgot that. Thanks for pointing it out.
Yeah, but it should be updated to explain what you need to do.
For example call it should hold RCU, or use for current.
Thanks,
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-06-10 8:29 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20260609084953.901576-1-wutengda@huaweicloud.com>
2026-06-09 9:43 ` [PATCH v3] rethook: Remove the running task check in rethook_find_ret_addr() Petr Mladek
2026-06-09 11:12 ` Tengda Wu
2026-06-10 8:29 ` Masami Hiramatsu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox