Linux Trace Kernel
 help / color / mirror / Atom feed
* [PATCH] rethook: Use tsk->on_cpu to check task execution state
@ 2026-05-25 13:22 Tengda Wu
  2026-05-26  3:37 ` Masami Hiramatsu
  0 siblings, 1 reply; 10+ messages in thread
From: Tengda Wu @ 2026-05-25 13:22 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Mathieu Desnoyers, Alexei Starovoitov, linux-trace-kernel,
	linux-kernel, Tengda Wu

When a task calls schedule() to yield the CPU, its state remains
TASK_RUNNING, but its stack is frozen and safe to walk.

Replace task_is_running(tsk) with tsk->on_cpu to avoid overly
conservative rejections.

Fixes: 54ecbe6f1ed5 ("rethook: Add a generic return hook")
Signed-off-by: Tengda Wu <wutengda@huaweicloud.com>
---
 kernel/trace/rethook.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
index 5a8bdf88999a..bd5e5f455e85 100644
--- a/kernel/trace/rethook.c
+++ b/kernel/trace/rethook.c
@@ -250,7 +250,7 @@ unsigned long rethook_find_ret_addr(struct task_struct *tsk, unsigned long frame
 	if (WARN_ON_ONCE(!cur))
 		return 0;
 
-	if (tsk != current && task_is_running(tsk))
+	if (tsk != current && tsk->on_cpu)
 		return 0;
 
 	do {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] rethook: Use tsk->on_cpu to check task execution state
  2026-05-25 13:22 [PATCH] rethook: Use tsk->on_cpu to check task execution state Tengda Wu
@ 2026-05-26  3:37 ` Masami Hiramatsu
  2026-05-29  3:39   ` Tengda Wu
  0 siblings, 1 reply; 10+ messages in thread
From: Masami Hiramatsu @ 2026-05-26  3:37 UTC (permalink / raw)
  To: Tengda Wu
  Cc: Steven Rostedt, Mathieu Desnoyers, Alexei Starovoitov,
	linux-trace-kernel, linux-kernel

On Mon, 25 May 2026 21:22:53 +0800
Tengda Wu <wutengda@huaweicloud.com> wrote:

> When a task calls schedule() to yield the CPU, its state remains
> TASK_RUNNING, but its stack is frozen and safe to walk.
> 
> Replace task_is_running(tsk) with tsk->on_cpu to avoid overly
> conservative rejections.

Please see the Sashiko's comment.

https://sashiko.dev/#/patchset/20260525132253.1889726-1-wutengda%40huaweicloud.com

When calling Unwind on a task other than the current, IMHO, it is
the responsibility of the caller of this function to ensure that the
stack trace of that task is safe.
We also should not use tsk->on_cpu, but should use task_on_cpu(tsk).

BTW, should task_on_cpu() use READ_ONCE() etc?
wait_task_inactive() seems a bit fragile.

Thanks,

> 
> Fixes: 54ecbe6f1ed5 ("rethook: Add a generic return hook")
> Signed-off-by: Tengda Wu <wutengda@huaweicloud.com>
> ---
>  kernel/trace/rethook.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
> index 5a8bdf88999a..bd5e5f455e85 100644
> --- a/kernel/trace/rethook.c
> +++ b/kernel/trace/rethook.c
> @@ -250,7 +250,7 @@ unsigned long rethook_find_ret_addr(struct task_struct *tsk, unsigned long frame
>  	if (WARN_ON_ONCE(!cur))
>  		return 0;
>  
> -	if (tsk != current && task_is_running(tsk))
> +	if (tsk != current && tsk->on_cpu)
>  		return 0;
>  
>  	do {
> -- 
> 2.34.1
> 
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] rethook: Use tsk->on_cpu to check task execution state
  2026-05-26  3:37 ` Masami Hiramatsu
@ 2026-05-29  3:39   ` Tengda Wu
  2026-05-31 23:40     ` Masami Hiramatsu
  0 siblings, 1 reply; 10+ messages in thread
From: Tengda Wu @ 2026-05-29  3:39 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Steven Rostedt, Mathieu Desnoyers, Alexei Starovoitov,
	linux-trace-kernel, linux-kernel

Hi Masami,

thanks for the review and feedback.

On 2026/5/26 11:37, Masami Hiramatsu wrote:
> On Mon, 25 May 2026 21:22:53 +0800
> Tengda Wu <wutengda@huaweicloud.com> wrote:
> 
>> When a task calls schedule() to yield the CPU, its state remains
>> TASK_RUNNING, but its stack is frozen and safe to walk.
>>
>> Replace task_is_running(tsk) with tsk->on_cpu to avoid overly
>> conservative rejections.
> 
> Please see the Sashiko's comment.
> 
> https://sashiko.dev/#/patchset/20260525132253.1889726-1-wutengda%40huaweicloud.com
> 
> When calling Unwind on a task other than the current, IMHO, it is
> the responsibility of the caller of this function to ensure that the
> stack trace of that task is safe.

Agree.

> We also should not use tsk->on_cpu, but should use task_on_cpu(tsk).
> 
> BTW, should task_on_cpu() use READ_ONCE() etc?
> wait_task_inactive() seems a bit fragile.
> 
> Thanks,
> 

It seems that using task_on_cpu() is not necessary here because:

1. It requires an additional 'rq' parameter not available in the rethook context.
2. It just returns p->on_cpu, which is identical to our current use of tsk->on_cpu.

/* file: kernel/sched/sched.h */
static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
{
	return p->on_cpu;
}

Given these constraints, staying with tsk->on_cpu seems more straightforward
for the rethook context.

Thanks,
Tengda

>>
>> Fixes: 54ecbe6f1ed5 ("rethook: Add a generic return hook")
>> Signed-off-by: Tengda Wu <wutengda@huaweicloud.com>
>> ---
>>  kernel/trace/rethook.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
>> index 5a8bdf88999a..bd5e5f455e85 100644
>> --- a/kernel/trace/rethook.c
>> +++ b/kernel/trace/rethook.c
>> @@ -250,7 +250,7 @@ unsigned long rethook_find_ret_addr(struct task_struct *tsk, unsigned long frame
>>  	if (WARN_ON_ONCE(!cur))
>>  		return 0;
>>  
>> -	if (tsk != current && task_is_running(tsk))
>> +	if (tsk != current && tsk->on_cpu)
>>  		return 0;
>>  
>>  	do {
>> -- 
>> 2.34.1
>>
>>
> 
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] rethook: Use tsk->on_cpu to check task execution state
  2026-05-29  3:39   ` Tengda Wu
@ 2026-05-31 23:40     ` Masami Hiramatsu
  2026-06-01  0:58       ` Tengda Wu
  2026-06-04  9:34       ` Peter Zijlstra
  0 siblings, 2 replies; 10+ messages in thread
From: Masami Hiramatsu @ 2026-05-31 23:40 UTC (permalink / raw)
  To: Tengda Wu, Peter Zijlstra
  Cc: Steven Rostedt, Mathieu Desnoyers, Alexei Starovoitov,
	linux-trace-kernel, linux-kernel

On Fri, 29 May 2026 11:39:49 +0800
Tengda Wu <wutengda@huaweicloud.com> wrote:

> > We also should not use tsk->on_cpu, but should use task_on_cpu(tsk).
> > 
> > BTW, should task_on_cpu() use READ_ONCE() etc?
> > wait_task_inactive() seems a bit fragile.
> > 
> > Thanks,
> > 
> 
> It seems that using task_on_cpu() is not necessary here because:
> 
> 1. It requires an additional 'rq' parameter not available in the rethook context.
> 2. It just returns p->on_cpu, which is identical to our current use of tsk->on_cpu.

OK. We may need to cleanup task_on_cpu() to remove unused rq, which
was historically introduced by commit 029632fbb7b7 ("sched: Make
separate sched*.c translation units") as a parameter for "task_running()"
because it called task_current(), but was cleaned by commit 0b9d46fc5ef7
("sched: Rename task_running() to task_on_cpu()") and now it is not
used.

> 
> /* file: kernel/sched/sched.h */
> static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
> {
> 	return p->on_cpu;
> }
> 
> Given these constraints, staying with tsk->on_cpu seems more straightforward
> for the rethook context.

The reason I asked you to use a wrapper function instead of directly
accessing that on_cpu is to ensure that this part isn't overlooked when
changing the scheduler's behavior in the future.

They may be equivalent at this point in time, but that doesn't necessarily
mean they will remain so in the future.

IMHO, an API interface should represent the behavior and resources
required for that operation, and when accessing it from outside the
subsystem, we should use that API whenever possible. Otherwise, even
if we want to update the internal implementation of one subsystem,
we will have to make changes to other subsystems as well.

Peter, is it OK to drop @rq from task_on_cpu()? Then we can use
it from rethook.

Thank you,

> 
> Thanks,
> Tengda
> 
> >>
> >> Fixes: 54ecbe6f1ed5 ("rethook: Add a generic return hook")
> >> Signed-off-by: Tengda Wu <wutengda@huaweicloud.com>
> >> ---
> >>  kernel/trace/rethook.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
> >> index 5a8bdf88999a..bd5e5f455e85 100644
> >> --- a/kernel/trace/rethook.c
> >> +++ b/kernel/trace/rethook.c
> >> @@ -250,7 +250,7 @@ unsigned long rethook_find_ret_addr(struct task_struct *tsk, unsigned long frame
> >>  	if (WARN_ON_ONCE(!cur))
> >>  		return 0;
> >>  
> >> -	if (tsk != current && task_is_running(tsk))
> >> +	if (tsk != current && tsk->on_cpu)
> >>  		return 0;
> >>  
> >>  	do {
> >> -- 
> >> 2.34.1
> >>
> >>
> > 
> > 
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] rethook: Use tsk->on_cpu to check task execution state
  2026-05-31 23:40     ` Masami Hiramatsu
@ 2026-06-01  0:58       ` Tengda Wu
  2026-06-04  9:34       ` Peter Zijlstra
  1 sibling, 0 replies; 10+ messages in thread
From: Tengda Wu @ 2026-06-01  0:58 UTC (permalink / raw)
  To: Masami Hiramatsu, Peter Zijlstra
  Cc: Steven Rostedt, Mathieu Desnoyers, Alexei Starovoitov,
	linux-trace-kernel, linux-kernel



On 2026/6/1 7:40, Masami Hiramatsu wrote:
> On Fri, 29 May 2026 11:39:49 +0800
> Tengda Wu <wutengda@huaweicloud.com> wrote:
> 
>>> We also should not use tsk->on_cpu, but should use task_on_cpu(tsk).
>>>
>>> BTW, should task_on_cpu() use READ_ONCE() etc?
>>> wait_task_inactive() seems a bit fragile.
>>>
>>> Thanks,
>>>
>>
>> It seems that using task_on_cpu() is not necessary here because:
>>
>> 1. It requires an additional 'rq' parameter not available in the rethook context.
>> 2. It just returns p->on_cpu, which is identical to our current use of tsk->on_cpu.
> 
> OK. We may need to cleanup task_on_cpu() to remove unused rq, which
> was historically introduced by commit 029632fbb7b7 ("sched: Make
> separate sched*.c translation units") as a parameter for "task_running()"
> because it called task_current(), but was cleaned by commit 0b9d46fc5ef7
> ("sched: Rename task_running() to task_on_cpu()") and now it is not
> used.
> 
>>
>> /* file: kernel/sched/sched.h */
>> static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
>> {
>> 	return p->on_cpu;
>> }
>>
>> Given these constraints, staying with tsk->on_cpu seems more straightforward
>> for the rethook context.
> 
> The reason I asked you to use a wrapper function instead of directly
> accessing that on_cpu is to ensure that this part isn't overlooked when
> changing the scheduler's behavior in the future.
> 
> They may be equivalent at this point in time, but that doesn't necessarily
> mean they will remain so in the future.
> 
> IMHO, an API interface should represent the behavior and resources
> required for that operation, and when accessing it from outside the
> subsystem, we should use that API whenever possible. Otherwise, even
> if we want to update the internal implementation of one subsystem,
> we will have to make changes to other subsystems as well.
> 
> Peter, is it OK to drop @rq from task_on_cpu()? Then we can use
> it from rethook.
> 
> Thank you,
> 

Thank you for the detailed explanation -- that makes perfect sense.

I looked into task_on_cpu() and saw it has many callers.
Once it's confirmed that the parameter can be safely dropped, I'll
follow up with my patch to use the wrapper instead of directly
accessing on_cpu.

Best,
Tengda

>>>>
>>>> Fixes: 54ecbe6f1ed5 ("rethook: Add a generic return hook")
>>>> Signed-off-by: Tengda Wu <wutengda@huaweicloud.com>
>>>> ---
>>>>  kernel/trace/rethook.c | 2 +-
>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
>>>> index 5a8bdf88999a..bd5e5f455e85 100644
>>>> --- a/kernel/trace/rethook.c
>>>> +++ b/kernel/trace/rethook.c
>>>> @@ -250,7 +250,7 @@ unsigned long rethook_find_ret_addr(struct task_struct *tsk, unsigned long frame
>>>>  	if (WARN_ON_ONCE(!cur))
>>>>  		return 0;
>>>>  
>>>> -	if (tsk != current && task_is_running(tsk))
>>>> +	if (tsk != current && tsk->on_cpu)
>>>>  		return 0;
>>>>  
>>>>  	do {
>>>> -- 
>>>> 2.34.1
>>>>
>>>>
>>>
>>>
>>
> 
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] rethook: Use tsk->on_cpu to check task execution state
  2026-05-31 23:40     ` Masami Hiramatsu
  2026-06-01  0:58       ` Tengda Wu
@ 2026-06-04  9:34       ` Peter Zijlstra
  2026-06-05 13:43         ` Masami Hiramatsu
  1 sibling, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2026-06-04  9:34 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Tengda Wu, Steven Rostedt, Mathieu Desnoyers, Alexei Starovoitov,
	linux-trace-kernel, linux-kernel

On Mon, Jun 01, 2026 at 08:40:01AM +0900, Masami Hiramatsu wrote:

> Peter, is it OK to drop @rq from task_on_cpu()? 

Sure.

> Then we can use it from rethook.

Well, it is in sched/sched.h, which is an internal header, and no you
cannot use that header in rethook.

But lets step back first, what is the actual problem here, why are we
looking at ->on_cpu at all?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] rethook: Use tsk->on_cpu to check task execution state
  2026-06-04  9:34       ` Peter Zijlstra
@ 2026-06-05 13:43         ` Masami Hiramatsu
  2026-06-08  1:52           ` Tengda Wu
  0 siblings, 1 reply; 10+ messages in thread
From: Masami Hiramatsu @ 2026-06-05 13:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tengda Wu, Steven Rostedt, Mathieu Desnoyers, Alexei Starovoitov,
	linux-trace-kernel, linux-kernel

On Thu, 4 Jun 2026 11:34:45 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Jun 01, 2026 at 08:40:01AM +0900, Masami Hiramatsu wrote:
> 
> > Peter, is it OK to drop @rq from task_on_cpu()? 
> 
> Sure.
> 
> > Then we can use it from rethook.
> 
> Well, it is in sched/sched.h, which is an internal header, and no you
> cannot use that header in rethook.

Ah, OK. Hmm, then we should not use it. Maybe ->on_cpu is also internal
state?

> 
> But lets step back first, what is the actual problem here, why are we
> looking at ->on_cpu at all?

Tengda, can you explain it?
I think you want to take a stacktrace on !current process, and
rethook_find_ret_addr() is rejected i the task is running state.

But if you can share actual situation what you need, it is
helpful for us to understand.

Thank you,


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] rethook: Use tsk->on_cpu to check task execution state
  2026-06-05 13:43         ` Masami Hiramatsu
@ 2026-06-08  1:52           ` Tengda Wu
  2026-06-08  2:56             ` Masami Hiramatsu
  0 siblings, 1 reply; 10+ messages in thread
From: Tengda Wu @ 2026-06-08  1:52 UTC (permalink / raw)
  To: Masami Hiramatsu, Peter Zijlstra
  Cc: Steven Rostedt, Mathieu Desnoyers, Alexei Starovoitov,
	linux-trace-kernel, linux-kernel



On 2026/6/5 21:43, Masami Hiramatsu wrote:
> On Thu, 4 Jun 2026 11:34:45 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
>> On Mon, Jun 01, 2026 at 08:40:01AM +0900, Masami Hiramatsu wrote:
>>
>>> Peter, is it OK to drop @rq from task_on_cpu()? 
>>
>> Sure.
>>
>>> Then we can use it from rethook.
>>
>> Well, it is in sched/sched.h, which is an internal header, and no you
>> cannot use that header in rethook.
> 
> Ah, OK. Hmm, then we should not use it. Maybe ->on_cpu is also internal
> state?
> 
>>
>> But lets step back first, what is the actual problem here, why are we
>> looking at ->on_cpu at all?
> 
> Tengda, can you explain it?
> I think you want to take a stacktrace on !current process, and
> rethook_find_ret_addr() is rejected i the task is running state.
> 
> But if you can share actual situation what you need, it is
> helpful for us to understand.
> 
> Thank you,
> 
> 


Sure.

Background: We are verifying the support of live patches for functions that
have a kretprobe. The specific verification method is as follows:

We construct a function foo() that calls bar():

void bar(void)
{
    for (;;) {
        schedule();
    }
}

void foo(void)
{
    bar();
}

A kretprobe is attached to bar():

echo 'r:rp1 bar' > /sys/kernel/tracing/kprobe_events
echo 1 > /sys/kernel/tracing/events/kprobes/rp1/enable

Then foo() is triggered. The expected behavior is that bar() will call
schedule() and yield the CPU.

After that, the live patch is activated to attempt replacing the implementation
of foo(). The expectation is that this should succeed.

However, in reality, because the task that called schedule() is still in the
RUNNING state, the condition task_is_running(tsk) inside rethook_find_ret_addr()
is not satisfied, causing the function to return early. This, in turn,
prevents stack_trace_save_tsk_reliable() from determining the stack as
reliable, leading to a failure in activating the live patch.

**Not sure if this is correct:**

We believe that after a task voluntarily calls schedule(), when the stack
is expected to be reliable, it is a safe time to activate a live patch.
Additionally, a similar tsk->on_cpu check can be found elsewhere in the
kernel (See task_on_another_cpu() in arch/x86/include/asm/unwind.h).
Therefore, we propose changing the task_is_running(tsk) condition to
tsk->on_cpu.

Thanks,
Tengda


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] rethook: Use tsk->on_cpu to check task execution state
  2026-06-08  1:52           ` Tengda Wu
@ 2026-06-08  2:56             ` Masami Hiramatsu
  2026-06-08  8:31               ` Tengda Wu
  0 siblings, 1 reply; 10+ messages in thread
From: Masami Hiramatsu @ 2026-06-08  2:56 UTC (permalink / raw)
  To: Tengda Wu, Josh Poimboeuf
  Cc: Peter Zijlstra, Steven Rostedt, Mathieu Desnoyers,
	Alexei Starovoitov, linux-trace-kernel, linux-kernel

On Mon, 8 Jun 2026 09:52:37 +0800
Tengda Wu <wutengda@huaweicloud.com> wrote:

> 
> 
> On 2026/6/5 21:43, Masami Hiramatsu wrote:
> > On Thu, 4 Jun 2026 11:34:45 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> >> On Mon, Jun 01, 2026 at 08:40:01AM +0900, Masami Hiramatsu wrote:
> >>
> >>> Peter, is it OK to drop @rq from task_on_cpu()? 
> >>
> >> Sure.
> >>
> >>> Then we can use it from rethook.
> >>
> >> Well, it is in sched/sched.h, which is an internal header, and no you
> >> cannot use that header in rethook.
> > 
> > Ah, OK. Hmm, then we should not use it. Maybe ->on_cpu is also internal
> > state?
> > 
> >>
> >> But lets step back first, what is the actual problem here, why are we
> >> looking at ->on_cpu at all?
> > 
> > Tengda, can you explain it?
> > I think you want to take a stacktrace on !current process, and
> > rethook_find_ret_addr() is rejected i the task is running state.
> > 
> > But if you can share actual situation what you need, it is
> > helpful for us to understand.
> > 
> > Thank you,
> > 
> > 
> 
> 
> Sure.
> 
> Background: We are verifying the support of live patches for functions that
> have a kretprobe. The specific verification method is as follows:
> 
> We construct a function foo() that calls bar():
> 
> void bar(void)
> {
>     for (;;) {
>         schedule();
>     }
> }
> 
> void foo(void)
> {
>     bar();
> }
> 
> A kretprobe is attached to bar():
> 
> echo 'r:rp1 bar' > /sys/kernel/tracing/kprobe_events
> echo 1 > /sys/kernel/tracing/events/kprobes/rp1/enable
> 
> Then foo() is triggered. The expected behavior is that bar() will call
> schedule() and yield the CPU.
> 
> After that, the live patch is activated to attempt replacing the implementation
> of foo(). The expectation is that this should succeed.
> 
> However, in reality, because the task that called schedule() is still in the
> RUNNING state, the condition task_is_running(tsk) inside rethook_find_ret_addr()
> is not satisfied, causing the function to return early. This, in turn,
> prevents stack_trace_save_tsk_reliable() from determining the stack as
> reliable, leading to a failure in activating the live patch.

Hmm is the bar() doing infinite loop, or limited loop but take a long time
so just yield a while? Anyway, it seems like a non-good design pattern.
Is it possible to avoid busy loops and instead use Workers, or wait for
something to complete or for input within a loop?

> 
> **Not sure if this is correct:**
> 
> We believe that after a task voluntarily calls schedule(), when the stack
> is expected to be reliable, it is a safe time to activate a live patch.

In this case, I don't know how to block the loop inside the bar.
Even if !tsk->on_cpu, the tsk can restart running right after checking
the flag.

> Additionally, a similar tsk->on_cpu check can be found elsewhere in the
> kernel (See task_on_another_cpu() in arch/x86/include/asm/unwind.h).
> Therefore, we propose changing the task_is_running(tsk) condition to
> tsk->on_cpu.

Yes, but the caller said there is another check to ensure the race.

        /*                  
         * Refuse to unwind the stack of a task while it's executing on another
         * CPU.  This check is racy, but that's ok: the unwinder has other
         * checks to prevent it from going off the rails.
         */
        if (task_on_another_cpu(task))
                goto err;

Josh, do you know how this avoid the race case?

Thank you,

> 
> Thanks,
> Tengda
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] rethook: Use tsk->on_cpu to check task execution state
  2026-06-08  2:56             ` Masami Hiramatsu
@ 2026-06-08  8:31               ` Tengda Wu
  0 siblings, 0 replies; 10+ messages in thread
From: Tengda Wu @ 2026-06-08  8:31 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Peter Zijlstra, Josh Poimboeuf, Steven Rostedt, Mathieu Desnoyers,
	Alexei Starovoitov, linux-trace-kernel, linux-kernel



On 2026/6/8 10:56, Masami Hiramatsu wrote:
> On Mon, 8 Jun 2026 09:52:37 +0800
> Tengda Wu <wutengda@huaweicloud.com> wrote:
> 
>>
>>
>> On 2026/6/5 21:43, Masami Hiramatsu wrote:
>>> On Thu, 4 Jun 2026 11:34:45 +0200
>>> Peter Zijlstra <peterz@infradead.org> wrote:
>>>
>>>> On Mon, Jun 01, 2026 at 08:40:01AM +0900, Masami Hiramatsu wrote:
>>>>
>>>>> Peter, is it OK to drop @rq from task_on_cpu()? 
>>>>
>>>> Sure.
>>>>
>>>>> Then we can use it from rethook.
>>>>
>>>> Well, it is in sched/sched.h, which is an internal header, and no you
>>>> cannot use that header in rethook.
>>>
>>> Ah, OK. Hmm, then we should not use it. Maybe ->on_cpu is also internal
>>> state?
>>>
>>>>
>>>> But lets step back first, what is the actual problem here, why are we
>>>> looking at ->on_cpu at all?
>>>
>>> Tengda, can you explain it?
>>> I think you want to take a stacktrace on !current process, and
>>> rethook_find_ret_addr() is rejected i the task is running state.
>>>
>>> But if you can share actual situation what you need, it is
>>> helpful for us to understand.
>>>
>>> Thank you,
>>>
>>>
>>
>>
>> Sure.
>>
>> Background: We are verifying the support of live patches for functions that
>> have a kretprobe. The specific verification method is as follows:
>>
>> We construct a function foo() that calls bar():
>>
>> void bar(void)
>> {
>>     for (;;) {
>>         schedule();
>>     }
>> }
>>
>> void foo(void)
>> {
>>     bar();
>> }
>>
>> A kretprobe is attached to bar():
>>
>> echo 'r:rp1 bar' > /sys/kernel/tracing/kprobe_events
>> echo 1 > /sys/kernel/tracing/events/kprobes/rp1/enable
>>
>> Then foo() is triggered. The expected behavior is that bar() will call
>> schedule() and yield the CPU.
>>
>> After that, the live patch is activated to attempt replacing the implementation
>> of foo(). The expectation is that this should succeed.
>>
>> However, in reality, because the task that called schedule() is still in the
>> RUNNING state, the condition task_is_running(tsk) inside rethook_find_ret_addr()
>> is not satisfied, causing the function to return early. This, in turn,
>> prevents stack_trace_save_tsk_reliable() from determining the stack as
>> reliable, leading to a failure in activating the live patch.
> 
> Hmm is the bar() doing infinite loop, or limited loop but take a long time
> so just yield a while? Anyway, it seems like a non-good design pattern.
> Is it possible to avoid busy loops and instead use Workers, or wait for
> something to complete or for input within a loop?
> 
>>
>> **Not sure if this is correct:**
>>
>> We believe that after a task voluntarily calls schedule(), when the stack
>> is expected to be reliable, it is a safe time to activate a live patch.
> 
> In this case, I don't know how to block the loop inside the bar.
> Even if !tsk->on_cpu, the tsk can restart running right after checking
> the flag.
> 

The infinite loop in bar() is indeed a poor design pattern. This test
case is only artificial, not from real-world code. It is merely
intended to verify live patch support for various cases.

However, the point you raised has indeed made me think. I realize that
checking only tsk->on_cpu is not sufficient -- there is also a race
condition where the task could be scheduled back onto a CPU right after
the check. I need to re-examine the validity of this test case and
whether it represents a safe live patch activation scenario.

Thank you again for your patience and for pointing out these
fundamental issues. Your guidance is much appreciated.

Best regards,
Tengda


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-06-08  8:31 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-25 13:22 [PATCH] rethook: Use tsk->on_cpu to check task execution state Tengda Wu
2026-05-26  3:37 ` Masami Hiramatsu
2026-05-29  3:39   ` Tengda Wu
2026-05-31 23:40     ` Masami Hiramatsu
2026-06-01  0:58       ` Tengda Wu
2026-06-04  9:34       ` Peter Zijlstra
2026-06-05 13:43         ` Masami Hiramatsu
2026-06-08  1:52           ` Tengda Wu
2026-06-08  2:56             ` Masami Hiramatsu
2026-06-08  8:31               ` Tengda Wu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox