[PATCH v4] rethook: Remove the running task check in rethook_find_ret

Linux Trace Kernel
 help / color / mirror / Atom feed

* [PATCH v4] rethook: Remove the running task check in rethook_find_ret_addr()
@ 2026-06-10  1:36 Tengda Wu
  2026-06-10  1:50 ` sashiko-bot
  0 siblings, 1 reply; 5+ messages in thread
From: Tengda Wu @ 2026-06-10  1:36 UTC (permalink / raw)
  To: Masami Hiramatsu, Peter Zijlstra, Petr Mladek
  Cc: Steven Rostedt, Mathieu Desnoyers, Alexei Starovoitov,
	linux-trace-kernel, linux-kernel, live-patching, Tengda Wu

The current check in rethook_find_ret_addr() prevents obtaining a return
address when the target task is marked as running. However, this condition
is both insufficient for correctness and unnecessary for its intended
purpose.

The check is inherently racy: a task can begin running on another CPU
immediately after task_is_running() returns false, potentially leading to
concurrent modification of rethook data structures while the iteration is
in progress.

Rather than trying to fix this unreliable check deep in the unwinding
path, simply remove it. The iteration is already safe from crashes because
unwind_next_frame() holds RCU and rethook_node structures are RCU-freed;
even if the iteration goes off the rails and returns invalid information,
it will not crash. Callers that require consistency must provide a safe
context themselves.

Fixes: 54ecbe6f1ed5 ("rethook: Add a generic return hook")
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Tengda Wu <wutengda@huaweicloud.com>
---
v4: Also update the function description in the comment.
v3: https://lore.kernel.org/all/20260609084953.901576-1-wutengda@huaweicloud.com/
v2: https://lore.kernel.org/all/20260609005728.458962-1-wutengda@huaweicloud.com/
v1: https://lore.kernel.org/all/20260525132253.1889726-1-wutengda@huaweicloud.com/

 kernel/trace/rethook.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
index 5a8bdf88999a..1e7fdebe3cd5 100644
--- a/kernel/trace/rethook.c
+++ b/kernel/trace/rethook.c
@@ -233,9 +233,10 @@ NOKPROBE_SYMBOL(__rethook_find_ret_addr);
  *
  * Find the correct return address modified by a rethook on @tsk in unsigned
  * long type.
- * The @tsk must be 'current' or a task which is not running. @frame is a hint
- * to get the currect return address - which is compared with the
- * rethook::frame field. The @cur is a loop cursor for searching the
+ * @tsk can be any task (any state). If not 'current', the result may be
+ * unreliable. Callers requiring reliability must ensure a safe context.
+ * @frame is a hint to get the correct return address - which is compared with
+ * the rethook::frame field. The @cur is a loop cursor for searching the
  * kretprobe return addresses on the @tsk. The '*@cur' should be NULL at the
  * first call, but '@cur' itself must NOT NULL.
  *
@@ -250,9 +251,6 @@ unsigned long rethook_find_ret_addr(struct task_struct *tsk, unsigned long frame
 	if (WARN_ON_ONCE(!cur))
 		return 0;
 
-	if (tsk != current && task_is_running(tsk))
-		return 0;
-
 	do {
 		ret = __rethook_find_ret_addr(tsk, cur);
 		if (!ret)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v4] rethook: Remove the running task check in rethook_find_ret_addr()
  2026-06-10  1:36 [PATCH v4] rethook: Remove the running task check in rethook_find_ret_addr() Tengda Wu
@ 2026-06-10  1:50 ` sashiko-bot
  2026-06-11 15:53   ` XIAO WU
  0 siblings, 1 reply; 5+ messages in thread
From: sashiko-bot @ 2026-06-10  1:50 UTC (permalink / raw)
  To: Masami Hiramatsu, Petr Mladek, Peter Zijlstra
  Cc: Tengda Wu, Mathieu Desnoyers, Alexei Starovoitov, Steven Rostedt,
	linux-kernel, linux-trace-kernel, live-patching

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] Removing `task_is_running()` from `rethook_find_ret_addr()` exposes stack unwinders to use-after-free, wild pointer dereferences, and list corruption when unwinding running tasks, because `rethook_node` recycling and unwinding lack adequate RCU synchronization.
--

commit 83e1fac9c64377191cc2b41ef9da781c46f1e95a
Author: Tengda Wu <wutengda@huaweicloud.com>

rethook: Remove the running task check in rethook_find_ret_addr()

This commit removes the task_is_running() check from rethook_find_ret_addr().
The commit message states the check is unnecessary because unwinders hold RCU
and rethook_node structures are RCU-freed. Callers are expected to provide
their own safe context.

> The iteration is already safe from crashes because
> unwind_next_frame() holds RCU and rethook_node structures are RCU-freed;
> even if the iteration goes off the rails and returns invalid information,
> it will not crash. Callers that require consistency must provide a safe
> context themselves.

[ ... ]

> diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
> index 5a8bdf88999a9..1e7fdebe3cd5d 100644
> --- a/kernel/trace/rethook.c
> +++ b/kernel/trace/rethook.c
[ ... ]
> @@ -250,9 +251,6 @@ unsigned long rethook_find_ret_addr(struct task_struct *tsk, unsigned long frame
>  	if (WARN_ON_ONCE(!cur))
>  		return 0;
>  
> -	if (tsk != current && task_is_running(tsk))
> -		return 0;
> -
>  	do {
>  		ret = __rethook_find_ret_addr(tsk, cur);

[Severity: High]
This is a pre-existing issue, but does removing this check entirely expose
the stack unwinders to a use-after-free and list corruption?

The commit message mentions that rethook_node structures are RCU-freed, but
looking at rethook_recycle(), the nodes appear to be immediately pushed back
to the lockless objpool without an RCU grace period:

kernel/trace/rethook.c:rethook_recycle() {
    ...
    objpool_push(node, &node->rethook->pool);
}

Additionally, while unwind_next_frame() does hold RCU for a single frame
iteration, stack unwinders like x86 ORC drop the read lock between frames:

arch/x86/kernel/unwind_orc.c:unwind_next_frame() {
    ...
    guard(rcu)();
    ...
}

Because the unwinder retains the list cursor (*cur) across frame iterations
while RCU is dropped, could unwinding a running task allow the target task
to recycle the node?

If the node is recycled and instantly reallocated to another task, the next
call to __rethook_find_ret_addr() would execute:

kernel/trace/rethook.c:__rethook_find_ret_addr() {
    ...
    node = node->next;

on a stale pointer. Could removing this check turn a theoretical race into
an easily triggered use-after-free when reading /proc/<pid>/stack for tasks
executing kretprobes?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260610013658.1837963-1-wutengda@huaweicloud.com?part=1

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v4] rethook: Remove the running task check in rethook_find_ret_addr()
  2026-06-10  1:50 ` sashiko-bot
@ 2026-06-11 15:53   ` XIAO WU
  2026-06-12  1:54     ` Tengda Wu
  2026-06-12 12:20     ` Masami Hiramatsu
  0 siblings, 2 replies; 5+ messages in thread
From: XIAO WU @ 2026-06-11 15:53 UTC (permalink / raw)
  To: sashiko-reviews, Masami Hiramatsu, Petr Mladek, Peter Zijlstra
  Cc: Tengda Wu, Mathieu Desnoyers, Alexei Starovoitov, Steven Rostedt,
	linux-kernel, linux-trace-kernel, live-patching

Hi Tengda,

Sashiko [1] reviewed this patch and found that removing the
task_is_running() check exposes stack unwinders to real crashes — not
just "invalid information."  A PoC confirms this: a KASAN panic triggers
within seconds when /proc/<pid>/stack reads the stack of a task that is
concurrently running a kretprobe.

[1] 
https://sashiko.dev/#/patchset/20260610013658.1837963-1-wutengda%40huaweicloud.com

 > diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
 > index 5a8bdf88999a..1e7fdebe3cd5 100644
 > --- a/kernel/trace/rethook.c
 > +++ b/kernel/trace/rethook.c
 > @@ -250,9 +251,6 @@ unsigned long rethook_find_ret_addr(struct 
task_struct *tsk, unsigned long frame
 >      if (WARN_ON_ONCE(!cur))
 >          return 0;
 >
 > -    if (tsk != current && task_is_running(tsk))
 > -        return 0;
 > -
 >      do {
 >          ret = __rethook_find_ret_addr(tsk, cur);
 >          if (!ret)

The commit message states:

 > The iteration is already safe from crashes because
 > unwind_next_frame() holds RCU and rethook_node structures are
 > RCU-freed; even if the iteration goes off the rails and returns
 > invalid information, it will not crash.

There are two problems with this claim, both reproducible.

**Problem 1: stack-out-of-bounds in unwind_next_frame itself**

The PoC below reliably triggers the following KASAN panic — not in the
rethook list traversal, but inside unwind_next_frame():

[ 1833.494623] BUG: KASAN: stack-out-of-bounds in 
unwind_next_frame+0x861/0x2080
[ 1833.494651] Read of size 2 at addr ffffc90003e6f5f0 by task poc/9854
[ 1833.494707] Call Trace:
[ 1833.494719]  dump_stack_lvl+0x116/0x1f0
[ 1833.494743]  print_report+0xf4/0x600
[ 1833.494788]  kasan_report+0xe0/0x110
[ 1833.494836]  unwind_next_frame+0x861/0x2080
[ 1833.494948]  arch_stack_walk+0x99/0x100
[ 1833.495000]  stack_trace_save_tsk+0x16a/0x200
[ 1833.495054]  proc_pid_stack+0x173/0x2b0
[ 1833.495103]  seq_read_iter+0x519/0x12d0
[ 1833.495166]  seq_read+0x3b7/0x590
[ 1833.495297]  vfs_read+0x1f5/0xd20
[ 1833.495497]  ksys_read+0x135/0x250
[ 1833.495549]  do_syscall_64+0x129/0x850
[ 1833.495566]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 1833.498894] Kernel panic - not syncing: KASAN: panic_on_warn set ...

page last free pid 9737 tgid 9737 stack trace:
  do_sys_openat2+0xbf/0x260          <-- target task inside kretprobe
  __x64_sys_openat+0x179/0x210

This crash has nothing to do with rethook_node lifetimes or RCU.  It
happens because the ORC unwinder reads stack memory while the target
task concurrently executes a kretprobe trampoline that modifies return
addresses.  The unwinder follows corrupted frame data past valid stack
boundaries.  RCU protection of rethook_node structures is irrelevant —
this crash occurs at the stack frame interpretation level, before any
rethook list traversal.

The old task_is_running() check prevented the unwinder from attempting
to unwind a running task's stack in the first place.

**Problem 2: use-after-free via rethook_node recycling**

Even if the stack-out-of-bounds above were addressed, a second crash
path exists in the rethook list traversal itself.

rethook_recycle() immediately pushes nodes back to the objpool without
an RCU grace period:

   kernel/trace/rethook.c:
   void rethook_recycle(struct rethook_node *node)
   {
           ...
           objpool_push(node, &node->rethook->pool);
   }

Meanwhile, unwind_next_frame() in arch/x86/kernel/unwind_orc.c drops
RCU between frames while the cursor (*cur) persists across iterations:

   arch/x86/kernel/unwind_orc.c:
   bool unwind_next_frame(...)
   {
           ...
           guard(rcu)();    // RCU held for one frame
           ...
   }                        // RCU dropped here

When the unwinder calls __rethook_find_ret_addr() in the next frame
iteration, it does:

   struct llist_node *first = tsk->rethooks.first;
   ...
   *cur = first;
   ...
   node = node->next;       // node may have been recycled

If the target task returns from a probed function between frames, its
rethook_node is recycled and can be instantly reallocated to another
task.  The unwinder's stale cursor then dereferences a freed pointer,
leading to use-after-free.

## Reproducer

The PoC sets up a kretprobe on do_sys_openat2, creates hot-loop threads
calling open(), and concurrently reads /proc/<tid>/stack.  The race
triggers within seconds (Problem 1 above; Problem 2 may reproduce on
kernels without KASAN or with different timing).

Build:  gcc -static -pthread -o poc poc.c
Run:    ./poc [runtime_seconds]
Needs:  root, CONFIG_KASAN=y

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <sys/syscall.h>
#include <sched.h>
#include <fcntl.h>
#include <errno.h>
#include <signal.h>
#include <pthread.h>
#include <dirent.h>

#define TRACE "/sys/kernel/tracing"

volatile int stop = 0;

static int tfs(const char *f, const char *b)
{
     char p[256]; int fd, r;
     snprintf(p, 256, "%s/%s", TRACE, f);
     fd = open(p, O_WRONLY | O_TRUNC);
     if (fd < 0) {
         system("mount -t tracefs tracefs /sys/kernel/tracing 2>/dev/null");
         usleep(50000);
         fd = open(p, O_WRONLY | O_TRUNC);
     }
     if (fd < 0) return -1;
     r = write(fd, b, strlen(b));
     close(fd);
     return r < 0 ? -1 : 0;
}

void *hot_thread(void *arg)
{
     while (!__atomic_load_n(&stop, __ATOMIC_RELAXED)) {
         int fd = open("/dev/null", O_RDONLY);
         if (fd >= 0) close(fd);
     }
     return NULL;
}

void *reader_thread(void *arg)
{
     pid_t target = *(pid_t *)arg;
     char path[64], buf[8192];
     snprintf(path, 64, "/proc/%d/stack", target);
     while (!__atomic_load_n(&stop, __ATOMIC_RELAXED)) {
         int fd = open(path, O_RDONLY);
         if (fd >= 0) { read(fd, buf, 8191); close(fd); }
     }
     return NULL;
}

void sigh(int s) { stop = 1; }

int main(int argc, char *argv[])
{
     int runtime = 120;
     if (argc > 1) runtime = atoi(argv[1]);

     printf("rethook race PoC\n");
     if (geteuid()) { printf("root needed\n"); return 1; }
     signal(SIGINT, sigh);

     pthread_t hot[4], rdr[4];
     pid_t hot_tids[4];
     int pairs = 4;

     for (int c = 0; c < runtime / 5 && !stop; c++) {
         tfs("events/kprobes/myretprobe/enable", "0");
         tfs("kprobe_events", "-:myretprobe");
         usleep(100);
         tfs("kprobe_events", "r:myretprobe do_sys_openat2 $retval");
         tfs("events/kprobes/myretprobe/enable", "1");

         pid_t main_tid = syscall(SYS_gettid);

         for (int i = 0; i < pairs; i++)
             pthread_create(&hot[i], NULL, hot_thread, NULL);

         usleep(300000);

         {
             DIR *d = opendir("/proc/self/task");
             int cnt = 0;
             if (d) {
                 struct dirent *de;
                 while ((de = readdir(d)) != NULL && cnt < pairs) {
                     pid_t t = atoi(de->d_name);
                     if (t > 0 && t != main_tid)
                         hot_tids[cnt++] = t;
                 }
                 closedir(d);
             }
             for (int i = 0; i < cnt; i++)
                 pthread_create(&rdr[i], NULL, reader_thread, &hot_tids[i]);
         }

         printf("round %d\n", c);
         sleep(5);

         stop = 1;
         usleep(100000);

         for (int i = 0; i < pairs; i++) pthread_join(hot[i], NULL);
         for (int i = 0; i < pairs; i++) pthread_join(rdr[i], NULL);

         stop = 0;
         usleep(1000);
     }

     tfs("events/kprobes/myretprobe/enable", "0");
     tfs("kprobe_events", "-:myretprobe");
     printf("Done\n");
     return 0;
}

## Summary

The v4 commit message claims the iteration "will not crash," but the PoC
demonstrates a reproducible KASAN panic:

1. stack-out-of-bounds in unwind_next_frame (ORC unwinder reads
    concurrently-modified stack frames of a running task)

2. Potential use-after-free in __rethook_find_ret_addr (rethook nodes
    recycled without RCU grace period, cursor persists across RCU drops)

The old task_is_running() check was racy but served as a practical
safety net.  Removing it without adding equivalent protection in the
callers (proc_pid_stack, BPF stack walkers) exposes users to kernel
panics via /proc/<pid>/stack on any task running a kretprobe.

Thanks,
Xiao


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v4] rethook: Remove the running task check in rethook_find_ret_addr()
  2026-06-11 15:53   ` XIAO WU
@ 2026-06-12  1:54     ` Tengda Wu
  2026-06-12 12:20     ` Masami Hiramatsu
  1 sibling, 0 replies; 5+ messages in thread
From: Tengda Wu @ 2026-06-12  1:54 UTC (permalink / raw)
  To: XIAO WU, sashiko-reviews, Masami Hiramatsu, Petr Mladek,
	Peter Zijlstra
  Cc: Mathieu Desnoyers, Alexei Starovoitov, Steven Rostedt,
	linux-kernel, linux-trace-kernel, live-patching

Hi Xiao,

Thank you very much for your detailed analysis and verification.

On 2026/6/11 23:53, XIAO WU wrote:
> Hi Tengda,
> 
> Sashiko [1] reviewed this patch and found that removing the
> task_is_running() check exposes stack unwinders to real crashes — not
> just "invalid information."  A PoC confirms this: a KASAN panic triggers
> within seconds when /proc/<pid>/stack reads the stack of a task that is
> concurrently running a kretprobe.
> 
> [1] https://sashiko.dev/#/patchset/20260610013658.1837963-1-wutengda%40huaweicloud.com
> 
>> diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
>> index 5a8bdf88999a..1e7fdebe3cd5 100644
>> --- a/kernel/trace/rethook.c
>> +++ b/kernel/trace/rethook.c
>> @@ -250,9 +251,6 @@ unsigned long rethook_find_ret_addr(struct task_struct *tsk, unsigned long frame
>>      if (WARN_ON_ONCE(!cur))
>>          return 0;
>>
>> -    if (tsk != current && task_is_running(tsk))
>> -        return 0;
>> -
>>      do {
>>          ret = __rethook_find_ret_addr(tsk, cur);
>>          if (!ret)
> 
> The commit message states:
> 
>> The iteration is already safe from crashes because
>> unwind_next_frame() holds RCU and rethook_node structures are
>> RCU-freed; even if the iteration goes off the rails and returns
>> invalid information, it will not crash.
> 
> There are two problems with this claim, both reproducible.
> 
> **Problem 1: stack-out-of-bounds in unwind_next_frame itself**
> 
> The PoC below reliably triggers the following KASAN panic — not in the
> rethook list traversal, but inside unwind_next_frame():
> 
> [ 1833.494623] BUG: KASAN: stack-out-of-bounds in unwind_next_frame+0x861/0x2080
> [ 1833.494651] Read of size 2 at addr ffffc90003e6f5f0 by task poc/9854
> [ 1833.494707] Call Trace:
> [ 1833.494719]  dump_stack_lvl+0x116/0x1f0
> [ 1833.494743]  print_report+0xf4/0x600
> [ 1833.494788]  kasan_report+0xe0/0x110
> [ 1833.494836]  unwind_next_frame+0x861/0x2080
> [ 1833.494948]  arch_stack_walk+0x99/0x100
> [ 1833.495000]  stack_trace_save_tsk+0x16a/0x200
> [ 1833.495054]  proc_pid_stack+0x173/0x2b0
> [ 1833.495103]  seq_read_iter+0x519/0x12d0
> [ 1833.495166]  seq_read+0x3b7/0x590
> [ 1833.495297]  vfs_read+0x1f5/0xd20
> [ 1833.495497]  ksys_read+0x135/0x250
> [ 1833.495549]  do_syscall_64+0x129/0x850
> [ 1833.495566]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [ 1833.498894] Kernel panic - not syncing: KASAN: panic_on_warn set ...
> 
> page last free pid 9737 tgid 9737 stack trace:
>  do_sys_openat2+0xbf/0x260          <-- target task inside kretprobe
>  __x64_sys_openat+0x179/0x210
> 
> This crash has nothing to do with rethook_node lifetimes or RCU.  It
> happens because the ORC unwinder reads stack memory while the target
> task concurrently executes a kretprobe trampoline that modifies return
> addresses.  The unwinder follows corrupted frame data past valid stack
> boundaries.  RCU protection of rethook_node structures is irrelevant —
> this crash occurs at the stack frame interpretation level, before any
> rethook list traversal.
> 
> The old task_is_running() check prevented the unwinder from attempting
> to unwind a running task's stack in the first place.
> 

Yes, I was able to reproduce the issue locally using your PoC. The problem
does exist as you described. I need to take a deeper look and figure out how
to properly fix it.

> **Problem 2: use-after-free via rethook_node recycling**
> 
> Even if the stack-out-of-bounds above were addressed, a second crash
> path exists in the rethook list traversal itself.
> 
> rethook_recycle() immediately pushes nodes back to the objpool without
> an RCU grace period:
> 
>   kernel/trace/rethook.c:
>   void rethook_recycle(struct rethook_node *node)
>   {
>           ...
>           objpool_push(node, &node->rethook->pool);
>   }
> 
> Meanwhile, unwind_next_frame() in arch/x86/kernel/unwind_orc.c drops
> RCU between frames while the cursor (*cur) persists across iterations:
> 
>   arch/x86/kernel/unwind_orc.c:
>   bool unwind_next_frame(...)
>   {
>           ...
>           guard(rcu)();    // RCU held for one frame
>           ...
>   }                        // RCU dropped here
> 
> When the unwinder calls __rethook_find_ret_addr() in the next frame
> iteration, it does:
> 
>   struct llist_node *first = tsk->rethooks.first;
>   ...
>   *cur = first;
>   ...
>   node = node->next;       // node may have been recycled
> 
> If the target task returns from a probed function between frames, its
> rethook_node is recycled and can be instantly reallocated to another
> task.  The unwinder's stale cursor then dereferences a freed pointer,
> leading to use-after-free.
> 

Yes, Sashiko also pointed this out. You have opened this issue for further
analysis to clarify its fault model. It appears to be a pre-existing issue
that may require a separate patch to resolve.

> ## Reproducer
> 
> The PoC sets up a kretprobe on do_sys_openat2, creates hot-loop threads
> calling open(), and concurrently reads /proc/<tid>/stack.  The race
> triggers within seconds (Problem 1 above; Problem 2 may reproduce on
> kernels without KASAN or with different timing).
> 
> Build:  gcc -static -pthread -o poc poc.c
> Run:    ./poc [runtime_seconds]
> Needs:  root, CONFIG_KASAN=y
> 
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <sys/wait.h>
> #include <sys/syscall.h>
> #include <sched.h>
> #include <fcntl.h>
> #include <errno.h>
> #include <signal.h>
> #include <pthread.h>
> #include <dirent.h>
> 
> #define TRACE "/sys/kernel/tracing"
> 
> volatile int stop = 0;
> 
> static int tfs(const char *f, const char *b)
> {
>     char p[256]; int fd, r;
>     snprintf(p, 256, "%s/%s", TRACE, f);
>     fd = open(p, O_WRONLY | O_TRUNC);
>     if (fd < 0) {
>         system("mount -t tracefs tracefs /sys/kernel/tracing 2>/dev/null");
>         usleep(50000);
>         fd = open(p, O_WRONLY | O_TRUNC);
>     }
>     if (fd < 0) return -1;
>     r = write(fd, b, strlen(b));
>     close(fd);
>     return r < 0 ? -1 : 0;
> }
> 
> void *hot_thread(void *arg)
> {
>     while (!__atomic_load_n(&stop, __ATOMIC_RELAXED)) {
>         int fd = open("/dev/null", O_RDONLY);
>         if (fd >= 0) close(fd);
>     }
>     return NULL;
> }
> 
> void *reader_thread(void *arg)
> {
>     pid_t target = *(pid_t *)arg;
>     char path[64], buf[8192];
>     snprintf(path, 64, "/proc/%d/stack", target);
>     while (!__atomic_load_n(&stop, __ATOMIC_RELAXED)) {
>         int fd = open(path, O_RDONLY);
>         if (fd >= 0) { read(fd, buf, 8191); close(fd); }
>     }
>     return NULL;
> }
> 
> void sigh(int s) { stop = 1; }
> 
> int main(int argc, char *argv[])
> {
>     int runtime = 120;
>     if (argc > 1) runtime = atoi(argv[1]);
> 
>     printf("rethook race PoC\n");
>     if (geteuid()) { printf("root needed\n"); return 1; }
>     signal(SIGINT, sigh);
> 
>     pthread_t hot[4], rdr[4];
>     pid_t hot_tids[4];
>     int pairs = 4;
> 
>     for (int c = 0; c < runtime / 5 && !stop; c++) {
>         tfs("events/kprobes/myretprobe/enable", "0");
>         tfs("kprobe_events", "-:myretprobe");
>         usleep(100);
>         tfs("kprobe_events", "r:myretprobe do_sys_openat2 $retval");
>         tfs("events/kprobes/myretprobe/enable", "1");
> 
>         pid_t main_tid = syscall(SYS_gettid);
> 
>         for (int i = 0; i < pairs; i++)
>             pthread_create(&hot[i], NULL, hot_thread, NULL);
> 
>         usleep(300000);
> 
>         {
>             DIR *d = opendir("/proc/self/task");
>             int cnt = 0;
>             if (d) {
>                 struct dirent *de;
>                 while ((de = readdir(d)) != NULL && cnt < pairs) {
>                     pid_t t = atoi(de->d_name);
>                     if (t > 0 && t != main_tid)
>                         hot_tids[cnt++] = t;
>                 }
>                 closedir(d);
>             }
>             for (int i = 0; i < cnt; i++)
>                 pthread_create(&rdr[i], NULL, reader_thread, &hot_tids[i]);
>         }
> 
>         printf("round %d\n", c);
>         sleep(5);
> 
>         stop = 1;
>         usleep(100000);
> 
>         for (int i = 0; i < pairs; i++) pthread_join(hot[i], NULL);
>         for (int i = 0; i < pairs; i++) pthread_join(rdr[i], NULL);
> 
>         stop = 0;
>         usleep(1000);
>     }
> 
>     tfs("events/kprobes/myretprobe/enable", "0");
>     tfs("kprobe_events", "-:myretprobe");
>     printf("Done\n");
>     return 0;
> }
> 
> ## Summary
> 
> The v4 commit message claims the iteration "will not crash," but the PoC
> demonstrates a reproducible KASAN panic:
> 
> 1. stack-out-of-bounds in unwind_next_frame (ORC unwinder reads
>    concurrently-modified stack frames of a running task)
> 
> 2. Potential use-after-free in __rethook_find_ret_addr (rethook nodes
>    recycled without RCU grace period, cursor persists across RCU drops)
> 
> The old task_is_running() check was racy but served as a practical
> safety net.  Removing it without adding equivalent protection in the
> callers (proc_pid_stack, BPF stack walkers) exposes users to kernel
> panics via /proc/<pid>/stack on any task running a kretprobe.

Once again, I truly appreciate your thorough review and testing.
I'm not sure if I can fully resolve these issues, but if I succeed, I will
send out a v5.

Best regards,
Tengda


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v4] rethook: Remove the running task check in rethook_find_ret_addr()
  2026-06-11 15:53   ` XIAO WU
  2026-06-12  1:54     ` Tengda Wu
@ 2026-06-12 12:20     ` Masami Hiramatsu
  1 sibling, 0 replies; 5+ messages in thread
From: Masami Hiramatsu @ 2026-06-12 12:20 UTC (permalink / raw)
  To: XIAO WU
  Cc: sashiko-reviews, Petr Mladek, Peter Zijlstra, Tengda Wu,
	Mathieu Desnoyers, Alexei Starovoitov, Steven Rostedt,
	linux-kernel, linux-trace-kernel, live-patching

On Thu, 11 Jun 2026 23:53:36 +0800
XIAO WU <xiaowu.417@qq.com> wrote:

> Hi Tengda,
> 
> Sashiko [1] reviewed this patch and found that removing the
> task_is_running() check exposes stack unwinders to real crashes — not
> just "invalid information."  A PoC confirms this: a KASAN panic triggers
> within seconds when /proc/<pid>/stack reads the stack of a task that is
> concurrently running a kretprobe.

Hmm, why /proc/<pid>/stack unwind stack so unreliable way...
That should stop the target process, because it is exposed to
userspace. Thus it should work as safe as possible.

Anyway, thanks for reporting with the test program.

> 
> [1] 
> https://sashiko.dev/#/patchset/20260610013658.1837963-1-wutengda%40huaweicloud.com
> 
>  > diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
>  > index 5a8bdf88999a..1e7fdebe3cd5 100644
>  > --- a/kernel/trace/rethook.c
>  > +++ b/kernel/trace/rethook.c
>  > @@ -250,9 +251,6 @@ unsigned long rethook_find_ret_addr(struct 
> task_struct *tsk, unsigned long frame
>  >      if (WARN_ON_ONCE(!cur))
>  >          return 0;
>  >
>  > -    if (tsk != current && task_is_running(tsk))
>  > -        return 0;
>  > -
>  >      do {
>  >          ret = __rethook_find_ret_addr(tsk, cur);
>  >          if (!ret)
> 
> The commit message states:
> 
>  > The iteration is already safe from crashes because
>  > unwind_next_frame() holds RCU and rethook_node structures are
>  > RCU-freed; even if the iteration goes off the rails and returns
>  > invalid information, it will not crash.
> 
> There are two problems with this claim, both reproducible.
> 
> **Problem 1: stack-out-of-bounds in unwind_next_frame itself**
> 
> The PoC below reliably triggers the following KASAN panic — not in the
> rethook list traversal, but inside unwind_next_frame():
> 
> [ 1833.494623] BUG: KASAN: stack-out-of-bounds in 
> unwind_next_frame+0x861/0x2080
> [ 1833.494651] Read of size 2 at addr ffffc90003e6f5f0 by task poc/9854
> [ 1833.494707] Call Trace:
> [ 1833.494719]  dump_stack_lvl+0x116/0x1f0
> [ 1833.494743]  print_report+0xf4/0x600
> [ 1833.494788]  kasan_report+0xe0/0x110
> [ 1833.494836]  unwind_next_frame+0x861/0x2080
> [ 1833.494948]  arch_stack_walk+0x99/0x100
> [ 1833.495000]  stack_trace_save_tsk+0x16a/0x200
> [ 1833.495054]  proc_pid_stack+0x173/0x2b0
> [ 1833.495103]  seq_read_iter+0x519/0x12d0
> [ 1833.495166]  seq_read+0x3b7/0x590
> [ 1833.495297]  vfs_read+0x1f5/0xd20
> [ 1833.495497]  ksys_read+0x135/0x250
> [ 1833.495549]  do_syscall_64+0x129/0x850
> [ 1833.495566]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [ 1833.498894] Kernel panic - not syncing: KASAN: panic_on_warn set ...
> 
> page last free pid 9737 tgid 9737 stack trace:
>   do_sys_openat2+0xbf/0x260          <-- target task inside kretprobe
>   __x64_sys_openat+0x179/0x210
> 
> This crash has nothing to do with rethook_node lifetimes or RCU.  It
> happens because the ORC unwinder reads stack memory while the target
> task concurrently executes a kretprobe trampoline that modifies return
> addresses.  The unwinder follows corrupted frame data past valid stack
> boundaries.  RCU protection of rethook_node structures is irrelevant —
> this crash occurs at the stack frame interpretation level, before any
> rethook list traversal.

OK, in that case, I think we should not allow list traversal.
I think without freezing the target task, accessing /proc/<pid>/stack
is potentially dangerous. Shouldn't we fix this at first?

> 
> The old task_is_running() check prevented the unwinder from attempting
> to unwind a running task's stack in the first place.
> 
> **Problem 2: use-after-free via rethook_node recycling**
> 
> Even if the stack-out-of-bounds above were addressed, a second crash
> path exists in the rethook list traversal itself.
> 
> rethook_recycle() immediately pushes nodes back to the objpool without
> an RCU grace period:
> 
>    kernel/trace/rethook.c:
>    void rethook_recycle(struct rethook_node *node)
>    {
>            ...
>            objpool_push(node, &node->rethook->pool);
>    }
> 
> Meanwhile, unwind_next_frame() in arch/x86/kernel/unwind_orc.c drops
> RCU between frames while the cursor (*cur) persists across iterations:
> 
>    arch/x86/kernel/unwind_orc.c:
>    bool unwind_next_frame(...)
>    {
>            ...
>            guard(rcu)();    // RCU held for one frame
>            ...
>    }                        // RCU dropped here
> 
> When the unwinder calls __rethook_find_ret_addr() in the next frame
> iteration, it does:
> 
>    struct llist_node *first = tsk->rethooks.first;
>    ...
>    *cur = first;
>    ...
>    node = node->next;       // node may have been recycled
> 
> If the target task returns from a probed function between frames, its
> rethook_node is recycled and can be instantly reallocated to another
> task.  The unwinder's stale cursor then dereferences a freed pointer,
> leading to use-after-free.


OK, this is still real problem. We should use call_rcu() to return
the object back to objpool.

Thanks!

> 
> ## Reproducer
> 
> The PoC sets up a kretprobe on do_sys_openat2, creates hot-loop threads
> calling open(), and concurrently reads /proc/<tid>/stack.  The race
> triggers within seconds (Problem 1 above; Problem 2 may reproduce on
> kernels without KASAN or with different timing).
> 
> Build:  gcc -static -pthread -o poc poc.c
> Run:    ./poc [runtime_seconds]
> Needs:  root, CONFIG_KASAN=y
> 
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <sys/wait.h>
> #include <sys/syscall.h>
> #include <sched.h>
> #include <fcntl.h>
> #include <errno.h>
> #include <signal.h>
> #include <pthread.h>
> #include <dirent.h>
> 
> #define TRACE "/sys/kernel/tracing"
> 
> volatile int stop = 0;
> 
> static int tfs(const char *f, const char *b)
> {
>      char p[256]; int fd, r;
>      snprintf(p, 256, "%s/%s", TRACE, f);
>      fd = open(p, O_WRONLY | O_TRUNC);
>      if (fd < 0) {
>          system("mount -t tracefs tracefs /sys/kernel/tracing 2>/dev/null");
>          usleep(50000);
>          fd = open(p, O_WRONLY | O_TRUNC);
>      }
>      if (fd < 0) return -1;
>      r = write(fd, b, strlen(b));
>      close(fd);
>      return r < 0 ? -1 : 0;
> }
> 
> void *hot_thread(void *arg)
> {
>      while (!__atomic_load_n(&stop, __ATOMIC_RELAXED)) {
>          int fd = open("/dev/null", O_RDONLY);
>          if (fd >= 0) close(fd);
>      }
>      return NULL;
> }
> 
> void *reader_thread(void *arg)
> {
>      pid_t target = *(pid_t *)arg;
>      char path[64], buf[8192];
>      snprintf(path, 64, "/proc/%d/stack", target);
>      while (!__atomic_load_n(&stop, __ATOMIC_RELAXED)) {
>          int fd = open(path, O_RDONLY);
>          if (fd >= 0) { read(fd, buf, 8191); close(fd); }
>      }
>      return NULL;
> }
> 
> void sigh(int s) { stop = 1; }
> 
> int main(int argc, char *argv[])
> {
>      int runtime = 120;
>      if (argc > 1) runtime = atoi(argv[1]);
> 
>      printf("rethook race PoC\n");
>      if (geteuid()) { printf("root needed\n"); return 1; }
>      signal(SIGINT, sigh);
> 
>      pthread_t hot[4], rdr[4];
>      pid_t hot_tids[4];
>      int pairs = 4;
> 
>      for (int c = 0; c < runtime / 5 && !stop; c++) {
>          tfs("events/kprobes/myretprobe/enable", "0");
>          tfs("kprobe_events", "-:myretprobe");
>          usleep(100);
>          tfs("kprobe_events", "r:myretprobe do_sys_openat2 $retval");
>          tfs("events/kprobes/myretprobe/enable", "1");
> 
>          pid_t main_tid = syscall(SYS_gettid);
> 
>          for (int i = 0; i < pairs; i++)
>              pthread_create(&hot[i], NULL, hot_thread, NULL);
> 
>          usleep(300000);
> 
>          {
>              DIR *d = opendir("/proc/self/task");
>              int cnt = 0;
>              if (d) {
>                  struct dirent *de;
>                  while ((de = readdir(d)) != NULL && cnt < pairs) {
>                      pid_t t = atoi(de->d_name);
>                      if (t > 0 && t != main_tid)
>                          hot_tids[cnt++] = t;
>                  }
>                  closedir(d);
>              }
>              for (int i = 0; i < cnt; i++)
>                  pthread_create(&rdr[i], NULL, reader_thread, &hot_tids[i]);
>          }
> 
>          printf("round %d\n", c);
>          sleep(5);
> 
>          stop = 1;
>          usleep(100000);
> 
>          for (int i = 0; i < pairs; i++) pthread_join(hot[i], NULL);
>          for (int i = 0; i < pairs; i++) pthread_join(rdr[i], NULL);
> 
>          stop = 0;
>          usleep(1000);
>      }
> 
>      tfs("events/kprobes/myretprobe/enable", "0");
>      tfs("kprobe_events", "-:myretprobe");
>      printf("Done\n");
>      return 0;
> }
> 
> ## Summary
> 
> The v4 commit message claims the iteration "will not crash," but the PoC
> demonstrates a reproducible KASAN panic:
> 
> 1. stack-out-of-bounds in unwind_next_frame (ORC unwinder reads
>     concurrently-modified stack frames of a running task)
> 
> 2. Potential use-after-free in __rethook_find_ret_addr (rethook nodes
>     recycled without RCU grace period, cursor persists across RCU drops)
> 
> The old task_is_running() check was racy but served as a practical
> safety net.  Removing it without adding equivalent protection in the
> callers (proc_pid_stack, BPF stack walkers) exposes users to kernel
> panics via /proc/<pid>/stack on any task running a kretprobe.
> 
> Thanks,
> Xiao
> 
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-06-12 12:20 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-10  1:36 [PATCH v4] rethook: Remove the running task check in rethook_find_ret_addr() Tengda Wu
2026-06-10  1:50 ` sashiko-bot
2026-06-11 15:53   ` XIAO WU
2026-06-12  1:54     ` Tengda Wu
2026-06-12 12:20     ` Masami Hiramatsu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox