linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Crash when attaching uretprobes to processes running in Docker
@ 2025-01-10 15:12 Eyal Birger
  2025-01-10 15:25 ` Aleksa Sarai
  0 siblings, 1 reply; 47+ messages in thread
From: Eyal Birger @ 2025-01-10 15:12 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: olsajiri, mhiramat, oleg, linux-kernel, linux-trace-kernel,
	BPF-dev-list, Song Liu, Yonghong Song, John Fastabend, peterz,
	tglx, bp, x86, linux-api, Andrii Nakryiko, Daniel Borkmann,
	Alexei Starovoitov, Andrii Nakryiko, rostedt@goodmis.org, rafi,
	Shmulik Ladkani

Hi,

When attaching uretprobes to processes running inside docker, the attached
process is segfaulted when encountering the retprobe. The offending commit
is:

ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe")

To my understanding, the reason is that now that uretprobe is a system call,
the default seccomp filters in docker block it as they only allow a specific
set of known syscalls.

This behavior can be reproduced by the below bash script, which works before
this commit.

Reported-by: Rafael Buchbinder <rafi@rbk.io>

Eyal.

--- CODE ---
#!/bin/bash

cat > /tmp/x.c << EOF
#include <stdio.h>
#include <seccomp.h>

char *syscalls[] = {
"write",
"exit_group",
};

__attribute__((noinline)) int probed(void)
{
printf("Probed\n");
return 1;
}

void apply_seccomp_filter(char **syscalls, int num_syscalls)
{
scmp_filter_ctx ctx;

ctx = seccomp_init(SCMP_ACT_ERRNO(1));
for (int i = 0; i < num_syscalls; i++) {
seccomp_rule_add(ctx, SCMP_ACT_ALLOW,
seccomp_syscall_resolve_name(syscalls[i]), 0);
}
seccomp_load(ctx);
seccomp_release(ctx);
}

int main(int argc, char *argv[])
{
int num_syscalls = sizeof(syscalls) / sizeof(syscalls[0]);

apply_seccomp_filter(syscalls, num_syscalls);

probed();

return 0;
}
EOF

cat > /tmp/trace.bt << EOF
uretprobe:/tmp/x:probed
{
    printf("ret=%d\n", retval);
}
EOF

gcc -o /tmp/x /tmp/x.c -lseccomp

/usr/bin/bpftrace /tmp/trace.bt &

sleep 5 # wait for uretprobe attach
/tmp/x

pkill bpftrace

rm /tmp/x /tmp/x.c /tmp/trace.bt

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-10 15:12 Crash when attaching uretprobes to processes running in Docker Eyal Birger
@ 2025-01-10 15:25 ` Aleksa Sarai
  2025-01-11 18:40   ` Jiri Olsa
  0 siblings, 1 reply; 47+ messages in thread
From: Aleksa Sarai @ 2025-01-10 15:25 UTC (permalink / raw)
  To: Eyal Birger
  Cc: Jiri Olsa, olsajiri, mhiramat, oleg, linux-kernel,
	linux-trace-kernel, BPF-dev-list, Song Liu, Yonghong Song,
	John Fastabend, peterz, tglx, bp, x86, linux-api, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko,
	rostedt@goodmis.org, rafi, Shmulik Ladkani

[-- Attachment #1: Type: text/plain, Size: 2167 bytes --]

On 2025-01-10, Eyal Birger <eyal.birger@gmail.com> wrote:
> Hi,
> 
> When attaching uretprobes to processes running inside docker, the attached
> process is segfaulted when encountering the retprobe. The offending commit
> is:
> 
> ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe")
> 
> To my understanding, the reason is that now that uretprobe is a system call,
> the default seccomp filters in docker block it as they only allow a specific
> set of known syscalls.

FWIW, the default seccomp profile of Docker _should_ return -ENOSYS for
uretprobe (runc has a bunch of ugly logic to try to guarantee this if
Docker hasn't updated their profile to include it). Though I guess that
isn't sufficient for the magic that uretprobe(2) does...

> This behavior can be reproduced by the below bash script, which works before
> this commit.
> 
> Reported-by: Rafael Buchbinder <rafi@rbk.io>
> 
> Eyal.
> 
> --- CODE ---
> #!/bin/bash
> 
> cat > /tmp/x.c << EOF
> #include <stdio.h>
> #include <seccomp.h>
> 
> char *syscalls[] = {
> "write",
> "exit_group",
> };
> 
> __attribute__((noinline)) int probed(void)
> {
> printf("Probed\n");
> return 1;
> }
> 
> void apply_seccomp_filter(char **syscalls, int num_syscalls)
> {
> scmp_filter_ctx ctx;
> 
> ctx = seccomp_init(SCMP_ACT_ERRNO(1));
> for (int i = 0; i < num_syscalls; i++) {
> seccomp_rule_add(ctx, SCMP_ACT_ALLOW,
> seccomp_syscall_resolve_name(syscalls[i]), 0);
> }
> seccomp_load(ctx);
> seccomp_release(ctx);
> }
> 
> int main(int argc, char *argv[])
> {
> int num_syscalls = sizeof(syscalls) / sizeof(syscalls[0]);
> 
> apply_seccomp_filter(syscalls, num_syscalls);
> 
> probed();
> 
> return 0;
> }
> EOF
> 
> cat > /tmp/trace.bt << EOF
> uretprobe:/tmp/x:probed
> {
>     printf("ret=%d\n", retval);
> }
> EOF
> 
> gcc -o /tmp/x /tmp/x.c -lseccomp
> 
> /usr/bin/bpftrace /tmp/trace.bt &
> 
> sleep 5 # wait for uretprobe attach
> /tmp/x
> 
> pkill bpftrace
> 
> rm /tmp/x /tmp/x.c /tmp/trace.bt
> 

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-10 15:25 ` Aleksa Sarai
@ 2025-01-11 18:40   ` Jiri Olsa
  2025-01-14  9:22     ` Jiri Olsa
  0 siblings, 1 reply; 47+ messages in thread
From: Jiri Olsa @ 2025-01-11 18:40 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Eyal Birger, olsajiri, mhiramat, oleg, linux-kernel,
	linux-trace-kernel, BPF-dev-list, Song Liu, Yonghong Song,
	John Fastabend, peterz, tglx, bp, x86, linux-api, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko,
	rostedt@goodmis.org, rafi, Shmulik Ladkani

On Sat, Jan 11, 2025 at 02:25:37AM +1100, Aleksa Sarai wrote:
> On 2025-01-10, Eyal Birger <eyal.birger@gmail.com> wrote:
> > Hi,
> > 
> > When attaching uretprobes to processes running inside docker, the attached
> > process is segfaulted when encountering the retprobe. The offending commit
> > is:
> > 
> > ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe")
> > 
> > To my understanding, the reason is that now that uretprobe is a system call,
> > the default seccomp filters in docker block it as they only allow a specific
> > set of known syscalls.
> 
> FWIW, the default seccomp profile of Docker _should_ return -ENOSYS for
> uretprobe (runc has a bunch of ugly logic to try to guarantee this if
> Docker hasn't updated their profile to include it). Though I guess that
> isn't sufficient for the magic that uretprobe(2) does...
> 
> > This behavior can be reproduced by the below bash script, which works before
> > this commit.
> > 
> > Reported-by: Rafael Buchbinder <rafi@rbk.io>

hi,
nice ;-) thanks for the report, the problem seems to be that uretprobe syscall
is blocked and uretprobe trampoline does not expect that

I think we could add code to the uretprobe trampoline to detect this and
execute standard int3 as fallback to process uretprobe, I'm checking on that

jirka


> > 
> > Eyal.
> > 
> > --- CODE ---
> > #!/bin/bash
> > 
> > cat > /tmp/x.c << EOF
> > #include <stdio.h>
> > #include <seccomp.h>
> > 
> > char *syscalls[] = {
> > "write",
> > "exit_group",
> > };
> > 
> > __attribute__((noinline)) int probed(void)
> > {
> > printf("Probed\n");
> > return 1;
> > }
> > 
> > void apply_seccomp_filter(char **syscalls, int num_syscalls)
> > {
> > scmp_filter_ctx ctx;
> > 
> > ctx = seccomp_init(SCMP_ACT_ERRNO(1));
> > for (int i = 0; i < num_syscalls; i++) {
> > seccomp_rule_add(ctx, SCMP_ACT_ALLOW,
> > seccomp_syscall_resolve_name(syscalls[i]), 0);
> > }
> > seccomp_load(ctx);
> > seccomp_release(ctx);
> > }
> > 
> > int main(int argc, char *argv[])
> > {
> > int num_syscalls = sizeof(syscalls) / sizeof(syscalls[0]);
> > 
> > apply_seccomp_filter(syscalls, num_syscalls);
> > 
> > probed();
> > 
> > return 0;
> > }
> > EOF
> > 
> > cat > /tmp/trace.bt << EOF
> > uretprobe:/tmp/x:probed
> > {
> >     printf("ret=%d\n", retval);
> > }
> > EOF
> > 
> > gcc -o /tmp/x /tmp/x.c -lseccomp
> > 
> > /usr/bin/bpftrace /tmp/trace.bt &
> > 
> > sleep 5 # wait for uretprobe attach
> > /tmp/x
> > 
> > pkill bpftrace
> > 
> > rm /tmp/x /tmp/x.c /tmp/trace.bt
> > 
> 
> -- 
> Aleksa Sarai
> Senior Software Engineer (Containers)
> SUSE Linux GmbH
> https://www.cyphar.com/



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-11 18:40   ` Jiri Olsa
@ 2025-01-14  9:22     ` Jiri Olsa
  2025-01-14 10:05       ` Masami Hiramatsu
                         ` (3 more replies)
  0 siblings, 4 replies; 47+ messages in thread
From: Jiri Olsa @ 2025-01-14  9:22 UTC (permalink / raw)
  To: Jiri Olsa, oleg
  Cc: Aleksa Sarai, Eyal Birger, mhiramat, linux-kernel,
	linux-trace-kernel, BPF-dev-list, Song Liu, Yonghong Song,
	John Fastabend, peterz, tglx, bp, x86, linux-api, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko,
	rostedt@goodmis.org, rafi, Shmulik Ladkani

On Sat, Jan 11, 2025 at 07:40:15PM +0100, Jiri Olsa wrote:
> On Sat, Jan 11, 2025 at 02:25:37AM +1100, Aleksa Sarai wrote:
> > On 2025-01-10, Eyal Birger <eyal.birger@gmail.com> wrote:
> > > Hi,
> > > 
> > > When attaching uretprobes to processes running inside docker, the attached
> > > process is segfaulted when encountering the retprobe. The offending commit
> > > is:
> > > 
> > > ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe")
> > > 
> > > To my understanding, the reason is that now that uretprobe is a system call,
> > > the default seccomp filters in docker block it as they only allow a specific
> > > set of known syscalls.
> > 
> > FWIW, the default seccomp profile of Docker _should_ return -ENOSYS for
> > uretprobe (runc has a bunch of ugly logic to try to guarantee this if
> > Docker hasn't updated their profile to include it). Though I guess that
> > isn't sufficient for the magic that uretprobe(2) does...
> > 
> > > This behavior can be reproduced by the below bash script, which works before
> > > this commit.
> > > 
> > > Reported-by: Rafael Buchbinder <rafi@rbk.io>
> 
> hi,
> nice ;-) thanks for the report, the problem seems to be that uretprobe syscall
> is blocked and uretprobe trampoline does not expect that
> 
> I think we could add code to the uretprobe trampoline to detect this and
> execute standard int3 as fallback to process uretprobe, I'm checking on that

hack below seems to fix the issue, it's using rbx to signal that uretprobe
syscall got executed, if not, trampoline does int3 and executes uretprobe
handler in the old way

unfortunately now the uretprobe trampoline size crosses the xol slot limit so
will need to come up with some generic/arch code solution for that, code below
is neglecting that for now

jirka


---
 arch/x86/kernel/uprobes.c | 24 ++++++++++++++++++++++++
 include/linux/uprobes.h   |  1 +
 kernel/events/uprobes.c   | 10 ++++++++--
 3 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 5a952c5ea66b..b54863f6fa25 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -315,14 +315,25 @@ asm (
 	".global uretprobe_trampoline_entry\n"
 	"uretprobe_trampoline_entry:\n"
 	"pushq %rax\n"
+	"pushq %rbx\n"
 	"pushq %rcx\n"
 	"pushq %r11\n"
+	"movq $1, %rbx\n"
 	"movq $" __stringify(__NR_uretprobe) ", %rax\n"
 	"syscall\n"
 	".global uretprobe_syscall_check\n"
 	"uretprobe_syscall_check:\n"
+	"or %rbx,%rbx\n"
+	"jz uretprobe_syscall_return\n"
 	"popq %r11\n"
 	"popq %rcx\n"
+	"popq %rbx\n"
+	"popq %rax\n"
+	"int3\n"
+	"uretprobe_syscall_return:\n"
+	"popq %r11\n"
+	"popq %rcx\n"
+	"popq %rbx\n"
 
 	/* The uretprobe syscall replaces stored %rax value with final
 	 * return address, so we don't restore %rax in here and just
@@ -338,6 +349,16 @@ extern u8 uretprobe_trampoline_entry[];
 extern u8 uretprobe_trampoline_end[];
 extern u8 uretprobe_syscall_check[];
 
+#define UINSNS_PER_PAGE                 (PAGE_SIZE/UPROBE_XOL_SLOT_BYTES)
+
+bool arch_is_uretprobe_trampoline(unsigned long vaddr)
+{
+	unsigned long start = uprobe_get_trampoline_vaddr();
+	unsigned long end = start + 2*UINSNS_PER_PAGE;
+
+	return vaddr >= start && vaddr < end;
+}
+
 void *arch_uprobe_trampoline(unsigned long *psize)
 {
 	static uprobe_opcode_t insn = UPROBE_SWBP_INSN;
@@ -418,6 +439,9 @@ SYSCALL_DEFINE0(uretprobe)
 	regs->r11 = regs->flags;
 	regs->cx  = regs->ip;
 
+	/* zero rbx to signal trampoline that uretprobe syscall was executed */
+	regs->bx  = 0;
+
 	return regs->ax;
 
 sigill:
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index e0a4c2082245..dbde57a68a1b 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -213,6 +213,7 @@ extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
 extern void uprobe_handle_trampoline(struct pt_regs *regs);
 extern void *arch_uprobe_trampoline(unsigned long *psize);
 extern unsigned long uprobe_get_trampoline_vaddr(void);
+bool arch_is_uretprobe_trampoline(unsigned long vaddr);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index fa04b14a7d72..73df64109f38 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1703,6 +1703,11 @@ void * __weak arch_uprobe_trampoline(unsigned long *psize)
 	return &insn;
 }
 
+bool __weak arch_is_uretprobe_trampoline(unsigned long vaddr)
+{
+	return vaddr == uprobe_get_trampoline_vaddr();
+}
+
 static struct xol_area *__create_xol_area(unsigned long vaddr)
 {
 	struct mm_struct *mm = current->mm;
@@ -1725,8 +1730,9 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
 
 	area->vaddr = vaddr;
 	init_waitqueue_head(&area->wq);
-	/* Reserve the 1st slot for get_trampoline_vaddr() */
+	/* Reserve the first two slots for get_trampoline_vaddr() */
 	set_bit(0, area->bitmap);
+	set_bit(1, area->bitmap);
 	insns = arch_uprobe_trampoline(&insns_size);
 	arch_uprobe_copy_ixol(area->page, 0, insns, insns_size);
 
@@ -2536,7 +2542,7 @@ static void handle_swbp(struct pt_regs *regs)
 	int is_swbp;
 
 	bp_vaddr = uprobe_get_swbp_addr(regs);
-	if (bp_vaddr == uprobe_get_trampoline_vaddr())
+	if (arch_is_uretprobe_trampoline(bp_vaddr))
 		return uprobe_handle_trampoline(regs);
 
 	rcu_read_lock_trace();
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-14  9:22     ` Jiri Olsa
@ 2025-01-14 10:05       ` Masami Hiramatsu
  2025-01-14 11:21         ` Oleg Nesterov
  2025-01-14 10:42       ` Peter Zijlstra
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 47+ messages in thread
From: Masami Hiramatsu @ 2025-01-14 10:05 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: oleg, Aleksa Sarai, Eyal Birger, mhiramat, linux-kernel,
	linux-trace-kernel, BPF-dev-list, Song Liu, Yonghong Song,
	John Fastabend, peterz, tglx, bp, x86, linux-api, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko,
	rostedt@goodmis.org, rafi, Shmulik Ladkani

On Tue, 14 Jan 2025 10:22:20 +0100
Jiri Olsa <olsajiri@gmail.com> wrote:

> On Sat, Jan 11, 2025 at 07:40:15PM +0100, Jiri Olsa wrote:
> > On Sat, Jan 11, 2025 at 02:25:37AM +1100, Aleksa Sarai wrote:
> > > On 2025-01-10, Eyal Birger <eyal.birger@gmail.com> wrote:
> > > > Hi,
> > > > 
> > > > When attaching uretprobes to processes running inside docker, the attached
> > > > process is segfaulted when encountering the retprobe. The offending commit
> > > > is:
> > > > 
> > > > ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe")
> > > > 
> > > > To my understanding, the reason is that now that uretprobe is a system call,
> > > > the default seccomp filters in docker block it as they only allow a specific
> > > > set of known syscalls.
> > > 
> > > FWIW, the default seccomp profile of Docker _should_ return -ENOSYS for
> > > uretprobe (runc has a bunch of ugly logic to try to guarantee this if
> > > Docker hasn't updated their profile to include it). Though I guess that
> > > isn't sufficient for the magic that uretprobe(2) does...
> > > 
> > > > This behavior can be reproduced by the below bash script, which works before
> > > > this commit.
> > > > 
> > > > Reported-by: Rafael Buchbinder <rafi@rbk.io>
> > 
> > hi,
> > nice ;-) thanks for the report, the problem seems to be that uretprobe syscall
> > is blocked and uretprobe trampoline does not expect that
> > 
> > I think we could add code to the uretprobe trampoline to detect this and
> > execute standard int3 as fallback to process uretprobe, I'm checking on that
> 
> hack below seems to fix the issue, it's using rbx to signal that uretprobe
> syscall got executed, if not, trampoline does int3 and executes uretprobe
> handler in the old way
> 
> unfortunately now the uretprobe trampoline size crosses the xol slot limit so
> will need to come up with some generic/arch code solution for that, code below
> is neglecting that for now
> 
> jirka
> 
> 
> ---
>  arch/x86/kernel/uprobes.c | 24 ++++++++++++++++++++++++
>  include/linux/uprobes.h   |  1 +
>  kernel/events/uprobes.c   | 10 ++++++++--
>  3 files changed, 33 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index 5a952c5ea66b..b54863f6fa25 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -315,14 +315,25 @@ asm (
>  	".global uretprobe_trampoline_entry\n"
>  	"uretprobe_trampoline_entry:\n"
>  	"pushq %rax\n"
> +	"pushq %rbx\n"
>  	"pushq %rcx\n"
>  	"pushq %r11\n"
> +	"movq $1, %rbx\n"
>  	"movq $" __stringify(__NR_uretprobe) ", %rax\n"
>  	"syscall\n"
>  	".global uretprobe_syscall_check\n"
>  	"uretprobe_syscall_check:\n"
> +	"or %rbx,%rbx\n"
> +	"jz uretprobe_syscall_return\n"
>  	"popq %r11\n"
>  	"popq %rcx\n"
> +	"popq %rbx\n"
> +	"popq %rax\n"
> +	"int3\n"
> +	"uretprobe_syscall_return:\n"
> +	"popq %r11\n"
> +	"popq %rcx\n"
> +	"popq %rbx\n"
>  
>  	/* The uretprobe syscall replaces stored %rax value with final
>  	 * return address, so we don't restore %rax in here and just
> @@ -338,6 +349,16 @@ extern u8 uretprobe_trampoline_entry[];
>  extern u8 uretprobe_trampoline_end[];
>  extern u8 uretprobe_syscall_check[];
>  
> +#define UINSNS_PER_PAGE                 (PAGE_SIZE/UPROBE_XOL_SLOT_BYTES)
> +
> +bool arch_is_uretprobe_trampoline(unsigned long vaddr)
> +{
> +	unsigned long start = uprobe_get_trampoline_vaddr();
> +	unsigned long end = start + 2*UINSNS_PER_PAGE;
> +
> +	return vaddr >= start && vaddr < end;
> +}
> +
>  void *arch_uprobe_trampoline(unsigned long *psize)
>  {
>  	static uprobe_opcode_t insn = UPROBE_SWBP_INSN;
> @@ -418,6 +439,9 @@ SYSCALL_DEFINE0(uretprobe)
>  	regs->r11 = regs->flags;
>  	regs->cx  = regs->ip;
>  
> +	/* zero rbx to signal trampoline that uretprobe syscall was executed */
> +	regs->bx  = 0;

Can we just return -ENOSYS as like as other syscall instead of
using rbx as a side channel?
We can carefully check the return address is not -ERRNO when set up
and reserve the -ENOSYS for this use case.

Thank you,

> +
>  	return regs->ax;
>  
>  sigill:
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index e0a4c2082245..dbde57a68a1b 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -213,6 +213,7 @@ extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
>  extern void uprobe_handle_trampoline(struct pt_regs *regs);
>  extern void *arch_uprobe_trampoline(unsigned long *psize);
>  extern unsigned long uprobe_get_trampoline_vaddr(void);
> +bool arch_is_uretprobe_trampoline(unsigned long vaddr);
>  #else /* !CONFIG_UPROBES */
>  struct uprobes_state {
>  };
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index fa04b14a7d72..73df64109f38 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -1703,6 +1703,11 @@ void * __weak arch_uprobe_trampoline(unsigned long *psize)
>  	return &insn;
>  }
>  
> +bool __weak arch_is_uretprobe_trampoline(unsigned long vaddr)
> +{
> +	return vaddr == uprobe_get_trampoline_vaddr();
> +}
> +
>  static struct xol_area *__create_xol_area(unsigned long vaddr)
>  {
>  	struct mm_struct *mm = current->mm;
> @@ -1725,8 +1730,9 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
>  
>  	area->vaddr = vaddr;
>  	init_waitqueue_head(&area->wq);
> -	/* Reserve the 1st slot for get_trampoline_vaddr() */
> +	/* Reserve the first two slots for get_trampoline_vaddr() */
>  	set_bit(0, area->bitmap);
> +	set_bit(1, area->bitmap);
>  	insns = arch_uprobe_trampoline(&insns_size);
>  	arch_uprobe_copy_ixol(area->page, 0, insns, insns_size);
>  
> @@ -2536,7 +2542,7 @@ static void handle_swbp(struct pt_regs *regs)
>  	int is_swbp;
>  
>  	bp_vaddr = uprobe_get_swbp_addr(regs);
> -	if (bp_vaddr == uprobe_get_trampoline_vaddr())
> +	if (arch_is_uretprobe_trampoline(bp_vaddr))
>  		return uprobe_handle_trampoline(regs);
>  
>  	rcu_read_lock_trace();
> -- 
> 2.47.1
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-14  9:22     ` Jiri Olsa
  2025-01-14 10:05       ` Masami Hiramatsu
@ 2025-01-14 10:42       ` Peter Zijlstra
  2025-01-14 11:01         ` Oleg Nesterov
  2025-01-14 10:58       ` Oleg Nesterov
  2025-01-14 14:08       ` Eyal Birger
  3 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2025-01-14 10:42 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: oleg, Aleksa Sarai, Eyal Birger, mhiramat, linux-kernel,
	linux-trace-kernel, BPF-dev-list, Song Liu, Yonghong Song,
	John Fastabend, tglx, bp, x86, linux-api, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko,
	rostedt@goodmis.org, rafi, Shmulik Ladkani

On Tue, Jan 14, 2025 at 10:22:20AM +0100, Jiri Olsa wrote:
> On Sat, Jan 11, 2025 at 07:40:15PM +0100, Jiri Olsa wrote:
> > On Sat, Jan 11, 2025 at 02:25:37AM +1100, Aleksa Sarai wrote:
> > > On 2025-01-10, Eyal Birger <eyal.birger@gmail.com> wrote:
> > > > Hi,
> > > > 
> > > > When attaching uretprobes to processes running inside docker, the attached
> > > > process is segfaulted when encountering the retprobe. The offending commit
> > > > is:
> > > > 
> > > > ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe")
> > > > 
> > > > To my understanding, the reason is that now that uretprobe is a system call,
> > > > the default seccomp filters in docker block it as they only allow a specific
> > > > set of known syscalls.
> > > 
> > > FWIW, the default seccomp profile of Docker _should_ return -ENOSYS for
> > > uretprobe (runc has a bunch of ugly logic to try to guarantee this if
> > > Docker hasn't updated their profile to include it). Though I guess that
> > > isn't sufficient for the magic that uretprobe(2) does...
> > > 
> > > > This behavior can be reproduced by the below bash script, which works before
> > > > this commit.
> > > > 
> > > > Reported-by: Rafael Buchbinder <rafi@rbk.io>
> > 
> > hi,
> > nice ;-) thanks for the report, the problem seems to be that uretprobe syscall
> > is blocked and uretprobe trampoline does not expect that
> > 
> > I think we could add code to the uretprobe trampoline to detect this and
> > execute standard int3 as fallback to process uretprobe, I'm checking on that
> 
> hack below seems to fix the issue, it's using rbx to signal that uretprobe
> syscall got executed, if not, trampoline does int3 and executes uretprobe
> handler in the old way
> 
> unfortunately now the uretprobe trampoline size crosses the xol slot limit so
> will need to come up with some generic/arch code solution for that, code below
> is neglecting that for now

Can't you detect the filter earlier and simply not install the
trampoline?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-14  9:22     ` Jiri Olsa
  2025-01-14 10:05       ` Masami Hiramatsu
  2025-01-14 10:42       ` Peter Zijlstra
@ 2025-01-14 10:58       ` Oleg Nesterov
  2025-01-14 14:19         ` Jiri Olsa
  2025-01-14 14:08       ` Eyal Birger
  3 siblings, 1 reply; 47+ messages in thread
From: Oleg Nesterov @ 2025-01-14 10:58 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Aleksa Sarai, Eyal Birger, mhiramat, linux-kernel,
	linux-trace-kernel, BPF-dev-list, Song Liu, Yonghong Song,
	John Fastabend, peterz, tglx, bp, x86, linux-api, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko,
	rostedt@goodmis.org, rafi, Shmulik Ladkani

On 01/14, Jiri Olsa wrote:
>
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -315,14 +315,25 @@ asm (
>  	".global uretprobe_trampoline_entry\n"
>  	"uretprobe_trampoline_entry:\n"
>  	"pushq %rax\n"
> +	"pushq %rbx\n"
>  	"pushq %rcx\n"
>  	"pushq %r11\n"
> +	"movq $1, %rbx\n"
>  	"movq $" __stringify(__NR_uretprobe) ", %rax\n"
>  	"syscall\n"
>  	".global uretprobe_syscall_check\n"
>  	"uretprobe_syscall_check:\n"
> +	"or %rbx,%rbx\n"
> +	"jz uretprobe_syscall_return\n"
>  	"popq %r11\n"
>  	"popq %rcx\n"
> +	"popq %rbx\n"
> +	"popq %rax\n"
> +	"int3\n"
> +	"uretprobe_syscall_return:\n"
> +	"popq %r11\n"
> +	"popq %rcx\n"
> +	"popq %rbx\n"

But why do we need to abuse %rbx? Can't uretprobe_trampoline_entry do

	syscall

// int3_section, in case sys_uretprobe() doesn't work
	popq %r11
	popq %rcx
	popq %rax
	int3

uretprobe_syscall_return:
	popq %r11
	popq %rcx
	popq %rbx
	retq

and change sys_uretprobe() to do

	- regs->ip = ip;
	+ regs->ip = ip + sizeof(int3_section);

?

Oleg.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-14 10:42       ` Peter Zijlstra
@ 2025-01-14 11:01         ` Oleg Nesterov
  2025-01-14 12:02           ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Oleg Nesterov @ 2025-01-14 11:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jiri Olsa, Aleksa Sarai, Eyal Birger, mhiramat, linux-kernel,
	linux-trace-kernel, BPF-dev-list, Song Liu, Yonghong Song,
	John Fastabend, tglx, bp, x86, linux-api, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko,
	rostedt@goodmis.org, rafi, Shmulik Ladkani

On 01/14, Peter Zijlstra wrote:
>
> On Tue, Jan 14, 2025 at 10:22:20AM +0100, Jiri Olsa wrote:
> >
> > hack below seems to fix the issue, it's using rbx to signal that uretprobe
> > syscall got executed, if not, trampoline does int3 and executes uretprobe
> > handler in the old way
> >
> > unfortunately now the uretprobe trampoline size crosses the xol slot limit so
> > will need to come up with some generic/arch code solution for that, code below
> > is neglecting that for now
>
> Can't you detect the filter earlier and simply not install the
> trampoline?

Did you mean detect the filter in prepare_uretprobe() ?

The probed function can install the filter before return...

Oleg.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-14 10:05       ` Masami Hiramatsu
@ 2025-01-14 11:21         ` Oleg Nesterov
  2025-01-14 14:21           ` Jiri Olsa
  0 siblings, 1 reply; 47+ messages in thread
From: Oleg Nesterov @ 2025-01-14 11:21 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Jiri Olsa, Aleksa Sarai, Eyal Birger, linux-kernel,
	linux-trace-kernel, BPF-dev-list, Song Liu, Yonghong Song,
	John Fastabend, peterz, tglx, bp, x86, linux-api, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko,
	rostedt@goodmis.org, rafi, Shmulik Ladkani

On 01/14, Masami Hiramatsu wrote:
>
> On Tue, 14 Jan 2025 10:22:20 +0100
> Jiri Olsa <olsajiri@gmail.com> wrote:
>
> > @@ -418,6 +439,9 @@ SYSCALL_DEFINE0(uretprobe)
> >  	regs->r11 = regs->flags;
> >  	regs->cx  = regs->ip;
> >
> > +	/* zero rbx to signal trampoline that uretprobe syscall was executed */
> > +	regs->bx  = 0;
>
> Can we just return -ENOSYS as like as other syscall instead of
> using rbx as a side channel?
> We can carefully check the return address is not -ERRNO when set up
> and reserve the -ENOSYS for this use case.

Not sure I understand...

But please not that the uretprobed function can return any value
including -ENOSYS, and this is what sys_uretprobe() has to return.

Oleg.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-14 11:01         ` Oleg Nesterov
@ 2025-01-14 12:02           ` Peter Zijlstra
  2025-01-14 12:32             ` Oleg Nesterov
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2025-01-14 12:02 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jiri Olsa, Aleksa Sarai, Eyal Birger, mhiramat, linux-kernel,
	linux-trace-kernel, BPF-dev-list, Song Liu, Yonghong Song,
	John Fastabend, tglx, bp, x86, linux-api, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko,
	rostedt@goodmis.org, rafi, Shmulik Ladkani

On Tue, Jan 14, 2025 at 12:01:50PM +0100, Oleg Nesterov wrote:
> On 01/14, Peter Zijlstra wrote:
> >
> > On Tue, Jan 14, 2025 at 10:22:20AM +0100, Jiri Olsa wrote:
> > >
> > > hack below seems to fix the issue, it's using rbx to signal that uretprobe
> > > syscall got executed, if not, trampoline does int3 and executes uretprobe
> > > handler in the old way
> > >
> > > unfortunately now the uretprobe trampoline size crosses the xol slot limit so
> > > will need to come up with some generic/arch code solution for that, code below
> > > is neglecting that for now
> >
> > Can't you detect the filter earlier and simply not install the
> > trampoline?
> 
> Did you mean detect the filter in prepare_uretprobe() ?

Yep. Aren't syscall filters static for the duration of the task?

> The probed function can install the filter before return...

If you're running a task with dynamic syscall filtering, you get to keep
the pieces no?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-14 12:02           ` Peter Zijlstra
@ 2025-01-14 12:32             ` Oleg Nesterov
  2025-01-14 14:07               ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Oleg Nesterov @ 2025-01-14 12:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jiri Olsa, Aleksa Sarai, Eyal Birger, mhiramat, linux-kernel,
	linux-trace-kernel, BPF-dev-list, Song Liu, Yonghong Song,
	John Fastabend, tglx, bp, x86, linux-api, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko,
	rostedt@goodmis.org, rafi, Shmulik Ladkani

On 01/14, Peter Zijlstra wrote:
>
> On Tue, Jan 14, 2025 at 12:01:50PM +0100, Oleg Nesterov wrote:
> > On 01/14, Peter Zijlstra wrote:
> > >
> > > On Tue, Jan 14, 2025 at 10:22:20AM +0100, Jiri Olsa wrote:
> > > >
> > > > hack below seems to fix the issue, it's using rbx to signal that uretprobe
> > > > syscall got executed, if not, trampoline does int3 and executes uretprobe
> > > > handler in the old way
> > > >
> > > > unfortunately now the uretprobe trampoline size crosses the xol slot limit so
> > > > will need to come up with some generic/arch code solution for that, code below
> > > > is neglecting that for now
> > >
> > > Can't you detect the filter earlier and simply not install the
> > > trampoline?
> >
> > Did you mean detect the filter in prepare_uretprobe() ?
>
> Yep. Aren't syscall filters static for the duration of the task?
>
> > The probed function can install the filter before return...
>
> If you're running a task with dynamic syscall filtering, you get to keep
> the pieces no?

Sorry, I don't understand... Perhaps because I am enjoying my state after
dentist appointment ;)

OK, suppose we have

	void start_SECCOMP_MODE_STRICT(void)
	{
		// in particular nacks __NR_uretprobe
		seccomp(SECCOMP_MODE_STRICT, ...);
	}

and we want to add uretprobe to this function.

In this case prepare_uretprobe() can't know that sys_uretprobe() won't
work when this function returns?

Oleg.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-14 12:32             ` Oleg Nesterov
@ 2025-01-14 14:07               ` Peter Zijlstra
  2025-01-14 17:43                 ` Oleg Nesterov
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2025-01-14 14:07 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jiri Olsa, Aleksa Sarai, Eyal Birger, mhiramat, linux-kernel,
	linux-trace-kernel, BPF-dev-list, Song Liu, Yonghong Song,
	John Fastabend, tglx, bp, x86, linux-api, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko,
	rostedt@goodmis.org, rafi, Shmulik Ladkani

On Tue, Jan 14, 2025 at 01:32:58PM +0100, Oleg Nesterov wrote:

> Sorry, I don't understand... Perhaps because I am enjoying my state after
> dentist appointment ;)

For some reason I thought to remember that parent thread would spawn
restricted child, however:

> OK, suppose we have
> 
> 	void start_SECCOMP_MODE_STRICT(void)
> 	{
> 		// in particular nacks __NR_uretprobe
> 		seccomp(SECCOMP_MODE_STRICT, ...);
> 	}
> 
> and we want to add uretprobe to this function.
> 
> In this case prepare_uretprobe() can't know that sys_uretprobe() won't
> work when this function returns?

Indeed. But any further probes placed after seccomp() would be able to,
and installing trampolines for them would be a waste, no?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-14  9:22     ` Jiri Olsa
                         ` (2 preceding siblings ...)
  2025-01-14 10:58       ` Oleg Nesterov
@ 2025-01-14 14:08       ` Eyal Birger
  2025-01-14 14:33         ` Oleg Nesterov
  3 siblings, 1 reply; 47+ messages in thread
From: Eyal Birger @ 2025-01-14 14:08 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: oleg, Aleksa Sarai, mhiramat, linux-kernel, linux-trace-kernel,
	BPF-dev-list, Song Liu, Yonghong Song, John Fastabend, peterz,
	tglx, bp, x86, linux-api, Andrii Nakryiko, Daniel Borkmann,
	Alexei Starovoitov, Andrii Nakryiko, rostedt@goodmis.org, rafi,
	Shmulik Ladkani

Hi Jiri,

On Tue, Jan 14, 2025 at 1:22 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Sat, Jan 11, 2025 at 07:40:15PM +0100, Jiri Olsa wrote:
> > On Sat, Jan 11, 2025 at 02:25:37AM +1100, Aleksa Sarai wrote:
> > > On 2025-01-10, Eyal Birger <eyal.birger@gmail.com> wrote:
> > > > Hi,
> > > >
> > > > When attaching uretprobes to processes running inside docker, the attached
> > > > process is segfaulted when encountering the retprobe. The offending commit
> > > > is:
> > > >
> > > > ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe")
> > > >
> > > > To my understanding, the reason is that now that uretprobe is a system call,
> > > > the default seccomp filters in docker block it as they only allow a specific
> > > > set of known syscalls.
> > >
> > > FWIW, the default seccomp profile of Docker _should_ return -ENOSYS for
> > > uretprobe (runc has a bunch of ugly logic to try to guarantee this if
> > > Docker hasn't updated their profile to include it). Though I guess that
> > > isn't sufficient for the magic that uretprobe(2) does...
> > >
> > > > This behavior can be reproduced by the below bash script, which works before
> > > > this commit.
> > > >
> > > > Reported-by: Rafael Buchbinder <rafi@rbk.io>
> >
> > hi,
> > nice ;-) thanks for the report, the problem seems to be that uretprobe syscall
> > is blocked and uretprobe trampoline does not expect that
> >
> > I think we could add code to the uretprobe trampoline to detect this and
> > execute standard int3 as fallback to process uretprobe, I'm checking on that
>
> hack below seems to fix the issue, it's using rbx to signal that uretprobe
> syscall got executed, if not, trampoline does int3 and executes uretprobe
> handler in the old way

FWIW If I change the seccomp policy to SCMP_ACT_KILL this still fails.

Eyal.

>
> unfortunately now the uretprobe trampoline size crosses the xol slot limit so
> will need to come up with some generic/arch code solution for that, code below
> is neglecting that for now


>
> jirka
>
>
> ---
>  arch/x86/kernel/uprobes.c | 24 ++++++++++++++++++++++++
>  include/linux/uprobes.h   |  1 +
>  kernel/events/uprobes.c   | 10 ++++++++--
>  3 files changed, 33 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index 5a952c5ea66b..b54863f6fa25 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -315,14 +315,25 @@ asm (
>         ".global uretprobe_trampoline_entry\n"
>         "uretprobe_trampoline_entry:\n"
>         "pushq %rax\n"
> +       "pushq %rbx\n"
>         "pushq %rcx\n"
>         "pushq %r11\n"
> +       "movq $1, %rbx\n"
>         "movq $" __stringify(__NR_uretprobe) ", %rax\n"
>         "syscall\n"
>         ".global uretprobe_syscall_check\n"
>         "uretprobe_syscall_check:\n"
> +       "or %rbx,%rbx\n"
> +       "jz uretprobe_syscall_return\n"
>         "popq %r11\n"
>         "popq %rcx\n"
> +       "popq %rbx\n"
> +       "popq %rax\n"
> +       "int3\n"
> +       "uretprobe_syscall_return:\n"
> +       "popq %r11\n"
> +       "popq %rcx\n"
> +       "popq %rbx\n"
>
>         /* The uretprobe syscall replaces stored %rax value with final
>          * return address, so we don't restore %rax in here and just
> @@ -338,6 +349,16 @@ extern u8 uretprobe_trampoline_entry[];
>  extern u8 uretprobe_trampoline_end[];
>  extern u8 uretprobe_syscall_check[];
>
> +#define UINSNS_PER_PAGE                 (PAGE_SIZE/UPROBE_XOL_SLOT_BYTES)
> +
> +bool arch_is_uretprobe_trampoline(unsigned long vaddr)
> +{
> +       unsigned long start = uprobe_get_trampoline_vaddr();
> +       unsigned long end = start + 2*UINSNS_PER_PAGE;
> +
> +       return vaddr >= start && vaddr < end;
> +}
> +
>  void *arch_uprobe_trampoline(unsigned long *psize)
>  {
>         static uprobe_opcode_t insn = UPROBE_SWBP_INSN;
> @@ -418,6 +439,9 @@ SYSCALL_DEFINE0(uretprobe)
>         regs->r11 = regs->flags;
>         regs->cx  = regs->ip;
>
> +       /* zero rbx to signal trampoline that uretprobe syscall was executed */
> +       regs->bx  = 0;
> +
>         return regs->ax;
>
>  sigill:
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index e0a4c2082245..dbde57a68a1b 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -213,6 +213,7 @@ extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
>  extern void uprobe_handle_trampoline(struct pt_regs *regs);
>  extern void *arch_uprobe_trampoline(unsigned long *psize);
>  extern unsigned long uprobe_get_trampoline_vaddr(void);
> +bool arch_is_uretprobe_trampoline(unsigned long vaddr);
>  #else /* !CONFIG_UPROBES */
>  struct uprobes_state {
>  };
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index fa04b14a7d72..73df64109f38 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -1703,6 +1703,11 @@ void * __weak arch_uprobe_trampoline(unsigned long *psize)
>         return &insn;
>  }
>
> +bool __weak arch_is_uretprobe_trampoline(unsigned long vaddr)
> +{
> +       return vaddr == uprobe_get_trampoline_vaddr();
> +}
> +
>  static struct xol_area *__create_xol_area(unsigned long vaddr)
>  {
>         struct mm_struct *mm = current->mm;
> @@ -1725,8 +1730,9 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
>
>         area->vaddr = vaddr;
>         init_waitqueue_head(&area->wq);
> -       /* Reserve the 1st slot for get_trampoline_vaddr() */
> +       /* Reserve the first two slots for get_trampoline_vaddr() */
>         set_bit(0, area->bitmap);
> +       set_bit(1, area->bitmap);
>         insns = arch_uprobe_trampoline(&insns_size);
>         arch_uprobe_copy_ixol(area->page, 0, insns, insns_size);
>
> @@ -2536,7 +2542,7 @@ static void handle_swbp(struct pt_regs *regs)
>         int is_swbp;
>
>         bp_vaddr = uprobe_get_swbp_addr(regs);
> -       if (bp_vaddr == uprobe_get_trampoline_vaddr())
> +       if (arch_is_uretprobe_trampoline(bp_vaddr))
>                 return uprobe_handle_trampoline(regs);
>
>         rcu_read_lock_trace();
> --
> 2.47.1
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-14 10:58       ` Oleg Nesterov
@ 2025-01-14 14:19         ` Jiri Olsa
  2025-01-14 19:21           ` Andrii Nakryiko
  0 siblings, 1 reply; 47+ messages in thread
From: Jiri Olsa @ 2025-01-14 14:19 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jiri Olsa, Aleksa Sarai, Eyal Birger, mhiramat, linux-kernel,
	linux-trace-kernel, BPF-dev-list, Song Liu, Yonghong Song,
	John Fastabend, peterz, tglx, bp, x86, linux-api, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko,
	rostedt@goodmis.org, rafi, Shmulik Ladkani

On Tue, Jan 14, 2025 at 11:58:03AM +0100, Oleg Nesterov wrote:
> On 01/14, Jiri Olsa wrote:
> >
> > --- a/arch/x86/kernel/uprobes.c
> > +++ b/arch/x86/kernel/uprobes.c
> > @@ -315,14 +315,25 @@ asm (
> >  	".global uretprobe_trampoline_entry\n"
> >  	"uretprobe_trampoline_entry:\n"
> >  	"pushq %rax\n"
> > +	"pushq %rbx\n"
> >  	"pushq %rcx\n"
> >  	"pushq %r11\n"
> > +	"movq $1, %rbx\n"
> >  	"movq $" __stringify(__NR_uretprobe) ", %rax\n"
> >  	"syscall\n"
> >  	".global uretprobe_syscall_check\n"
> >  	"uretprobe_syscall_check:\n"
> > +	"or %rbx,%rbx\n"
> > +	"jz uretprobe_syscall_return\n"
> >  	"popq %r11\n"
> >  	"popq %rcx\n"
> > +	"popq %rbx\n"
> > +	"popq %rax\n"
> > +	"int3\n"
> > +	"uretprobe_syscall_return:\n"
> > +	"popq %r11\n"
> > +	"popq %rcx\n"
> > +	"popq %rbx\n"
> 
> But why do we need to abuse %rbx? Can't uretprobe_trampoline_entry do
> 
> 	syscall
> 
> // int3_section, in case sys_uretprobe() doesn't work
> 	popq %r11
> 	popq %rcx
> 	popq %rax
> 	int3
> 
> uretprobe_syscall_return:
> 	popq %r11
> 	popq %rcx
> 	popq %rbx
> 	retq
> 
> and change sys_uretprobe() to do
> 
> 	- regs->ip = ip;
> 	+ regs->ip = ip + sizeof(int3_section);

nice idea, I wonder we get the trampoline size under one xol slot with that

thanks,
jirka

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-14 11:21         ` Oleg Nesterov
@ 2025-01-14 14:21           ` Jiri Olsa
  2025-01-17  1:23             ` Masami Hiramatsu
  0 siblings, 1 reply; 47+ messages in thread
From: Jiri Olsa @ 2025-01-14 14:21 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Masami Hiramatsu, Jiri Olsa, Aleksa Sarai, Eyal Birger,
	linux-kernel, linux-trace-kernel, BPF-dev-list, Song Liu,
	Yonghong Song, John Fastabend, peterz, tglx, bp, x86, linux-api,
	Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov,
	Andrii Nakryiko, rostedt@goodmis.org, rafi, Shmulik Ladkani

On Tue, Jan 14, 2025 at 12:21:07PM +0100, Oleg Nesterov wrote:
> On 01/14, Masami Hiramatsu wrote:
> >
> > On Tue, 14 Jan 2025 10:22:20 +0100
> > Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > > @@ -418,6 +439,9 @@ SYSCALL_DEFINE0(uretprobe)
> > >  	regs->r11 = regs->flags;
> > >  	regs->cx  = regs->ip;
> > >
> > > +	/* zero rbx to signal trampoline that uretprobe syscall was executed */
> > > +	regs->bx  = 0;
> >
> > Can we just return -ENOSYS as like as other syscall instead of
> > using rbx as a side channel?
> > We can carefully check the return address is not -ERRNO when set up
> > and reserve the -ENOSYS for this use case.
> 
> Not sure I understand...
> 
> But please not that the uretprobed function can return any value
> including -ENOSYS, and this is what sys_uretprobe() has to return.

right, uretprobe syscall returns value of the uretprobed function,
so we can't use any reserved value

jirka

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-14 14:08       ` Eyal Birger
@ 2025-01-14 14:33         ` Oleg Nesterov
  2025-01-14 14:56           ` Jiri Olsa
  0 siblings, 1 reply; 47+ messages in thread
From: Oleg Nesterov @ 2025-01-14 14:33 UTC (permalink / raw)
  To: Eyal Birger
  Cc: Jiri Olsa, Aleksa Sarai, mhiramat, linux-kernel,
	linux-trace-kernel, BPF-dev-list, Song Liu, Yonghong Song,
	John Fastabend, peterz, tglx, bp, x86, linux-api, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko,
	rostedt@goodmis.org, rafi, Shmulik Ladkani

On 01/14, Eyal Birger wrote:
>
> FWIW If I change the seccomp policy to SCMP_ACT_KILL this still fails.

Ah... I don't know what SCMP_ACT_KILL is, but indeed, in general it is
not safe to even try to call sys_uretprobe() if it is filtered.

Say, __secure_computing(SECCOMP_MODE_STRICT)->__secure_computing_strict()
does do_exit(SIGKILL) :/

Oleg.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-14 14:33         ` Oleg Nesterov
@ 2025-01-14 14:56           ` Jiri Olsa
  2025-01-14 17:25             ` Oleg Nesterov
  0 siblings, 1 reply; 47+ messages in thread
From: Jiri Olsa @ 2025-01-14 14:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eyal Birger, Jiri Olsa, Aleksa Sarai, mhiramat, linux-kernel,
	linux-trace-kernel, BPF-dev-list, Song Liu, Yonghong Song,
	John Fastabend, peterz, tglx, bp, x86, linux-api, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko,
	rostedt@goodmis.org, rafi, Shmulik Ladkani

On Tue, Jan 14, 2025 at 03:33:13PM +0100, Oleg Nesterov wrote:
> On 01/14, Eyal Birger wrote:
> >
> > FWIW If I change the seccomp policy to SCMP_ACT_KILL this still fails.
> 
> Ah... I don't know what SCMP_ACT_KILL is, but indeed, in general it is
> not safe to even try to call sys_uretprobe() if it is filtered.
> 
> Say, __secure_computing(SECCOMP_MODE_STRICT)->__secure_computing_strict()
> does do_exit(SIGKILL) :/

ugh.. could we just 'disable' uretprobe trampoline when seccomp gets enabled?
overwrite first byte with int3.. and similarly check on seccomp when installing
uretprobe and switch to int3

jirka

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-14 14:56           ` Jiri Olsa
@ 2025-01-14 17:25             ` Oleg Nesterov
  2025-01-15  9:36               ` Jiri Olsa
  0 siblings, 1 reply; 47+ messages in thread
From: Oleg Nesterov @ 2025-01-14 17:25 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Eyal Birger, Aleksa Sarai, mhiramat, linux-kernel,
	linux-trace-kernel, BPF-dev-list, Song Liu, Yonghong Song,
	John Fastabend, peterz, tglx, bp, x86, linux-api, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko,
	rostedt@goodmis.org, rafi, Shmulik Ladkani

On 01/14, Jiri Olsa wrote:
>
> ugh.. could we just 'disable' uretprobe trampoline when seccomp gets enabled?
> overwrite first byte with int3.. and similarly check on seccomp when installing
> uretprobe and switch to int3

Sorry, I don't understand... What exactly we can do? Aside from checking
IS_ENABLED(CONFIG_SECCOMP) in arch_uprobe_trampoline() ?

Oleg.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-14 14:07               ` Peter Zijlstra
@ 2025-01-14 17:43                 ` Oleg Nesterov
  0 siblings, 0 replies; 47+ messages in thread
From: Oleg Nesterov @ 2025-01-14 17:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jiri Olsa, Aleksa Sarai, Eyal Birger, mhiramat, linux-kernel,
	linux-trace-kernel, BPF-dev-list, Song Liu, Yonghong Song,
	John Fastabend, tglx, bp, x86, linux-api, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko,
	rostedt@goodmis.org, rafi, Shmulik Ladkani

On 01/14, Peter Zijlstra wrote:
>
> On Tue, Jan 14, 2025 at 01:32:58PM +0100, Oleg Nesterov wrote:
>
> > OK, suppose we have
> >
> > 	void start_SECCOMP_MODE_STRICT(void)
> > 	{
> > 		// in particular nacks __NR_uretprobe
> > 		seccomp(SECCOMP_MODE_STRICT, ...);
> > 	}
> >
> > and we want to add uretprobe to this function.
> >
> > In this case prepare_uretprobe() can't know that sys_uretprobe() won't
> > work when this function returns?
>
> Indeed. But any further probes placed after seccomp() would be able to,
> and installing trampolines for them would be a waste, no?

But the probed task will crash when it returns from
start_SECCOMP_MODE_STRICT() above.

Even if, due to seccomp filtering, sys_uretprobe() doesn't kill the task
(I missed the fact it can) but just returns ENOSYS/whatever.

Oleg.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-14 14:19         ` Jiri Olsa
@ 2025-01-14 19:21           ` Andrii Nakryiko
  2025-01-14 20:39             ` Oleg Nesterov
  0 siblings, 1 reply; 47+ messages in thread
From: Andrii Nakryiko @ 2025-01-14 19:21 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Aleksa Sarai, Eyal Birger, mhiramat, linux-kernel,
	linux-trace-kernel, BPF-dev-list, Song Liu, Yonghong Song,
	John Fastabend, peterz, tglx, bp, x86, linux-api, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, rostedt@goodmis.org, rafi,
	Shmulik Ladkani

On Tue, Jan 14, 2025 at 6:19 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Tue, Jan 14, 2025 at 11:58:03AM +0100, Oleg Nesterov wrote:
> > On 01/14, Jiri Olsa wrote:
> > >
> > > --- a/arch/x86/kernel/uprobes.c
> > > +++ b/arch/x86/kernel/uprobes.c
> > > @@ -315,14 +315,25 @@ asm (
> > >     ".global uretprobe_trampoline_entry\n"
> > >     "uretprobe_trampoline_entry:\n"
> > >     "pushq %rax\n"
> > > +   "pushq %rbx\n"
> > >     "pushq %rcx\n"
> > >     "pushq %r11\n"
> > > +   "movq $1, %rbx\n"
> > >     "movq $" __stringify(__NR_uretprobe) ", %rax\n"
> > >     "syscall\n"
> > >     ".global uretprobe_syscall_check\n"
> > >     "uretprobe_syscall_check:\n"
> > > +   "or %rbx,%rbx\n"
> > > +   "jz uretprobe_syscall_return\n"
> > >     "popq %r11\n"
> > >     "popq %rcx\n"
> > > +   "popq %rbx\n"
> > > +   "popq %rax\n"
> > > +   "int3\n"
> > > +   "uretprobe_syscall_return:\n"
> > > +   "popq %r11\n"
> > > +   "popq %rcx\n"
> > > +   "popq %rbx\n"
> >
> > But why do we need to abuse %rbx? Can't uretprobe_trampoline_entry do
> >
> >       syscall
> >
> > // int3_section, in case sys_uretprobe() doesn't work
> >       popq %r11
> >       popq %rcx
> >       popq %rax
> >       int3
> >
> > uretprobe_syscall_return:
> >       popq %r11
> >       popq %rcx
> >       popq %rbx
> >       retq
> >
> > and change sys_uretprobe() to do
> >
> >       - regs->ip = ip;
> >       + regs->ip = ip + sizeof(int3_section);
>
> nice idea, I wonder we get the trampoline size under one xol slot with that
>

Should we just fix whoever is blocking kernel-internal special syscall
(sys_uretprobe)? What would happen if someone blocked that other
special kernel-internal syscall for signal handling (can't remember
the name, but the one that was an inspiration/justification for
sys_uretprobe)?


> thanks,
> jirka

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-14 19:21           ` Andrii Nakryiko
@ 2025-01-14 20:39             ` Oleg Nesterov
  2025-01-14 21:45               ` Andrii Nakryiko
  0 siblings, 1 reply; 47+ messages in thread
From: Oleg Nesterov @ 2025-01-14 20:39 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jiri Olsa, Aleksa Sarai, Eyal Birger, mhiramat, linux-kernel,
	linux-trace-kernel, BPF-dev-list, Song Liu, Yonghong Song,
	John Fastabend, peterz, tglx, bp, x86, linux-api, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, rostedt@goodmis.org, rafi,
	Shmulik Ladkani

On 01/14, Andrii Nakryiko wrote:
>
> Should we just fix whoever is blocking kernel-internal special syscall
> (sys_uretprobe)?

Well, we can add __NR_uretprobe to mode1_syscalls[] but this won't
really help.

We can't "fix" the existing user-space setups which can nack any
"unnecessary/unknown" syscall.

> What would happen if someone blocked that other
> special kernel-internal syscall for signal handling (can't remember
> the name,

sys_rt_sigreturn().

Yes, the task will crash after return from the signal handler if this
syscall is filtered out.

But, unlike sys_uretprobe(), sys_rt_sigreturn() is old, so the existing
setups must know that sigreturn() should be respected...

Oleg.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-14 20:39             ` Oleg Nesterov
@ 2025-01-14 21:45               ` Andrii Nakryiko
  2025-01-14 22:10                 ` Oleg Nesterov
  2025-01-17 11:41                 ` Peter Zijlstra
  0 siblings, 2 replies; 47+ messages in thread
From: Andrii Nakryiko @ 2025-01-14 21:45 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jiri Olsa, Aleksa Sarai, Eyal Birger, mhiramat, linux-kernel,
	linux-trace-kernel, BPF-dev-list, Song Liu, Yonghong Song,
	John Fastabend, peterz, tglx, bp, x86, linux-api, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, rostedt@goodmis.org, rafi,
	Shmulik Ladkani

On Tue, Jan 14, 2025 at 12:40 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 01/14, Andrii Nakryiko wrote:
> >
> > Should we just fix whoever is blocking kernel-internal special syscall
> > (sys_uretprobe)?
>
> Well, we can add __NR_uretprobe to mode1_syscalls[] but this won't
> really help.
>
> We can't "fix" the existing user-space setups which can nack any
> "unnecessary/unknown" syscall.
>
> > What would happen if someone blocked that other
> > special kernel-internal syscall for signal handling (can't remember
> > the name,
>
> sys_rt_sigreturn().
>
> Yes, the task will crash after return from the signal handler if this
> syscall is filtered out.
>
> But, unlike sys_uretprobe(), sys_rt_sigreturn() is old, so the existing
> setups must know that sigreturn() should be respected...

someday sys_uretprobe will be old as well ;) FWIW, systemd allowlisted
sys_uretprobe, see [0]

  [0] https://github.com/systemd/systemd/issues/34615#issuecomment-2406761451

>
> Oleg.
>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-14 21:45               ` Andrii Nakryiko
@ 2025-01-14 22:10                 ` Oleg Nesterov
  2025-01-14 23:52                   ` Andrii Nakryiko
  2025-01-17 11:41                 ` Peter Zijlstra
  1 sibling, 1 reply; 47+ messages in thread
From: Oleg Nesterov @ 2025-01-14 22:10 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jiri Olsa, Aleksa Sarai, Eyal Birger, mhiramat, linux-kernel,
	linux-trace-kernel, BPF-dev-list, Song Liu, Yonghong Song,
	John Fastabend, peterz, tglx, bp, x86, linux-api, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, rostedt@goodmis.org, rafi,
	Shmulik Ladkani

On 01/14, Andrii Nakryiko wrote:
>
> On Tue, Jan 14, 2025 at 12:40 PM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > But, unlike sys_uretprobe(), sys_rt_sigreturn() is old, so the existing
> > setups must know that sigreturn() should be respected...
>
> someday sys_uretprobe will be old as well ;) FWIW, systemd allowlisted
> sys_uretprobe, see [0]

And I agree! ;)

I mean, I'd personally prefer to do nothing and wait until userspace figures
out that we have another "special" syscall.

But can we do it? I simply do not know. Can we ignore this (valid) bug report?

Oleg.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-14 22:10                 ` Oleg Nesterov
@ 2025-01-14 23:52                   ` Andrii Nakryiko
  2025-01-15  0:09                     ` Eyal Birger
  0 siblings, 1 reply; 47+ messages in thread
From: Andrii Nakryiko @ 2025-01-14 23:52 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jiri Olsa, Aleksa Sarai, Eyal Birger, mhiramat, linux-kernel,
	linux-trace-kernel, BPF-dev-list, Song Liu, Yonghong Song,
	John Fastabend, peterz, tglx, bp, x86, linux-api, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, rostedt@goodmis.org, rafi,
	Shmulik Ladkani

On Tue, Jan 14, 2025 at 2:11 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 01/14, Andrii Nakryiko wrote:
> >
> > On Tue, Jan 14, 2025 at 12:40 PM Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > But, unlike sys_uretprobe(), sys_rt_sigreturn() is old, so the existing
> > > setups must know that sigreturn() should be respected...
> >
> > someday sys_uretprobe will be old as well ;) FWIW, systemd allowlisted
> > sys_uretprobe, see [0]
>
> And I agree! ;)
>
> I mean, I'd personally prefer to do nothing and wait until userspace figures
> out that we have another "special" syscall.
>
> But can we do it? I simply do not know. Can we ignore this (valid) bug report?
>

Seems wrong for kernel to try to guess whether some syscall is
filtered by some policy or not (though maybe I'm misunderstanding the
details and it's kernel-originated problem?). Seems like a recipe for
more problems.

Nothing is really fundamentally broken. Some piece of software needs
an upgraded library to not disable the kernel's special syscall (just
like sys_rt_sigreturn, nothing "new" here, really). Users can't do
uprobing in such broken setups (but not in general), seems like a good
incentive for everyone to push for the right thing here: fixed up to
date software.

> Oleg.
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-14 23:52                   ` Andrii Nakryiko
@ 2025-01-15  0:09                     ` Eyal Birger
  2025-01-15  0:50                       ` Oleg Nesterov
  0 siblings, 1 reply; 47+ messages in thread
From: Eyal Birger @ 2025-01-15  0:09 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Oleg Nesterov, Jiri Olsa, Sarai Aleksa, mhiramat, linux-kernel,
	linux-trace-kernel, BPF-dev-list, Song Liu, Yonghong Song,
	John Fastabend, peterz, tglx, bp, x86, linux-api, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, rostedt, rafi,
	Shmulik Ladkani

Hi,

> On Jan 14, 2025, at 15:52, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> 
> On Tue, Jan 14, 2025 at 2:11 PM Oleg Nesterov <oleg@redhat.com> wrote:
>> 
>>> On 01/14, Andrii Nakryiko wrote:
>>> 
>>> On Tue, Jan 14, 2025 at 12:40 PM Oleg Nesterov <oleg@redhat.com> wrote:
>>>> 
>>>> But, unlike sys_uretprobe(), sys_rt_sigreturn() is old, so the existing
>>>> setups must know that sigreturn() should be respected...
>>> 
>>> someday sys_uretprobe will be old as well ;) FWIW, systemd allowlisted
>>> sys_uretprobe, see [0]
>> 
>> And I agree! ;)
>> 
>> I mean, I'd personally prefer to do nothing and wait until userspace figures
>> out that we have another "special" syscall.
>> 
>> But can we do it? I simply do not know. Can we ignore this (valid) bug report?
>> 
> 
> Seems wrong for kernel to try to guess whether some syscall is
> filtered by some policy or not (though maybe I'm misunderstanding the
> details and it's kernel-originated problem?). Seems like a recipe for
> more problems.
> 
> Nothing is really fundamentally broken. Some piece of software needs
> an upgraded library to not disable the kernel's special syscall (just
> like sys_rt_sigreturn, nothing "new" here, really). Users can't do
> uprobing in such broken setups (but not in general), seems like a good
> incentive for everyone to push for the right thing here: fixed up to
> date software.

It’s not “users” doing the uprobing in this case.
Its software, that’s working fine in previous kernel versions and upon upgrade starts creating crashes in other processes.

IMHO demanding that other software (e.g docker) be upgraded in order to run on a newer kernel is not what Linux formerly guaranteed.

Eyal
> 
>> Oleg.
>> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-15  0:09                     ` Eyal Birger
@ 2025-01-15  0:50                       ` Oleg Nesterov
  2025-01-15  5:45                         ` Shmulik Ladkani
  0 siblings, 1 reply; 47+ messages in thread
From: Oleg Nesterov @ 2025-01-15  0:50 UTC (permalink / raw)
  To: Eyal Birger
  Cc: Andrii Nakryiko, Jiri Olsa, Sarai Aleksa, mhiramat, linux-kernel,
	linux-trace-kernel, BPF-dev-list, Song Liu, Yonghong Song,
	John Fastabend, peterz, tglx, bp, x86, linux-api, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, rostedt, rafi,
	Shmulik Ladkani

On 01/14, Eyal Birger wrote:
>
> Its software, that’s working fine in previous kernel versions and upon
> upgrade starts creating crashes in other processes.
>
> IMHO demanding that other software (e.g docker) be upgraded in order to
> run on a newer kernel is not what Linux formerly guaranteed.

Agreed.

Oleg.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-15  0:50                       ` Oleg Nesterov
@ 2025-01-15  5:45                         ` Shmulik Ladkani
  2025-01-15 15:51                           ` Oleg Nesterov
  0 siblings, 1 reply; 47+ messages in thread
From: Shmulik Ladkani @ 2025-01-15  5:45 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eyal Birger, Andrii Nakryiko, Jiri Olsa, Sarai Aleksa, mhiramat,
	linux-kernel, linux-trace-kernel, BPF-dev-list, Song Liu,
	Yonghong Song, John Fastabend, peterz, tglx, bp, x86, linux-api,
	Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov, rostedt,
	rafi

On Wed, 15 Jan 2025 01:50:13 +0100 Oleg Nesterov <oleg@redhat.com>
wrote:

> On 01/14, Eyal Birger wrote:
> >
> > Its software, that’s working fine in previous kernel versions and
> > upon upgrade starts creating crashes in other processes.
> >
> > IMHO demanding that other software (e.g docker) be upgraded in
> > order to run on a newer kernel is not what Linux formerly
> > guaranteed.  
> 
> Agreed.

IMO There are 2 problematic aspects with ff474a78cef5
("uprobe: Add uretprobe syscall to speed up return probe").

The first, as Eyal mentioned, is the kernel regression: There are
endless systems out there (iaas and paas) that have both
telementry/instrumentation/tracing software (utilizing uprobes) and
container environments (duch as docker) that enforce syscall
restrictions on their workloads.
These systems worked so far, and with kernels having ff474a78cef5 the
workloads processes fault.

The second, is the fact that ff474a78cef5 (which adds a new syscall
invocation to the uretprobe trampoline) *exposes an internal kernel
implementation* to the userspace system:

There are millions of binaries/libraries out there that *never issue*
the new syscall: they simply do not have that call in their
instructions. Take for example hello-world.

However, once hello-world is traced (with software utilizing
uprobes) hello-world *unknowingly* DO issue the new syscall, just
because the kernel decided to implement its uretprobe trampoline using
a new syscall - a mechanism that should be completely transparent and
seamless to the user program.

This is totally unexpected, and to ask a system admin to "guess" whether
hello-world is "going to issue the syscall despite the fact that
such invocation does not exist in its own code at all" (and set seccomp
permissions accordingly) is asking for the admin to know the exact
*internal mechanisms* that the kernel use for implemeting the
trampolines.

Just like we won't add a div-by-zero fault to the trampoline, we
shoudn't add any instruction (such as a syscall) that isn't *completely
transparent* to the userspace program.

Best,
    Shmulik

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-14 17:25             ` Oleg Nesterov
@ 2025-01-15  9:36               ` Jiri Olsa
  2025-01-15 13:24                 ` Eyal Birger
                                   ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Jiri Olsa @ 2025-01-15  9:36 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jiri Olsa, Eyal Birger, Aleksa Sarai, mhiramat, linux-kernel,
	linux-trace-kernel, BPF-dev-list, Song Liu, Yonghong Song,
	John Fastabend, peterz, tglx, bp, x86, linux-api, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko,
	rostedt@goodmis.org, rafi, Shmulik Ladkani

On Tue, Jan 14, 2025 at 06:25:20PM +0100, Oleg Nesterov wrote:
> On 01/14, Jiri Olsa wrote:
> >
> > ugh.. could we just 'disable' uretprobe trampoline when seccomp gets enabled?
> > overwrite first byte with int3.. and similarly check on seccomp when installing
> > uretprobe and switch to int3
> 
> Sorry, I don't understand... What exactly we can do? Aside from checking
> IS_ENABLED(CONFIG_SECCOMP) in arch_uprobe_trampoline() ?

I need to check more on seccomp, but I imagine we could do following:
  - when seccomp filter is installed we could check uprobe trampoline
    and if it's already installed we change it to int3 trampoline
  - when uprobe trampoline is getting installed we check if there's
    seccomp filter installed for task and we use int3 trampoline

other than that I guess we will have to add sysctl to enable uretprobe
trampoline..

jirka

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-15  9:36               ` Jiri Olsa
@ 2025-01-15 13:24                 ` Eyal Birger
  2025-01-15 13:25                 ` Jiri Olsa
  2025-01-15 15:06                 ` Oleg Nesterov
  2 siblings, 0 replies; 47+ messages in thread
From: Eyal Birger @ 2025-01-15 13:24 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Aleksa Sarai, mhiramat, linux-kernel,
	linux-trace-kernel, BPF-dev-list, Song Liu, Yonghong Song,
	John Fastabend, peterz, tglx, bp, x86, linux-api, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko,
	rostedt@goodmis.org, rafi, Shmulik Ladkani

Hi,

On Wed, Jan 15, 2025 at 1:36 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Tue, Jan 14, 2025 at 06:25:20PM +0100, Oleg Nesterov wrote:
> > On 01/14, Jiri Olsa wrote:
> > >
> > > ugh.. could we just 'disable' uretprobe trampoline when seccomp gets enabled?
> > > overwrite first byte with int3.. and similarly check on seccomp when installing
> > > uretprobe and switch to int3
> >
> > Sorry, I don't understand... What exactly we can do? Aside from checking
> > IS_ENABLED(CONFIG_SECCOMP) in arch_uprobe_trampoline() ?
>
> I need to check more on seccomp, but I imagine we could do following:
>   - when seccomp filter is installed we could check uprobe trampoline
>     and if it's already installed we change it to int3 trampoline
>   - when uprobe trampoline is getting installed we check if there's
>     seccomp filter installed for task and we use int3 trampoline

Sounds reasonable to me.
I'm wondering how hard it is to figure out the seccomp installation
given that from what I understand it's inherited.

>
> other than that I guess we will have to add sysctl to enable uretprobe
> trampoline..

I'm wondering when one would enable/disable such sysctl.
"Give me speed but potentially crash processes I don't control"
is a curious semantic...

Eyal

>
> jirka

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-15  9:36               ` Jiri Olsa
  2025-01-15 13:24                 ` Eyal Birger
@ 2025-01-15 13:25                 ` Jiri Olsa
  2025-01-15 15:06                 ` Oleg Nesterov
  2 siblings, 0 replies; 47+ messages in thread
From: Jiri Olsa @ 2025-01-15 13:25 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Eyal Birger, Aleksa Sarai, mhiramat, linux-kernel,
	linux-trace-kernel, BPF-dev-list, Song Liu, Yonghong Song,
	John Fastabend, peterz, tglx, bp, x86, linux-api, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko,
	rostedt@goodmis.org, rafi, Shmulik Ladkani

On Wed, Jan 15, 2025 at 10:36:51AM +0100, Jiri Olsa wrote:
> On Tue, Jan 14, 2025 at 06:25:20PM +0100, Oleg Nesterov wrote:
> > On 01/14, Jiri Olsa wrote:
> > >
> > > ugh.. could we just 'disable' uretprobe trampoline when seccomp gets enabled?
> > > overwrite first byte with int3.. and similarly check on seccomp when installing
> > > uretprobe and switch to int3
> > 
> > Sorry, I don't understand... What exactly we can do? Aside from checking
> > IS_ENABLED(CONFIG_SECCOMP) in arch_uprobe_trampoline() ?
> 
> I need to check more on seccomp, but I imagine we could do following:
>   - when seccomp filter is installed we could check uprobe trampoline
>     and if it's already installed we change it to int3 trampoline

nah, other threads could be in there already.. :-\

jirka

>   - when uprobe trampoline is getting installed we check if there's
>     seccomp filter installed for task and we use int3 trampoline
> 
> other than that I guess we will have to add sysctl to enable uretprobe
> trampoline..
> 
> jirka

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-15  9:36               ` Jiri Olsa
  2025-01-15 13:24                 ` Eyal Birger
  2025-01-15 13:25                 ` Jiri Olsa
@ 2025-01-15 15:06                 ` Oleg Nesterov
  2025-01-15 17:56                   ` Alexei Starovoitov
  2 siblings, 1 reply; 47+ messages in thread
From: Oleg Nesterov @ 2025-01-15 15:06 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Eyal Birger, Aleksa Sarai, mhiramat, linux-kernel,
	linux-trace-kernel, BPF-dev-list, Song Liu, Yonghong Song,
	John Fastabend, peterz, tglx, bp, x86, linux-api, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko,
	rostedt@goodmis.org, rafi, Shmulik Ladkani

On 01/15, Jiri Olsa wrote:
>
> On Tue, Jan 14, 2025 at 06:25:20PM +0100, Oleg Nesterov wrote:
> >
> > Sorry, I don't understand... What exactly we can do? Aside from checking
> > IS_ENABLED(CONFIG_SECCOMP) in arch_uprobe_trampoline() ?
>
> I need to check more on seccomp, but I imagine we could do following:
>   - when seccomp filter is installed we could check uprobe trampoline
>     and if it's already installed we change it to int3 trampoline
>   - when uprobe trampoline is getting installed we check if there's
>     seccomp filter installed for task and we use int3 trampoline

I still don't understand... But whatever you meant, I doubt it can work.

> other than that I guess we will have to add sysctl to enable uretprobe
> trampoline..

Or we can change __secure_computing() to do nothing if
this_syscall == __NR_uretprobe. Or even change syscall_trace_enter/exit
to do this check.

But I don't really like this idea, I don't feel this is the right solution...

Oleg.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-15  5:45                         ` Shmulik Ladkani
@ 2025-01-15 15:51                           ` Oleg Nesterov
  0 siblings, 0 replies; 47+ messages in thread
From: Oleg Nesterov @ 2025-01-15 15:51 UTC (permalink / raw)
  To: Shmulik Ladkani
  Cc: Eyal Birger, Andrii Nakryiko, Jiri Olsa, Sarai Aleksa, mhiramat,
	linux-kernel, linux-trace-kernel, BPF-dev-list, Song Liu,
	Yonghong Song, John Fastabend, peterz, tglx, bp, x86, linux-api,
	Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov, rostedt,
	rafi

On 01/15, Shmulik Ladkani wrote:
>
> IMO There are 2 problematic aspects with ff474a78cef5
> ("uprobe: Add uretprobe syscall to speed up return probe").
>
> The first, as Eyal mentioned, is the kernel regression: There are
> endless systems out there (iaas and paas) that have both
> telementry/instrumentation/tracing software (utilizing uprobes) and
> container environments (duch as docker) that enforce syscall
> restrictions on their workloads.
> These systems worked so far, and with kernels having ff474a78cef5 the
> workloads processes fault.

Again, I have to agree. The kernel should not break userspace.

But,

> The second, is the fact that ff474a78cef5 (which adds a new syscall
> invocation to the uretprobe trampoline) *exposes an internal kernel
> implementation* to the userspace system:

I disagree...

> There are millions of binaries/libraries out there that *never issue*
> the new syscall: they simply do not have that call in their
> instructions. Take for example hello-world.

And they should never use this syscall,

> However, once hello-world is traced (with software utilizing
> uprobes) hello-world *unknowingly* DO issue the new syscall, just
> because the kernel decided to implement its uretprobe trampoline using
> a new syscall - a mechanism that should be completely transparent and
> seamless to the user program.

IMO, sys_uretprobe() doesn't really differ from sys_sigreturn() in this
respect.

> This is totally unexpected, and to ask a system admin to "guess" whether
> hello-world is "going to issue the syscall despite the fact that
> such invocation does not exist in its own code at all" (and set seccomp
> permissions accordingly) is asking for the admin to know the exact
> *internal mechanisms* that the kernel use for implemeting the
> trampolines.

Well, man 2 uretprobe can help ;)

> we
> shoudn't add any instruction (such as a syscall) that isn't *completely
> transparent* to the userspace program.

We can't make it *completely transparent*, but it is easy to hide this
syscall from seccomp (and/or ptrace).

And this will fix the problem. But I don't feel this is the right solution.

Oleg.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-15 15:06                 ` Oleg Nesterov
@ 2025-01-15 17:56                   ` Alexei Starovoitov
  2025-01-15 18:20                     ` Andrii Nakryiko
  2025-01-15 18:40                     ` Oleg Nesterov
  0 siblings, 2 replies; 47+ messages in thread
From: Alexei Starovoitov @ 2025-01-15 17:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jiri Olsa, Eyal Birger, Aleksa Sarai, Masami Hiramatsu,
	linux-kernel, linux-trace-kernel, BPF-dev-list, Song Liu,
	Yonghong Song, John Fastabend, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, X86 ML, Linux API, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko,
	rostedt@goodmis.org, rafi, Shmulik Ladkani

On Wed, Jan 15, 2025 at 7:06 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> Or we can change __secure_computing() to do nothing if
> this_syscall == __NR_uretprobe.

I think that's the best way forward.
seccomp already allowlists sigreturn syscall.
uretprobe syscall is in the same category.
See __secure_computing_strict.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-15 17:56                   ` Alexei Starovoitov
@ 2025-01-15 18:20                     ` Andrii Nakryiko
  2025-01-15 18:40                     ` Oleg Nesterov
  1 sibling, 0 replies; 47+ messages in thread
From: Andrii Nakryiko @ 2025-01-15 18:20 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Oleg Nesterov, Jiri Olsa, Eyal Birger, Aleksa Sarai,
	Masami Hiramatsu, linux-kernel, linux-trace-kernel, BPF-dev-list,
	Song Liu, Yonghong Song, John Fastabend, Peter Zijlstra,
	Thomas Gleixner, Borislav Petkov, X86 ML, Linux API,
	Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov,
	rostedt@goodmis.org, rafi, Shmulik Ladkani

On Wed, Jan 15, 2025 at 9:56 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Jan 15, 2025 at 7:06 AM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Or we can change __secure_computing() to do nothing if
> > this_syscall == __NR_uretprobe.
>
> I think that's the best way forward.
> seccomp already allowlists sigreturn syscall.
> uretprobe syscall is in the same category.

+1, we will have a similar problem with sys_uprobe (when it's added).
Just like rt_sigreturn, these are special kernel-only mechanisms, and
the kernel already protects itself from any user abuse. So I think we
should have a way to ensure those special syscalls can go through
regardless of seccomp.

> See __secure_computing_strict.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-15 17:56                   ` Alexei Starovoitov
  2025-01-15 18:20                     ` Andrii Nakryiko
@ 2025-01-15 18:40                     ` Oleg Nesterov
  2025-01-15 18:48                       ` Eyal Birger
  1 sibling, 1 reply; 47+ messages in thread
From: Oleg Nesterov @ 2025-01-15 18:40 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jiri Olsa, Eyal Birger, Aleksa Sarai, Masami Hiramatsu,
	linux-kernel, linux-trace-kernel, BPF-dev-list, Song Liu,
	Yonghong Song, John Fastabend, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, X86 ML, Linux API, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko,
	rostedt@goodmis.org, rafi, Shmulik Ladkani

On 01/15, Alexei Starovoitov wrote:
>
> On Wed, Jan 15, 2025 at 7:06 AM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Or we can change __secure_computing() to do nothing if
> > this_syscall == __NR_uretprobe.
>
> I think that's the best way forward.
> seccomp already allowlists sigreturn syscall.

Only if SECCOMP_MODE_STRICT. But it won't help if we add __NR_uretprobe
into into mode1_syscalls/mode1_syscalls_32.

SECCOMP_MODE_FILTER can do anything. Just I guess nobody tries to offend
sigreturn for obvious reasons.

But yes, perhaps we do not have a better solution.

Oleg.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-15 18:40                     ` Oleg Nesterov
@ 2025-01-15 18:48                       ` Eyal Birger
  2025-01-15 19:03                         ` Oleg Nesterov
  0 siblings, 1 reply; 47+ messages in thread
From: Eyal Birger @ 2025-01-15 18:48 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Alexei Starovoitov, Jiri Olsa, Aleksa Sarai, Masami Hiramatsu,
	linux-kernel, linux-trace-kernel, BPF-dev-list, Song Liu,
	Yonghong Song, John Fastabend, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, X86 ML, Linux API, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko,
	rostedt@goodmis.org, rafi, Shmulik Ladkani

On Wed, Jan 15, 2025 at 10:40 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 01/15, Alexei Starovoitov wrote:
> >
> > On Wed, Jan 15, 2025 at 7:06 AM Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > Or we can change __secure_computing() to do nothing if
> > > this_syscall == __NR_uretprobe.
> >
> > I think that's the best way forward.
> > seccomp already allowlists sigreturn syscall.
>
> Only if SECCOMP_MODE_STRICT. But it won't help if we add __NR_uretprobe
> into into mode1_syscalls/mode1_syscalls_32.
>
> SECCOMP_MODE_FILTER can do anything. Just I guess nobody tries to offend
> sigreturn for obvious reasons.
>
> But yes, perhaps we do not have a better solution.
>

Indeed - doing the check in __secure_computing_strict() doesn't seem to be
enough.

In __secure_computing(), i.e. the below hack it works.

Eyal.

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 385d48293a5f..5739482036ce 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1359,6 +1359,9 @@ int __secure_computing(const struct seccomp_data *sd)
        this_syscall = sd ? sd->nr :
                syscall_get_nr(current, current_pt_regs());

+       if (this_syscall == __NR_uretprobe)
+               return 0;
+
        switch (mode) {
        case SECCOMP_MODE_STRICT:
                __secure_computing_strict(this_syscall);  /* may call do_exit */

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-15 18:48                       ` Eyal Birger
@ 2025-01-15 19:03                         ` Oleg Nesterov
  2025-01-15 21:14                           ` Eyal Birger
  0 siblings, 1 reply; 47+ messages in thread
From: Oleg Nesterov @ 2025-01-15 19:03 UTC (permalink / raw)
  To: Eyal Birger
  Cc: Alexei Starovoitov, Jiri Olsa, Aleksa Sarai, Masami Hiramatsu,
	linux-kernel, linux-trace-kernel, BPF-dev-list, Song Liu,
	Yonghong Song, John Fastabend, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, X86 ML, Linux API, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko,
	rostedt@goodmis.org, rafi, Shmulik Ladkani

On 01/15, Eyal Birger wrote:
>
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -1359,6 +1359,9 @@ int __secure_computing(const struct seccomp_data *sd)
>         this_syscall = sd ? sd->nr :
>                 syscall_get_nr(current, current_pt_regs());
>
> +       if (this_syscall == __NR_uretprobe)
> +               return 0;
> +

Yes, this is what I meant. But we need the new arch-dependent helper.

Oleg.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-15 19:03                         ` Oleg Nesterov
@ 2025-01-15 21:14                           ` Eyal Birger
  2025-01-16 14:39                             ` Oleg Nesterov
  0 siblings, 1 reply; 47+ messages in thread
From: Eyal Birger @ 2025-01-15 21:14 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Alexei Starovoitov, Jiri Olsa, Aleksa Sarai, Masami Hiramatsu,
	linux-kernel, linux-trace-kernel, BPF-dev-list, Song Liu,
	Yonghong Song, John Fastabend, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, X86 ML, Linux API, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko,
	rostedt@goodmis.org, rafi, Shmulik Ladkani

On Wed, Jan 15, 2025 at 11:03 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 01/15, Eyal Birger wrote:
> >
> > --- a/kernel/seccomp.c
> > +++ b/kernel/seccomp.c
> > @@ -1359,6 +1359,9 @@ int __secure_computing(const struct seccomp_data *sd)
> >         this_syscall = sd ? sd->nr :
> >                 syscall_get_nr(current, current_pt_regs());
> >
> > +       if (this_syscall == __NR_uretprobe)
> > +               return 0;
> > +
>
> Yes, this is what I meant. But we need the new arch-dependent helper.

Do you mean because __NR_uretprobe is not defined for other architectures?
Is there an existing helper? I wasn't able to find one...

If not, would it just make sense to just wrap this check in
#ifdef __NR_uretprobe ?

Eyal.

>
> Oleg.
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-15 21:14                           ` Eyal Birger
@ 2025-01-16 14:39                             ` Oleg Nesterov
  2025-01-16 14:47                               ` Eyal Birger
  0 siblings, 1 reply; 47+ messages in thread
From: Oleg Nesterov @ 2025-01-16 14:39 UTC (permalink / raw)
  To: Eyal Birger
  Cc: Alexei Starovoitov, Jiri Olsa, Aleksa Sarai, Masami Hiramatsu,
	linux-kernel, linux-trace-kernel, BPF-dev-list, Song Liu,
	Yonghong Song, John Fastabend, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, X86 ML, Linux API, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko,
	rostedt@goodmis.org, rafi, Shmulik Ladkani

On 01/15, Eyal Birger wrote:
>
> On Wed, Jan 15, 2025 at 11:03 AM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > On 01/15, Eyal Birger wrote:
> > >
> > > --- a/kernel/seccomp.c
> > > +++ b/kernel/seccomp.c
> > > @@ -1359,6 +1359,9 @@ int __secure_computing(const struct seccomp_data *sd)
> > >         this_syscall = sd ? sd->nr :
> > >                 syscall_get_nr(current, current_pt_regs());
> > >
> > > +       if (this_syscall == __NR_uretprobe)
> > > +               return 0;
> > > +
> >
> > Yes, this is what I meant. But we need the new arch-dependent helper.
>
> Do you mean because __NR_uretprobe is not defined for other architectures?

Yes, and see below,

> Is there an existing helper? I wasn't able to find one...

No,

> If not, would it just make sense to just wrap this check in
> #ifdef __NR_uretprobe ?

Given that we need a simple fix for -stable, I won't argue.
Up to seccomp maintainers.

But please note that this_syscall == __NR_uretprobe can be false
positive if is_compat_task().

__NR_uretprobe == __NR_ia32_rt_tgsigqueueinfo, so I guess we need

	#ifdef CONFIG_X86_64
		if (this_syscall == __NR_uretprobe && !in_ia32_syscall())
			return 0;
	#endif

I don't think we need to worry about the X86_X32 tasks...

Oleg.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-16 14:39                             ` Oleg Nesterov
@ 2025-01-16 14:47                               ` Eyal Birger
  2025-01-16 15:31                                 ` Oleg Nesterov
  0 siblings, 1 reply; 47+ messages in thread
From: Eyal Birger @ 2025-01-16 14:47 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Alexei Starovoitov, Jiri Olsa, Aleksa Sarai, Masami Hiramatsu,
	linux-kernel, linux-trace-kernel, BPF-dev-list, Song Liu,
	Yonghong Song, John Fastabend, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, X86 ML, Linux API, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko,
	rostedt@goodmis.org, rafi, Shmulik Ladkani

On Thu, Jan 16, 2025 at 6:40 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 01/15, Eyal Birger wrote:
> >
> > On Wed, Jan 15, 2025 at 11:03 AM Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > On 01/15, Eyal Birger wrote:
> > > >
> > > > --- a/kernel/seccomp.c
> > > > +++ b/kernel/seccomp.c
> > > > @@ -1359,6 +1359,9 @@ int __secure_computing(const struct seccomp_data *sd)
> > > >         this_syscall = sd ? sd->nr :
> > > >                 syscall_get_nr(current, current_pt_regs());
> > > >
> > > > +       if (this_syscall == __NR_uretprobe)
> > > > +               return 0;
> > > > +
> > >
> > > Yes, this is what I meant. But we need the new arch-dependent helper.
> >
> > Do you mean because __NR_uretprobe is not defined for other architectures?
>
> Yes, and see below,
>
> > Is there an existing helper? I wasn't able to find one...
>
> No,
>
> > If not, would it just make sense to just wrap this check in
> > #ifdef __NR_uretprobe ?
>
> Given that we need a simple fix for -stable, I won't argue.
> Up to seccomp maintainers.
>
> But please note that this_syscall == __NR_uretprobe can be false
> positive if is_compat_task().
>
> __NR_uretprobe == __NR_ia32_rt_tgsigqueueinfo, so I guess we need
>
>         #ifdef CONFIG_X86_64
>                 if (this_syscall == __NR_uretprobe && !in_ia32_syscall())
>                         return 0;
>         #endif
>
> I don't think we need to worry about the X86_X32 tasks...
Ack. I agree.

Do you want to send a formal patch, or should I?

Eyal.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-16 14:47                               ` Eyal Birger
@ 2025-01-16 15:31                                 ` Oleg Nesterov
  2025-01-16 17:11                                   ` Eyal Birger
  0 siblings, 1 reply; 47+ messages in thread
From: Oleg Nesterov @ 2025-01-16 15:31 UTC (permalink / raw)
  To: Eyal Birger
  Cc: Alexei Starovoitov, Jiri Olsa, Aleksa Sarai, Masami Hiramatsu,
	linux-kernel, linux-trace-kernel, BPF-dev-list, Song Liu,
	Yonghong Song, John Fastabend, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, X86 ML, Linux API, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko,
	rostedt@goodmis.org, rafi, Shmulik Ladkani

On 01/16, Eyal Birger wrote:
>
> Ack. I agree.
>
> Do you want to send a formal patch, or should I?

Please send the patch ;)

Oleg.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-16 15:31                                 ` Oleg Nesterov
@ 2025-01-16 17:11                                   ` Eyal Birger
  2025-01-17  0:48                                     ` Oleg Nesterov
  0 siblings, 1 reply; 47+ messages in thread
From: Eyal Birger @ 2025-01-16 17:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Alexei Starovoitov, Jiri Olsa, Aleksa Sarai, Masami Hiramatsu,
	linux-kernel, linux-trace-kernel, BPF-dev-list, Song Liu,
	Yonghong Song, John Fastabend, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, X86 ML, Linux API, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko,
	rostedt@goodmis.org, rafi, Shmulik Ladkani

On Thu, Jan 16, 2025 at 7:31 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 01/16, Eyal Birger wrote:
> >
> > Ack. I agree.
> >
> > Do you want to send a formal patch, or should I?
>
> Please send the patch ;)

Will do. If it's ok I'll put you in a "Suggested-by" tag (or
co-developed - as you see fit).

Eyal.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-16 17:11                                   ` Eyal Birger
@ 2025-01-17  0:48                                     ` Oleg Nesterov
  0 siblings, 0 replies; 47+ messages in thread
From: Oleg Nesterov @ 2025-01-17  0:48 UTC (permalink / raw)
  To: Eyal Birger
  Cc: Alexei Starovoitov, Jiri Olsa, Aleksa Sarai, Masami Hiramatsu,
	linux-kernel, linux-trace-kernel, BPF-dev-list, Song Liu,
	Yonghong Song, John Fastabend, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, X86 ML, Linux API, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko,
	rostedt@goodmis.org, rafi, Shmulik Ladkani

On 01/16, Eyal Birger wrote:
>
> Will do. If it's ok I'll put you in a "Suggested-by" tag (or
> co-developed - as you see fit).

Heh ;) thanks Eyal but I don't care at all. This doesn't matter.
What does matter is that you have found/investigated the problem.

So if you agree with this change, just send the patch and I'll
send my acked-by in reply.

Thanks ;)

Oleg.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-14 14:21           ` Jiri Olsa
@ 2025-01-17  1:23             ` Masami Hiramatsu
  2025-01-17  1:57               ` Oleg Nesterov
  0 siblings, 1 reply; 47+ messages in thread
From: Masami Hiramatsu @ 2025-01-17  1:23 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Masami Hiramatsu, Aleksa Sarai, Eyal Birger,
	linux-kernel, linux-trace-kernel, BPF-dev-list, Song Liu,
	Yonghong Song, John Fastabend, peterz, tglx, bp, x86, linux-api,
	Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov,
	Andrii Nakryiko, rostedt@goodmis.org, rafi, Shmulik Ladkani

On Tue, 14 Jan 2025 15:21:29 +0100
Jiri Olsa <olsajiri@gmail.com> wrote:

> On Tue, Jan 14, 2025 at 12:21:07PM +0100, Oleg Nesterov wrote:
> > On 01/14, Masami Hiramatsu wrote:
> > >
> > > On Tue, 14 Jan 2025 10:22:20 +0100
> > > Jiri Olsa <olsajiri@gmail.com> wrote:
> > >
> > > > @@ -418,6 +439,9 @@ SYSCALL_DEFINE0(uretprobe)
> > > >  	regs->r11 = regs->flags;
> > > >  	regs->cx  = regs->ip;
> > > >
> > > > +	/* zero rbx to signal trampoline that uretprobe syscall was executed */
> > > > +	regs->bx  = 0;
> > >
> > > Can we just return -ENOSYS as like as other syscall instead of
> > > using rbx as a side channel?
> > > We can carefully check the return address is not -ERRNO when set up
> > > and reserve the -ENOSYS for this use case.
> > 
> > Not sure I understand...
> > 
> > But please not that the uretprobed function can return any value
> > including -ENOSYS, and this is what sys_uretprobe() has to return.
> 
> right, uretprobe syscall returns value of the uretprobed function,
> so we can't use any reserved value

We can make uretprobe (entry) fail if the return value is one of
errno or NULL, because it *knows* what the return value here.

Thank you,

> 
> jirka


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-17  1:23             ` Masami Hiramatsu
@ 2025-01-17  1:57               ` Oleg Nesterov
  0 siblings, 0 replies; 47+ messages in thread
From: Oleg Nesterov @ 2025-01-17  1:57 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Jiri Olsa, Aleksa Sarai, Eyal Birger, linux-kernel,
	linux-trace-kernel, BPF-dev-list, Song Liu, Yonghong Song,
	John Fastabend, peterz, tglx, bp, x86, linux-api, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko,
	rostedt@goodmis.org, rafi, Shmulik Ladkani

On 01/17, Masami Hiramatsu wrote:
>
> On Tue, 14 Jan 2025 15:21:29 +0100
> Jiri Olsa <olsajiri@gmail.com> wrote:
>
> > On Tue, Jan 14, 2025 at 12:21:07PM +0100, Oleg Nesterov wrote:
> > >
> > > But please not that the uretprobed function can return any value
> > > including -ENOSYS, and this is what sys_uretprobe() has to return.
> >
> > right, uretprobe syscall returns value of the uretprobed function,
> > so we can't use any reserved value
>
> We can make uretprobe (entry) fail if the return value is one of
> errno or NULL, because it *knows* what the return value here.

I fail to understand... Could you spell please?

But whatever you meant, I don't think it is right, sorry.. please
correct me.

"it *knows*". Who it? How can it know??? I'd say "it" can't know,
but probably I missed your point.

Not to mention it doesn't really matter. It is not safe to even try
to do

	movq $__NR_uretprobe, %rax
	syscall

if the probed task has seccomp filters.

Oleg.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-14 21:45               ` Andrii Nakryiko
  2025-01-14 22:10                 ` Oleg Nesterov
@ 2025-01-17 11:41                 ` Peter Zijlstra
  2025-01-17 17:53                   ` Andrii Nakryiko
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2025-01-17 11:41 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Oleg Nesterov, Jiri Olsa, Aleksa Sarai, Eyal Birger, mhiramat,
	linux-kernel, linux-trace-kernel, BPF-dev-list, Song Liu,
	Yonghong Song, John Fastabend, tglx, bp, x86, linux-api,
	Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov,
	rostedt@goodmis.org, rafi, Shmulik Ladkani

On Tue, Jan 14, 2025 at 01:45:11PM -0800, Andrii Nakryiko wrote:

> someday sys_uretprobe will be old as well

And then we'll delete it because everybody that cares about performance
will have FRED on ;-)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Crash when attaching uretprobes to processes running in Docker
  2025-01-17 11:41                 ` Peter Zijlstra
@ 2025-01-17 17:53                   ` Andrii Nakryiko
  0 siblings, 0 replies; 47+ messages in thread
From: Andrii Nakryiko @ 2025-01-17 17:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Jiri Olsa, Aleksa Sarai, Eyal Birger, mhiramat,
	linux-kernel, linux-trace-kernel, BPF-dev-list, Song Liu,
	Yonghong Song, John Fastabend, tglx, bp, x86, linux-api,
	Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov,
	rostedt@goodmis.org, rafi, Shmulik Ladkani

On Fri, Jan 17, 2025 at 3:41 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Jan 14, 2025 at 01:45:11PM -0800, Andrii Nakryiko wrote:
>
> > someday sys_uretprobe will be old as well
>
> And then we'll delete it because everybody that cares about performance
> will have FRED on ;-)

with "then" sufficiently far into the future, sure :)

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2025-01-17 17:53 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-10 15:12 Crash when attaching uretprobes to processes running in Docker Eyal Birger
2025-01-10 15:25 ` Aleksa Sarai
2025-01-11 18:40   ` Jiri Olsa
2025-01-14  9:22     ` Jiri Olsa
2025-01-14 10:05       ` Masami Hiramatsu
2025-01-14 11:21         ` Oleg Nesterov
2025-01-14 14:21           ` Jiri Olsa
2025-01-17  1:23             ` Masami Hiramatsu
2025-01-17  1:57               ` Oleg Nesterov
2025-01-14 10:42       ` Peter Zijlstra
2025-01-14 11:01         ` Oleg Nesterov
2025-01-14 12:02           ` Peter Zijlstra
2025-01-14 12:32             ` Oleg Nesterov
2025-01-14 14:07               ` Peter Zijlstra
2025-01-14 17:43                 ` Oleg Nesterov
2025-01-14 10:58       ` Oleg Nesterov
2025-01-14 14:19         ` Jiri Olsa
2025-01-14 19:21           ` Andrii Nakryiko
2025-01-14 20:39             ` Oleg Nesterov
2025-01-14 21:45               ` Andrii Nakryiko
2025-01-14 22:10                 ` Oleg Nesterov
2025-01-14 23:52                   ` Andrii Nakryiko
2025-01-15  0:09                     ` Eyal Birger
2025-01-15  0:50                       ` Oleg Nesterov
2025-01-15  5:45                         ` Shmulik Ladkani
2025-01-15 15:51                           ` Oleg Nesterov
2025-01-17 11:41                 ` Peter Zijlstra
2025-01-17 17:53                   ` Andrii Nakryiko
2025-01-14 14:08       ` Eyal Birger
2025-01-14 14:33         ` Oleg Nesterov
2025-01-14 14:56           ` Jiri Olsa
2025-01-14 17:25             ` Oleg Nesterov
2025-01-15  9:36               ` Jiri Olsa
2025-01-15 13:24                 ` Eyal Birger
2025-01-15 13:25                 ` Jiri Olsa
2025-01-15 15:06                 ` Oleg Nesterov
2025-01-15 17:56                   ` Alexei Starovoitov
2025-01-15 18:20                     ` Andrii Nakryiko
2025-01-15 18:40                     ` Oleg Nesterov
2025-01-15 18:48                       ` Eyal Birger
2025-01-15 19:03                         ` Oleg Nesterov
2025-01-15 21:14                           ` Eyal Birger
2025-01-16 14:39                             ` Oleg Nesterov
2025-01-16 14:47                               ` Eyal Birger
2025-01-16 15:31                                 ` Oleg Nesterov
2025-01-16 17:11                                   ` Eyal Birger
2025-01-17  0:48                                     ` Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).