[LSF/MM/BPF TOPIC] faster uprobes

BPF List
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] faster uprobes
@ 2024-02-29 14:39 Jiri Olsa
  2024-03-01  0:25 ` Andrii Nakryiko
  2024-03-01 19:39 ` Kui-Feng Lee
  0 siblings, 2 replies; 31+ messages in thread
From: Jiri Olsa @ 2024-02-29 14:39 UTC (permalink / raw)
  To: bpf, Alexei Starovoitov, lsf-pc, Andrii Nakryiko, Yonghong Song,
	Oleg Nesterov, Daniel Borkmann

One of uprobe pain points is having slow execution that involves
two traps in worst case scenario or single trap if the original
instruction can be emulated. For return uprobes there's one extra
trap on top of that.

My current idea on how to make this faster is to follow the optimized
kprobes and replace the normal uprobe trap instruction with jump to
user space trampoline that:

  - executes syscall to call uprobe consumers callbacks
  - executes original instructions
  - jumps back to continue with the original code

There are of course corner cases where above will have trouble or
won't work completely, like:

  - executing original instructions in the trampoline is tricky wrt
    rip relative addressing

  - some instructions we can't move to trampoline at all

  - the uprobe address is on page boundary so the jump instruction to
    trampoline would span across 2 pages, hence the page replace won't
    be atomic, which might cause issues

  - ... ? many others I'm sure

Still with all the limitations I think we could be able to speed up
some amount of the uprobes, which seems worth doing.

I'd like to have the discussion on the topic and get some agreement
or directions on how this should be done.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM/BPF TOPIC] faster uprobes
  2024-02-29 14:39 [LSF/MM/BPF TOPIC] faster uprobes Jiri Olsa
@ 2024-03-01  0:25 ` Andrii Nakryiko
  2024-03-01  8:18   ` Jiri Olsa
  2024-03-01 19:39 ` Kui-Feng Lee
  1 sibling, 1 reply; 31+ messages in thread
From: Andrii Nakryiko @ 2024-03-01  0:25 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: bpf, Alexei Starovoitov, lsf-pc, Yonghong Song, Oleg Nesterov,
	Daniel Borkmann

On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> One of uprobe pain points is having slow execution that involves
> two traps in worst case scenario or single trap if the original
> instruction can be emulated. For return uprobes there's one extra
> trap on top of that.
>
> My current idea on how to make this faster is to follow the optimized
> kprobes and replace the normal uprobe trap instruction with jump to
> user space trampoline that:
>
>   - executes syscall to call uprobe consumers callbacks

Did you get a chance to measure relative performance of syscall vs
int3 interrupt handling? If not, do you think you'll be able to get
some numbers by the time the conference starts? This should inform the
decision whether it even makes sense to go through all the trouble.

>   - executes original instructions
>   - jumps back to continue with the original code
>
> There are of course corner cases where above will have trouble or
> won't work completely, like:
>
>   - executing original instructions in the trampoline is tricky wrt
>     rip relative addressing
>
>   - some instructions we can't move to trampoline at all
>
>   - the uprobe address is on page boundary so the jump instruction to
>     trampoline would span across 2 pages, hence the page replace won't
>     be atomic, which might cause issues
>
>   - ... ? many others I'm sure
>
> Still with all the limitations I think we could be able to speed up
> some amount of the uprobes, which seems worth doing.
>
> I'd like to have the discussion on the topic and get some agreement
> or directions on how this should be done.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM/BPF TOPIC] faster uprobes
  2024-03-01  0:25 ` Andrii Nakryiko
@ 2024-03-01  8:18   ` Jiri Olsa
  2024-03-01 17:01     ` Alexei Starovoitov
  0 siblings, 1 reply; 31+ messages in thread
From: Jiri Olsa @ 2024-03-01  8:18 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jiri Olsa, bpf, Alexei Starovoitov, lsf-pc, Yonghong Song,
	Oleg Nesterov, Daniel Borkmann

On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote:
> On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > One of uprobe pain points is having slow execution that involves
> > two traps in worst case scenario or single trap if the original
> > instruction can be emulated. For return uprobes there's one extra
> > trap on top of that.
> >
> > My current idea on how to make this faster is to follow the optimized
> > kprobes and replace the normal uprobe trap instruction with jump to
> > user space trampoline that:
> >
> >   - executes syscall to call uprobe consumers callbacks
> 
> Did you get a chance to measure relative performance of syscall vs
> int3 interrupt handling? If not, do you think you'll be able to get
> some numbers by the time the conference starts? This should inform the
> decision whether it even makes sense to go through all the trouble.

right, will do that

jirka

> 
> >   - executes original instructions
> >   - jumps back to continue with the original code
> >
> > There are of course corner cases where above will have trouble or
> > won't work completely, like:
> >
> >   - executing original instructions in the trampoline is tricky wrt
> >     rip relative addressing
> >
> >   - some instructions we can't move to trampoline at all
> >
> >   - the uprobe address is on page boundary so the jump instruction to
> >     trampoline would span across 2 pages, hence the page replace won't
> >     be atomic, which might cause issues
> >
> >   - ... ? many others I'm sure
> >
> > Still with all the limitations I think we could be able to speed up
> > some amount of the uprobes, which seems worth doing.
> >
> > I'd like to have the discussion on the topic and get some agreement
> > or directions on how this should be done.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM/BPF TOPIC] faster uprobes
  2024-03-01  8:18   ` Jiri Olsa
@ 2024-03-01 17:01     ` Alexei Starovoitov
  2024-03-01 17:26       ` Andrii Nakryiko
  2024-03-02 20:46       ` Jiri Olsa
  0 siblings, 2 replies; 31+ messages in thread
From: Alexei Starovoitov @ 2024-03-01 17:01 UTC (permalink / raw)
  To: Jiri Olsa, yunwei356
  Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, lsf-pc, Yonghong Song,
	Oleg Nesterov, Daniel Borkmann

On Fri, Mar 1, 2024 at 12:18 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote:
> > On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > >
> > > One of uprobe pain points is having slow execution that involves
> > > two traps in worst case scenario or single trap if the original
> > > instruction can be emulated. For return uprobes there's one extra
> > > trap on top of that.
> > >
> > > My current idea on how to make this faster is to follow the optimized
> > > kprobes and replace the normal uprobe trap instruction with jump to
> > > user space trampoline that:
> > >
> > >   - executes syscall to call uprobe consumers callbacks
> >
> > Did you get a chance to measure relative performance of syscall vs
> > int3 interrupt handling? If not, do you think you'll be able to get
> > some numbers by the time the conference starts? This should inform the
> > decision whether it even makes sense to go through all the trouble.
>
> right, will do that

I believe Yusheng measured syscall vs uprobe performance
difference during LPC. iirc it was something like 3x.
Certainly necessary to have a benchmark.
selftests/bpf/bench has one for uprobe.
Probably should extend with sys_bpf.

Regarding:
> replace the normal uprobe trap instruction with jump to
user space trampoline

it should probably be a call to trampoline instead of a jump.
Unless you plan to generate a different trampoline for every location ?

Also how would you pick a space for a trampoline in the target process ?
Analyze /proc/pid/maps and look for gaps in executable sections?

We can start simple with a USDT that uses nop5 instead of nop1
and explicit single trampoline for all USDT locations
that saves all (callee and caller saved) registers and
then does sys_bpf with a new cmd.

To replace nop5 with a call to trampoline we can use text_poke_bp
approach: replace 1st byte with int3, replace 2-5 with target addr,
replace 1st byte to make an actual call insn.

Once patched there will be no simulation of insns or kernel traps.
Just normal user code that calls into trampoline, that calls sys_bpf,
and returns back.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM/BPF TOPIC] faster uprobes
  2024-03-01 17:01     ` Alexei Starovoitov
@ 2024-03-01 17:26       ` Andrii Nakryiko
  2024-03-01 18:08         ` Yunwei 123
  2024-03-03 10:20         ` Jiri Olsa
  2024-03-02 20:46       ` Jiri Olsa
  1 sibling, 2 replies; 31+ messages in thread
From: Andrii Nakryiko @ 2024-03-01 17:26 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jiri Olsa, yunwei356, bpf, Alexei Starovoitov, lsf-pc,
	Yonghong Song, Oleg Nesterov, Daniel Borkmann

On Fri, Mar 1, 2024 at 9:01 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Mar 1, 2024 at 12:18 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote:
> > > On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > > >
> > > > One of uprobe pain points is having slow execution that involves
> > > > two traps in worst case scenario or single trap if the original
> > > > instruction can be emulated. For return uprobes there's one extra
> > > > trap on top of that.
> > > >
> > > > My current idea on how to make this faster is to follow the optimized
> > > > kprobes and replace the normal uprobe trap instruction with jump to
> > > > user space trampoline that:
> > > >
> > > >   - executes syscall to call uprobe consumers callbacks
> > >
> > > Did you get a chance to measure relative performance of syscall vs
> > > int3 interrupt handling? If not, do you think you'll be able to get
> > > some numbers by the time the conference starts? This should inform the
> > > decision whether it even makes sense to go through all the trouble.
> >
> > right, will do that
>
> I believe Yusheng measured syscall vs uprobe performance
> difference during LPC. iirc it was something like 3x.

Do you have a link to slides? Was it actual uprobe vs just some fast
syscall (not doing BPF program execution) comparison? Or comparing the
performance of int3 handling vs equivalent syscall handling.

I suspect it's the former, and so probably not that representative.
I'm curious about the performance of going
userspace->kernel->userspace through int3 vs syscall (all other things
being equal).

> Certainly necessary to have a benchmark.
> selftests/bpf/bench has one for uprobe.
> Probably should extend with sys_bpf.
>
> Regarding:
> > replace the normal uprobe trap instruction with jump to
> user space trampoline
>
> it should probably be a call to trampoline instead of a jump.
> Unless you plan to generate a different trampoline for every location ?
>
> Also how would you pick a space for a trampoline in the target process ?
> Analyze /proc/pid/maps and look for gaps in executable sections?

kernel already does that for uretprobes, it adds a new "[uprobes]"
memory mapping, so this part is already implemented

>
> We can start simple with a USDT that uses nop5 instead of nop1
> and explicit single trampoline for all USDT locations
> that saves all (callee and caller saved) registers and
> then does sys_bpf with a new cmd.
>
> To replace nop5 with a call to trampoline we can use text_poke_bp
> approach: replace 1st byte with int3, replace 2-5 with target addr,
> replace 1st byte to make an actual call insn.
>
> Once patched there will be no simulation of insns or kernel traps.
> Just normal user code that calls into trampoline, that calls sys_bpf,
> and returns back.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM/BPF TOPIC] faster uprobes
  2024-03-01 17:26       ` Andrii Nakryiko
@ 2024-03-01 18:08         ` Yunwei 123
  2024-03-03 10:20         ` Jiri Olsa
  1 sibling, 0 replies; 31+ messages in thread
From: Yunwei 123 @ 2024-03-01 18:08 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Alexei Starovoitov, Jiri Olsa, bpf, Alexei Starovoitov, lsf-pc,
	Yonghong Song, Oleg Nesterov, Daniel Borkmann

Hi!

I did some basic experiment on bpftime, which combined user space
trampoline in bpftime with a bpf_prog_test_run syscall to run eBPF
code in kernel. In my laptop, it was about 2-3x faster than original
trap based Uprobe.

The experiment code was in
https://github.com/eunomia-bpf/bpftime/blob/71f13ae80e93e8ff45e1b0320c25ff14cb25b4ba/runtime/src/bpftime_prog.cpp#L113

(That's just a poc, not kernel patches)


On Fri, Mar 1, 2024 at 5:27 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Fri, Mar 1, 2024 at 9:01 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Fri, Mar 1, 2024 at 12:18 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > >
> > > On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote:
> > > > On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > > > >
> > > > > One of uprobe pain points is having slow execution that involves
> > > > > two traps in worst case scenario or single trap if the original
> > > > > instruction can be emulated. For return uprobes there's one extra
> > > > > trap on top of that.
> > > > >
> > > > > My current idea on how to make this faster is to follow the optimized
> > > > > kprobes and replace the normal uprobe trap instruction with jump to
> > > > > user space trampoline that:
> > > > >
> > > > >   - executes syscall to call uprobe consumers callbacks
> > > >
> > > > Did you get a chance to measure relative performance of syscall vs
> > > > int3 interrupt handling? If not, do you think you'll be able to get
> > > > some numbers by the time the conference starts? This should inform the
> > > > decision whether it even makes sense to go through all the trouble.
> > >
> > > right, will do that
> >
> > I believe Yusheng measured syscall vs uprobe performance
> > difference during LPC. iirc it was something like 3x.
>
> Do you have a link to slides? Was it actual uprobe vs just some fast
> syscall (not doing BPF program execution) comparison? Or comparing the
> performance of int3 handling vs equivalent syscall handling.
>
> I suspect it's the former, and so probably not that representative.
> I'm curious about the performance of going
> userspace->kernel->userspace through int3 vs syscall (all other things
> being equal).
>
> > Certainly necessary to have a benchmark.
> > selftests/bpf/bench has one for uprobe.
> > Probably should extend with sys_bpf.
> >
> > Regarding:
> > > replace the normal uprobe trap instruction with jump to
> > user space trampoline
> >
> > it should probably be a call to trampoline instead of a jump.
> > Unless you plan to generate a different trampoline for every location ?
> >
> > Also how would you pick a space for a trampoline in the target process ?
> > Analyze /proc/pid/maps and look for gaps in executable sections?
>
> kernel already does that for uretprobes, it adds a new "[uprobes]"
> memory mapping, so this part is already implemented
>
> >
> > We can start simple with a USDT that uses nop5 instead of nop1
> > and explicit single trampoline for all USDT locations
> > that saves all (callee and caller saved) registers and
> > then does sys_bpf with a new cmd.
> >
> > To replace nop5 with a call to trampoline we can use text_poke_bp
> > approach: replace 1st byte with int3, replace 2-5 with target addr,
> > replace 1st byte to make an actual call insn.
> >
> > Once patched there will be no simulation of insns or kernel traps.
> > Just normal user code that calls into trampoline, that calls sys_bpf,
> > and returns back.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM/BPF TOPIC] faster uprobes
  2024-03-01 17:26       ` Andrii Nakryiko
  2024-03-01 18:08         ` Yunwei 123
@ 2024-03-03 10:20         ` Jiri Olsa
  2024-03-05  0:55           ` Andrii Nakryiko
  1 sibling, 1 reply; 31+ messages in thread
From: Jiri Olsa @ 2024-03-03 10:20 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Alexei Starovoitov, Jiri Olsa, yunwei356, bpf, Alexei Starovoitov,
	lsf-pc, Yonghong Song, Oleg Nesterov, Daniel Borkmann

On Fri, Mar 01, 2024 at 09:26:57AM -0800, Andrii Nakryiko wrote:
> On Fri, Mar 1, 2024 at 9:01 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Fri, Mar 1, 2024 at 12:18 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > >
> > > On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote:
> > > > On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > > > >
> > > > > One of uprobe pain points is having slow execution that involves
> > > > > two traps in worst case scenario or single trap if the original
> > > > > instruction can be emulated. For return uprobes there's one extra
> > > > > trap on top of that.
> > > > >
> > > > > My current idea on how to make this faster is to follow the optimized
> > > > > kprobes and replace the normal uprobe trap instruction with jump to
> > > > > user space trampoline that:
> > > > >
> > > > >   - executes syscall to call uprobe consumers callbacks
> > > >
> > > > Did you get a chance to measure relative performance of syscall vs
> > > > int3 interrupt handling? If not, do you think you'll be able to get
> > > > some numbers by the time the conference starts? This should inform the
> > > > decision whether it even makes sense to go through all the trouble.
> > >
> > > right, will do that
> >
> > I believe Yusheng measured syscall vs uprobe performance
> > difference during LPC. iirc it was something like 3x.
> 
> Do you have a link to slides? Was it actual uprobe vs just some fast
> syscall (not doing BPF program execution) comparison? Or comparing the
> performance of int3 handling vs equivalent syscall handling.
> 
> I suspect it's the former, and so probably not that representative.
> I'm curious about the performance of going
> userspace->kernel->userspace through int3 vs syscall (all other things
> being equal).

I have a simple test [1] comparing:
  - uprobe with 2 traps
  - uprobe with 1 trap
  - syscall executing uprobe

the syscall takes uprobe address as argument, finds the uprobe and executes
its consumers, which should be comparable to what the trampoline will do

test does same amount of loops triggering each uprobe type and measures
the time it took

  # ./test_progs -t uprobe_syscall_bench -v
  bpf_testmod.ko is already unloaded.
  Loading bpf_testmod.ko...
  Successfully loaded bpf_testmod.ko.
  test_bench_1:PASS:uprobe_bench__open_and_load 0 nsec
  test_bench_1:PASS:uprobe_bench__attach 0 nsec
  test_bench_1:PASS:uprobe1_cnt 0 nsec
  test_bench_1:PASS:syscalls_uprobe1_cnt 0 nsec
  test_bench_1:PASS:uprobe2_cnt 0 nsec
  test_bench_1: uprobes (1 trap) in  36.439s
  test_bench_1: uprobes (2 trap) in  91.960s
  test_bench_1: syscalls         in  17.872s
  #395/1   uprobe_syscall_bench/bench_1:OK
  #395     uprobe_syscall_bench:OK
  Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED

syscall uprobe execution seems to be ~2x faster than 1 trap uprobe
and ~5x faster than 2 traps uprobe

jirka


[1] https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/log/?h=uprobe_syscall_bench

> 
> > Certainly necessary to have a benchmark.
> > selftests/bpf/bench has one for uprobe.
> > Probably should extend with sys_bpf.
> >
> > Regarding:
> > > replace the normal uprobe trap instruction with jump to
> > user space trampoline
> >
> > it should probably be a call to trampoline instead of a jump.
> > Unless you plan to generate a different trampoline for every location ?
> >
> > Also how would you pick a space for a trampoline in the target process ?
> > Analyze /proc/pid/maps and look for gaps in executable sections?
> 
> kernel already does that for uretprobes, it adds a new "[uprobes]"
> memory mapping, so this part is already implemented
> 
> >
> > We can start simple with a USDT that uses nop5 instead of nop1
> > and explicit single trampoline for all USDT locations
> > that saves all (callee and caller saved) registers and
> > then does sys_bpf with a new cmd.
> >
> > To replace nop5 with a call to trampoline we can use text_poke_bp
> > approach: replace 1st byte with int3, replace 2-5 with target addr,
> > replace 1st byte to make an actual call insn.
> >
> > Once patched there will be no simulation of insns or kernel traps.
> > Just normal user code that calls into trampoline, that calls sys_bpf,
> > and returns back.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM/BPF TOPIC] faster uprobes
  2024-03-03 10:20         ` Jiri Olsa
@ 2024-03-05  0:55           ` Andrii Nakryiko
  2024-03-05  8:24             ` Jiri Olsa
  0 siblings, 1 reply; 31+ messages in thread
From: Andrii Nakryiko @ 2024-03-05  0:55 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Alexei Starovoitov, yunwei356, bpf, Alexei Starovoitov, lsf-pc,
	Yonghong Song, Oleg Nesterov, Daniel Borkmann

On Sun, Mar 3, 2024 at 2:20 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Fri, Mar 01, 2024 at 09:26:57AM -0800, Andrii Nakryiko wrote:
> > On Fri, Mar 1, 2024 at 9:01 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Fri, Mar 1, 2024 at 12:18 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > > >
> > > > On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote:
> > > > > On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > > > > >
> > > > > > One of uprobe pain points is having slow execution that involves
> > > > > > two traps in worst case scenario or single trap if the original
> > > > > > instruction can be emulated. For return uprobes there's one extra
> > > > > > trap on top of that.
> > > > > >
> > > > > > My current idea on how to make this faster is to follow the optimized
> > > > > > kprobes and replace the normal uprobe trap instruction with jump to
> > > > > > user space trampoline that:
> > > > > >
> > > > > >   - executes syscall to call uprobe consumers callbacks
> > > > >
> > > > > Did you get a chance to measure relative performance of syscall vs
> > > > > int3 interrupt handling? If not, do you think you'll be able to get
> > > > > some numbers by the time the conference starts? This should inform the
> > > > > decision whether it even makes sense to go through all the trouble.
> > > >
> > > > right, will do that
> > >
> > > I believe Yusheng measured syscall vs uprobe performance
> > > difference during LPC. iirc it was something like 3x.
> >
> > Do you have a link to slides? Was it actual uprobe vs just some fast
> > syscall (not doing BPF program execution) comparison? Or comparing the
> > performance of int3 handling vs equivalent syscall handling.
> >
> > I suspect it's the former, and so probably not that representative.
> > I'm curious about the performance of going
> > userspace->kernel->userspace through int3 vs syscall (all other things
> > being equal).
>
> I have a simple test [1] comparing:
>   - uprobe with 2 traps
>   - uprobe with 1 trap
>   - syscall executing uprobe
>
> the syscall takes uprobe address as argument, finds the uprobe and executes
> its consumers, which should be comparable to what the trampoline will do
>
> test does same amount of loops triggering each uprobe type and measures
> the time it took
>
>   # ./test_progs -t uprobe_syscall_bench -v
>   bpf_testmod.ko is already unloaded.
>   Loading bpf_testmod.ko...
>   Successfully loaded bpf_testmod.ko.
>   test_bench_1:PASS:uprobe_bench__open_and_load 0 nsec
>   test_bench_1:PASS:uprobe_bench__attach 0 nsec
>   test_bench_1:PASS:uprobe1_cnt 0 nsec
>   test_bench_1:PASS:syscalls_uprobe1_cnt 0 nsec
>   test_bench_1:PASS:uprobe2_cnt 0 nsec
>   test_bench_1: uprobes (1 trap) in  36.439s
>   test_bench_1: uprobes (2 trap) in  91.960s
>   test_bench_1: syscalls         in  17.872s
>   #395/1   uprobe_syscall_bench/bench_1:OK
>   #395     uprobe_syscall_bench:OK
>   Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED
>
> syscall uprobe execution seems to be ~2x faster than 1 trap uprobe
> and ~5x faster than 2 traps uprobe
>

Thanks for running benchmarks! I quickly looked at the selftest and
noticed this:

+/*
+ * Assuming following prolog:
+ *
+ * 6984ac:       55                      push   %rbp
+ * 6984ad:       48 89 e5                mov    %rsp,%rbp
+ */
+noinline void uprobe2_bench_trigger(void)
+{
+        asm volatile ("");
+}

This actually will be optimized out to just ret in -O2 mode (make
RELEASE=1 for selftests):

00000000005a0ce0 <uprobe2_bench_trigger>:
  5a0ce0: c3                            retq
  5a0ce1: 66 66 2e 0f 1f 84 00 00 00 00 00      nopw    %cs:(%rax,%rax)
  5a0cec: 0f 1f 40 00                   nopl    (%rax)

So be careful with that.

Also, I just updated our existing set of uprobe benchmarks (see [0]),
do you mind adding your syscall-based one as another one there and
running all of them and sharing the numbers with us? Very curious to
see both absolute and relative numbers from that benchmark. (and
please do build with RELEASE=1)

You should be able to just run benchs/run_bench_uprobes.sh (also don't
forget to add your syscall-based benchmark to the list of benchmarks
in that shell script).

Thank you!


BTW, while I think patching multiple instructions for syscall-based
uprobe is going to be extremely tricky, I think at least u*ret*probe's
int3 can be pretty easily optimized away with syscall, given that the
kernel controls code generation there. If anything, it will get the
uretprobe case a bit closer to the performance of uprobe. Give it some
thought.


  [0] https://patchwork.kernel.org/project/netdevbpf/patch/20240301214551.1686095-1-andrii@kernel.org/

> jirka
>
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/log/?h=uprobe_syscall_bench
>
> >
> > > Certainly necessary to have a benchmark.
> > > selftests/bpf/bench has one for uprobe.
> > > Probably should extend with sys_bpf.
> > >
> > > Regarding:
> > > > replace the normal uprobe trap instruction with jump to
> > > user space trampoline
> > >
> > > it should probably be a call to trampoline instead of a jump.
> > > Unless you plan to generate a different trampoline for every location ?
> > >
> > > Also how would you pick a space for a trampoline in the target process ?
> > > Analyze /proc/pid/maps and look for gaps in executable sections?
> >
> > kernel already does that for uretprobes, it adds a new "[uprobes]"
> > memory mapping, so this part is already implemented
> >
> > >
> > > We can start simple with a USDT that uses nop5 instead of nop1
> > > and explicit single trampoline for all USDT locations
> > > that saves all (callee and caller saved) registers and
> > > then does sys_bpf with a new cmd.
> > >
> > > To replace nop5 with a call to trampoline we can use text_poke_bp
> > > approach: replace 1st byte with int3, replace 2-5 with target addr,
> > > replace 1st byte to make an actual call insn.
> > >
> > > Once patched there will be no simulation of insns or kernel traps.
> > > Just normal user code that calls into trampoline, that calls sys_bpf,
> > > and returns back.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM/BPF TOPIC] faster uprobes
  2024-03-05  0:55           ` Andrii Nakryiko
@ 2024-03-05  8:24             ` Jiri Olsa
  2024-03-05 15:30               ` Jiri Olsa
  2024-03-11 10:59               ` Jiri Olsa
  0 siblings, 2 replies; 31+ messages in thread
From: Jiri Olsa @ 2024-03-05  8:24 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jiri Olsa, Alexei Starovoitov, yunwei356, bpf, Alexei Starovoitov,
	lsf-pc, Yonghong Song, Oleg Nesterov, Daniel Borkmann

On Mon, Mar 04, 2024 at 04:55:33PM -0800, Andrii Nakryiko wrote:
> On Sun, Mar 3, 2024 at 2:20 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Fri, Mar 01, 2024 at 09:26:57AM -0800, Andrii Nakryiko wrote:
> > > On Fri, Mar 1, 2024 at 9:01 AM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Fri, Mar 1, 2024 at 12:18 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > > > >
> > > > > On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote:
> > > > > > On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > > > > > >
> > > > > > > One of uprobe pain points is having slow execution that involves
> > > > > > > two traps in worst case scenario or single trap if the original
> > > > > > > instruction can be emulated. For return uprobes there's one extra
> > > > > > > trap on top of that.
> > > > > > >
> > > > > > > My current idea on how to make this faster is to follow the optimized
> > > > > > > kprobes and replace the normal uprobe trap instruction with jump to
> > > > > > > user space trampoline that:
> > > > > > >
> > > > > > >   - executes syscall to call uprobe consumers callbacks
> > > > > >
> > > > > > Did you get a chance to measure relative performance of syscall vs
> > > > > > int3 interrupt handling? If not, do you think you'll be able to get
> > > > > > some numbers by the time the conference starts? This should inform the
> > > > > > decision whether it even makes sense to go through all the trouble.
> > > > >
> > > > > right, will do that
> > > >
> > > > I believe Yusheng measured syscall vs uprobe performance
> > > > difference during LPC. iirc it was something like 3x.
> > >
> > > Do you have a link to slides? Was it actual uprobe vs just some fast
> > > syscall (not doing BPF program execution) comparison? Or comparing the
> > > performance of int3 handling vs equivalent syscall handling.
> > >
> > > I suspect it's the former, and so probably not that representative.
> > > I'm curious about the performance of going
> > > userspace->kernel->userspace through int3 vs syscall (all other things
> > > being equal).
> >
> > I have a simple test [1] comparing:
> >   - uprobe with 2 traps
> >   - uprobe with 1 trap
> >   - syscall executing uprobe
> >
> > the syscall takes uprobe address as argument, finds the uprobe and executes
> > its consumers, which should be comparable to what the trampoline will do
> >
> > test does same amount of loops triggering each uprobe type and measures
> > the time it took
> >
> >   # ./test_progs -t uprobe_syscall_bench -v
> >   bpf_testmod.ko is already unloaded.
> >   Loading bpf_testmod.ko...
> >   Successfully loaded bpf_testmod.ko.
> >   test_bench_1:PASS:uprobe_bench__open_and_load 0 nsec
> >   test_bench_1:PASS:uprobe_bench__attach 0 nsec
> >   test_bench_1:PASS:uprobe1_cnt 0 nsec
> >   test_bench_1:PASS:syscalls_uprobe1_cnt 0 nsec
> >   test_bench_1:PASS:uprobe2_cnt 0 nsec
> >   test_bench_1: uprobes (1 trap) in  36.439s
> >   test_bench_1: uprobes (2 trap) in  91.960s
> >   test_bench_1: syscalls         in  17.872s
> >   #395/1   uprobe_syscall_bench/bench_1:OK
> >   #395     uprobe_syscall_bench:OK
> >   Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED
> >
> > syscall uprobe execution seems to be ~2x faster than 1 trap uprobe
> > and ~5x faster than 2 traps uprobe
> >
> 
> Thanks for running benchmarks! I quickly looked at the selftest and
> noticed this:
> 
> +/*
> + * Assuming following prolog:
> + *
> + * 6984ac:       55                      push   %rbp
> + * 6984ad:       48 89 e5                mov    %rsp,%rbp
> + */
> +noinline void uprobe2_bench_trigger(void)
> +{
> +        asm volatile ("");
> +}
> 
> This actually will be optimized out to just ret in -O2 mode (make
> RELEASE=1 for selftests):
> 
> 00000000005a0ce0 <uprobe2_bench_trigger>:
>   5a0ce0: c3                            retq
>   5a0ce1: 66 66 2e 0f 1f 84 00 00 00 00 00      nopw    %cs:(%rax,%rax)
>   5a0cec: 0f 1f 40 00                   nopl    (%rax)
> 
> So be careful with that.

right, I did not mean for this to be checked in, just wanted to get the
numbers quickly

> 
> Also, I just updated our existing set of uprobe benchmarks (see [0]),
> do you mind adding your syscall-based one as another one there and
> running all of them and sharing the numbers with us? Very curious to
> see both absolute and relative numbers from that benchmark. (and
> please do build with RELEASE=1)
> 
> You should be able to just run benchs/run_bench_uprobes.sh (also don't
> forget to add your syscall-based benchmark to the list of benchmarks
> in that shell script).

yes, saw it and was going to run/compare it.. it's good idea to add
the syscall one and get all numbers together, will do that

> 
> Thank you!
> 
> 
> BTW, while I think patching multiple instructions for syscall-based
> uprobe is going to be extremely tricky, I think at least u*ret*probe's
> int3 can be pretty easily optimized away with syscall, given that the
> kernel controls code generation there. If anything, it will get the
> uretprobe case a bit closer to the performance of uprobe. Give it some
> thought.

hm, right.. the trampoline is there already, but at the moment is global
and used by all uretprobes.. and int3 code moves userspace (changes rip)
to the original return address.. maybe we can do that through syscall
as well

or we could add jump back to uretprobe's original return addrress to the
trampoline, but then we need special trampoline for each uretprobe,
I'll check

thanks,
jirka

> 
> 
>   [0] https://patchwork.kernel.org/project/netdevbpf/patch/20240301214551.1686095-1-andrii@kernel.org/
> 
> > jirka
> >
> >
> > [1] https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/log/?h=uprobe_syscall_bench
> >
> > >
> > > > Certainly necessary to have a benchmark.
> > > > selftests/bpf/bench has one for uprobe.
> > > > Probably should extend with sys_bpf.
> > > >
> > > > Regarding:
> > > > > replace the normal uprobe trap instruction with jump to
> > > > user space trampoline
> > > >
> > > > it should probably be a call to trampoline instead of a jump.
> > > > Unless you plan to generate a different trampoline for every location ?
> > > >
> > > > Also how would you pick a space for a trampoline in the target process ?
> > > > Analyze /proc/pid/maps and look for gaps in executable sections?
> > >
> > > kernel already does that for uretprobes, it adds a new "[uprobes]"
> > > memory mapping, so this part is already implemented
> > >
> > > >
> > > > We can start simple with a USDT that uses nop5 instead of nop1
> > > > and explicit single trampoline for all USDT locations
> > > > that saves all (callee and caller saved) registers and
> > > > then does sys_bpf with a new cmd.
> > > >
> > > > To replace nop5 with a call to trampoline we can use text_poke_bp
> > > > approach: replace 1st byte with int3, replace 2-5 with target addr,
> > > > replace 1st byte to make an actual call insn.
> > > >
> > > > Once patched there will be no simulation of insns or kernel traps.
> > > > Just normal user code that calls into trampoline, that calls sys_bpf,
> > > > and returns back.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM/BPF TOPIC] faster uprobes
  2024-03-05  8:24             ` Jiri Olsa
@ 2024-03-05 15:30               ` Jiri Olsa
  2024-03-05 17:30                 ` Andrii Nakryiko
  2024-03-11 10:59               ` Jiri Olsa
  1 sibling, 1 reply; 31+ messages in thread
From: Jiri Olsa @ 2024-03-05 15:30 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Andrii Nakryiko, Alexei Starovoitov, yunwei356, bpf,
	Alexei Starovoitov, lsf-pc, Yonghong Song, Oleg Nesterov,
	Daniel Borkmann

On Tue, Mar 05, 2024 at 09:24:08AM +0100, Jiri Olsa wrote:
> On Mon, Mar 04, 2024 at 04:55:33PM -0800, Andrii Nakryiko wrote:
> > On Sun, Mar 3, 2024 at 2:20 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > >
> > > On Fri, Mar 01, 2024 at 09:26:57AM -0800, Andrii Nakryiko wrote:
> > > > On Fri, Mar 1, 2024 at 9:01 AM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Fri, Mar 1, 2024 at 12:18 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > > > > >
> > > > > > On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote:
> > > > > > > On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > > > > > > >
> > > > > > > > One of uprobe pain points is having slow execution that involves
> > > > > > > > two traps in worst case scenario or single trap if the original
> > > > > > > > instruction can be emulated. For return uprobes there's one extra
> > > > > > > > trap on top of that.
> > > > > > > >
> > > > > > > > My current idea on how to make this faster is to follow the optimized
> > > > > > > > kprobes and replace the normal uprobe trap instruction with jump to
> > > > > > > > user space trampoline that:
> > > > > > > >
> > > > > > > >   - executes syscall to call uprobe consumers callbacks
> > > > > > >
> > > > > > > Did you get a chance to measure relative performance of syscall vs
> > > > > > > int3 interrupt handling? If not, do you think you'll be able to get
> > > > > > > some numbers by the time the conference starts? This should inform the
> > > > > > > decision whether it even makes sense to go through all the trouble.
> > > > > >
> > > > > > right, will do that
> > > > >
> > > > > I believe Yusheng measured syscall vs uprobe performance
> > > > > difference during LPC. iirc it was something like 3x.
> > > >
> > > > Do you have a link to slides? Was it actual uprobe vs just some fast
> > > > syscall (not doing BPF program execution) comparison? Or comparing the
> > > > performance of int3 handling vs equivalent syscall handling.
> > > >
> > > > I suspect it's the former, and so probably not that representative.
> > > > I'm curious about the performance of going
> > > > userspace->kernel->userspace through int3 vs syscall (all other things
> > > > being equal).
> > >
> > > I have a simple test [1] comparing:
> > >   - uprobe with 2 traps
> > >   - uprobe with 1 trap
> > >   - syscall executing uprobe
> > >
> > > the syscall takes uprobe address as argument, finds the uprobe and executes
> > > its consumers, which should be comparable to what the trampoline will do
> > >
> > > test does same amount of loops triggering each uprobe type and measures
> > > the time it took
> > >
> > >   # ./test_progs -t uprobe_syscall_bench -v
> > >   bpf_testmod.ko is already unloaded.
> > >   Loading bpf_testmod.ko...
> > >   Successfully loaded bpf_testmod.ko.
> > >   test_bench_1:PASS:uprobe_bench__open_and_load 0 nsec
> > >   test_bench_1:PASS:uprobe_bench__attach 0 nsec
> > >   test_bench_1:PASS:uprobe1_cnt 0 nsec
> > >   test_bench_1:PASS:syscalls_uprobe1_cnt 0 nsec
> > >   test_bench_1:PASS:uprobe2_cnt 0 nsec
> > >   test_bench_1: uprobes (1 trap) in  36.439s
> > >   test_bench_1: uprobes (2 trap) in  91.960s
> > >   test_bench_1: syscalls         in  17.872s
> > >   #395/1   uprobe_syscall_bench/bench_1:OK
> > >   #395     uprobe_syscall_bench:OK
> > >   Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED
> > >
> > > syscall uprobe execution seems to be ~2x faster than 1 trap uprobe
> > > and ~5x faster than 2 traps uprobe
> > >
> > 
> > Thanks for running benchmarks! I quickly looked at the selftest and
> > noticed this:
> > 
> > +/*
> > + * Assuming following prolog:
> > + *
> > + * 6984ac:       55                      push   %rbp
> > + * 6984ad:       48 89 e5                mov    %rsp,%rbp
> > + */
> > +noinline void uprobe2_bench_trigger(void)
> > +{
> > +        asm volatile ("");
> > +}
> > 
> > This actually will be optimized out to just ret in -O2 mode (make
> > RELEASE=1 for selftests):
> > 
> > 00000000005a0ce0 <uprobe2_bench_trigger>:
> >   5a0ce0: c3                            retq
> >   5a0ce1: 66 66 2e 0f 1f 84 00 00 00 00 00      nopw    %cs:(%rax,%rax)
> >   5a0cec: 0f 1f 40 00                   nopl    (%rax)
> > 
> > So be careful with that.
> 
> right, I did not mean for this to be checked in, just wanted to get the
> numbers quickly
> 
> > 
> > Also, I just updated our existing set of uprobe benchmarks (see [0]),
> > do you mind adding your syscall-based one as another one there and
> > running all of them and sharing the numbers with us? Very curious to
> > see both absolute and relative numbers from that benchmark. (and
> > please do build with RELEASE=1)
> > 
> > You should be able to just run benchs/run_bench_uprobes.sh (also don't
> > forget to add your syscall-based benchmark to the list of benchmarks
> > in that shell script).
> 
> yes, saw it and was going to run/compare it.. it's good idea to add
> the syscall one and get all numbers together, will do that

seems to be consistent with my previous test:

base           :   15.854 ± 0.007M/s
uprobe-nop     :    2.859 ± 0.007M/s
uprobe-push    :    2.697 ± 0.002M/s
uprobe-ret     :    1.081 ± 0.000M/s
uprobe-syscall :    5.520 ± 0.006M/s
uretprobe-nop  :    1.422 ± 0.002M/s
uretprobe-push :    1.396 ± 0.002M/s
uretprobe-ret  :    0.787 ± 0.000M/s
uretprobe-syscall:    1.888 ± 0.002M/s

syscall uprobe is ~2x faster than 1 trap uprobe and ~5x faster than 2 traps uprobe

uretprobe is bit more tricky to compare, the speed up is there for the initial
uprobe hit, then there's again the trap from the uretprobe trampoline

I have the bench changes in here [1], I'll send it out together with rfc post

jirka


[1] https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/log/?h=uprobe_syscall_bench_1

> 
> > 
> > Thank you!
> > 
> > 
> > BTW, while I think patching multiple instructions for syscall-based
> > uprobe is going to be extremely tricky, I think at least u*ret*probe's
> > int3 can be pretty easily optimized away with syscall, given that the
> > kernel controls code generation there. If anything, it will get the
> > uretprobe case a bit closer to the performance of uprobe. Give it some
> > thought.
> 
> hm, right.. the trampoline is there already, but at the moment is global
> and used by all uretprobes.. and int3 code moves userspace (changes rip)
> to the original return address.. maybe we can do that through syscall
> as well
> 
> or we could add jump back to uretprobe's original return addrress to the
> trampoline, but then we need special trampoline for each uretprobe,
> I'll check
> 
> thanks,
> jirka
> 
> > 
> > 
> >   [0] https://patchwork.kernel.org/project/netdevbpf/patch/20240301214551.1686095-1-andrii@kernel.org/
> > 
> > > jirka
> > >
> > >
> > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/log/?h=uprobe_syscall_bench
> > >
> > > >
> > > > > Certainly necessary to have a benchmark.
> > > > > selftests/bpf/bench has one for uprobe.
> > > > > Probably should extend with sys_bpf.
> > > > >
> > > > > Regarding:
> > > > > > replace the normal uprobe trap instruction with jump to
> > > > > user space trampoline
> > > > >
> > > > > it should probably be a call to trampoline instead of a jump.
> > > > > Unless you plan to generate a different trampoline for every location ?
> > > > >
> > > > > Also how would you pick a space for a trampoline in the target process ?
> > > > > Analyze /proc/pid/maps and look for gaps in executable sections?
> > > >
> > > > kernel already does that for uretprobes, it adds a new "[uprobes]"
> > > > memory mapping, so this part is already implemented
> > > >
> > > > >
> > > > > We can start simple with a USDT that uses nop5 instead of nop1
> > > > > and explicit single trampoline for all USDT locations
> > > > > that saves all (callee and caller saved) registers and
> > > > > then does sys_bpf with a new cmd.
> > > > >
> > > > > To replace nop5 with a call to trampoline we can use text_poke_bp
> > > > > approach: replace 1st byte with int3, replace 2-5 with target addr,
> > > > > replace 1st byte to make an actual call insn.
> > > > >
> > > > > Once patched there will be no simulation of insns or kernel traps.
> > > > > Just normal user code that calls into trampoline, that calls sys_bpf,
> > > > > and returns back.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM/BPF TOPIC] faster uprobes
  2024-03-05 15:30               ` Jiri Olsa
@ 2024-03-05 17:30                 ` Andrii Nakryiko
  0 siblings, 0 replies; 31+ messages in thread
From: Andrii Nakryiko @ 2024-03-05 17:30 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Alexei Starovoitov, yunwei356, bpf, Alexei Starovoitov, lsf-pc,
	Yonghong Song, Oleg Nesterov, Daniel Borkmann

On Tue, Mar 5, 2024 at 7:30 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Tue, Mar 05, 2024 at 09:24:08AM +0100, Jiri Olsa wrote:
> > On Mon, Mar 04, 2024 at 04:55:33PM -0800, Andrii Nakryiko wrote:
> > > On Sun, Mar 3, 2024 at 2:20 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > > >
> > > > On Fri, Mar 01, 2024 at 09:26:57AM -0800, Andrii Nakryiko wrote:
> > > > > On Fri, Mar 1, 2024 at 9:01 AM Alexei Starovoitov
> > > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Mar 1, 2024 at 12:18 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > > > > > >
> > > > > > > On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote:
> > > > > > > > On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > One of uprobe pain points is having slow execution that involves
> > > > > > > > > two traps in worst case scenario or single trap if the original
> > > > > > > > > instruction can be emulated. For return uprobes there's one extra
> > > > > > > > > trap on top of that.
> > > > > > > > >
> > > > > > > > > My current idea on how to make this faster is to follow the optimized
> > > > > > > > > kprobes and replace the normal uprobe trap instruction with jump to
> > > > > > > > > user space trampoline that:
> > > > > > > > >
> > > > > > > > >   - executes syscall to call uprobe consumers callbacks
> > > > > > > >
> > > > > > > > Did you get a chance to measure relative performance of syscall vs
> > > > > > > > int3 interrupt handling? If not, do you think you'll be able to get
> > > > > > > > some numbers by the time the conference starts? This should inform the
> > > > > > > > decision whether it even makes sense to go through all the trouble.
> > > > > > >
> > > > > > > right, will do that
> > > > > >
> > > > > > I believe Yusheng measured syscall vs uprobe performance
> > > > > > difference during LPC. iirc it was something like 3x.
> > > > >
> > > > > Do you have a link to slides? Was it actual uprobe vs just some fast
> > > > > syscall (not doing BPF program execution) comparison? Or comparing the
> > > > > performance of int3 handling vs equivalent syscall handling.
> > > > >
> > > > > I suspect it's the former, and so probably not that representative.
> > > > > I'm curious about the performance of going
> > > > > userspace->kernel->userspace through int3 vs syscall (all other things
> > > > > being equal).
> > > >
> > > > I have a simple test [1] comparing:
> > > >   - uprobe with 2 traps
> > > >   - uprobe with 1 trap
> > > >   - syscall executing uprobe
> > > >
> > > > the syscall takes uprobe address as argument, finds the uprobe and executes
> > > > its consumers, which should be comparable to what the trampoline will do
> > > >
> > > > test does same amount of loops triggering each uprobe type and measures
> > > > the time it took
> > > >
> > > >   # ./test_progs -t uprobe_syscall_bench -v
> > > >   bpf_testmod.ko is already unloaded.
> > > >   Loading bpf_testmod.ko...
> > > >   Successfully loaded bpf_testmod.ko.
> > > >   test_bench_1:PASS:uprobe_bench__open_and_load 0 nsec
> > > >   test_bench_1:PASS:uprobe_bench__attach 0 nsec
> > > >   test_bench_1:PASS:uprobe1_cnt 0 nsec
> > > >   test_bench_1:PASS:syscalls_uprobe1_cnt 0 nsec
> > > >   test_bench_1:PASS:uprobe2_cnt 0 nsec
> > > >   test_bench_1: uprobes (1 trap) in  36.439s
> > > >   test_bench_1: uprobes (2 trap) in  91.960s
> > > >   test_bench_1: syscalls         in  17.872s
> > > >   #395/1   uprobe_syscall_bench/bench_1:OK
> > > >   #395     uprobe_syscall_bench:OK
> > > >   Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED
> > > >
> > > > syscall uprobe execution seems to be ~2x faster than 1 trap uprobe
> > > > and ~5x faster than 2 traps uprobe
> > > >
> > >
> > > Thanks for running benchmarks! I quickly looked at the selftest and
> > > noticed this:
> > >
> > > +/*
> > > + * Assuming following prolog:
> > > + *
> > > + * 6984ac:       55                      push   %rbp
> > > + * 6984ad:       48 89 e5                mov    %rsp,%rbp
> > > + */
> > > +noinline void uprobe2_bench_trigger(void)
> > > +{
> > > +        asm volatile ("");
> > > +}
> > >
> > > This actually will be optimized out to just ret in -O2 mode (make
> > > RELEASE=1 for selftests):
> > >
> > > 00000000005a0ce0 <uprobe2_bench_trigger>:
> > >   5a0ce0: c3                            retq
> > >   5a0ce1: 66 66 2e 0f 1f 84 00 00 00 00 00      nopw    %cs:(%rax,%rax)
> > >   5a0cec: 0f 1f 40 00                   nopl    (%rax)
> > >
> > > So be careful with that.
> >
> > right, I did not mean for this to be checked in, just wanted to get the
> > numbers quickly
> >
> > >
> > > Also, I just updated our existing set of uprobe benchmarks (see [0]),
> > > do you mind adding your syscall-based one as another one there and
> > > running all of them and sharing the numbers with us? Very curious to
> > > see both absolute and relative numbers from that benchmark. (and
> > > please do build with RELEASE=1)
> > >
> > > You should be able to just run benchs/run_bench_uprobes.sh (also don't
> > > forget to add your syscall-based benchmark to the list of benchmarks
> > > in that shell script).
> >
> > yes, saw it and was going to run/compare it.. it's good idea to add
> > the syscall one and get all numbers together, will do that
>
> seems to be consistent with my previous test:
>
> base           :   15.854 ± 0.007M/s
> uprobe-nop     :    2.859 ± 0.007M/s
> uprobe-push    :    2.697 ± 0.002M/s
> uprobe-ret     :    1.081 ± 0.000M/s
> uprobe-syscall :    5.520 ± 0.006M/s
> uretprobe-nop  :    1.422 ± 0.002M/s
> uretprobe-push :    1.396 ± 0.002M/s
> uretprobe-ret  :    0.787 ± 0.000M/s
> uretprobe-syscall:    1.888 ± 0.002M/s
>
> syscall uprobe is ~2x faster than 1 trap uprobe and ~5x faster than 2 traps uprobe
>

great, thanks a lot for the numbers! It's good that we have comparable
benchmark numbers now.

> uretprobe is bit more tricky to compare, the speed up is there for the initial
> uprobe hit, then there's again the trap from the uretprobe trampoline

yep, makes sense

>
> I have the bench changes in here [1], I'll send it out together with rfc post
>
> jirka
>
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/log/?h=uprobe_syscall_bench_1
>
> >
> > >
> > > Thank you!
> > >
> > >
> > > BTW, while I think patching multiple instructions for syscall-based
> > > uprobe is going to be extremely tricky, I think at least u*ret*probe's
> > > int3 can be pretty easily optimized away with syscall, given that the
> > > kernel controls code generation there. If anything, it will get the
> > > uretprobe case a bit closer to the performance of uprobe. Give it some
> > > thought.
> >
> > hm, right.. the trampoline is there already, but at the moment is global
> > and used by all uretprobes.. and int3 code moves userspace (changes rip)
> > to the original return address.. maybe we can do that through syscall
> > as well
> >
> > or we could add jump back to uretprobe's original return addrress to the
> > trampoline, but then we need special trampoline for each uretprobe,
> > I'll check
> >
> > thanks,
> > jirka
> >
> > >
> > >
> > >   [0] https://patchwork.kernel.org/project/netdevbpf/patch/20240301214551.1686095-1-andrii@kernel.org/
> > >
> > > > jirka
> > > >
> > > >
> > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/log/?h=uprobe_syscall_bench
> > > >
> > > > >
> > > > > > Certainly necessary to have a benchmark.
> > > > > > selftests/bpf/bench has one for uprobe.
> > > > > > Probably should extend with sys_bpf.
> > > > > >
> > > > > > Regarding:
> > > > > > > replace the normal uprobe trap instruction with jump to
> > > > > > user space trampoline
> > > > > >
> > > > > > it should probably be a call to trampoline instead of a jump.
> > > > > > Unless you plan to generate a different trampoline for every location ?
> > > > > >
> > > > > > Also how would you pick a space for a trampoline in the target process ?
> > > > > > Analyze /proc/pid/maps and look for gaps in executable sections?
> > > > >
> > > > > kernel already does that for uretprobes, it adds a new "[uprobes]"
> > > > > memory mapping, so this part is already implemented
> > > > >
> > > > > >
> > > > > > We can start simple with a USDT that uses nop5 instead of nop1
> > > > > > and explicit single trampoline for all USDT locations
> > > > > > that saves all (callee and caller saved) registers and
> > > > > > then does sys_bpf with a new cmd.
> > > > > >
> > > > > > To replace nop5 with a call to trampoline we can use text_poke_bp
> > > > > > approach: replace 1st byte with int3, replace 2-5 with target addr,
> > > > > > replace 1st byte to make an actual call insn.
> > > > > >
> > > > > > Once patched there will be no simulation of insns or kernel traps.
> > > > > > Just normal user code that calls into trampoline, that calls sys_bpf,
> > > > > > and returns back.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM/BPF TOPIC] faster uprobes
  2024-03-05  8:24             ` Jiri Olsa
  2024-03-05 15:30               ` Jiri Olsa
@ 2024-03-11 10:59               ` Jiri Olsa
  2024-03-11 15:06                 ` Oleg Nesterov
  2024-03-11 17:32                 ` Andrii Nakryiko
  1 sibling, 2 replies; 31+ messages in thread
From: Jiri Olsa @ 2024-03-11 10:59 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Andrii Nakryiko, Alexei Starovoitov, yunwei356, bpf,
	Alexei Starovoitov, lsf-pc, Yonghong Song, Oleg Nesterov,
	Daniel Borkmann

On Tue, Mar 05, 2024 at 09:24:08AM +0100, Jiri Olsa wrote:
> On Mon, Mar 04, 2024 at 04:55:33PM -0800, Andrii Nakryiko wrote:
> > On Sun, Mar 3, 2024 at 2:20 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > >
> > > On Fri, Mar 01, 2024 at 09:26:57AM -0800, Andrii Nakryiko wrote:
> > > > On Fri, Mar 1, 2024 at 9:01 AM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Fri, Mar 1, 2024 at 12:18 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > > > > >
> > > > > > On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote:
> > > > > > > On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > > > > > > >
> > > > > > > > One of uprobe pain points is having slow execution that involves
> > > > > > > > two traps in worst case scenario or single trap if the original
> > > > > > > > instruction can be emulated. For return uprobes there's one extra
> > > > > > > > trap on top of that.
> > > > > > > >
> > > > > > > > My current idea on how to make this faster is to follow the optimized
> > > > > > > > kprobes and replace the normal uprobe trap instruction with jump to
> > > > > > > > user space trampoline that:
> > > > > > > >
> > > > > > > >   - executes syscall to call uprobe consumers callbacks
> > > > > > >
> > > > > > > Did you get a chance to measure relative performance of syscall vs
> > > > > > > int3 interrupt handling? If not, do you think you'll be able to get
> > > > > > > some numbers by the time the conference starts? This should inform the
> > > > > > > decision whether it even makes sense to go through all the trouble.
> > > > > >
> > > > > > right, will do that
> > > > >
> > > > > I believe Yusheng measured syscall vs uprobe performance
> > > > > difference during LPC. iirc it was something like 3x.
> > > >
> > > > Do you have a link to slides? Was it actual uprobe vs just some fast
> > > > syscall (not doing BPF program execution) comparison? Or comparing the
> > > > performance of int3 handling vs equivalent syscall handling.
> > > >
> > > > I suspect it's the former, and so probably not that representative.
> > > > I'm curious about the performance of going
> > > > userspace->kernel->userspace through int3 vs syscall (all other things
> > > > being equal).
> > >
> > > I have a simple test [1] comparing:
> > >   - uprobe with 2 traps
> > >   - uprobe with 1 trap
> > >   - syscall executing uprobe
> > >
> > > the syscall takes uprobe address as argument, finds the uprobe and executes
> > > its consumers, which should be comparable to what the trampoline will do
> > >
> > > test does same amount of loops triggering each uprobe type and measures
> > > the time it took
> > >
> > >   # ./test_progs -t uprobe_syscall_bench -v
> > >   bpf_testmod.ko is already unloaded.
> > >   Loading bpf_testmod.ko...
> > >   Successfully loaded bpf_testmod.ko.
> > >   test_bench_1:PASS:uprobe_bench__open_and_load 0 nsec
> > >   test_bench_1:PASS:uprobe_bench__attach 0 nsec
> > >   test_bench_1:PASS:uprobe1_cnt 0 nsec
> > >   test_bench_1:PASS:syscalls_uprobe1_cnt 0 nsec
> > >   test_bench_1:PASS:uprobe2_cnt 0 nsec
> > >   test_bench_1: uprobes (1 trap) in  36.439s
> > >   test_bench_1: uprobes (2 trap) in  91.960s
> > >   test_bench_1: syscalls         in  17.872s
> > >   #395/1   uprobe_syscall_bench/bench_1:OK
> > >   #395     uprobe_syscall_bench:OK
> > >   Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED
> > >
> > > syscall uprobe execution seems to be ~2x faster than 1 trap uprobe
> > > and ~5x faster than 2 traps uprobe
> > >
> > 
> > Thanks for running benchmarks! I quickly looked at the selftest and
> > noticed this:
> > 
> > +/*
> > + * Assuming following prolog:
> > + *
> > + * 6984ac:       55                      push   %rbp
> > + * 6984ad:       48 89 e5                mov    %rsp,%rbp
> > + */
> > +noinline void uprobe2_bench_trigger(void)
> > +{
> > +        asm volatile ("");
> > +}
> > 
> > This actually will be optimized out to just ret in -O2 mode (make
> > RELEASE=1 for selftests):
> > 
> > 00000000005a0ce0 <uprobe2_bench_trigger>:
> >   5a0ce0: c3                            retq
> >   5a0ce1: 66 66 2e 0f 1f 84 00 00 00 00 00      nopw    %cs:(%rax,%rax)
> >   5a0cec: 0f 1f 40 00                   nopl    (%rax)
> > 
> > So be careful with that.
> 
> right, I did not mean for this to be checked in, just wanted to get the
> numbers quickly
> 
> > 
> > Also, I just updated our existing set of uprobe benchmarks (see [0]),
> > do you mind adding your syscall-based one as another one there and
> > running all of them and sharing the numbers with us? Very curious to
> > see both absolute and relative numbers from that benchmark. (and
> > please do build with RELEASE=1)
> > 
> > You should be able to just run benchs/run_bench_uprobes.sh (also don't
> > forget to add your syscall-based benchmark to the list of benchmarks
> > in that shell script).
> 
> yes, saw it and was going to run/compare it.. it's good idea to add
> the syscall one and get all numbers together, will do that
> 
> > 
> > Thank you!
> > 
> > 
> > BTW, while I think patching multiple instructions for syscall-based
> > uprobe is going to be extremely tricky, I think at least u*ret*probe's
> > int3 can be pretty easily optimized away with syscall, given that the
> > kernel controls code generation there. If anything, it will get the
> > uretprobe case a bit closer to the performance of uprobe. Give it some
> > thought.
> 
> hm, right.. the trampoline is there already, but at the moment is global
> and used by all uretprobes.. and int3 code moves userspace (changes rip)
> to the original return address.. maybe we can do that through syscall
> as well

it seems like good idea, I tried change below (use syscall on return
trampoline) and got some speedup:

current:
  base           :   15.817 ± 0.009M/s
  uprobe-nop     :    2.901 ± 0.000M/s
  uprobe-push    :    2.743 ± 0.002M/s
  uprobe-ret     :    1.089 ± 0.001M/s
  uretprobe-nop  :    1.448 ± 0.001M/s
  uretprobe-push :    1.407 ± 0.001M/s
  uretprobe-ret  :    0.792 ± 0.001M/s

with syscall:
  base           :   15.831 ± 0.026M/s
  uprobe-nop     :    2.904 ± 0.001M/s
  uprobe-push    :    2.764 ± 0.002M/s
  uprobe-ret     :    1.082 ± 0.001M/s
  uretprobe-nop  :    1.785 ± 0.000M/s
  uretprobe-push :    1.733 ± 0.001M/s
  uretprobe-ret  :    0.885 ± 0.004M/s

~23% for nop/push (emulated) cases, ~11% for ret (sstep) case

jirka


---
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 7e8d46f4147f..fa5f8a058bc2 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -383,6 +383,7 @@
 459	common	lsm_get_self_attr	sys_lsm_get_self_attr
 460	common	lsm_set_self_attr	sys_lsm_set_self_attr
 461	common	lsm_list_modules	sys_lsm_list_modules
+462	64	uprobe			sys_uprobe
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 6c07f6daaa22..fceef2b4e243 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -12,11 +12,13 @@
 #include <linux/ptrace.h>
 #include <linux/uprobes.h>
 #include <linux/uaccess.h>
+#include <linux/syscalls.h>
 
 #include <linux/kdebug.h>
 #include <asm/processor.h>
 #include <asm/insn.h>
 #include <asm/mmu_context.h>
+#include <asm/syscalls.h>
 
 /* Post-execution fixups. */
 
@@ -308,6 +310,53 @@ static int uprobe_init_insn(struct arch_uprobe *auprobe, struct insn *insn, bool
 }
 
 #ifdef CONFIG_X86_64
+
+asm (
+       ".pushsection .rodata\n"
+       ".global uretprobe_syscall_entry\n"
+       "uretprobe_syscall_entry:\n"
+       "push %rax\n"
+       "mov $462, %rax\n"
+       "syscall\n"
+       ".global uretprobe_syscall_end\n"
+       "uretprobe_syscall_end:\n"
+       ".popsection\n"
+);
+
+extern __visible u8 uretprobe_syscall_entry[];
+extern __visible u8 uretprobe_syscall_end[];
+
+uprobe_opcode_t* arch_uprobe_trampoline(unsigned long *psize)
+{
+	*psize = uretprobe_syscall_end - uretprobe_syscall_entry;
+	return uretprobe_syscall_entry;
+}
+
+SYSCALL_DEFINE1(uprobe, unsigned long, cmd)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+	unsigned long ax, err;
+
+	/*
+	 * We get invoked from the trampoline that pushed rax
+	 * value on stack, read and restore the value.
+	 */
+	err = copy_from_user((void*) &ax, (void *) regs->sp, sizeof(ax));
+	WARN_ON_ONCE(err);
+
+	regs->ax = ax;
+	regs->orig_ax = ax;
+
+	/*
+	 * And pop the stack back, because we jump to original
+	 * return value.
+	 */
+	regs->sp = regs->sp + sizeof(regs->sp);
+
+	uprobe_handle_trampoline(regs);
+	return ax;
+}
+
 /*
  * If arch_uprobe->insn doesn't use rip-relative addressing, return
  * immediately.  Otherwise, rewrite the instruction so that it accesses
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 77eb9b0e7685..4f7d5b41b718 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -987,6 +987,8 @@ asmlinkage long sys_spu_run(int fd, __u32 __user *unpc,
 asmlinkage long sys_spu_create(const char __user *name,
 		unsigned int flags, umode_t mode, int fd);
 
+asmlinkage long sys_uprobe(unsigned long cmd);
+
 
 /*
  * Deprecated system calls which are still defined in
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index f46e0ca0169c..9ef244c8ff19 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -138,6 +138,8 @@ extern bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c
 extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
 extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
 					 void *src, unsigned long len);
+extern void uprobe_handle_trampoline(struct pt_regs *regs);
+uprobe_opcode_t* arch_uprobe_trampoline(unsigned long *psize);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
 };
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 75f00965ab15..2702799648e6 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -842,8 +842,11 @@ __SYSCALL(__NR_lsm_set_self_attr, sys_lsm_set_self_attr)
 #define __NR_lsm_list_modules 461
 __SYSCALL(__NR_lsm_list_modules, sys_lsm_list_modules)
 
+#define __NR_uprobe 462
+__SYSCALL(__NR_uprobe, sys_uprobe)
+
 #undef __NR_syscalls
-#define __NR_syscalls 462
+#define __NR_syscalls 463
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 929e98c62965..fefaf4804e1f 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1474,10 +1474,19 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
 	return ret;
 }
 
+uprobe_opcode_t* __weak arch_uprobe_trampoline(unsigned long *psize)
+{
+	static uprobe_opcode_t insn = UPROBE_SWBP_INSN;
+
+	*psize = UPROBE_SWBP_INSN_SIZE;
+	return &insn;
+}
+
 static struct xol_area *__create_xol_area(unsigned long vaddr)
 {
 	struct mm_struct *mm = current->mm;
-	uprobe_opcode_t insn = UPROBE_SWBP_INSN;
+	unsigned long insns_size;
+	uprobe_opcode_t *insns;
 	struct xol_area *area;
 
 	area = kmalloc(sizeof(*area), GFP_KERNEL);
@@ -1502,7 +1511,8 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
 	/* Reserve the 1st slot for get_trampoline_vaddr() */
 	set_bit(0, area->bitmap);
 	atomic_set(&area->slot_count, 1);
-	arch_uprobe_copy_ixol(area->pages[0], 0, &insn, UPROBE_SWBP_INSN_SIZE);
+	insns = arch_uprobe_trampoline(&insns_size);
+	arch_uprobe_copy_ixol(area->pages[0], 0, insns, insns_size);
 
 	if (!xol_add_vma(mm, area))
 		return area;
@@ -2123,7 +2133,7 @@ static struct return_instance *find_next_ret_chain(struct return_instance *ri)
 	return ri;
 }
 
-static void handle_trampoline(struct pt_regs *regs)
+void uprobe_handle_trampoline(struct pt_regs *regs)
 {
 	struct uprobe_task *utask;
 	struct return_instance *ri, *next;
@@ -2188,7 +2198,7 @@ static void handle_swbp(struct pt_regs *regs)
 
 	bp_vaddr = uprobe_get_swbp_addr(regs);
 	if (bp_vaddr == get_trampoline_vaddr())
-		return handle_trampoline(regs);
+		return uprobe_handle_trampoline(regs);
 
 	uprobe = find_active_uprobe(bp_vaddr, &is_swbp);
 	if (!uprobe) {
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index faad00cce269..ddc954f28317 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -391,3 +391,5 @@ COND_SYSCALL(setuid16);
 
 /* restartable sequence */
 COND_SYSCALL(rseq);
+
+COND_SYSCALL(uprobe);

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [LSF/MM/BPF TOPIC] faster uprobes
  2024-03-11 10:59               ` Jiri Olsa
@ 2024-03-11 15:06                 ` Oleg Nesterov
  2024-03-11 16:46                   ` Jiri Olsa
  2024-03-11 17:32                 ` Andrii Nakryiko
  1 sibling, 1 reply; 31+ messages in thread
From: Oleg Nesterov @ 2024-03-11 15:06 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Andrii Nakryiko, Alexei Starovoitov, yunwei356, bpf,
	Alexei Starovoitov, lsf-pc, Yonghong Song, Daniel Borkmann

I forgot everything about the low-level x86_64 code, but...

On 03/11, Jiri Olsa wrote:
>
>  #ifdef CONFIG_X86_64
> +
> +asm (
> +       ".pushsection .rodata\n"
> +       ".global uretprobe_syscall_entry\n"
> +       "uretprobe_syscall_entry:\n"
> +       "push %rax\n"
> +       "mov $462, %rax\n"
> +       "syscall\n"

Hmm... I think you need to save/restore more registers clobbered by
syscall/entry_SYSCALL_64 ?

> +SYSCALL_DEFINE1(uprobe, unsigned long, cmd)
> +{
> +	struct pt_regs *regs = task_pt_regs(current);
> +	unsigned long ax, err;
> +
> +	/*
> +	 * We get invoked from the trampoline that pushed rax
> +	 * value on stack, read and restore the value.
> +	 */
> +	err = copy_from_user((void*) &ax, (void *) regs->sp, sizeof(ax));
> +	WARN_ON_ONCE(err);
> +
> +	regs->ax = ax;

probably not strictly needed, we are going to return ax...

> +	regs->orig_ax = ax;

This doesn't look right. I think you need

	regs->orig_ax = -1;

Say, to avoid the "Did we come from a system call" checks in
arch_do_signal_or_restart() or handle_signal().

Oleg.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM/BPF TOPIC] faster uprobes
  2024-03-11 15:06                 ` Oleg Nesterov
@ 2024-03-11 16:46                   ` Jiri Olsa
  2024-03-11 17:02                     ` Oleg Nesterov
  0 siblings, 1 reply; 31+ messages in thread
From: Jiri Olsa @ 2024-03-11 16:46 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jiri Olsa, Andrii Nakryiko, Alexei Starovoitov, yunwei356, bpf,
	Alexei Starovoitov, lsf-pc, Yonghong Song, Daniel Borkmann

On Mon, Mar 11, 2024 at 04:06:59PM +0100, Oleg Nesterov wrote:
> I forgot everything about the low-level x86_64 code, but...
> 
> On 03/11, Jiri Olsa wrote:
> >
> >  #ifdef CONFIG_X86_64
> > +
> > +asm (
> > +       ".pushsection .rodata\n"
> > +       ".global uretprobe_syscall_entry\n"
> > +       "uretprobe_syscall_entry:\n"
> > +       "push %rax\n"
> > +       "mov $462, %rax\n"
> > +       "syscall\n"
> 
> Hmm... I think you need to save/restore more registers clobbered by
> syscall/entry_SYSCALL_64 ?

hum, so the call happens on the function call return, so I thought
we should just preserve callee saved registers which seems to be
taken care of by the entry_SYSCALL_64 path.. I will double check

> 
> > +SYSCALL_DEFINE1(uprobe, unsigned long, cmd)
> > +{
> > +	struct pt_regs *regs = task_pt_regs(current);
> > +	unsigned long ax, err;
> > +
> > +	/*
> > +	 * We get invoked from the trampoline that pushed rax
> > +	 * value on stack, read and restore the value.
> > +	 */
> > +	err = copy_from_user((void*) &ax, (void *) regs->sp, sizeof(ax));
> > +	WARN_ON_ONCE(err);
> > +
> > +	regs->ax = ax;
> 
> probably not strictly needed, we are going to return ax...

it needs to be there for the bpf program to read proper return
value from regs

> 
> > +	regs->orig_ax = ax;
> 
> This doesn't look right. I think you need
> 
> 	regs->orig_ax = -1;
> 
> Say, to avoid the "Did we come from a system call" checks in
> arch_do_signal_or_restart() or handle_signal().

ugh right that's probably wrong, I need check on that

thanks,
jirka

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM/BPF TOPIC] faster uprobes
  2024-03-11 16:46                   ` Jiri Olsa
@ 2024-03-11 17:02                     ` Oleg Nesterov
  2024-03-11 21:11                       ` Jiri Olsa
  0 siblings, 1 reply; 31+ messages in thread
From: Oleg Nesterov @ 2024-03-11 17:02 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Andrii Nakryiko, Alexei Starovoitov, yunwei356, bpf,
	Alexei Starovoitov, lsf-pc, Yonghong Song, Daniel Borkmann

On 03/11, Jiri Olsa wrote:
>
> On Mon, Mar 11, 2024 at 04:06:59PM +0100, Oleg Nesterov wrote:
> > I forgot everything about the low-level x86_64 code, but...
> >
> > On 03/11, Jiri Olsa wrote:
> > >
> > >  #ifdef CONFIG_X86_64
> > > +
> > > +asm (
> > > +       ".pushsection .rodata\n"
> > > +       ".global uretprobe_syscall_entry\n"
> > > +       "uretprobe_syscall_entry:\n"
> > > +       "push %rax\n"
> > > +       "mov $462, %rax\n"
> > > +       "syscall\n"
> >
> > Hmm... I think you need to save/restore more registers clobbered by
> > syscall/entry_SYSCALL_64 ?
>
> hum, so the call happens on the function call return, so I thought
> we should just preserve callee saved registers which seems to be
> taken care of by the entry_SYSCALL_64 path..

Yes, but we do not know if the (ret)probed function obeys the C-calling
convention, perhaps it is low level asm code or not a C function.

> I will double check

but I won't insist if you think we do not care.

> > > +
> > > +	regs->ax = ax;
> >
> > probably not strictly needed, we are going to return ax...
>
> it needs to be there for the bpf program to read proper return
> value from regs

OK, I see, thanks.

Oleg.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM/BPF TOPIC] faster uprobes
  2024-03-11 17:02                     ` Oleg Nesterov
@ 2024-03-11 21:11                       ` Jiri Olsa
  0 siblings, 0 replies; 31+ messages in thread
From: Jiri Olsa @ 2024-03-11 21:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jiri Olsa, Andrii Nakryiko, Alexei Starovoitov, yunwei356, bpf,
	Alexei Starovoitov, lsf-pc, Yonghong Song, Daniel Borkmann

On Mon, Mar 11, 2024 at 06:02:31PM +0100, Oleg Nesterov wrote:
> On 03/11, Jiri Olsa wrote:
> >
> > On Mon, Mar 11, 2024 at 04:06:59PM +0100, Oleg Nesterov wrote:
> > > I forgot everything about the low-level x86_64 code, but...
> > >
> > > On 03/11, Jiri Olsa wrote:
> > > >
> > > >  #ifdef CONFIG_X86_64
> > > > +
> > > > +asm (
> > > > +       ".pushsection .rodata\n"
> > > > +       ".global uretprobe_syscall_entry\n"
> > > > +       "uretprobe_syscall_entry:\n"
> > > > +       "push %rax\n"
> > > > +       "mov $462, %rax\n"
> > > > +       "syscall\n"
> > >
> > > Hmm... I think you need to save/restore more registers clobbered by
> > > syscall/entry_SYSCALL_64 ?
> >
> > hum, so the call happens on the function call return, so I thought
> > we should just preserve callee saved registers which seems to be
> > taken care of by the entry_SYSCALL_64 path..
> 
> Yes, but we do not know if the (ret)probed function obeys the C-calling
> convention, perhaps it is low level asm code or not a C function.

ah right.. I think we need to make sure all is saved/restored

thanks,
jirka

> 
> > I will double check
> 
> but I won't insist if you think we do not care.
> 
> > > > +
> > > > +	regs->ax = ax;
> > >
> > > probably not strictly needed, we are going to return ax...
> >
> > it needs to be there for the bpf program to read proper return
> > value from regs
> 
> OK, I see, thanks.
> 
> Oleg.
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM/BPF TOPIC] faster uprobes
  2024-03-11 10:59               ` Jiri Olsa
  2024-03-11 15:06                 ` Oleg Nesterov
@ 2024-03-11 17:32                 ` Andrii Nakryiko
  2024-03-11 21:26                   ` Jiri Olsa
  1 sibling, 1 reply; 31+ messages in thread
From: Andrii Nakryiko @ 2024-03-11 17:32 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Alexei Starovoitov, yunwei356, bpf, Alexei Starovoitov, lsf-pc,
	Yonghong Song, Oleg Nesterov, Daniel Borkmann

On Mon, Mar 11, 2024 at 3:59 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Tue, Mar 05, 2024 at 09:24:08AM +0100, Jiri Olsa wrote:
> > On Mon, Mar 04, 2024 at 04:55:33PM -0800, Andrii Nakryiko wrote:
> > > On Sun, Mar 3, 2024 at 2:20 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > > >
> > > > On Fri, Mar 01, 2024 at 09:26:57AM -0800, Andrii Nakryiko wrote:
> > > > > On Fri, Mar 1, 2024 at 9:01 AM Alexei Starovoitov
> > > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Mar 1, 2024 at 12:18 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > > > > > >
> > > > > > > On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote:
> > > > > > > > On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > One of uprobe pain points is having slow execution that involves
> > > > > > > > > two traps in worst case scenario or single trap if the original
> > > > > > > > > instruction can be emulated. For return uprobes there's one extra
> > > > > > > > > trap on top of that.
> > > > > > > > >
> > > > > > > > > My current idea on how to make this faster is to follow the optimized
> > > > > > > > > kprobes and replace the normal uprobe trap instruction with jump to
> > > > > > > > > user space trampoline that:
> > > > > > > > >
> > > > > > > > >   - executes syscall to call uprobe consumers callbacks
> > > > > > > >
> > > > > > > > Did you get a chance to measure relative performance of syscall vs
> > > > > > > > int3 interrupt handling? If not, do you think you'll be able to get
> > > > > > > > some numbers by the time the conference starts? This should inform the
> > > > > > > > decision whether it even makes sense to go through all the trouble.
> > > > > > >
> > > > > > > right, will do that
> > > > > >
> > > > > > I believe Yusheng measured syscall vs uprobe performance
> > > > > > difference during LPC. iirc it was something like 3x.
> > > > >
> > > > > Do you have a link to slides? Was it actual uprobe vs just some fast
> > > > > syscall (not doing BPF program execution) comparison? Or comparing the
> > > > > performance of int3 handling vs equivalent syscall handling.
> > > > >
> > > > > I suspect it's the former, and so probably not that representative.
> > > > > I'm curious about the performance of going
> > > > > userspace->kernel->userspace through int3 vs syscall (all other things
> > > > > being equal).
> > > >
> > > > I have a simple test [1] comparing:
> > > >   - uprobe with 2 traps
> > > >   - uprobe with 1 trap
> > > >   - syscall executing uprobe
> > > >
> > > > the syscall takes uprobe address as argument, finds the uprobe and executes
> > > > its consumers, which should be comparable to what the trampoline will do
> > > >
> > > > test does same amount of loops triggering each uprobe type and measures
> > > > the time it took
> > > >
> > > >   # ./test_progs -t uprobe_syscall_bench -v
> > > >   bpf_testmod.ko is already unloaded.
> > > >   Loading bpf_testmod.ko...
> > > >   Successfully loaded bpf_testmod.ko.
> > > >   test_bench_1:PASS:uprobe_bench__open_and_load 0 nsec
> > > >   test_bench_1:PASS:uprobe_bench__attach 0 nsec
> > > >   test_bench_1:PASS:uprobe1_cnt 0 nsec
> > > >   test_bench_1:PASS:syscalls_uprobe1_cnt 0 nsec
> > > >   test_bench_1:PASS:uprobe2_cnt 0 nsec
> > > >   test_bench_1: uprobes (1 trap) in  36.439s
> > > >   test_bench_1: uprobes (2 trap) in  91.960s
> > > >   test_bench_1: syscalls         in  17.872s
> > > >   #395/1   uprobe_syscall_bench/bench_1:OK
> > > >   #395     uprobe_syscall_bench:OK
> > > >   Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED
> > > >
> > > > syscall uprobe execution seems to be ~2x faster than 1 trap uprobe
> > > > and ~5x faster than 2 traps uprobe
> > > >
> > >
> > > Thanks for running benchmarks! I quickly looked at the selftest and
> > > noticed this:
> > >
> > > +/*
> > > + * Assuming following prolog:
> > > + *
> > > + * 6984ac:       55                      push   %rbp
> > > + * 6984ad:       48 89 e5                mov    %rsp,%rbp
> > > + */
> > > +noinline void uprobe2_bench_trigger(void)
> > > +{
> > > +        asm volatile ("");
> > > +}
> > >
> > > This actually will be optimized out to just ret in -O2 mode (make
> > > RELEASE=1 for selftests):
> > >
> > > 00000000005a0ce0 <uprobe2_bench_trigger>:
> > >   5a0ce0: c3                            retq
> > >   5a0ce1: 66 66 2e 0f 1f 84 00 00 00 00 00      nopw    %cs:(%rax,%rax)
> > >   5a0cec: 0f 1f 40 00                   nopl    (%rax)
> > >
> > > So be careful with that.
> >
> > right, I did not mean for this to be checked in, just wanted to get the
> > numbers quickly
> >
> > >
> > > Also, I just updated our existing set of uprobe benchmarks (see [0]),
> > > do you mind adding your syscall-based one as another one there and
> > > running all of them and sharing the numbers with us? Very curious to
> > > see both absolute and relative numbers from that benchmark. (and
> > > please do build with RELEASE=1)
> > >
> > > You should be able to just run benchs/run_bench_uprobes.sh (also don't
> > > forget to add your syscall-based benchmark to the list of benchmarks
> > > in that shell script).
> >
> > yes, saw it and was going to run/compare it.. it's good idea to add
> > the syscall one and get all numbers together, will do that
> >
> > >
> > > Thank you!
> > >
> > >
> > > BTW, while I think patching multiple instructions for syscall-based
> > > uprobe is going to be extremely tricky, I think at least u*ret*probe's
> > > int3 can be pretty easily optimized away with syscall, given that the
> > > kernel controls code generation there. If anything, it will get the
> > > uretprobe case a bit closer to the performance of uprobe. Give it some
> > > thought.
> >
> > hm, right.. the trampoline is there already, but at the moment is global
> > and used by all uretprobes.. and int3 code moves userspace (changes rip)
> > to the original return address.. maybe we can do that through syscall
> > as well
>
> it seems like good idea, I tried change below (use syscall on return
> trampoline) and got some speedup:
>
> current:
>   base           :   15.817 ± 0.009M/s
>   uprobe-nop     :    2.901 ± 0.000M/s
>   uprobe-push    :    2.743 ± 0.002M/s
>   uprobe-ret     :    1.089 ± 0.001M/s
>   uretprobe-nop  :    1.448 ± 0.001M/s
>   uretprobe-push :    1.407 ± 0.001M/s
>   uretprobe-ret  :    0.792 ± 0.001M/s
>
> with syscall:
>   base           :   15.831 ± 0.026M/s
>   uprobe-nop     :    2.904 ± 0.001M/s
>   uprobe-push    :    2.764 ± 0.002M/s
>   uprobe-ret     :    1.082 ± 0.001M/s
>   uretprobe-nop  :    1.785 ± 0.000M/s
>   uretprobe-push :    1.733 ± 0.001M/s
>   uretprobe-ret  :    0.885 ± 0.004M/s
>
> ~23% for nop/push (emulated) cases, ~11% for ret (sstep) case
>
> jirka

heh, I tried this as well over weekend, though I cut few more corners
(see diff below, I didn't add saving/restoring rax, though that would
be required, of course). My test machine is (way) slower, though, so I
got a slightly different numbers (up to 15%):

### baseline
uprobe-base    :   79.462 ± 0.058M/s
base           :    2.920 ± 0.004M/s
uprobe-nop     :    1.093 ± 0.001M/s
uprobe-push    :    1.066 ± 0.001M/s
uprobe-ret     :    0.480 ± 0.001M/s
uretprobe-nop  :    0.555 ± 0.000M/s
uretprobe-push :    0.549 ± 0.000M/s
uretprobe-ret  :    0.338 ± 0.000M/s


### uretprobe syscall (vs baseline)
uprobe-base    :   79.488 ± 0.033M/s
base           :    2.917 ± 0.003M/s
uprobe-nop     :    1.095 ± 0.001M/s
uprobe-push    :    1.058 ± 0.000M/s
uprobe-ret     :    0.483 ± 0.000M/s
uretprobe-nop  :    0.638 ± 0.000M/s (+15%)
uretprobe-push :    0.627 ± 0.000M/s (+14.2%)
uretprobe-ret  :    0.366 ± 0.000M/s (+8.3%)

Either way, yes, we should implement this. Are you planning to send an
official patch some time soon? I'm working on other small improvements
in uprobe/uretprobe, I'll probably send the first patches
today/tomorrow, but they shouldn't interfere with this uretprobe code
path.

See also a few comments on your code below.

My hacky changes:

commit dcf59baa5ad8ea4edb86ff1558eb1dcc28fcc7c0
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Sat Mar 9 20:16:02 2024 -0800

    [WIP] uprobes: implement uretprobe through syscall

    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl
b/arch/x86/entry/syscalls/syscall_32.tbl
index 5f8591ce7f25..0207dfb5018d 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -466,3 +466,4 @@
 459    i386    lsm_get_self_attr       sys_lsm_get_self_attr
 460    i386    lsm_set_self_attr       sys_lsm_set_self_attr
 461    i386    lsm_list_modules        sys_lsm_list_modules
+462    i386    trace                   sys_trace
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl
b/arch/x86/entry/syscalls/syscall_64.tbl
index 7e8d46f4147f..c16599e15bbb 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -383,6 +383,7 @@
 459    common  lsm_get_self_attr       sys_lsm_get_self_attr
 460    common  lsm_set_self_attr       sys_lsm_set_self_attr
 461    common  lsm_list_modules        sys_lsm_list_modules
+462    common  trace                   sys_trace

 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 77eb9b0e7685..f2863c9fab3d 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -965,6 +965,8 @@ asmlinkage long sys_lsm_set_self_attr(unsigned int
attr, struct lsm_ctx *ctx,
                                      size_t size, __u32 flags);
 asmlinkage long sys_lsm_list_modules(u64 *ids, size_t *size, u32 flags);

+asmlinkage long sys_trace(int cmd);
+
 /*
  * Architecture-specific system calls
  */
diff --git a/include/uapi/asm-generic/unistd.h
b/include/uapi/asm-generic/unistd.h
index 75f00965ab15..2f66eb8b068e 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -842,8 +842,11 @@ __SYSCALL(__NR_lsm_set_self_attr, sys_lsm_set_self_attr)
 #define __NR_lsm_list_modules 461
 __SYSCALL(__NR_lsm_list_modules, sys_lsm_list_modules)

+#define __NR_trace 462
+__SYSCALL(__NR_TRACE, sys_trace)
+
 #undef __NR_syscalls
-#define __NR_syscalls 462
+#define __NR_syscalls 463

 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 56a460719628..c6fc15bdbffc 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -28,6 +28,7 @@
 #include <linux/khugepaged.h>

 #include <linux/uprobes.h>
+#include <linux/syscalls.h>

 #define UINSNS_PER_PAGE
(PAGE_SIZE/UPROBE_XOL_SLOT_BYTES)
 #define MAX_UPROBE_XOL_SLOTS           UINSNS_PER_PAGE
@@ -1488,8 +1489,15 @@ static int xol_add_vma(struct mm_struct *mm,
struct xol_area *area)
 static struct xol_area *__create_xol_area(unsigned long vaddr)
 {
        struct mm_struct *mm = current->mm;
-       uprobe_opcode_t insn = UPROBE_SWBP_INSN;
+       //uprobe_opcode_t insn = UPROBE_SWBP_INSN;
        struct xol_area *area;
+       char uret_syscall_patch[] = {
+               /* mov rax, __NR_trace */
+               0x48, 0xC7, 0xC0, 0xCE, 0x01, 0x00, 0x00,
+               /* syscall */
+               0x0F, 0x05,
+       };
+       const int URET_SYSCALL_INSNS_SIZE = ARRAY_SIZE(uret_syscall_patch);

        area = kmalloc(sizeof(*area), GFP_KERNEL);
        if (unlikely(!area))
@@ -1513,7 +1521,8 @@ static struct xol_area
*__create_xol_area(unsigned long vaddr)
        /* Reserve the 1st slot for get_trampoline_vaddr() */
        set_bit(0, area->bitmap);
        atomic_set(&area->slot_count, 1);
-       arch_uprobe_copy_ixol(area->pages[0], 0, &insn, UPROBE_SWBP_INSN_SIZE);
+       arch_uprobe_copy_ixol(area->pages[0], 0, &uret_syscall_patch,
URET_SYSCALL_INSNS_SIZE);
+       //arch_uprobe_copy_ixol(area->pages[0], 0, &insn,
UPROBE_SWBP_INSN_SIZE);

        if (!xol_add_vma(mm, area))
                return area;
@@ -2366,3 +2375,16 @@ void __init uprobes_init(void)

        BUG_ON(register_die_notifier(&uprobe_exception_nb));
 }
+
+SYSCALL_DEFINE0(trace)
+{
+       struct pt_regs *regs = current_pt_regs();
+
+       //printk("BEFORE UTRACE!!!\n");
+
+       handle_trampoline(regs);
+
+       //printk("AFTER UTRACE!!!\n");
+
+       return 0;
+}
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index faad00cce269..4a3bc957dd43 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -174,6 +174,7 @@ COND_SYSCALL_COMPAT(fadvise64_64);
 COND_SYSCALL(lsm_get_self_attr);
 COND_SYSCALL(lsm_set_self_attr);
 COND_SYSCALL(lsm_list_modules);
+COND_SYSCALL(trace);

 /* CONFIG_MMU only */
 COND_SYSCALL(swapon);

>
>
> ---
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index 7e8d46f4147f..fa5f8a058bc2 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -383,6 +383,7 @@
>  459    common  lsm_get_self_attr       sys_lsm_get_self_attr
>  460    common  lsm_set_self_attr       sys_lsm_set_self_attr
>  461    common  lsm_list_modules        sys_lsm_list_modules
> +462    64      uprobe                  sys_uprobe
>

we should call it "uretprobe", "uprobe" will be a separate thing with
different logic.

I went with generic "trace", but realized that it would be better to
have separate more targeted "special/internal" syscalls (where, if
necessary, extra arguments would be passed through stack to avoid
storing/restoring user-space registers). We have rg_sigreturn
precedent which explicitly states that userspace shouldn't use it and
shouldn't rely on any specific arguments conventions.

[...]

>  /*
>   * Deprecated system calls which are still defined in
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index f46e0ca0169c..9ef244c8ff19 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -138,6 +138,8 @@ extern bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c
>  extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
>  extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
>                                          void *src, unsigned long len);
> +extern void uprobe_handle_trampoline(struct pt_regs *regs);
> +uprobe_opcode_t* arch_uprobe_trampoline(unsigned long *psize);

just `void *` here? it can be a sequence of instructions now

>  #else /* !CONFIG_UPROBES */
>  struct uprobes_state {
>  };
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 75f00965ab15..2702799648e6 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -842,8 +842,11 @@ __SYSCALL(__NR_lsm_set_self_attr, sys_lsm_set_self_attr)
>  #define __NR_lsm_list_modules 461
>  __SYSCALL(__NR_lsm_list_modules, sys_lsm_list_modules)
>
> +#define __NR_uprobe 462
> +__SYSCALL(__NR_uprobe, sys_uprobe)

[...]

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [LSF/MM/BPF TOPIC] faster uprobes
  2024-03-11 17:32                 ` Andrii Nakryiko
@ 2024-03-11 21:26                   ` Jiri Olsa
  2024-03-11 23:05                     ` Andrii Nakryiko
  0 siblings, 1 reply; 31+ messages in thread
From: Jiri Olsa @ 2024-03-11 21:26 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jiri Olsa, Alexei Starovoitov, yunwei356, bpf, Alexei Starovoitov,
	lsf-pc, Yonghong Song, Oleg Nesterov, Daniel Borkmann

On Mon, Mar 11, 2024 at 10:32:23AM -0700, Andrii Nakryiko wrote:
> On Mon, Mar 11, 2024 at 3:59 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Tue, Mar 05, 2024 at 09:24:08AM +0100, Jiri Olsa wrote:
> > > On Mon, Mar 04, 2024 at 04:55:33PM -0800, Andrii Nakryiko wrote:
> > > > On Sun, Mar 3, 2024 at 2:20 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > > > >
> > > > > On Fri, Mar 01, 2024 at 09:26:57AM -0800, Andrii Nakryiko wrote:
> > > > > > On Fri, Mar 1, 2024 at 9:01 AM Alexei Starovoitov
> > > > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > > >
> > > > > > > On Fri, Mar 1, 2024 at 12:18 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote:
> > > > > > > > > On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > One of uprobe pain points is having slow execution that involves
> > > > > > > > > > two traps in worst case scenario or single trap if the original
> > > > > > > > > > instruction can be emulated. For return uprobes there's one extra
> > > > > > > > > > trap on top of that.
> > > > > > > > > >
> > > > > > > > > > My current idea on how to make this faster is to follow the optimized
> > > > > > > > > > kprobes and replace the normal uprobe trap instruction with jump to
> > > > > > > > > > user space trampoline that:
> > > > > > > > > >
> > > > > > > > > >   - executes syscall to call uprobe consumers callbacks
> > > > > > > > >
> > > > > > > > > Did you get a chance to measure relative performance of syscall vs
> > > > > > > > > int3 interrupt handling? If not, do you think you'll be able to get
> > > > > > > > > some numbers by the time the conference starts? This should inform the
> > > > > > > > > decision whether it even makes sense to go through all the trouble.
> > > > > > > >
> > > > > > > > right, will do that
> > > > > > >
> > > > > > > I believe Yusheng measured syscall vs uprobe performance
> > > > > > > difference during LPC. iirc it was something like 3x.
> > > > > >
> > > > > > Do you have a link to slides? Was it actual uprobe vs just some fast
> > > > > > syscall (not doing BPF program execution) comparison? Or comparing the
> > > > > > performance of int3 handling vs equivalent syscall handling.
> > > > > >
> > > > > > I suspect it's the former, and so probably not that representative.
> > > > > > I'm curious about the performance of going
> > > > > > userspace->kernel->userspace through int3 vs syscall (all other things
> > > > > > being equal).
> > > > >
> > > > > I have a simple test [1] comparing:
> > > > >   - uprobe with 2 traps
> > > > >   - uprobe with 1 trap
> > > > >   - syscall executing uprobe
> > > > >
> > > > > the syscall takes uprobe address as argument, finds the uprobe and executes
> > > > > its consumers, which should be comparable to what the trampoline will do
> > > > >
> > > > > test does same amount of loops triggering each uprobe type and measures
> > > > > the time it took
> > > > >
> > > > >   # ./test_progs -t uprobe_syscall_bench -v
> > > > >   bpf_testmod.ko is already unloaded.
> > > > >   Loading bpf_testmod.ko...
> > > > >   Successfully loaded bpf_testmod.ko.
> > > > >   test_bench_1:PASS:uprobe_bench__open_and_load 0 nsec
> > > > >   test_bench_1:PASS:uprobe_bench__attach 0 nsec
> > > > >   test_bench_1:PASS:uprobe1_cnt 0 nsec
> > > > >   test_bench_1:PASS:syscalls_uprobe1_cnt 0 nsec
> > > > >   test_bench_1:PASS:uprobe2_cnt 0 nsec
> > > > >   test_bench_1: uprobes (1 trap) in  36.439s
> > > > >   test_bench_1: uprobes (2 trap) in  91.960s
> > > > >   test_bench_1: syscalls         in  17.872s
> > > > >   #395/1   uprobe_syscall_bench/bench_1:OK
> > > > >   #395     uprobe_syscall_bench:OK
> > > > >   Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED
> > > > >
> > > > > syscall uprobe execution seems to be ~2x faster than 1 trap uprobe
> > > > > and ~5x faster than 2 traps uprobe
> > > > >
> > > >
> > > > Thanks for running benchmarks! I quickly looked at the selftest and
> > > > noticed this:
> > > >
> > > > +/*
> > > > + * Assuming following prolog:
> > > > + *
> > > > + * 6984ac:       55                      push   %rbp
> > > > + * 6984ad:       48 89 e5                mov    %rsp,%rbp
> > > > + */
> > > > +noinline void uprobe2_bench_trigger(void)
> > > > +{
> > > > +        asm volatile ("");
> > > > +}
> > > >
> > > > This actually will be optimized out to just ret in -O2 mode (make
> > > > RELEASE=1 for selftests):
> > > >
> > > > 00000000005a0ce0 <uprobe2_bench_trigger>:
> > > >   5a0ce0: c3                            retq
> > > >   5a0ce1: 66 66 2e 0f 1f 84 00 00 00 00 00      nopw    %cs:(%rax,%rax)
> > > >   5a0cec: 0f 1f 40 00                   nopl    (%rax)
> > > >
> > > > So be careful with that.
> > >
> > > right, I did not mean for this to be checked in, just wanted to get the
> > > numbers quickly
> > >
> > > >
> > > > Also, I just updated our existing set of uprobe benchmarks (see [0]),
> > > > do you mind adding your syscall-based one as another one there and
> > > > running all of them and sharing the numbers with us? Very curious to
> > > > see both absolute and relative numbers from that benchmark. (and
> > > > please do build with RELEASE=1)
> > > >
> > > > You should be able to just run benchs/run_bench_uprobes.sh (also don't
> > > > forget to add your syscall-based benchmark to the list of benchmarks
> > > > in that shell script).
> > >
> > > yes, saw it and was going to run/compare it.. it's good idea to add
> > > the syscall one and get all numbers together, will do that
> > >
> > > >
> > > > Thank you!
> > > >
> > > >
> > > > BTW, while I think patching multiple instructions for syscall-based
> > > > uprobe is going to be extremely tricky, I think at least u*ret*probe's
> > > > int3 can be pretty easily optimized away with syscall, given that the
> > > > kernel controls code generation there. If anything, it will get the
> > > > uretprobe case a bit closer to the performance of uprobe. Give it some
> > > > thought.
> > >
> > > hm, right.. the trampoline is there already, but at the moment is global
> > > and used by all uretprobes.. and int3 code moves userspace (changes rip)
> > > to the original return address.. maybe we can do that through syscall
> > > as well
> >
> > it seems like good idea, I tried change below (use syscall on return
> > trampoline) and got some speedup:
> >
> > current:
> >   base           :   15.817 ± 0.009M/s
> >   uprobe-nop     :    2.901 ± 0.000M/s
> >   uprobe-push    :    2.743 ± 0.002M/s
> >   uprobe-ret     :    1.089 ± 0.001M/s
> >   uretprobe-nop  :    1.448 ± 0.001M/s
> >   uretprobe-push :    1.407 ± 0.001M/s
> >   uretprobe-ret  :    0.792 ± 0.001M/s
> >
> > with syscall:
> >   base           :   15.831 ± 0.026M/s
> >   uprobe-nop     :    2.904 ± 0.001M/s
> >   uprobe-push    :    2.764 ± 0.002M/s
> >   uprobe-ret     :    1.082 ± 0.001M/s
> >   uretprobe-nop  :    1.785 ± 0.000M/s
> >   uretprobe-push :    1.733 ± 0.001M/s
> >   uretprobe-ret  :    0.885 ± 0.004M/s
> >
> > ~23% for nop/push (emulated) cases, ~11% for ret (sstep) case
> >
> > jirka
> 
> heh, I tried this as well over weekend, though I cut few more corners
> (see diff below, I didn't add saving/restoring rax, though that would
> be required, of course). My test machine is (way) slower, though, so I
> got a slightly different numbers (up to 15%):

nice :-) btw I just checked on another slower amd server and it's ~10% in
all 3 cases, my previous results are from intel machine.. I guess the hw
trap behaviour/speed makes this not proportional across archs

> 
> ### baseline
> uprobe-base    :   79.462 ± 0.058M/s
> base           :    2.920 ± 0.004M/s
> uprobe-nop     :    1.093 ± 0.001M/s
> uprobe-push    :    1.066 ± 0.001M/s
> uprobe-ret     :    0.480 ± 0.001M/s
> uretprobe-nop  :    0.555 ± 0.000M/s
> uretprobe-push :    0.549 ± 0.000M/s
> uretprobe-ret  :    0.338 ± 0.000M/s
> 
> 
> ### uretprobe syscall (vs baseline)
> uprobe-base    :   79.488 ± 0.033M/s
> base           :    2.917 ± 0.003M/s
> uprobe-nop     :    1.095 ± 0.001M/s
> uprobe-push    :    1.058 ± 0.000M/s
> uprobe-ret     :    0.483 ± 0.000M/s
> uretprobe-nop  :    0.638 ± 0.000M/s (+15%)
> uretprobe-push :    0.627 ± 0.000M/s (+14.2%)
> uretprobe-ret  :    0.366 ± 0.000M/s (+8.3%)
> 
> Either way, yes, we should implement this. Are you planning to send an
> official patch some time soon? I'm working on other small improvements
> in uprobe/uretprobe, I'll probably send the first patches
> today/tomorrow, but they shouldn't interfere with this uretprobe code
> path.

yes, wanted to finish/post it this week

SNIP

> > ---
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> > index 7e8d46f4147f..fa5f8a058bc2 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -383,6 +383,7 @@
> >  459    common  lsm_get_self_attr       sys_lsm_get_self_attr
> >  460    common  lsm_set_self_attr       sys_lsm_set_self_attr
> >  461    common  lsm_list_modules        sys_lsm_list_modules
> > +462    64      uprobe                  sys_uprobe
> >
> 
> we should call it "uretprobe", "uprobe" will be a separate thing with
> different logic.
> 
> I went with generic "trace", but realized that it would be better to
> have separate more targeted "special/internal" syscalls (where, if
> necessary, extra arguments would be passed through stack to avoid
> storing/restoring user-space registers). We have rg_sigreturn
> precedent which explicitly states that userspace shouldn't use it and
> shouldn't rely on any specific arguments conventions.

somehow I thought of syscalls as of scare resource and wanted to add
arguments/commands to the uprobe syscalls.. but having uretprobe
dedicated syscall makes things easier

> 
> [...]
> 
> >  /*
> >   * Deprecated system calls which are still defined in
> > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > index f46e0ca0169c..9ef244c8ff19 100644
> > --- a/include/linux/uprobes.h
> > +++ b/include/linux/uprobes.h
> > @@ -138,6 +138,8 @@ extern bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c
> >  extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
> >  extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
> >                                          void *src, unsigned long len);
> > +extern void uprobe_handle_trampoline(struct pt_regs *regs);
> > +uprobe_opcode_t* arch_uprobe_trampoline(unsigned long *psize);
> 
> just `void *` here? it can be a sequence of instructions now

hm, it's pointer to u8, which should be fine no? is there benefit to
have void* in here instead?

thanks,
jirka

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM/BPF TOPIC] faster uprobes
  2024-03-11 21:26                   ` Jiri Olsa
@ 2024-03-11 23:05                     ` Andrii Nakryiko
  0 siblings, 0 replies; 31+ messages in thread
From: Andrii Nakryiko @ 2024-03-11 23:05 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Alexei Starovoitov, yunwei356, bpf, Alexei Starovoitov, lsf-pc,
	Yonghong Song, Oleg Nesterov, Daniel Borkmann

On Mon, Mar 11, 2024 at 2:26 PM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Mon, Mar 11, 2024 at 10:32:23AM -0700, Andrii Nakryiko wrote:
> > On Mon, Mar 11, 2024 at 3:59 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > >
> > > On Tue, Mar 05, 2024 at 09:24:08AM +0100, Jiri Olsa wrote:
> > > > On Mon, Mar 04, 2024 at 04:55:33PM -0800, Andrii Nakryiko wrote:
> > > > > On Sun, Mar 3, 2024 at 2:20 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Mar 01, 2024 at 09:26:57AM -0800, Andrii Nakryiko wrote:
> > > > > > > On Fri, Mar 1, 2024 at 9:01 AM Alexei Starovoitov
> > > > > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Fri, Mar 1, 2024 at 12:18 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote:
> > > > > > > > > > On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > One of uprobe pain points is having slow execution that involves
> > > > > > > > > > > two traps in worst case scenario or single trap if the original
> > > > > > > > > > > instruction can be emulated. For return uprobes there's one extra
> > > > > > > > > > > trap on top of that.
> > > > > > > > > > >
> > > > > > > > > > > My current idea on how to make this faster is to follow the optimized
> > > > > > > > > > > kprobes and replace the normal uprobe trap instruction with jump to
> > > > > > > > > > > user space trampoline that:
> > > > > > > > > > >
> > > > > > > > > > >   - executes syscall to call uprobe consumers callbacks
> > > > > > > > > >
> > > > > > > > > > Did you get a chance to measure relative performance of syscall vs
> > > > > > > > > > int3 interrupt handling? If not, do you think you'll be able to get
> > > > > > > > > > some numbers by the time the conference starts? This should inform the
> > > > > > > > > > decision whether it even makes sense to go through all the trouble.
> > > > > > > > >
> > > > > > > > > right, will do that
> > > > > > > >
> > > > > > > > I believe Yusheng measured syscall vs uprobe performance
> > > > > > > > difference during LPC. iirc it was something like 3x.
> > > > > > >
> > > > > > > Do you have a link to slides? Was it actual uprobe vs just some fast
> > > > > > > syscall (not doing BPF program execution) comparison? Or comparing the
> > > > > > > performance of int3 handling vs equivalent syscall handling.
> > > > > > >
> > > > > > > I suspect it's the former, and so probably not that representative.
> > > > > > > I'm curious about the performance of going
> > > > > > > userspace->kernel->userspace through int3 vs syscall (all other things
> > > > > > > being equal).
> > > > > >
> > > > > > I have a simple test [1] comparing:
> > > > > >   - uprobe with 2 traps
> > > > > >   - uprobe with 1 trap
> > > > > >   - syscall executing uprobe
> > > > > >
> > > > > > the syscall takes uprobe address as argument, finds the uprobe and executes
> > > > > > its consumers, which should be comparable to what the trampoline will do
> > > > > >
> > > > > > test does same amount of loops triggering each uprobe type and measures
> > > > > > the time it took
> > > > > >
> > > > > >   # ./test_progs -t uprobe_syscall_bench -v
> > > > > >   bpf_testmod.ko is already unloaded.
> > > > > >   Loading bpf_testmod.ko...
> > > > > >   Successfully loaded bpf_testmod.ko.
> > > > > >   test_bench_1:PASS:uprobe_bench__open_and_load 0 nsec
> > > > > >   test_bench_1:PASS:uprobe_bench__attach 0 nsec
> > > > > >   test_bench_1:PASS:uprobe1_cnt 0 nsec
> > > > > >   test_bench_1:PASS:syscalls_uprobe1_cnt 0 nsec
> > > > > >   test_bench_1:PASS:uprobe2_cnt 0 nsec
> > > > > >   test_bench_1: uprobes (1 trap) in  36.439s
> > > > > >   test_bench_1: uprobes (2 trap) in  91.960s
> > > > > >   test_bench_1: syscalls         in  17.872s
> > > > > >   #395/1   uprobe_syscall_bench/bench_1:OK
> > > > > >   #395     uprobe_syscall_bench:OK
> > > > > >   Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED
> > > > > >
> > > > > > syscall uprobe execution seems to be ~2x faster than 1 trap uprobe
> > > > > > and ~5x faster than 2 traps uprobe
> > > > > >
> > > > >
> > > > > Thanks for running benchmarks! I quickly looked at the selftest and
> > > > > noticed this:
> > > > >
> > > > > +/*
> > > > > + * Assuming following prolog:
> > > > > + *
> > > > > + * 6984ac:       55                      push   %rbp
> > > > > + * 6984ad:       48 89 e5                mov    %rsp,%rbp
> > > > > + */
> > > > > +noinline void uprobe2_bench_trigger(void)
> > > > > +{
> > > > > +        asm volatile ("");
> > > > > +}
> > > > >
> > > > > This actually will be optimized out to just ret in -O2 mode (make
> > > > > RELEASE=1 for selftests):
> > > > >
> > > > > 00000000005a0ce0 <uprobe2_bench_trigger>:
> > > > >   5a0ce0: c3                            retq
> > > > >   5a0ce1: 66 66 2e 0f 1f 84 00 00 00 00 00      nopw    %cs:(%rax,%rax)
> > > > >   5a0cec: 0f 1f 40 00                   nopl    (%rax)
> > > > >
> > > > > So be careful with that.
> > > >
> > > > right, I did not mean for this to be checked in, just wanted to get the
> > > > numbers quickly
> > > >
> > > > >
> > > > > Also, I just updated our existing set of uprobe benchmarks (see [0]),
> > > > > do you mind adding your syscall-based one as another one there and
> > > > > running all of them and sharing the numbers with us? Very curious to
> > > > > see both absolute and relative numbers from that benchmark. (and
> > > > > please do build with RELEASE=1)
> > > > >
> > > > > You should be able to just run benchs/run_bench_uprobes.sh (also don't
> > > > > forget to add your syscall-based benchmark to the list of benchmarks
> > > > > in that shell script).
> > > >
> > > > yes, saw it and was going to run/compare it.. it's good idea to add
> > > > the syscall one and get all numbers together, will do that
> > > >
> > > > >
> > > > > Thank you!
> > > > >
> > > > >
> > > > > BTW, while I think patching multiple instructions for syscall-based
> > > > > uprobe is going to be extremely tricky, I think at least u*ret*probe's
> > > > > int3 can be pretty easily optimized away with syscall, given that the
> > > > > kernel controls code generation there. If anything, it will get the
> > > > > uretprobe case a bit closer to the performance of uprobe. Give it some
> > > > > thought.
> > > >
> > > > hm, right.. the trampoline is there already, but at the moment is global
> > > > and used by all uretprobes.. and int3 code moves userspace (changes rip)
> > > > to the original return address.. maybe we can do that through syscall
> > > > as well
> > >
> > > it seems like good idea, I tried change below (use syscall on return
> > > trampoline) and got some speedup:
> > >
> > > current:
> > >   base           :   15.817 ± 0.009M/s
> > >   uprobe-nop     :    2.901 ± 0.000M/s
> > >   uprobe-push    :    2.743 ± 0.002M/s
> > >   uprobe-ret     :    1.089 ± 0.001M/s
> > >   uretprobe-nop  :    1.448 ± 0.001M/s
> > >   uretprobe-push :    1.407 ± 0.001M/s
> > >   uretprobe-ret  :    0.792 ± 0.001M/s
> > >
> > > with syscall:
> > >   base           :   15.831 ± 0.026M/s
> > >   uprobe-nop     :    2.904 ± 0.001M/s
> > >   uprobe-push    :    2.764 ± 0.002M/s
> > >   uprobe-ret     :    1.082 ± 0.001M/s
> > >   uretprobe-nop  :    1.785 ± 0.000M/s
> > >   uretprobe-push :    1.733 ± 0.001M/s
> > >   uretprobe-ret  :    0.885 ± 0.004M/s
> > >
> > > ~23% for nop/push (emulated) cases, ~11% for ret (sstep) case
> > >
> > > jirka
> >
> > heh, I tried this as well over weekend, though I cut few more corners
> > (see diff below, I didn't add saving/restoring rax, though that would
> > be required, of course). My test machine is (way) slower, though, so I
> > got a slightly different numbers (up to 15%):
>
> nice :-) btw I just checked on another slower amd server and it's ~10% in
> all 3 cases, my previous results are from intel machine.. I guess the hw
> trap behaviour/speed makes this not proportional across archs
>
> >
> > ### baseline
> > uprobe-base    :   79.462 ± 0.058M/s
> > base           :    2.920 ± 0.004M/s
> > uprobe-nop     :    1.093 ± 0.001M/s
> > uprobe-push    :    1.066 ± 0.001M/s
> > uprobe-ret     :    0.480 ± 0.001M/s
> > uretprobe-nop  :    0.555 ± 0.000M/s
> > uretprobe-push :    0.549 ± 0.000M/s
> > uretprobe-ret  :    0.338 ± 0.000M/s
> >
> >
> > ### uretprobe syscall (vs baseline)
> > uprobe-base    :   79.488 ± 0.033M/s
> > base           :    2.917 ± 0.003M/s
> > uprobe-nop     :    1.095 ± 0.001M/s
> > uprobe-push    :    1.058 ± 0.000M/s
> > uprobe-ret     :    0.483 ± 0.000M/s
> > uretprobe-nop  :    0.638 ± 0.000M/s (+15%)
> > uretprobe-push :    0.627 ± 0.000M/s (+14.2%)
> > uretprobe-ret  :    0.366 ± 0.000M/s (+8.3%)
> >
> > Either way, yes, we should implement this. Are you planning to send an
> > official patch some time soon? I'm working on other small improvements
> > in uprobe/uretprobe, I'll probably send the first patches
> > today/tomorrow, but they shouldn't interfere with this uretprobe code
> > path.
>
> yes, wanted to finish/post it this week
>

great, looking forward, we can use these speeds up for uretprobe in
our production

> SNIP
>
> > > ---
> > > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> > > index 7e8d46f4147f..fa5f8a058bc2 100644
> > > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > > @@ -383,6 +383,7 @@
> > >  459    common  lsm_get_self_attr       sys_lsm_get_self_attr
> > >  460    common  lsm_set_self_attr       sys_lsm_set_self_attr
> > >  461    common  lsm_list_modules        sys_lsm_list_modules
> > > +462    64      uprobe                  sys_uprobe
> > >
> >
> > we should call it "uretprobe", "uprobe" will be a separate thing with
> > different logic.
> >
> > I went with generic "trace", but realized that it would be better to
> > have separate more targeted "special/internal" syscalls (where, if
> > necessary, extra arguments would be passed through stack to avoid
> > storing/restoring user-space registers). We have rg_sigreturn
> > precedent which explicitly states that userspace shouldn't use it and
> > shouldn't rely on any specific arguments conventions.
>
> somehow I thought of syscalls as of scare resource and wanted to add
> arguments/commands to the uprobe syscalls.. but having uretprobe
> dedicated syscall makes things easier

given we are at 462 already, not sure I believe it's scarce :) but
it's also a performance aspect, not using any arguments means we can
avoid saving/restoring user-space registers, so it makes sense to have
2-3 dedicated uprobe/uretprobe syscalls vs 1 more generic on (IMO), at
least from performance POV.

>
> >
> > [...]
> >
> > >  /*
> > >   * Deprecated system calls which are still defined in
> > > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > > index f46e0ca0169c..9ef244c8ff19 100644
> > > --- a/include/linux/uprobes.h
> > > +++ b/include/linux/uprobes.h
> > > @@ -138,6 +138,8 @@ extern bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c
> > >  extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
> > >  extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
> > >                                          void *src, unsigned long len);
> > > +extern void uprobe_handle_trampoline(struct pt_regs *regs);
> > > +uprobe_opcode_t* arch_uprobe_trampoline(unsigned long *psize);
> >
> > just `void *` here? it can be a sequence of instructions now
>
> hm, it's pointer to u8, which should be fine no? is there benefit to
> have void* in here instead?
>

Quick grepping initially brought up `typedef u32 uprobe_opcode_t;`,
but that's for non-x86 architectures. I don't think it matters all
that much (in my mind it's just a generated code blob, so just `void
*` memory, we don't have to look at its contents).


> thanks,
> jirka

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM/BPF TOPIC] faster uprobes
  2024-03-01 17:01     ` Alexei Starovoitov
  2024-03-01 17:26       ` Andrii Nakryiko
@ 2024-03-02 20:46       ` Jiri Olsa
  2024-03-02 21:08         ` Alexei Starovoitov
  1 sibling, 1 reply; 31+ messages in thread
From: Jiri Olsa @ 2024-03-02 20:46 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jiri Olsa, yunwei356, Andrii Nakryiko, bpf, Alexei Starovoitov,
	lsf-pc, Yonghong Song, Oleg Nesterov, Daniel Borkmann

On Fri, Mar 01, 2024 at 09:01:07AM -0800, Alexei Starovoitov wrote:
> On Fri, Mar 1, 2024 at 12:18 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote:
> > > On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > > >
> > > > One of uprobe pain points is having slow execution that involves
> > > > two traps in worst case scenario or single trap if the original
> > > > instruction can be emulated. For return uprobes there's one extra
> > > > trap on top of that.
> > > >
> > > > My current idea on how to make this faster is to follow the optimized
> > > > kprobes and replace the normal uprobe trap instruction with jump to
> > > > user space trampoline that:
> > > >
> > > >   - executes syscall to call uprobe consumers callbacks
> > >
> > > Did you get a chance to measure relative performance of syscall vs
> > > int3 interrupt handling? If not, do you think you'll be able to get
> > > some numbers by the time the conference starts? This should inform the
> > > decision whether it even makes sense to go through all the trouble.
> >
> > right, will do that
> 
> I believe Yusheng measured syscall vs uprobe performance
> difference during LPC. iirc it was something like 3x.
> Certainly necessary to have a benchmark.
> selftests/bpf/bench has one for uprobe.
> Probably should extend with sys_bpf.
> 

ok, did not know there was uprobe benchmark, will check

> Regarding:
> > replace the normal uprobe trap instruction with jump to
> user space trampoline
> 
> it should probably be a call to trampoline instead of a jump.
> Unless you plan to generate a different trampoline for every location ?

I wanted to store the ip of the uprobe as argument for the syscall,
but the call instruction will push return address on stack and we
can use it to get uprobe's address.. great

> 
> Also how would you pick a space for a trampoline in the target process ?
> Analyze /proc/pid/maps and look for gaps in executable sections?

As Andrii mentioned in other response there's already one page mapped
as '[uprobes]' mapping, it's used as trampoline for return uprobes
(contains just int3 instruction) and as buffers to hold the original
instruction for the single step execution

I think if we endup with just single trampoline we can just use some
of the space from that page, our trampoline should not be big

> 
> We can start simple with a USDT that uses nop5 instead of nop1
> and explicit single trampoline for all USDT locations
> that saves all (callee and caller saved) registers and
> then does sys_bpf with a new cmd.

ah, I did not realize USDTs are like that, will check, good idea

> 
> To replace nop5 with a call to trampoline we can use text_poke_bp
> approach: replace 1st byte with int3, replace 2-5 with target addr,
> replace 1st byte to make an actual call insn.

I'm bit in the dark in here, but uprobe_write_opcode stores the int3
byte by allocating new page, copying the contents of the old page over
and updating it with int3 byte.. then calls __replace_page to put new
page in place

should that be enough also for 5 bytes update? the cpu executing that
exact page will page fault and get the new updated page? I discussed
with Oleg and got this understanding, I might be wrong

hm what if the cpu is just executing the address in the middle of the
uprobe's original instructions and the page gets updated.. I need to
check more on this ;-)

> 
> Once patched there will be no simulation of insns or kernel traps.
> Just normal user code that calls into trampoline, that calls sys_bpf,
> and returns back.

I saw this as generic uprobe enhancement, should it be sys_bpf syscall,
not a some generic one? we will call all the uprobe's handlers/consumers

thanks,
jirka

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM/BPF TOPIC] faster uprobes
  2024-03-02 20:46       ` Jiri Olsa
@ 2024-03-02 21:08         ` Alexei Starovoitov
  2024-03-02 21:49           ` Oleg Nesterov
  0 siblings, 1 reply; 31+ messages in thread
From: Alexei Starovoitov @ 2024-03-02 21:08 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: yunwei356, Andrii Nakryiko, bpf, Alexei Starovoitov, lsf-pc,
	Yonghong Song, Oleg Nesterov, Daniel Borkmann

On Sat, Mar 2, 2024 at 12:46 PM Jiri Olsa <olsajiri@gmail.com> wrote:
>
>
> I'm bit in the dark in here, but uprobe_write_opcode stores the int3
> byte by allocating new page, copying the contents of the old page over
> and updating it with int3 byte.. then calls __replace_page to put new
> page in place
>
> should that be enough also for 5 bytes update? the cpu executing that
> exact page will page fault and get the new updated page? I discussed
> with Oleg and got this understanding, I might be wrong
>
> hm what if the cpu is just executing the address in the middle of the
> uprobe's original instructions and the page gets updated.. I need to
> check more on this ;-)

I suspect it's all working fine already.
Only x86 is using single byte uprobe.
All other archs are using 2 or 4 byte.
So replacing an insn or two with a call should work.

> I saw this as generic uprobe enhancement, should it be sys_bpf syscall,
> not a some generic one? we will call all the uprobe's handlers/consumers

yeah. If we can make all uprobes faster without relying on nop5 usdt
then it's certainly better.
But if "replace any insn" turns out to be too complex
we can limit it to replacing nop5 or replacing simple insns
in the prologue like push, mov.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM/BPF TOPIC] faster uprobes
  2024-03-02 21:08         ` Alexei Starovoitov
@ 2024-03-02 21:49           ` Oleg Nesterov
  0 siblings, 0 replies; 31+ messages in thread
From: Oleg Nesterov @ 2024-03-02 21:49 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jiri Olsa, yunwei356, Andrii Nakryiko, bpf, Alexei Starovoitov,
	lsf-pc, Yonghong Song, Daniel Borkmann

On 03/02, Alexei Starovoitov wrote:
>
> I suspect it's all working fine already.
> Only x86 is using single byte uprobe.
> All other archs are using 2 or 4 byte.

Yes, so we have UPROBE_SWBP_INSN_SIZE

> So replacing an insn or two with a call should work.

Please note that  __uprobe_register(offset) fails if
!IS_ALIGNED(offset, UPROBE_SWBP_INSN_SIZE)

Not to mention that if "call" replaces 2 insns we have another problem:
what if another consumer wants to probe the 2ns insn ?

but perhaps (quite possibly) I misunderstand you.

Oleg.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM/BPF TOPIC] faster uprobes
  2024-02-29 14:39 [LSF/MM/BPF TOPIC] faster uprobes Jiri Olsa
  2024-03-01  0:25 ` Andrii Nakryiko
@ 2024-03-01 19:39 ` Kui-Feng Lee
  2024-03-05 17:18   ` Jiri Olsa
  1 sibling, 1 reply; 31+ messages in thread
From: Kui-Feng Lee @ 2024-03-01 19:39 UTC (permalink / raw)
  To: Jiri Olsa, bpf, Alexei Starovoitov, lsf-pc, Andrii Nakryiko,
	Yonghong Song, Oleg Nesterov, Daniel Borkmann




On 2/29/24 06:39, Jiri Olsa wrote:
> One of uprobe pain points is having slow execution that involves
> two traps in worst case scenario or single trap if the original
> instruction can be emulated. For return uprobes there's one extra
> trap on top of that.
> 
> My current idea on how to make this faster is to follow the optimized
> kprobes and replace the normal uprobe trap instruction with jump to
> user space trampoline that:
> 
>    - executes syscall to call uprobe consumers callbacks
>    - executes original instructions
>    - jumps back to continue with the original code
> 
> There are of course corner cases where above will have trouble or
> won't work completely, like:
> 
>    - executing original instructions in the trampoline is tricky wrt
>      rip relative addressing
> 
>    - some instructions we can't move to trampoline at all
> 
>    - the uprobe address is on page boundary so the jump instruction to
>      trampoline would span across 2 pages, hence the page replace won't
>      be atomic, which might cause issues
> 
>    - ... ? many others I'm sure
> 
> Still with all the limitations I think we could be able to speed up
> some amount of the uprobes, which seems worth doing.

Just a random idea related to this.
Could we also run jit code of bpf programs in the user space to collect
information instead of going back to the kernel every time?
These jit code should not be able to access helpers or kfuncs, but they
still can collect and aggregate data, store data in bpf maps, and change
behavior of user space programs.

> 
> I'd like to have the discussion on the topic and get some agreement
> or directions on how this should be done.
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM/BPF TOPIC] faster uprobes
  2024-03-01 19:39 ` Kui-Feng Lee
@ 2024-03-05 17:18   ` Jiri Olsa
  2024-03-05 23:53     ` Song Liu
  0 siblings, 1 reply; 31+ messages in thread
From: Jiri Olsa @ 2024-03-05 17:18 UTC (permalink / raw)
  To: Kui-Feng Lee
  Cc: Jiri Olsa, bpf, Alexei Starovoitov, lsf-pc, Andrii Nakryiko,
	Yonghong Song, Oleg Nesterov, Daniel Borkmann

On Fri, Mar 01, 2024 at 11:39:03AM -0800, Kui-Feng Lee wrote:
> 
> 
> 
> On 2/29/24 06:39, Jiri Olsa wrote:
> > One of uprobe pain points is having slow execution that involves
> > two traps in worst case scenario or single trap if the original
> > instruction can be emulated. For return uprobes there's one extra
> > trap on top of that.
> > 
> > My current idea on how to make this faster is to follow the optimized
> > kprobes and replace the normal uprobe trap instruction with jump to
> > user space trampoline that:
> > 
> >    - executes syscall to call uprobe consumers callbacks
> >    - executes original instructions
> >    - jumps back to continue with the original code
> > 
> > There are of course corner cases where above will have trouble or
> > won't work completely, like:
> > 
> >    - executing original instructions in the trampoline is tricky wrt
> >      rip relative addressing
> > 
> >    - some instructions we can't move to trampoline at all
> > 
> >    - the uprobe address is on page boundary so the jump instruction to
> >      trampoline would span across 2 pages, hence the page replace won't
> >      be atomic, which might cause issues
> > 
> >    - ... ? many others I'm sure
> > 
> > Still with all the limitations I think we could be able to speed up
> > some amount of the uprobes, which seems worth doing.
> 
> Just a random idea related to this.
> Could we also run jit code of bpf programs in the user space to collect
> information instead of going back to the kernel every time?

sorry for late reply, do you mean like ubpf? the scope of this change
is to speed up the generic uprobe, ebpf is just one of the consumers

jirka


> These jit code should not be able to access helpers or kfuncs, but they
> still can collect and aggregate data, store data in bpf maps, and change
> behavior of user space programs.
> 
> > 
> > I'd like to have the discussion on the topic and get some agreement
> > or directions on how this should be done.
> > 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM/BPF TOPIC] faster uprobes
  2024-03-05 17:18   ` Jiri Olsa
@ 2024-03-05 23:53     ` Song Liu
  2024-03-07  9:15       ` Jiri Olsa
  2024-03-07 23:02       ` Kui-Feng Lee
  0 siblings, 2 replies; 31+ messages in thread
From: Song Liu @ 2024-03-05 23:53 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Kui-Feng Lee, bpf, Alexei Starovoitov, lsf-pc, Andrii Nakryiko,
	Yonghong Song, Oleg Nesterov, Daniel Borkmann

On Tue, Mar 5, 2024 at 9:18 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Fri, Mar 01, 2024 at 11:39:03AM -0800, Kui-Feng Lee wrote:
> >
> >
> >
> > On 2/29/24 06:39, Jiri Olsa wrote:
> > > One of uprobe pain points is having slow execution that involves
> > > two traps in worst case scenario or single trap if the original
> > > instruction can be emulated. For return uprobes there's one extra
> > > trap on top of that.
> > >
> > > My current idea on how to make this faster is to follow the optimized
> > > kprobes and replace the normal uprobe trap instruction with jump to
> > > user space trampoline that:
> > >
> > >    - executes syscall to call uprobe consumers callbacks
> > >    - executes original instructions
> > >    - jumps back to continue with the original code
> > >
> > > There are of course corner cases where above will have trouble or
> > > won't work completely, like:
> > >
> > >    - executing original instructions in the trampoline is tricky wrt
> > >      rip relative addressing
> > >
> > >    - some instructions we can't move to trampoline at all
> > >
> > >    - the uprobe address is on page boundary so the jump instruction to
> > >      trampoline would span across 2 pages, hence the page replace won't
> > >      be atomic, which might cause issues
> > >
> > >    - ... ? many others I'm sure
> > >
> > > Still with all the limitations I think we could be able to speed up
> > > some amount of the uprobes, which seems worth doing.
> >
> > Just a random idea related to this.
> > Could we also run jit code of bpf programs in the user space to collect
> > information instead of going back to the kernel every time?

I was thinking about a similar idea. I guess these user space BPF
programs will have limited features that we can probably use them
update bpf maps. For this limited scope, we still need bpf_arena.
Otherwise, the user space bpf program will need to update the bpf
maps with sys_bpf(), which adds the same overhead as triggering
the program with a syscall.

>
> sorry for late reply, do you mean like ubpf? the scope of this change
> is to speed up the generic uprobe, ebpf is just one of the consumers

I guess this means we need a new syscall?

Thanks,
Song

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM/BPF TOPIC] faster uprobes
  2024-03-05 23:53     ` Song Liu
@ 2024-03-07  9:15       ` Jiri Olsa
  2024-03-07 23:02       ` Kui-Feng Lee
  1 sibling, 0 replies; 31+ messages in thread
From: Jiri Olsa @ 2024-03-07  9:15 UTC (permalink / raw)
  To: Song Liu
  Cc: Jiri Olsa, Kui-Feng Lee, bpf, Alexei Starovoitov, lsf-pc,
	Andrii Nakryiko, Yonghong Song, Oleg Nesterov, Daniel Borkmann

On Tue, Mar 05, 2024 at 03:53:35PM -0800, Song Liu wrote:
> On Tue, Mar 5, 2024 at 9:18 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Fri, Mar 01, 2024 at 11:39:03AM -0800, Kui-Feng Lee wrote:
> > >
> > >
> > >
> > > On 2/29/24 06:39, Jiri Olsa wrote:
> > > > One of uprobe pain points is having slow execution that involves
> > > > two traps in worst case scenario or single trap if the original
> > > > instruction can be emulated. For return uprobes there's one extra
> > > > trap on top of that.
> > > >
> > > > My current idea on how to make this faster is to follow the optimized
> > > > kprobes and replace the normal uprobe trap instruction with jump to
> > > > user space trampoline that:
> > > >
> > > >    - executes syscall to call uprobe consumers callbacks
> > > >    - executes original instructions
> > > >    - jumps back to continue with the original code
> > > >
> > > > There are of course corner cases where above will have trouble or
> > > > won't work completely, like:
> > > >
> > > >    - executing original instructions in the trampoline is tricky wrt
> > > >      rip relative addressing
> > > >
> > > >    - some instructions we can't move to trampoline at all
> > > >
> > > >    - the uprobe address is on page boundary so the jump instruction to
> > > >      trampoline would span across 2 pages, hence the page replace won't
> > > >      be atomic, which might cause issues
> > > >
> > > >    - ... ? many others I'm sure
> > > >
> > > > Still with all the limitations I think we could be able to speed up
> > > > some amount of the uprobes, which seems worth doing.
> > >
> > > Just a random idea related to this.
> > > Could we also run jit code of bpf programs in the user space to collect
> > > information instead of going back to the kernel every time?
> 
> I was thinking about a similar idea. I guess these user space BPF
> programs will have limited features that we can probably use them
> update bpf maps. For this limited scope, we still need bpf_arena.
> Otherwise, the user space bpf program will need to update the bpf
> maps with sys_bpf(), which adds the same overhead as triggering
> the program with a syscall.
> 
> >
> > sorry for late reply, do you mean like ubpf? the scope of this change
> > is to speed up the generic uprobe, ebpf is just one of the consumers
> 
> I guess this means we need a new syscall?

yes that's the idea, to replace the trap with syscall,
so far I used light version of that for initial testing [1]

jirka


[1] https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/log/?h=uprobe_syscall_bench_1

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM/BPF TOPIC] faster uprobes
  2024-03-05 23:53     ` Song Liu
  2024-03-07  9:15       ` Jiri Olsa
@ 2024-03-07 23:02       ` Kui-Feng Lee
  2024-03-08 15:43         ` Andrei Matei
  1 sibling, 1 reply; 31+ messages in thread
From: Kui-Feng Lee @ 2024-03-07 23:02 UTC (permalink / raw)
  To: Song Liu, Jiri Olsa
  Cc: bpf, Alexei Starovoitov, lsf-pc, Andrii Nakryiko, Yonghong Song,
	Oleg Nesterov, Daniel Borkmann



On 3/5/24 15:53, Song Liu wrote:
> On Tue, Mar 5, 2024 at 9:18 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>>
>> On Fri, Mar 01, 2024 at 11:39:03AM -0800, Kui-Feng Lee wrote:
>>>
>>>
>>>
>>> On 2/29/24 06:39, Jiri Olsa wrote:
>>>> One of uprobe pain points is having slow execution that involves
>>>> two traps in worst case scenario or single trap if the original
>>>> instruction can be emulated. For return uprobes there's one extra
>>>> trap on top of that.
>>>>
>>>> My current idea on how to make this faster is to follow the optimized
>>>> kprobes and replace the normal uprobe trap instruction with jump to
>>>> user space trampoline that:
>>>>
>>>>     - executes syscall to call uprobe consumers callbacks
>>>>     - executes original instructions
>>>>     - jumps back to continue with the original code
>>>>
>>>> There are of course corner cases where above will have trouble or
>>>> won't work completely, like:
>>>>
>>>>     - executing original instructions in the trampoline is tricky wrt
>>>>       rip relative addressing
>>>>
>>>>     - some instructions we can't move to trampoline at all
>>>>
>>>>     - the uprobe address is on page boundary so the jump instruction to
>>>>       trampoline would span across 2 pages, hence the page replace won't
>>>>       be atomic, which might cause issues
>>>>
>>>>     - ... ? many others I'm sure
>>>>
>>>> Still with all the limitations I think we could be able to speed up
>>>> some amount of the uprobes, which seems worth doing.
>>>
>>> Just a random idea related to this.
>>> Could we also run jit code of bpf programs in the user space to collect
>>> information instead of going back to the kernel every time?
> 
> I was thinking about a similar idea. I guess these user space BPF
> programs will have limited features that we can probably use them
> update bpf maps. For this limited scope, we still need bpf_arena.
> Otherwise, the user space bpf program will need to update the bpf
> maps with sys_bpf(), which adds the same overhead as triggering

That is true. However, even without bpf_arena, it still works with
some workarounds without going through sys_bpf().

> the program with a syscall.
> 
>>
>> sorry for late reply, do you mean like ubpf? the scope of this change
>> is to speed up the generic uprobe, ebpf is just one of the consumers
> 
> I guess this means we need a new syscall?
> 
> Thanks,
> Song

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM/BPF TOPIC] faster uprobes
  2024-03-07 23:02       ` Kui-Feng Lee
@ 2024-03-08 15:43         ` Andrei Matei
  2024-03-12 17:16           ` Kui-Feng Lee
  0 siblings, 1 reply; 31+ messages in thread
From: Andrei Matei @ 2024-03-08 15:43 UTC (permalink / raw)
  To: Kui-Feng Lee
  Cc: Song Liu, Jiri Olsa, bpf, Alexei Starovoitov, lsf-pc,
	Andrii Nakryiko, Yonghong Song, Oleg Nesterov, Daniel Borkmann

On Thu, Mar 7, 2024 at 6:02 PM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>
>
>
> On 3/5/24 15:53, Song Liu wrote:
> > On Tue, Mar 5, 2024 at 9:18 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> >>
> >> On Fri, Mar 01, 2024 at 11:39:03AM -0800, Kui-Feng Lee wrote:
> >>>
> >>>
> >>>
> >>> On 2/29/24 06:39, Jiri Olsa wrote:
> >>>> One of uprobe pain points is having slow execution that involves
> >>>> two traps in worst case scenario or single trap if the original
> >>>> instruction can be emulated. For return uprobes there's one extra
> >>>> trap on top of that.
> >>>>
> >>>> My current idea on how to make this faster is to follow the optimized
> >>>> kprobes and replace the normal uprobe trap instruction with jump to
> >>>> user space trampoline that:
> >>>>
> >>>>     - executes syscall to call uprobe consumers callbacks
> >>>>     - executes original instructions
> >>>>     - jumps back to continue with the original code
> >>>>
> >>>> There are of course corner cases where above will have trouble or
> >>>> won't work completely, like:
> >>>>
> >>>>     - executing original instructions in the trampoline is tricky wrt
> >>>>       rip relative addressing
> >>>>
> >>>>     - some instructions we can't move to trampoline at all
> >>>>
> >>>>     - the uprobe address is on page boundary so the jump instruction to
> >>>>       trampoline would span across 2 pages, hence the page replace won't
> >>>>       be atomic, which might cause issues
> >>>>
> >>>>     - ... ? many others I'm sure
> >>>>
> >>>> Still with all the limitations I think we could be able to speed up
> >>>> some amount of the uprobes, which seems worth doing.
> >>>
> >>> Just a random idea related to this.
> >>> Could we also run jit code of bpf programs in the user space to collect
> >>> information instead of going back to the kernel every time?
> >
> > I was thinking about a similar idea. I guess these user space BPF
> > programs will have limited features that we can probably use them
> > update bpf maps. For this limited scope, we still need bpf_arena.
> > Otherwise, the user space bpf program will need to update the bpf
> > maps with sys_bpf(), which adds the same overhead as triggering
>
> That is true. However, even without bpf_arena, it still works with
> some workarounds without going through sys_bpf().

Anything making uprobes faster would be very welcomed for my project.  The
biggest performance problem for us is the cost of bpf_probe_read_user()
relative to raw memory access. Every call to this helper walks the process'
page table to check that the access would not cause a fault (I think); this is
very slow. I wonder if there's some other option that would keep the safety
requirement for the memory access -- I'm imagining an optimistic mode where the
raw access is performed (in the target process' memory space) and, in the rare
case when a fault happens, the kernel would somehow recover from the fault and
fail the bpf_probe_read_user() helper. Would something like that be technically
feasible / has there been any prior interest in faster access to user memory?

A more limited option that might be helpful would be a vectorized version of
bpf_probe_read_user() that verifies many pointers at once.


>
> > the program with a syscall.
> >
> >>
> >> sorry for late reply, do you mean like ubpf? the scope of this change
> >> is to speed up the generic uprobe, ebpf is just one of the consumers
> >
> > I guess this means we need a new syscall?
> >
> > Thanks,
> > Song
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM/BPF TOPIC] faster uprobes
  2024-03-08 15:43         ` Andrei Matei
@ 2024-03-12 17:16           ` Kui-Feng Lee
  2024-03-13  1:32             ` Andrei Matei
  0 siblings, 1 reply; 31+ messages in thread
From: Kui-Feng Lee @ 2024-03-12 17:16 UTC (permalink / raw)
  To: Andrei Matei
  Cc: Song Liu, Jiri Olsa, bpf, Alexei Starovoitov, lsf-pc,
	Andrii Nakryiko, Yonghong Song, Oleg Nesterov, Daniel Borkmann



On 3/8/24 07:43, Andrei Matei wrote:
> On Thu, Mar 7, 2024 at 6:02 PM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>>
>>
>>
>> On 3/5/24 15:53, Song Liu wrote:
>>> On Tue, Mar 5, 2024 at 9:18 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>>>>
>>>> On Fri, Mar 01, 2024 at 11:39:03AM -0800, Kui-Feng Lee wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 2/29/24 06:39, Jiri Olsa wrote:
>>>>>> One of uprobe pain points is having slow execution that involves
>>>>>> two traps in worst case scenario or single trap if the original
>>>>>> instruction can be emulated. For return uprobes there's one extra
>>>>>> trap on top of that.
>>>>>>
>>>>>> My current idea on how to make this faster is to follow the optimized
>>>>>> kprobes and replace the normal uprobe trap instruction with jump to
>>>>>> user space trampoline that:
>>>>>>
>>>>>>      - executes syscall to call uprobe consumers callbacks
>>>>>>      - executes original instructions
>>>>>>      - jumps back to continue with the original code
>>>>>>
>>>>>> There are of course corner cases where above will have trouble or
>>>>>> won't work completely, like:
>>>>>>
>>>>>>      - executing original instructions in the trampoline is tricky wrt
>>>>>>        rip relative addressing
>>>>>>
>>>>>>      - some instructions we can't move to trampoline at all
>>>>>>
>>>>>>      - the uprobe address is on page boundary so the jump instruction to
>>>>>>        trampoline would span across 2 pages, hence the page replace won't
>>>>>>        be atomic, which might cause issues
>>>>>>
>>>>>>      - ... ? many others I'm sure
>>>>>>
>>>>>> Still with all the limitations I think we could be able to speed up
>>>>>> some amount of the uprobes, which seems worth doing.
>>>>>
>>>>> Just a random idea related to this.
>>>>> Could we also run jit code of bpf programs in the user space to collect
>>>>> information instead of going back to the kernel every time?
>>>
>>> I was thinking about a similar idea. I guess these user space BPF
>>> programs will have limited features that we can probably use them
>>> update bpf maps. For this limited scope, we still need bpf_arena.
>>> Otherwise, the user space bpf program will need to update the bpf
>>> maps with sys_bpf(), which adds the same overhead as triggering
>>
>> That is true. However, even without bpf_arena, it still works with
>> some workarounds without going through sys_bpf().
> 
> Anything making uprobes faster would be very welcomed for my project.  The
> biggest performance problem for us is the cost of bpf_probe_read_user()
> relative to raw memory access. Every call to this helper walks the process'

"raw memory access"? Do you mean not going through any helper function,
reading from a pointer directly?

> page table to check that the access would not cause a fault (I think); this is
> very slow. I wonder if there's some other option that would keep the safety
> requirement for the memory access -- I'm imagining an optimistic mode where the
> raw access is performed (in the target process' memory space) and, in the rare
> case when a fault happens, the kernel would somehow recover from the fault and

I am not very familiar with this part. I read the implementation of
bpf_probe_read_user() a little bit. It does what you mentioned here. It
would cause page faults, however, the handler will skip the instruction
leaving the counter non-zero. By checking the counter, it knows the
instruction is not completed, and returns an error.

I am curious about what your access pattern looks like. Does it access a
large number of small chunks of data? Or, does it access a small number
of big chunks of data?

> fail the bpf_probe_read_user() helper. Would something like that be technically
> feasible / has there been any prior interest in faster access to user memory
> 
> A more limited option that might be helpful would be a vectorized version of
> bpf_probe_read_user() that verifies many pointers at once.
> 
> 
>>
>>> the program with a syscall.
>>>
>>>>
>>>> sorry for late reply, do you mean like ubpf? the scope of this change
>>>> is to speed up the generic uprobe, ebpf is just one of the consumers
>>>
>>> I guess this means we need a new syscall?
>>>
>>> Thanks,
>>> Song
>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM/BPF TOPIC] faster uprobes
  2024-03-12 17:16           ` Kui-Feng Lee
@ 2024-03-13  1:32             ` Andrei Matei
  2024-03-13  5:42               ` Kui-Feng Lee
  0 siblings, 1 reply; 31+ messages in thread
From: Andrei Matei @ 2024-03-13  1:32 UTC (permalink / raw)
  To: Kui-Feng Lee
  Cc: Song Liu, Jiri Olsa, bpf, Alexei Starovoitov, lsf-pc,
	Andrii Nakryiko, Yonghong Song, Oleg Nesterov, Daniel Borkmann

On Tue, Mar 12, 2024 at 1:16 PM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>
>
>
> On 3/8/24 07:43, Andrei Matei wrote:
> > On Thu, Mar 7, 2024 at 6:02 PM Kui-Feng Lee <sinquersw@gmail.com> wrote:
> >>
> >>
> >>
> >> On 3/5/24 15:53, Song Liu wrote:
> >>> On Tue, Mar 5, 2024 at 9:18 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> >>>>
> >>>> On Fri, Mar 01, 2024 at 11:39:03AM -0800, Kui-Feng Lee wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 2/29/24 06:39, Jiri Olsa wrote:
> >>>>>> One of uprobe pain points is having slow execution that involves
> >>>>>> two traps in worst case scenario or single trap if the original
> >>>>>> instruction can be emulated. For return uprobes there's one extra
> >>>>>> trap on top of that.
> >>>>>>
> >>>>>> My current idea on how to make this faster is to follow the optimized
> >>>>>> kprobes and replace the normal uprobe trap instruction with jump to
> >>>>>> user space trampoline that:
> >>>>>>
> >>>>>>      - executes syscall to call uprobe consumers callbacks
> >>>>>>      - executes original instructions
> >>>>>>      - jumps back to continue with the original code
> >>>>>>
> >>>>>> There are of course corner cases where above will have trouble or
> >>>>>> won't work completely, like:
> >>>>>>
> >>>>>>      - executing original instructions in the trampoline is tricky wrt
> >>>>>>        rip relative addressing
> >>>>>>
> >>>>>>      - some instructions we can't move to trampoline at all
> >>>>>>
> >>>>>>      - the uprobe address is on page boundary so the jump instruction to
> >>>>>>        trampoline would span across 2 pages, hence the page replace won't
> >>>>>>        be atomic, which might cause issues
> >>>>>>
> >>>>>>      - ... ? many others I'm sure
> >>>>>>
> >>>>>> Still with all the limitations I think we could be able to speed up
> >>>>>> some amount of the uprobes, which seems worth doing.
> >>>>>
> >>>>> Just a random idea related to this.
> >>>>> Could we also run jit code of bpf programs in the user space to collect
> >>>>> information instead of going back to the kernel every time?
> >>>
> >>> I was thinking about a similar idea. I guess these user space BPF
> >>> programs will have limited features that we can probably use them
> >>> update bpf maps. For this limited scope, we still need bpf_arena.
> >>> Otherwise, the user space bpf program will need to update the bpf
> >>> maps with sys_bpf(), which adds the same overhead as triggering
> >>
> >> That is true. However, even without bpf_arena, it still works with
> >> some workarounds without going through sys_bpf().
> >
> > Anything making uprobes faster would be very welcomed for my project.  The
> > biggest performance problem for us is the cost of bpf_probe_read_user()
> > relative to raw memory access. Every call to this helper walks the process'
>
> "raw memory access"? Do you mean not going through any helper function,
> reading from a pointer directly?

Right.
I recognize that, as long as bpf runs "in the kernel", one cannot simply
dereference a user-space pointer since the kernel is a different virtual memory
space (*). Still, I wish there bpf_probe_read_user() were faster.

(*) Or, is it indeed a different memory space or is the kernel's virtual
address space mapped into every process? Did this change through KPTI? I would
be curious to read a good resource on what exactly it means to switch from
user-space to the kernel and back, if such a thing exists.

>
> > page table to check that the access would not cause a fault (I think); this is
> > very slow. I wonder if there's some other option that would keep the safety
> > requirement for the memory access -- I'm imagining an optimistic mode where the
> > raw access is performed (in the target process' memory space) and, in the rare
> > case when a fault happens, the kernel would somehow recover from the fault and
>
> I am not very familiar with this part. I read the implementation of
> bpf_probe_read_user() a little bit. It does what you mentioned here. It
> would cause page faults, however, the handler will skip the instruction
> leaving the counter non-zero. By checking the counter, it knows the
> instruction is not completed, and returns an error.
>
> I am curious about what your access pattern looks like. Does it access a
> large number of small chunks of data? Or, does it access a small number
> of big chunks of data?

My access pattern looks like a lot of small reads. Some of these reads could be
done at the same time if we had a vectorized API (i.e. some of the pointers are
known in advance); for others there are data dependencies (i.e. we need to
dereference a pointer to know what we'll want to read next). Specifically, the
use case is a debugger of sorts which uses BPF uprobes for poking around in the
target process' memory, rather than the more traditional ptrace-based
techniques (ptrace being very slow). This debugger needs to walk a lot of
thread stacks by following stack pointers or by using DWARF unwind information,
and then it further reads data structures from the target process' stacks and
heaps, chasing pointers recursively.


>
> > fail the bpf_probe_read_user() helper. Would something like that be technically
> > feasible / has there been any prior interest in faster access to user memory
> >
> > A more limited option that might be helpful would be a vectorized version of
> > bpf_probe_read_user() that verifies many pointers at once.
> >
> >
> >>
> >>> the program with a syscall.
> >>>
> >>>>
> >>>> sorry for late reply, do you mean like ubpf? the scope of this change
> >>>> is to speed up the generic uprobe, ebpf is just one of the consumers
> >>>
> >>> I guess this means we need a new syscall?
> >>>
> >>> Thanks,
> >>> Song
> >>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [LSF/MM/BPF TOPIC] faster uprobes
  2024-03-13  1:32             ` Andrei Matei
@ 2024-03-13  5:42               ` Kui-Feng Lee
  0 siblings, 0 replies; 31+ messages in thread
From: Kui-Feng Lee @ 2024-03-13  5:42 UTC (permalink / raw)
  To: Andrei Matei
  Cc: Song Liu, Jiri Olsa, bpf, Alexei Starovoitov, lsf-pc,
	Andrii Nakryiko, Yonghong Song, Oleg Nesterov, Daniel Borkmann



On 3/12/24 18:32, Andrei Matei wrote:
> On Tue, Mar 12, 2024 at 1:16 PM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>>
>>
>>
>> On 3/8/24 07:43, Andrei Matei wrote:
>>> On Thu, Mar 7, 2024 at 6:02 PM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> On 3/5/24 15:53, Song Liu wrote:
>>>>> On Tue, Mar 5, 2024 at 9:18 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>>>>>>
>>>>>> On Fri, Mar 01, 2024 at 11:39:03AM -0800, Kui-Feng Lee wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 2/29/24 06:39, Jiri Olsa wrote:
>>>>>>>> One of uprobe pain points is having slow execution that involves
>>>>>>>> two traps in worst case scenario or single trap if the original
>>>>>>>> instruction can be emulated. For return uprobes there's one extra
>>>>>>>> trap on top of that.
>>>>>>>>
>>>>>>>> My current idea on how to make this faster is to follow the optimized
>>>>>>>> kprobes and replace the normal uprobe trap instruction with jump to
>>>>>>>> user space trampoline that:
>>>>>>>>
>>>>>>>>       - executes syscall to call uprobe consumers callbacks
>>>>>>>>       - executes original instructions
>>>>>>>>       - jumps back to continue with the original code
>>>>>>>>
>>>>>>>> There are of course corner cases where above will have trouble or
>>>>>>>> won't work completely, like:
>>>>>>>>
>>>>>>>>       - executing original instructions in the trampoline is tricky wrt
>>>>>>>>         rip relative addressing
>>>>>>>>
>>>>>>>>       - some instructions we can't move to trampoline at all
>>>>>>>>
>>>>>>>>       - the uprobe address is on page boundary so the jump instruction to
>>>>>>>>         trampoline would span across 2 pages, hence the page replace won't
>>>>>>>>         be atomic, which might cause issues
>>>>>>>>
>>>>>>>>       - ... ? many others I'm sure
>>>>>>>>
>>>>>>>> Still with all the limitations I think we could be able to speed up
>>>>>>>> some amount of the uprobes, which seems worth doing.
>>>>>>>
>>>>>>> Just a random idea related to this.
>>>>>>> Could we also run jit code of bpf programs in the user space to collect
>>>>>>> information instead of going back to the kernel every time?
>>>>>
>>>>> I was thinking about a similar idea. I guess these user space BPF
>>>>> programs will have limited features that we can probably use them
>>>>> update bpf maps. For this limited scope, we still need bpf_arena.
>>>>> Otherwise, the user space bpf program will need to update the bpf
>>>>> maps with sys_bpf(), which adds the same overhead as triggering
>>>>
>>>> That is true. However, even without bpf_arena, it still works with
>>>> some workarounds without going through sys_bpf().
>>>
>>> Anything making uprobes faster would be very welcomed for my project.  The
>>> biggest performance problem for us is the cost of bpf_probe_read_user()
>>> relative to raw memory access. Every call to this helper walks the process'
>>
>> "raw memory access"? Do you mean not going through any helper function,
>> reading from a pointer directly?
> 
> Right.
> I recognize that, as long as bpf runs "in the kernel", one cannot simply
> dereference a user-space pointer since the kernel is a different virtual memory
> space (*). Still, I wish there bpf_probe_read_user() were faster.
> 
> (*) Or, is it indeed a different memory space or is the kernel's virtual
> address space mapped into every process? Did this change through KPTI? I would
> be curious to read a good resource on what exactly it means to switch from
> user-space to the kernel and back, if such a thing exists.

FYI! This is architecture dependent. AFAIK, with x86 platforms, kernel
can access the memory of the user space directly if it is in a
process/task context. But, you should not relies on it.

If you look into bpf_probe_read_user(), it eventually do something like
"rep movsb" on x86 platforms. Access user space memory directly with
some extra checks. So, the bottleneck here can be the extra checks and
memory copying. If you access small chunks like what you said bellow,
the overhead of checks could be expensive.

> 
>>
>>> page table to check that the access would not cause a fault (I think); this is
>>> very slow. I wonder if there's some other option that would keep the safety
>>> requirement for the memory access -- I'm imagining an optimistic mode where the
>>> raw access is performed (in the target process' memory space) and, in the rare
>>> case when a fault happens, the kernel would somehow recover from the fault and
>>
>> I am not very familiar with this part. I read the implementation of
>> bpf_probe_read_user() a little bit. It does what you mentioned here. It
>> would cause page faults, however, the handler will skip the instruction
>> leaving the counter non-zero. By checking the counter, it knows the
>> instruction is not completed, and returns an error.
>>
>> I am curious about what your access pattern looks like. Does it access a
>> large number of small chunks of data? Or, does it access a small number
>> of big chunks of data?
> 
> My access pattern looks like a lot of small reads. Some of these reads could be
> done at the same time if we had a vectorized API (i.e. some of the pointers are
> known in advance); for others there are data dependencies (i.e. we need to
> dereference a pointer to know what we'll want to read next). Specifically, the
> use case is a debugger of sorts which uses BPF uprobes for poking around in the
> target process' memory, rather than the more traditional ptrace-based
> techniques (ptrace being very slow). This debugger needs to walk a lot of
> thread stacks by following stack pointers or by using DWARF unwind information,
> and then it further reads data structures from the target process' stacks and
> heaps, chasing pointers recursively.


A related information. You may already know that bpf_probe_read_user()
can fail if a page fault happens.  A vectorized API probably doesn't
change it. It is a limitation of non-sleepable BPF programs. Sleepable 
BPF programs should be able to overcome it.


> 
> 
>>
>>> fail the bpf_probe_read_user() helper. Would something like that be technically
>>> feasible / has there been any prior interest in faster access to user memory
>>>
>>> A more limited option that might be helpful would be a vectorized version of
>>> bpf_probe_read_user() that verifies many pointers at once.
>>>
>>>
>>>>
>>>>> the program with a syscall.
>>>>>
>>>>>>
>>>>>> sorry for late reply, do you mean like ubpf? the scope of this change
>>>>>> is to speed up the generic uprobe, ebpf is just one of the consumers
>>>>>
>>>>> I guess this means we need a new syscall?
>>>>>
>>>>> Thanks,
>>>>> Song
>>>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2024-03-13  5:42 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-29 14:39 [LSF/MM/BPF TOPIC] faster uprobes Jiri Olsa
2024-03-01  0:25 ` Andrii Nakryiko
2024-03-01  8:18   ` Jiri Olsa
2024-03-01 17:01     ` Alexei Starovoitov
2024-03-01 17:26       ` Andrii Nakryiko
2024-03-01 18:08         ` Yunwei 123
2024-03-03 10:20         ` Jiri Olsa
2024-03-05  0:55           ` Andrii Nakryiko
2024-03-05  8:24             ` Jiri Olsa
2024-03-05 15:30               ` Jiri Olsa
2024-03-05 17:30                 ` Andrii Nakryiko
2024-03-11 10:59               ` Jiri Olsa
2024-03-11 15:06                 ` Oleg Nesterov
2024-03-11 16:46                   ` Jiri Olsa
2024-03-11 17:02                     ` Oleg Nesterov
2024-03-11 21:11                       ` Jiri Olsa
2024-03-11 17:32                 ` Andrii Nakryiko
2024-03-11 21:26                   ` Jiri Olsa
2024-03-11 23:05                     ` Andrii Nakryiko
2024-03-02 20:46       ` Jiri Olsa
2024-03-02 21:08         ` Alexei Starovoitov
2024-03-02 21:49           ` Oleg Nesterov
2024-03-01 19:39 ` Kui-Feng Lee
2024-03-05 17:18   ` Jiri Olsa
2024-03-05 23:53     ` Song Liu
2024-03-07  9:15       ` Jiri Olsa
2024-03-07 23:02       ` Kui-Feng Lee
2024-03-08 15:43         ` Andrei Matei
2024-03-12 17:16           ` Kui-Feng Lee
2024-03-13  1:32             ` Andrei Matei
2024-03-13  5:42               ` Kui-Feng Lee

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox