* [LSF/MM/BPF TOPIC] faster uprobes
@ 2024-02-29 14:39 Jiri Olsa
2024-03-01 0:25 ` Andrii Nakryiko
2024-03-01 19:39 ` Kui-Feng Lee
0 siblings, 2 replies; 31+ messages in thread
From: Jiri Olsa @ 2024-02-29 14:39 UTC (permalink / raw)
To: bpf, Alexei Starovoitov, lsf-pc, Andrii Nakryiko, Yonghong Song,
Oleg Nesterov, Daniel Borkmann
One of uprobe pain points is having slow execution that involves
two traps in worst case scenario or single trap if the original
instruction can be emulated. For return uprobes there's one extra
trap on top of that.
My current idea on how to make this faster is to follow the optimized
kprobes and replace the normal uprobe trap instruction with jump to
user space trampoline that:
- executes syscall to call uprobe consumers callbacks
- executes original instructions
- jumps back to continue with the original code
There are of course corner cases where above will have trouble or
won't work completely, like:
- executing original instructions in the trampoline is tricky wrt
rip relative addressing
- some instructions we can't move to trampoline at all
- the uprobe address is on page boundary so the jump instruction to
trampoline would span across 2 pages, hence the page replace won't
be atomic, which might cause issues
- ... ? many others I'm sure
Still with all the limitations I think we could be able to speed up
some amount of the uprobes, which seems worth doing.
I'd like to have the discussion on the topic and get some agreement
or directions on how this should be done.
^ permalink raw reply [flat|nested] 31+ messages in thread* Re: [LSF/MM/BPF TOPIC] faster uprobes 2024-02-29 14:39 [LSF/MM/BPF TOPIC] faster uprobes Jiri Olsa @ 2024-03-01 0:25 ` Andrii Nakryiko 2024-03-01 8:18 ` Jiri Olsa 2024-03-01 19:39 ` Kui-Feng Lee 1 sibling, 1 reply; 31+ messages in thread From: Andrii Nakryiko @ 2024-03-01 0:25 UTC (permalink / raw) To: Jiri Olsa Cc: bpf, Alexei Starovoitov, lsf-pc, Yonghong Song, Oleg Nesterov, Daniel Borkmann On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > One of uprobe pain points is having slow execution that involves > two traps in worst case scenario or single trap if the original > instruction can be emulated. For return uprobes there's one extra > trap on top of that. > > My current idea on how to make this faster is to follow the optimized > kprobes and replace the normal uprobe trap instruction with jump to > user space trampoline that: > > - executes syscall to call uprobe consumers callbacks Did you get a chance to measure relative performance of syscall vs int3 interrupt handling? If not, do you think you'll be able to get some numbers by the time the conference starts? This should inform the decision whether it even makes sense to go through all the trouble. > - executes original instructions > - jumps back to continue with the original code > > There are of course corner cases where above will have trouble or > won't work completely, like: > > - executing original instructions in the trampoline is tricky wrt > rip relative addressing > > - some instructions we can't move to trampoline at all > > - the uprobe address is on page boundary so the jump instruction to > trampoline would span across 2 pages, hence the page replace won't > be atomic, which might cause issues > > - ... ? many others I'm sure > > Still with all the limitations I think we could be able to speed up > some amount of the uprobes, which seems worth doing. > > I'd like to have the discussion on the topic and get some agreement > or directions on how this should be done. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM/BPF TOPIC] faster uprobes 2024-03-01 0:25 ` Andrii Nakryiko @ 2024-03-01 8:18 ` Jiri Olsa 2024-03-01 17:01 ` Alexei Starovoitov 0 siblings, 1 reply; 31+ messages in thread From: Jiri Olsa @ 2024-03-01 8:18 UTC (permalink / raw) To: Andrii Nakryiko Cc: Jiri Olsa, bpf, Alexei Starovoitov, lsf-pc, Yonghong Song, Oleg Nesterov, Daniel Borkmann On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote: > On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > One of uprobe pain points is having slow execution that involves > > two traps in worst case scenario or single trap if the original > > instruction can be emulated. For return uprobes there's one extra > > trap on top of that. > > > > My current idea on how to make this faster is to follow the optimized > > kprobes and replace the normal uprobe trap instruction with jump to > > user space trampoline that: > > > > - executes syscall to call uprobe consumers callbacks > > Did you get a chance to measure relative performance of syscall vs > int3 interrupt handling? If not, do you think you'll be able to get > some numbers by the time the conference starts? This should inform the > decision whether it even makes sense to go through all the trouble. right, will do that jirka > > > - executes original instructions > > - jumps back to continue with the original code > > > > There are of course corner cases where above will have trouble or > > won't work completely, like: > > > > - executing original instructions in the trampoline is tricky wrt > > rip relative addressing > > > > - some instructions we can't move to trampoline at all > > > > - the uprobe address is on page boundary so the jump instruction to > > trampoline would span across 2 pages, hence the page replace won't > > be atomic, which might cause issues > > > > - ... ? many others I'm sure > > > > Still with all the limitations I think we could be able to speed up > > some amount of the uprobes, which seems worth doing. > > > > I'd like to have the discussion on the topic and get some agreement > > or directions on how this should be done. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM/BPF TOPIC] faster uprobes 2024-03-01 8:18 ` Jiri Olsa @ 2024-03-01 17:01 ` Alexei Starovoitov 2024-03-01 17:26 ` Andrii Nakryiko 2024-03-02 20:46 ` Jiri Olsa 0 siblings, 2 replies; 31+ messages in thread From: Alexei Starovoitov @ 2024-03-01 17:01 UTC (permalink / raw) To: Jiri Olsa, yunwei356 Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, lsf-pc, Yonghong Song, Oleg Nesterov, Daniel Borkmann On Fri, Mar 1, 2024 at 12:18 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote: > > On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > One of uprobe pain points is having slow execution that involves > > > two traps in worst case scenario or single trap if the original > > > instruction can be emulated. For return uprobes there's one extra > > > trap on top of that. > > > > > > My current idea on how to make this faster is to follow the optimized > > > kprobes and replace the normal uprobe trap instruction with jump to > > > user space trampoline that: > > > > > > - executes syscall to call uprobe consumers callbacks > > > > Did you get a chance to measure relative performance of syscall vs > > int3 interrupt handling? If not, do you think you'll be able to get > > some numbers by the time the conference starts? This should inform the > > decision whether it even makes sense to go through all the trouble. > > right, will do that I believe Yusheng measured syscall vs uprobe performance difference during LPC. iirc it was something like 3x. Certainly necessary to have a benchmark. selftests/bpf/bench has one for uprobe. Probably should extend with sys_bpf. Regarding: > replace the normal uprobe trap instruction with jump to user space trampoline it should probably be a call to trampoline instead of a jump. Unless you plan to generate a different trampoline for every location ? Also how would you pick a space for a trampoline in the target process ? Analyze /proc/pid/maps and look for gaps in executable sections? We can start simple with a USDT that uses nop5 instead of nop1 and explicit single trampoline for all USDT locations that saves all (callee and caller saved) registers and then does sys_bpf with a new cmd. To replace nop5 with a call to trampoline we can use text_poke_bp approach: replace 1st byte with int3, replace 2-5 with target addr, replace 1st byte to make an actual call insn. Once patched there will be no simulation of insns or kernel traps. Just normal user code that calls into trampoline, that calls sys_bpf, and returns back. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM/BPF TOPIC] faster uprobes 2024-03-01 17:01 ` Alexei Starovoitov @ 2024-03-01 17:26 ` Andrii Nakryiko 2024-03-01 18:08 ` Yunwei 123 2024-03-03 10:20 ` Jiri Olsa 2024-03-02 20:46 ` Jiri Olsa 1 sibling, 2 replies; 31+ messages in thread From: Andrii Nakryiko @ 2024-03-01 17:26 UTC (permalink / raw) To: Alexei Starovoitov Cc: Jiri Olsa, yunwei356, bpf, Alexei Starovoitov, lsf-pc, Yonghong Song, Oleg Nesterov, Daniel Borkmann On Fri, Mar 1, 2024 at 9:01 AM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Fri, Mar 1, 2024 at 12:18 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote: > > > On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > > > One of uprobe pain points is having slow execution that involves > > > > two traps in worst case scenario or single trap if the original > > > > instruction can be emulated. For return uprobes there's one extra > > > > trap on top of that. > > > > > > > > My current idea on how to make this faster is to follow the optimized > > > > kprobes and replace the normal uprobe trap instruction with jump to > > > > user space trampoline that: > > > > > > > > - executes syscall to call uprobe consumers callbacks > > > > > > Did you get a chance to measure relative performance of syscall vs > > > int3 interrupt handling? If not, do you think you'll be able to get > > > some numbers by the time the conference starts? This should inform the > > > decision whether it even makes sense to go through all the trouble. > > > > right, will do that > > I believe Yusheng measured syscall vs uprobe performance > difference during LPC. iirc it was something like 3x. Do you have a link to slides? Was it actual uprobe vs just some fast syscall (not doing BPF program execution) comparison? Or comparing the performance of int3 handling vs equivalent syscall handling. I suspect it's the former, and so probably not that representative. I'm curious about the performance of going userspace->kernel->userspace through int3 vs syscall (all other things being equal). > Certainly necessary to have a benchmark. > selftests/bpf/bench has one for uprobe. > Probably should extend with sys_bpf. > > Regarding: > > replace the normal uprobe trap instruction with jump to > user space trampoline > > it should probably be a call to trampoline instead of a jump. > Unless you plan to generate a different trampoline for every location ? > > Also how would you pick a space for a trampoline in the target process ? > Analyze /proc/pid/maps and look for gaps in executable sections? kernel already does that for uretprobes, it adds a new "[uprobes]" memory mapping, so this part is already implemented > > We can start simple with a USDT that uses nop5 instead of nop1 > and explicit single trampoline for all USDT locations > that saves all (callee and caller saved) registers and > then does sys_bpf with a new cmd. > > To replace nop5 with a call to trampoline we can use text_poke_bp > approach: replace 1st byte with int3, replace 2-5 with target addr, > replace 1st byte to make an actual call insn. > > Once patched there will be no simulation of insns or kernel traps. > Just normal user code that calls into trampoline, that calls sys_bpf, > and returns back. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM/BPF TOPIC] faster uprobes 2024-03-01 17:26 ` Andrii Nakryiko @ 2024-03-01 18:08 ` Yunwei 123 2024-03-03 10:20 ` Jiri Olsa 1 sibling, 0 replies; 31+ messages in thread From: Yunwei 123 @ 2024-03-01 18:08 UTC (permalink / raw) To: Andrii Nakryiko Cc: Alexei Starovoitov, Jiri Olsa, bpf, Alexei Starovoitov, lsf-pc, Yonghong Song, Oleg Nesterov, Daniel Borkmann Hi! I did some basic experiment on bpftime, which combined user space trampoline in bpftime with a bpf_prog_test_run syscall to run eBPF code in kernel. In my laptop, it was about 2-3x faster than original trap based Uprobe. The experiment code was in https://github.com/eunomia-bpf/bpftime/blob/71f13ae80e93e8ff45e1b0320c25ff14cb25b4ba/runtime/src/bpftime_prog.cpp#L113 (That's just a poc, not kernel patches) On Fri, Mar 1, 2024 at 5:27 PM Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > On Fri, Mar 1, 2024 at 9:01 AM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > On Fri, Mar 1, 2024 at 12:18 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote: > > > > On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > > > > > One of uprobe pain points is having slow execution that involves > > > > > two traps in worst case scenario or single trap if the original > > > > > instruction can be emulated. For return uprobes there's one extra > > > > > trap on top of that. > > > > > > > > > > My current idea on how to make this faster is to follow the optimized > > > > > kprobes and replace the normal uprobe trap instruction with jump to > > > > > user space trampoline that: > > > > > > > > > > - executes syscall to call uprobe consumers callbacks > > > > > > > > Did you get a chance to measure relative performance of syscall vs > > > > int3 interrupt handling? If not, do you think you'll be able to get > > > > some numbers by the time the conference starts? This should inform the > > > > decision whether it even makes sense to go through all the trouble. > > > > > > right, will do that > > > > I believe Yusheng measured syscall vs uprobe performance > > difference during LPC. iirc it was something like 3x. > > Do you have a link to slides? Was it actual uprobe vs just some fast > syscall (not doing BPF program execution) comparison? Or comparing the > performance of int3 handling vs equivalent syscall handling. > > I suspect it's the former, and so probably not that representative. > I'm curious about the performance of going > userspace->kernel->userspace through int3 vs syscall (all other things > being equal). > > > Certainly necessary to have a benchmark. > > selftests/bpf/bench has one for uprobe. > > Probably should extend with sys_bpf. > > > > Regarding: > > > replace the normal uprobe trap instruction with jump to > > user space trampoline > > > > it should probably be a call to trampoline instead of a jump. > > Unless you plan to generate a different trampoline for every location ? > > > > Also how would you pick a space for a trampoline in the target process ? > > Analyze /proc/pid/maps and look for gaps in executable sections? > > kernel already does that for uretprobes, it adds a new "[uprobes]" > memory mapping, so this part is already implemented > > > > > We can start simple with a USDT that uses nop5 instead of nop1 > > and explicit single trampoline for all USDT locations > > that saves all (callee and caller saved) registers and > > then does sys_bpf with a new cmd. > > > > To replace nop5 with a call to trampoline we can use text_poke_bp > > approach: replace 1st byte with int3, replace 2-5 with target addr, > > replace 1st byte to make an actual call insn. > > > > Once patched there will be no simulation of insns or kernel traps. > > Just normal user code that calls into trampoline, that calls sys_bpf, > > and returns back. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM/BPF TOPIC] faster uprobes 2024-03-01 17:26 ` Andrii Nakryiko 2024-03-01 18:08 ` Yunwei 123 @ 2024-03-03 10:20 ` Jiri Olsa 2024-03-05 0:55 ` Andrii Nakryiko 1 sibling, 1 reply; 31+ messages in thread From: Jiri Olsa @ 2024-03-03 10:20 UTC (permalink / raw) To: Andrii Nakryiko Cc: Alexei Starovoitov, Jiri Olsa, yunwei356, bpf, Alexei Starovoitov, lsf-pc, Yonghong Song, Oleg Nesterov, Daniel Borkmann On Fri, Mar 01, 2024 at 09:26:57AM -0800, Andrii Nakryiko wrote: > On Fri, Mar 1, 2024 at 9:01 AM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > On Fri, Mar 1, 2024 at 12:18 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote: > > > > On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > > > > > One of uprobe pain points is having slow execution that involves > > > > > two traps in worst case scenario or single trap if the original > > > > > instruction can be emulated. For return uprobes there's one extra > > > > > trap on top of that. > > > > > > > > > > My current idea on how to make this faster is to follow the optimized > > > > > kprobes and replace the normal uprobe trap instruction with jump to > > > > > user space trampoline that: > > > > > > > > > > - executes syscall to call uprobe consumers callbacks > > > > > > > > Did you get a chance to measure relative performance of syscall vs > > > > int3 interrupt handling? If not, do you think you'll be able to get > > > > some numbers by the time the conference starts? This should inform the > > > > decision whether it even makes sense to go through all the trouble. > > > > > > right, will do that > > > > I believe Yusheng measured syscall vs uprobe performance > > difference during LPC. iirc it was something like 3x. > > Do you have a link to slides? Was it actual uprobe vs just some fast > syscall (not doing BPF program execution) comparison? Or comparing the > performance of int3 handling vs equivalent syscall handling. > > I suspect it's the former, and so probably not that representative. > I'm curious about the performance of going > userspace->kernel->userspace through int3 vs syscall (all other things > being equal). I have a simple test [1] comparing: - uprobe with 2 traps - uprobe with 1 trap - syscall executing uprobe the syscall takes uprobe address as argument, finds the uprobe and executes its consumers, which should be comparable to what the trampoline will do test does same amount of loops triggering each uprobe type and measures the time it took # ./test_progs -t uprobe_syscall_bench -v bpf_testmod.ko is already unloaded. Loading bpf_testmod.ko... Successfully loaded bpf_testmod.ko. test_bench_1:PASS:uprobe_bench__open_and_load 0 nsec test_bench_1:PASS:uprobe_bench__attach 0 nsec test_bench_1:PASS:uprobe1_cnt 0 nsec test_bench_1:PASS:syscalls_uprobe1_cnt 0 nsec test_bench_1:PASS:uprobe2_cnt 0 nsec test_bench_1: uprobes (1 trap) in 36.439s test_bench_1: uprobes (2 trap) in 91.960s test_bench_1: syscalls in 17.872s #395/1 uprobe_syscall_bench/bench_1:OK #395 uprobe_syscall_bench:OK Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED syscall uprobe execution seems to be ~2x faster than 1 trap uprobe and ~5x faster than 2 traps uprobe jirka [1] https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/log/?h=uprobe_syscall_bench > > > Certainly necessary to have a benchmark. > > selftests/bpf/bench has one for uprobe. > > Probably should extend with sys_bpf. > > > > Regarding: > > > replace the normal uprobe trap instruction with jump to > > user space trampoline > > > > it should probably be a call to trampoline instead of a jump. > > Unless you plan to generate a different trampoline for every location ? > > > > Also how would you pick a space for a trampoline in the target process ? > > Analyze /proc/pid/maps and look for gaps in executable sections? > > kernel already does that for uretprobes, it adds a new "[uprobes]" > memory mapping, so this part is already implemented > > > > > We can start simple with a USDT that uses nop5 instead of nop1 > > and explicit single trampoline for all USDT locations > > that saves all (callee and caller saved) registers and > > then does sys_bpf with a new cmd. > > > > To replace nop5 with a call to trampoline we can use text_poke_bp > > approach: replace 1st byte with int3, replace 2-5 with target addr, > > replace 1st byte to make an actual call insn. > > > > Once patched there will be no simulation of insns or kernel traps. > > Just normal user code that calls into trampoline, that calls sys_bpf, > > and returns back. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM/BPF TOPIC] faster uprobes 2024-03-03 10:20 ` Jiri Olsa @ 2024-03-05 0:55 ` Andrii Nakryiko 2024-03-05 8:24 ` Jiri Olsa 0 siblings, 1 reply; 31+ messages in thread From: Andrii Nakryiko @ 2024-03-05 0:55 UTC (permalink / raw) To: Jiri Olsa Cc: Alexei Starovoitov, yunwei356, bpf, Alexei Starovoitov, lsf-pc, Yonghong Song, Oleg Nesterov, Daniel Borkmann On Sun, Mar 3, 2024 at 2:20 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > On Fri, Mar 01, 2024 at 09:26:57AM -0800, Andrii Nakryiko wrote: > > On Fri, Mar 1, 2024 at 9:01 AM Alexei Starovoitov > > <alexei.starovoitov@gmail.com> wrote: > > > > > > On Fri, Mar 1, 2024 at 12:18 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > > > On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote: > > > > > On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > > > > > > > One of uprobe pain points is having slow execution that involves > > > > > > two traps in worst case scenario or single trap if the original > > > > > > instruction can be emulated. For return uprobes there's one extra > > > > > > trap on top of that. > > > > > > > > > > > > My current idea on how to make this faster is to follow the optimized > > > > > > kprobes and replace the normal uprobe trap instruction with jump to > > > > > > user space trampoline that: > > > > > > > > > > > > - executes syscall to call uprobe consumers callbacks > > > > > > > > > > Did you get a chance to measure relative performance of syscall vs > > > > > int3 interrupt handling? If not, do you think you'll be able to get > > > > > some numbers by the time the conference starts? This should inform the > > > > > decision whether it even makes sense to go through all the trouble. > > > > > > > > right, will do that > > > > > > I believe Yusheng measured syscall vs uprobe performance > > > difference during LPC. iirc it was something like 3x. > > > > Do you have a link to slides? Was it actual uprobe vs just some fast > > syscall (not doing BPF program execution) comparison? Or comparing the > > performance of int3 handling vs equivalent syscall handling. > > > > I suspect it's the former, and so probably not that representative. > > I'm curious about the performance of going > > userspace->kernel->userspace through int3 vs syscall (all other things > > being equal). > > I have a simple test [1] comparing: > - uprobe with 2 traps > - uprobe with 1 trap > - syscall executing uprobe > > the syscall takes uprobe address as argument, finds the uprobe and executes > its consumers, which should be comparable to what the trampoline will do > > test does same amount of loops triggering each uprobe type and measures > the time it took > > # ./test_progs -t uprobe_syscall_bench -v > bpf_testmod.ko is already unloaded. > Loading bpf_testmod.ko... > Successfully loaded bpf_testmod.ko. > test_bench_1:PASS:uprobe_bench__open_and_load 0 nsec > test_bench_1:PASS:uprobe_bench__attach 0 nsec > test_bench_1:PASS:uprobe1_cnt 0 nsec > test_bench_1:PASS:syscalls_uprobe1_cnt 0 nsec > test_bench_1:PASS:uprobe2_cnt 0 nsec > test_bench_1: uprobes (1 trap) in 36.439s > test_bench_1: uprobes (2 trap) in 91.960s > test_bench_1: syscalls in 17.872s > #395/1 uprobe_syscall_bench/bench_1:OK > #395 uprobe_syscall_bench:OK > Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED > > syscall uprobe execution seems to be ~2x faster than 1 trap uprobe > and ~5x faster than 2 traps uprobe > Thanks for running benchmarks! I quickly looked at the selftest and noticed this: +/* + * Assuming following prolog: + * + * 6984ac: 55 push %rbp + * 6984ad: 48 89 e5 mov %rsp,%rbp + */ +noinline void uprobe2_bench_trigger(void) +{ + asm volatile (""); +} This actually will be optimized out to just ret in -O2 mode (make RELEASE=1 for selftests): 00000000005a0ce0 <uprobe2_bench_trigger>: 5a0ce0: c3 retq 5a0ce1: 66 66 2e 0f 1f 84 00 00 00 00 00 nopw %cs:(%rax,%rax) 5a0cec: 0f 1f 40 00 nopl (%rax) So be careful with that. Also, I just updated our existing set of uprobe benchmarks (see [0]), do you mind adding your syscall-based one as another one there and running all of them and sharing the numbers with us? Very curious to see both absolute and relative numbers from that benchmark. (and please do build with RELEASE=1) You should be able to just run benchs/run_bench_uprobes.sh (also don't forget to add your syscall-based benchmark to the list of benchmarks in that shell script). Thank you! BTW, while I think patching multiple instructions for syscall-based uprobe is going to be extremely tricky, I think at least u*ret*probe's int3 can be pretty easily optimized away with syscall, given that the kernel controls code generation there. If anything, it will get the uretprobe case a bit closer to the performance of uprobe. Give it some thought. [0] https://patchwork.kernel.org/project/netdevbpf/patch/20240301214551.1686095-1-andrii@kernel.org/ > jirka > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/log/?h=uprobe_syscall_bench > > > > > > Certainly necessary to have a benchmark. > > > selftests/bpf/bench has one for uprobe. > > > Probably should extend with sys_bpf. > > > > > > Regarding: > > > > replace the normal uprobe trap instruction with jump to > > > user space trampoline > > > > > > it should probably be a call to trampoline instead of a jump. > > > Unless you plan to generate a different trampoline for every location ? > > > > > > Also how would you pick a space for a trampoline in the target process ? > > > Analyze /proc/pid/maps and look for gaps in executable sections? > > > > kernel already does that for uretprobes, it adds a new "[uprobes]" > > memory mapping, so this part is already implemented > > > > > > > > We can start simple with a USDT that uses nop5 instead of nop1 > > > and explicit single trampoline for all USDT locations > > > that saves all (callee and caller saved) registers and > > > then does sys_bpf with a new cmd. > > > > > > To replace nop5 with a call to trampoline we can use text_poke_bp > > > approach: replace 1st byte with int3, replace 2-5 with target addr, > > > replace 1st byte to make an actual call insn. > > > > > > Once patched there will be no simulation of insns or kernel traps. > > > Just normal user code that calls into trampoline, that calls sys_bpf, > > > and returns back. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM/BPF TOPIC] faster uprobes 2024-03-05 0:55 ` Andrii Nakryiko @ 2024-03-05 8:24 ` Jiri Olsa 2024-03-05 15:30 ` Jiri Olsa 2024-03-11 10:59 ` Jiri Olsa 0 siblings, 2 replies; 31+ messages in thread From: Jiri Olsa @ 2024-03-05 8:24 UTC (permalink / raw) To: Andrii Nakryiko Cc: Jiri Olsa, Alexei Starovoitov, yunwei356, bpf, Alexei Starovoitov, lsf-pc, Yonghong Song, Oleg Nesterov, Daniel Borkmann On Mon, Mar 04, 2024 at 04:55:33PM -0800, Andrii Nakryiko wrote: > On Sun, Mar 3, 2024 at 2:20 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > On Fri, Mar 01, 2024 at 09:26:57AM -0800, Andrii Nakryiko wrote: > > > On Fri, Mar 1, 2024 at 9:01 AM Alexei Starovoitov > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > On Fri, Mar 1, 2024 at 12:18 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > > > > > On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote: > > > > > > On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > > > > > > > > > One of uprobe pain points is having slow execution that involves > > > > > > > two traps in worst case scenario or single trap if the original > > > > > > > instruction can be emulated. For return uprobes there's one extra > > > > > > > trap on top of that. > > > > > > > > > > > > > > My current idea on how to make this faster is to follow the optimized > > > > > > > kprobes and replace the normal uprobe trap instruction with jump to > > > > > > > user space trampoline that: > > > > > > > > > > > > > > - executes syscall to call uprobe consumers callbacks > > > > > > > > > > > > Did you get a chance to measure relative performance of syscall vs > > > > > > int3 interrupt handling? If not, do you think you'll be able to get > > > > > > some numbers by the time the conference starts? This should inform the > > > > > > decision whether it even makes sense to go through all the trouble. > > > > > > > > > > right, will do that > > > > > > > > I believe Yusheng measured syscall vs uprobe performance > > > > difference during LPC. iirc it was something like 3x. > > > > > > Do you have a link to slides? Was it actual uprobe vs just some fast > > > syscall (not doing BPF program execution) comparison? Or comparing the > > > performance of int3 handling vs equivalent syscall handling. > > > > > > I suspect it's the former, and so probably not that representative. > > > I'm curious about the performance of going > > > userspace->kernel->userspace through int3 vs syscall (all other things > > > being equal). > > > > I have a simple test [1] comparing: > > - uprobe with 2 traps > > - uprobe with 1 trap > > - syscall executing uprobe > > > > the syscall takes uprobe address as argument, finds the uprobe and executes > > its consumers, which should be comparable to what the trampoline will do > > > > test does same amount of loops triggering each uprobe type and measures > > the time it took > > > > # ./test_progs -t uprobe_syscall_bench -v > > bpf_testmod.ko is already unloaded. > > Loading bpf_testmod.ko... > > Successfully loaded bpf_testmod.ko. > > test_bench_1:PASS:uprobe_bench__open_and_load 0 nsec > > test_bench_1:PASS:uprobe_bench__attach 0 nsec > > test_bench_1:PASS:uprobe1_cnt 0 nsec > > test_bench_1:PASS:syscalls_uprobe1_cnt 0 nsec > > test_bench_1:PASS:uprobe2_cnt 0 nsec > > test_bench_1: uprobes (1 trap) in 36.439s > > test_bench_1: uprobes (2 trap) in 91.960s > > test_bench_1: syscalls in 17.872s > > #395/1 uprobe_syscall_bench/bench_1:OK > > #395 uprobe_syscall_bench:OK > > Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED > > > > syscall uprobe execution seems to be ~2x faster than 1 trap uprobe > > and ~5x faster than 2 traps uprobe > > > > Thanks for running benchmarks! I quickly looked at the selftest and > noticed this: > > +/* > + * Assuming following prolog: > + * > + * 6984ac: 55 push %rbp > + * 6984ad: 48 89 e5 mov %rsp,%rbp > + */ > +noinline void uprobe2_bench_trigger(void) > +{ > + asm volatile (""); > +} > > This actually will be optimized out to just ret in -O2 mode (make > RELEASE=1 for selftests): > > 00000000005a0ce0 <uprobe2_bench_trigger>: > 5a0ce0: c3 retq > 5a0ce1: 66 66 2e 0f 1f 84 00 00 00 00 00 nopw %cs:(%rax,%rax) > 5a0cec: 0f 1f 40 00 nopl (%rax) > > So be careful with that. right, I did not mean for this to be checked in, just wanted to get the numbers quickly > > Also, I just updated our existing set of uprobe benchmarks (see [0]), > do you mind adding your syscall-based one as another one there and > running all of them and sharing the numbers with us? Very curious to > see both absolute and relative numbers from that benchmark. (and > please do build with RELEASE=1) > > You should be able to just run benchs/run_bench_uprobes.sh (also don't > forget to add your syscall-based benchmark to the list of benchmarks > in that shell script). yes, saw it and was going to run/compare it.. it's good idea to add the syscall one and get all numbers together, will do that > > Thank you! > > > BTW, while I think patching multiple instructions for syscall-based > uprobe is going to be extremely tricky, I think at least u*ret*probe's > int3 can be pretty easily optimized away with syscall, given that the > kernel controls code generation there. If anything, it will get the > uretprobe case a bit closer to the performance of uprobe. Give it some > thought. hm, right.. the trampoline is there already, but at the moment is global and used by all uretprobes.. and int3 code moves userspace (changes rip) to the original return address.. maybe we can do that through syscall as well or we could add jump back to uretprobe's original return addrress to the trampoline, but then we need special trampoline for each uretprobe, I'll check thanks, jirka > > > [0] https://patchwork.kernel.org/project/netdevbpf/patch/20240301214551.1686095-1-andrii@kernel.org/ > > > jirka > > > > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/log/?h=uprobe_syscall_bench > > > > > > > > > Certainly necessary to have a benchmark. > > > > selftests/bpf/bench has one for uprobe. > > > > Probably should extend with sys_bpf. > > > > > > > > Regarding: > > > > > replace the normal uprobe trap instruction with jump to > > > > user space trampoline > > > > > > > > it should probably be a call to trampoline instead of a jump. > > > > Unless you plan to generate a different trampoline for every location ? > > > > > > > > Also how would you pick a space for a trampoline in the target process ? > > > > Analyze /proc/pid/maps and look for gaps in executable sections? > > > > > > kernel already does that for uretprobes, it adds a new "[uprobes]" > > > memory mapping, so this part is already implemented > > > > > > > > > > > We can start simple with a USDT that uses nop5 instead of nop1 > > > > and explicit single trampoline for all USDT locations > > > > that saves all (callee and caller saved) registers and > > > > then does sys_bpf with a new cmd. > > > > > > > > To replace nop5 with a call to trampoline we can use text_poke_bp > > > > approach: replace 1st byte with int3, replace 2-5 with target addr, > > > > replace 1st byte to make an actual call insn. > > > > > > > > Once patched there will be no simulation of insns or kernel traps. > > > > Just normal user code that calls into trampoline, that calls sys_bpf, > > > > and returns back. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM/BPF TOPIC] faster uprobes 2024-03-05 8:24 ` Jiri Olsa @ 2024-03-05 15:30 ` Jiri Olsa 2024-03-05 17:30 ` Andrii Nakryiko 2024-03-11 10:59 ` Jiri Olsa 1 sibling, 1 reply; 31+ messages in thread From: Jiri Olsa @ 2024-03-05 15:30 UTC (permalink / raw) To: Jiri Olsa Cc: Andrii Nakryiko, Alexei Starovoitov, yunwei356, bpf, Alexei Starovoitov, lsf-pc, Yonghong Song, Oleg Nesterov, Daniel Borkmann On Tue, Mar 05, 2024 at 09:24:08AM +0100, Jiri Olsa wrote: > On Mon, Mar 04, 2024 at 04:55:33PM -0800, Andrii Nakryiko wrote: > > On Sun, Mar 3, 2024 at 2:20 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > On Fri, Mar 01, 2024 at 09:26:57AM -0800, Andrii Nakryiko wrote: > > > > On Fri, Mar 1, 2024 at 9:01 AM Alexei Starovoitov > > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > > > On Fri, Mar 1, 2024 at 12:18 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > > > > > > > On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote: > > > > > > > On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > > > > > > > > > > > One of uprobe pain points is having slow execution that involves > > > > > > > > two traps in worst case scenario or single trap if the original > > > > > > > > instruction can be emulated. For return uprobes there's one extra > > > > > > > > trap on top of that. > > > > > > > > > > > > > > > > My current idea on how to make this faster is to follow the optimized > > > > > > > > kprobes and replace the normal uprobe trap instruction with jump to > > > > > > > > user space trampoline that: > > > > > > > > > > > > > > > > - executes syscall to call uprobe consumers callbacks > > > > > > > > > > > > > > Did you get a chance to measure relative performance of syscall vs > > > > > > > int3 interrupt handling? If not, do you think you'll be able to get > > > > > > > some numbers by the time the conference starts? This should inform the > > > > > > > decision whether it even makes sense to go through all the trouble. > > > > > > > > > > > > right, will do that > > > > > > > > > > I believe Yusheng measured syscall vs uprobe performance > > > > > difference during LPC. iirc it was something like 3x. > > > > > > > > Do you have a link to slides? Was it actual uprobe vs just some fast > > > > syscall (not doing BPF program execution) comparison? Or comparing the > > > > performance of int3 handling vs equivalent syscall handling. > > > > > > > > I suspect it's the former, and so probably not that representative. > > > > I'm curious about the performance of going > > > > userspace->kernel->userspace through int3 vs syscall (all other things > > > > being equal). > > > > > > I have a simple test [1] comparing: > > > - uprobe with 2 traps > > > - uprobe with 1 trap > > > - syscall executing uprobe > > > > > > the syscall takes uprobe address as argument, finds the uprobe and executes > > > its consumers, which should be comparable to what the trampoline will do > > > > > > test does same amount of loops triggering each uprobe type and measures > > > the time it took > > > > > > # ./test_progs -t uprobe_syscall_bench -v > > > bpf_testmod.ko is already unloaded. > > > Loading bpf_testmod.ko... > > > Successfully loaded bpf_testmod.ko. > > > test_bench_1:PASS:uprobe_bench__open_and_load 0 nsec > > > test_bench_1:PASS:uprobe_bench__attach 0 nsec > > > test_bench_1:PASS:uprobe1_cnt 0 nsec > > > test_bench_1:PASS:syscalls_uprobe1_cnt 0 nsec > > > test_bench_1:PASS:uprobe2_cnt 0 nsec > > > test_bench_1: uprobes (1 trap) in 36.439s > > > test_bench_1: uprobes (2 trap) in 91.960s > > > test_bench_1: syscalls in 17.872s > > > #395/1 uprobe_syscall_bench/bench_1:OK > > > #395 uprobe_syscall_bench:OK > > > Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED > > > > > > syscall uprobe execution seems to be ~2x faster than 1 trap uprobe > > > and ~5x faster than 2 traps uprobe > > > > > > > Thanks for running benchmarks! I quickly looked at the selftest and > > noticed this: > > > > +/* > > + * Assuming following prolog: > > + * > > + * 6984ac: 55 push %rbp > > + * 6984ad: 48 89 e5 mov %rsp,%rbp > > + */ > > +noinline void uprobe2_bench_trigger(void) > > +{ > > + asm volatile (""); > > +} > > > > This actually will be optimized out to just ret in -O2 mode (make > > RELEASE=1 for selftests): > > > > 00000000005a0ce0 <uprobe2_bench_trigger>: > > 5a0ce0: c3 retq > > 5a0ce1: 66 66 2e 0f 1f 84 00 00 00 00 00 nopw %cs:(%rax,%rax) > > 5a0cec: 0f 1f 40 00 nopl (%rax) > > > > So be careful with that. > > right, I did not mean for this to be checked in, just wanted to get the > numbers quickly > > > > > Also, I just updated our existing set of uprobe benchmarks (see [0]), > > do you mind adding your syscall-based one as another one there and > > running all of them and sharing the numbers with us? Very curious to > > see both absolute and relative numbers from that benchmark. (and > > please do build with RELEASE=1) > > > > You should be able to just run benchs/run_bench_uprobes.sh (also don't > > forget to add your syscall-based benchmark to the list of benchmarks > > in that shell script). > > yes, saw it and was going to run/compare it.. it's good idea to add > the syscall one and get all numbers together, will do that seems to be consistent with my previous test: base : 15.854 ± 0.007M/s uprobe-nop : 2.859 ± 0.007M/s uprobe-push : 2.697 ± 0.002M/s uprobe-ret : 1.081 ± 0.000M/s uprobe-syscall : 5.520 ± 0.006M/s uretprobe-nop : 1.422 ± 0.002M/s uretprobe-push : 1.396 ± 0.002M/s uretprobe-ret : 0.787 ± 0.000M/s uretprobe-syscall: 1.888 ± 0.002M/s syscall uprobe is ~2x faster than 1 trap uprobe and ~5x faster than 2 traps uprobe uretprobe is bit more tricky to compare, the speed up is there for the initial uprobe hit, then there's again the trap from the uretprobe trampoline I have the bench changes in here [1], I'll send it out together with rfc post jirka [1] https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/log/?h=uprobe_syscall_bench_1 > > > > > Thank you! > > > > > > BTW, while I think patching multiple instructions for syscall-based > > uprobe is going to be extremely tricky, I think at least u*ret*probe's > > int3 can be pretty easily optimized away with syscall, given that the > > kernel controls code generation there. If anything, it will get the > > uretprobe case a bit closer to the performance of uprobe. Give it some > > thought. > > hm, right.. the trampoline is there already, but at the moment is global > and used by all uretprobes.. and int3 code moves userspace (changes rip) > to the original return address.. maybe we can do that through syscall > as well > > or we could add jump back to uretprobe's original return addrress to the > trampoline, but then we need special trampoline for each uretprobe, > I'll check > > thanks, > jirka > > > > > > > [0] https://patchwork.kernel.org/project/netdevbpf/patch/20240301214551.1686095-1-andrii@kernel.org/ > > > > > jirka > > > > > > > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/log/?h=uprobe_syscall_bench > > > > > > > > > > > > Certainly necessary to have a benchmark. > > > > > selftests/bpf/bench has one for uprobe. > > > > > Probably should extend with sys_bpf. > > > > > > > > > > Regarding: > > > > > > replace the normal uprobe trap instruction with jump to > > > > > user space trampoline > > > > > > > > > > it should probably be a call to trampoline instead of a jump. > > > > > Unless you plan to generate a different trampoline for every location ? > > > > > > > > > > Also how would you pick a space for a trampoline in the target process ? > > > > > Analyze /proc/pid/maps and look for gaps in executable sections? > > > > > > > > kernel already does that for uretprobes, it adds a new "[uprobes]" > > > > memory mapping, so this part is already implemented > > > > > > > > > > > > > > We can start simple with a USDT that uses nop5 instead of nop1 > > > > > and explicit single trampoline for all USDT locations > > > > > that saves all (callee and caller saved) registers and > > > > > then does sys_bpf with a new cmd. > > > > > > > > > > To replace nop5 with a call to trampoline we can use text_poke_bp > > > > > approach: replace 1st byte with int3, replace 2-5 with target addr, > > > > > replace 1st byte to make an actual call insn. > > > > > > > > > > Once patched there will be no simulation of insns or kernel traps. > > > > > Just normal user code that calls into trampoline, that calls sys_bpf, > > > > > and returns back. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM/BPF TOPIC] faster uprobes 2024-03-05 15:30 ` Jiri Olsa @ 2024-03-05 17:30 ` Andrii Nakryiko 0 siblings, 0 replies; 31+ messages in thread From: Andrii Nakryiko @ 2024-03-05 17:30 UTC (permalink / raw) To: Jiri Olsa Cc: Alexei Starovoitov, yunwei356, bpf, Alexei Starovoitov, lsf-pc, Yonghong Song, Oleg Nesterov, Daniel Borkmann On Tue, Mar 5, 2024 at 7:30 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > On Tue, Mar 05, 2024 at 09:24:08AM +0100, Jiri Olsa wrote: > > On Mon, Mar 04, 2024 at 04:55:33PM -0800, Andrii Nakryiko wrote: > > > On Sun, Mar 3, 2024 at 2:20 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > > > On Fri, Mar 01, 2024 at 09:26:57AM -0800, Andrii Nakryiko wrote: > > > > > On Fri, Mar 1, 2024 at 9:01 AM Alexei Starovoitov > > > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > > > > > On Fri, Mar 1, 2024 at 12:18 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > > > > > > > > > On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote: > > > > > > > > On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > > > > > > > > > > > > > One of uprobe pain points is having slow execution that involves > > > > > > > > > two traps in worst case scenario or single trap if the original > > > > > > > > > instruction can be emulated. For return uprobes there's one extra > > > > > > > > > trap on top of that. > > > > > > > > > > > > > > > > > > My current idea on how to make this faster is to follow the optimized > > > > > > > > > kprobes and replace the normal uprobe trap instruction with jump to > > > > > > > > > user space trampoline that: > > > > > > > > > > > > > > > > > > - executes syscall to call uprobe consumers callbacks > > > > > > > > > > > > > > > > Did you get a chance to measure relative performance of syscall vs > > > > > > > > int3 interrupt handling? If not, do you think you'll be able to get > > > > > > > > some numbers by the time the conference starts? This should inform the > > > > > > > > decision whether it even makes sense to go through all the trouble. > > > > > > > > > > > > > > right, will do that > > > > > > > > > > > > I believe Yusheng measured syscall vs uprobe performance > > > > > > difference during LPC. iirc it was something like 3x. > > > > > > > > > > Do you have a link to slides? Was it actual uprobe vs just some fast > > > > > syscall (not doing BPF program execution) comparison? Or comparing the > > > > > performance of int3 handling vs equivalent syscall handling. > > > > > > > > > > I suspect it's the former, and so probably not that representative. > > > > > I'm curious about the performance of going > > > > > userspace->kernel->userspace through int3 vs syscall (all other things > > > > > being equal). > > > > > > > > I have a simple test [1] comparing: > > > > - uprobe with 2 traps > > > > - uprobe with 1 trap > > > > - syscall executing uprobe > > > > > > > > the syscall takes uprobe address as argument, finds the uprobe and executes > > > > its consumers, which should be comparable to what the trampoline will do > > > > > > > > test does same amount of loops triggering each uprobe type and measures > > > > the time it took > > > > > > > > # ./test_progs -t uprobe_syscall_bench -v > > > > bpf_testmod.ko is already unloaded. > > > > Loading bpf_testmod.ko... > > > > Successfully loaded bpf_testmod.ko. > > > > test_bench_1:PASS:uprobe_bench__open_and_load 0 nsec > > > > test_bench_1:PASS:uprobe_bench__attach 0 nsec > > > > test_bench_1:PASS:uprobe1_cnt 0 nsec > > > > test_bench_1:PASS:syscalls_uprobe1_cnt 0 nsec > > > > test_bench_1:PASS:uprobe2_cnt 0 nsec > > > > test_bench_1: uprobes (1 trap) in 36.439s > > > > test_bench_1: uprobes (2 trap) in 91.960s > > > > test_bench_1: syscalls in 17.872s > > > > #395/1 uprobe_syscall_bench/bench_1:OK > > > > #395 uprobe_syscall_bench:OK > > > > Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED > > > > > > > > syscall uprobe execution seems to be ~2x faster than 1 trap uprobe > > > > and ~5x faster than 2 traps uprobe > > > > > > > > > > Thanks for running benchmarks! I quickly looked at the selftest and > > > noticed this: > > > > > > +/* > > > + * Assuming following prolog: > > > + * > > > + * 6984ac: 55 push %rbp > > > + * 6984ad: 48 89 e5 mov %rsp,%rbp > > > + */ > > > +noinline void uprobe2_bench_trigger(void) > > > +{ > > > + asm volatile (""); > > > +} > > > > > > This actually will be optimized out to just ret in -O2 mode (make > > > RELEASE=1 for selftests): > > > > > > 00000000005a0ce0 <uprobe2_bench_trigger>: > > > 5a0ce0: c3 retq > > > 5a0ce1: 66 66 2e 0f 1f 84 00 00 00 00 00 nopw %cs:(%rax,%rax) > > > 5a0cec: 0f 1f 40 00 nopl (%rax) > > > > > > So be careful with that. > > > > right, I did not mean for this to be checked in, just wanted to get the > > numbers quickly > > > > > > > > Also, I just updated our existing set of uprobe benchmarks (see [0]), > > > do you mind adding your syscall-based one as another one there and > > > running all of them and sharing the numbers with us? Very curious to > > > see both absolute and relative numbers from that benchmark. (and > > > please do build with RELEASE=1) > > > > > > You should be able to just run benchs/run_bench_uprobes.sh (also don't > > > forget to add your syscall-based benchmark to the list of benchmarks > > > in that shell script). > > > > yes, saw it and was going to run/compare it.. it's good idea to add > > the syscall one and get all numbers together, will do that > > seems to be consistent with my previous test: > > base : 15.854 ± 0.007M/s > uprobe-nop : 2.859 ± 0.007M/s > uprobe-push : 2.697 ± 0.002M/s > uprobe-ret : 1.081 ± 0.000M/s > uprobe-syscall : 5.520 ± 0.006M/s > uretprobe-nop : 1.422 ± 0.002M/s > uretprobe-push : 1.396 ± 0.002M/s > uretprobe-ret : 0.787 ± 0.000M/s > uretprobe-syscall: 1.888 ± 0.002M/s > > syscall uprobe is ~2x faster than 1 trap uprobe and ~5x faster than 2 traps uprobe > great, thanks a lot for the numbers! It's good that we have comparable benchmark numbers now. > uretprobe is bit more tricky to compare, the speed up is there for the initial > uprobe hit, then there's again the trap from the uretprobe trampoline yep, makes sense > > I have the bench changes in here [1], I'll send it out together with rfc post > > jirka > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/log/?h=uprobe_syscall_bench_1 > > > > > > > > > Thank you! > > > > > > > > > BTW, while I think patching multiple instructions for syscall-based > > > uprobe is going to be extremely tricky, I think at least u*ret*probe's > > > int3 can be pretty easily optimized away with syscall, given that the > > > kernel controls code generation there. If anything, it will get the > > > uretprobe case a bit closer to the performance of uprobe. Give it some > > > thought. > > > > hm, right.. the trampoline is there already, but at the moment is global > > and used by all uretprobes.. and int3 code moves userspace (changes rip) > > to the original return address.. maybe we can do that through syscall > > as well > > > > or we could add jump back to uretprobe's original return addrress to the > > trampoline, but then we need special trampoline for each uretprobe, > > I'll check > > > > thanks, > > jirka > > > > > > > > > > > [0] https://patchwork.kernel.org/project/netdevbpf/patch/20240301214551.1686095-1-andrii@kernel.org/ > > > > > > > jirka > > > > > > > > > > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/log/?h=uprobe_syscall_bench > > > > > > > > > > > > > > > Certainly necessary to have a benchmark. > > > > > > selftests/bpf/bench has one for uprobe. > > > > > > Probably should extend with sys_bpf. > > > > > > > > > > > > Regarding: > > > > > > > replace the normal uprobe trap instruction with jump to > > > > > > user space trampoline > > > > > > > > > > > > it should probably be a call to trampoline instead of a jump. > > > > > > Unless you plan to generate a different trampoline for every location ? > > > > > > > > > > > > Also how would you pick a space for a trampoline in the target process ? > > > > > > Analyze /proc/pid/maps and look for gaps in executable sections? > > > > > > > > > > kernel already does that for uretprobes, it adds a new "[uprobes]" > > > > > memory mapping, so this part is already implemented > > > > > > > > > > > > > > > > > We can start simple with a USDT that uses nop5 instead of nop1 > > > > > > and explicit single trampoline for all USDT locations > > > > > > that saves all (callee and caller saved) registers and > > > > > > then does sys_bpf with a new cmd. > > > > > > > > > > > > To replace nop5 with a call to trampoline we can use text_poke_bp > > > > > > approach: replace 1st byte with int3, replace 2-5 with target addr, > > > > > > replace 1st byte to make an actual call insn. > > > > > > > > > > > > Once patched there will be no simulation of insns or kernel traps. > > > > > > Just normal user code that calls into trampoline, that calls sys_bpf, > > > > > > and returns back. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM/BPF TOPIC] faster uprobes 2024-03-05 8:24 ` Jiri Olsa 2024-03-05 15:30 ` Jiri Olsa @ 2024-03-11 10:59 ` Jiri Olsa 2024-03-11 15:06 ` Oleg Nesterov 2024-03-11 17:32 ` Andrii Nakryiko 1 sibling, 2 replies; 31+ messages in thread From: Jiri Olsa @ 2024-03-11 10:59 UTC (permalink / raw) To: Jiri Olsa Cc: Andrii Nakryiko, Alexei Starovoitov, yunwei356, bpf, Alexei Starovoitov, lsf-pc, Yonghong Song, Oleg Nesterov, Daniel Borkmann On Tue, Mar 05, 2024 at 09:24:08AM +0100, Jiri Olsa wrote: > On Mon, Mar 04, 2024 at 04:55:33PM -0800, Andrii Nakryiko wrote: > > On Sun, Mar 3, 2024 at 2:20 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > On Fri, Mar 01, 2024 at 09:26:57AM -0800, Andrii Nakryiko wrote: > > > > On Fri, Mar 1, 2024 at 9:01 AM Alexei Starovoitov > > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > > > On Fri, Mar 1, 2024 at 12:18 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > > > > > > > On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote: > > > > > > > On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > > > > > > > > > > > One of uprobe pain points is having slow execution that involves > > > > > > > > two traps in worst case scenario or single trap if the original > > > > > > > > instruction can be emulated. For return uprobes there's one extra > > > > > > > > trap on top of that. > > > > > > > > > > > > > > > > My current idea on how to make this faster is to follow the optimized > > > > > > > > kprobes and replace the normal uprobe trap instruction with jump to > > > > > > > > user space trampoline that: > > > > > > > > > > > > > > > > - executes syscall to call uprobe consumers callbacks > > > > > > > > > > > > > > Did you get a chance to measure relative performance of syscall vs > > > > > > > int3 interrupt handling? If not, do you think you'll be able to get > > > > > > > some numbers by the time the conference starts? This should inform the > > > > > > > decision whether it even makes sense to go through all the trouble. > > > > > > > > > > > > right, will do that > > > > > > > > > > I believe Yusheng measured syscall vs uprobe performance > > > > > difference during LPC. iirc it was something like 3x. > > > > > > > > Do you have a link to slides? Was it actual uprobe vs just some fast > > > > syscall (not doing BPF program execution) comparison? Or comparing the > > > > performance of int3 handling vs equivalent syscall handling. > > > > > > > > I suspect it's the former, and so probably not that representative. > > > > I'm curious about the performance of going > > > > userspace->kernel->userspace through int3 vs syscall (all other things > > > > being equal). > > > > > > I have a simple test [1] comparing: > > > - uprobe with 2 traps > > > - uprobe with 1 trap > > > - syscall executing uprobe > > > > > > the syscall takes uprobe address as argument, finds the uprobe and executes > > > its consumers, which should be comparable to what the trampoline will do > > > > > > test does same amount of loops triggering each uprobe type and measures > > > the time it took > > > > > > # ./test_progs -t uprobe_syscall_bench -v > > > bpf_testmod.ko is already unloaded. > > > Loading bpf_testmod.ko... > > > Successfully loaded bpf_testmod.ko. > > > test_bench_1:PASS:uprobe_bench__open_and_load 0 nsec > > > test_bench_1:PASS:uprobe_bench__attach 0 nsec > > > test_bench_1:PASS:uprobe1_cnt 0 nsec > > > test_bench_1:PASS:syscalls_uprobe1_cnt 0 nsec > > > test_bench_1:PASS:uprobe2_cnt 0 nsec > > > test_bench_1: uprobes (1 trap) in 36.439s > > > test_bench_1: uprobes (2 trap) in 91.960s > > > test_bench_1: syscalls in 17.872s > > > #395/1 uprobe_syscall_bench/bench_1:OK > > > #395 uprobe_syscall_bench:OK > > > Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED > > > > > > syscall uprobe execution seems to be ~2x faster than 1 trap uprobe > > > and ~5x faster than 2 traps uprobe > > > > > > > Thanks for running benchmarks! I quickly looked at the selftest and > > noticed this: > > > > +/* > > + * Assuming following prolog: > > + * > > + * 6984ac: 55 push %rbp > > + * 6984ad: 48 89 e5 mov %rsp,%rbp > > + */ > > +noinline void uprobe2_bench_trigger(void) > > +{ > > + asm volatile (""); > > +} > > > > This actually will be optimized out to just ret in -O2 mode (make > > RELEASE=1 for selftests): > > > > 00000000005a0ce0 <uprobe2_bench_trigger>: > > 5a0ce0: c3 retq > > 5a0ce1: 66 66 2e 0f 1f 84 00 00 00 00 00 nopw %cs:(%rax,%rax) > > 5a0cec: 0f 1f 40 00 nopl (%rax) > > > > So be careful with that. > > right, I did not mean for this to be checked in, just wanted to get the > numbers quickly > > > > > Also, I just updated our existing set of uprobe benchmarks (see [0]), > > do you mind adding your syscall-based one as another one there and > > running all of them and sharing the numbers with us? Very curious to > > see both absolute and relative numbers from that benchmark. (and > > please do build with RELEASE=1) > > > > You should be able to just run benchs/run_bench_uprobes.sh (also don't > > forget to add your syscall-based benchmark to the list of benchmarks > > in that shell script). > > yes, saw it and was going to run/compare it.. it's good idea to add > the syscall one and get all numbers together, will do that > > > > > Thank you! > > > > > > BTW, while I think patching multiple instructions for syscall-based > > uprobe is going to be extremely tricky, I think at least u*ret*probe's > > int3 can be pretty easily optimized away with syscall, given that the > > kernel controls code generation there. If anything, it will get the > > uretprobe case a bit closer to the performance of uprobe. Give it some > > thought. > > hm, right.. the trampoline is there already, but at the moment is global > and used by all uretprobes.. and int3 code moves userspace (changes rip) > to the original return address.. maybe we can do that through syscall > as well it seems like good idea, I tried change below (use syscall on return trampoline) and got some speedup: current: base : 15.817 ± 0.009M/s uprobe-nop : 2.901 ± 0.000M/s uprobe-push : 2.743 ± 0.002M/s uprobe-ret : 1.089 ± 0.001M/s uretprobe-nop : 1.448 ± 0.001M/s uretprobe-push : 1.407 ± 0.001M/s uretprobe-ret : 0.792 ± 0.001M/s with syscall: base : 15.831 ± 0.026M/s uprobe-nop : 2.904 ± 0.001M/s uprobe-push : 2.764 ± 0.002M/s uprobe-ret : 1.082 ± 0.001M/s uretprobe-nop : 1.785 ± 0.000M/s uretprobe-push : 1.733 ± 0.001M/s uretprobe-ret : 0.885 ± 0.004M/s ~23% for nop/push (emulated) cases, ~11% for ret (sstep) case jirka --- diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 7e8d46f4147f..fa5f8a058bc2 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -383,6 +383,7 @@ 459 common lsm_get_self_attr sys_lsm_get_self_attr 460 common lsm_set_self_attr sys_lsm_set_self_attr 461 common lsm_list_modules sys_lsm_list_modules +462 64 uprobe sys_uprobe # # Due to a historical design error, certain syscalls are numbered differently diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c index 6c07f6daaa22..fceef2b4e243 100644 --- a/arch/x86/kernel/uprobes.c +++ b/arch/x86/kernel/uprobes.c @@ -12,11 +12,13 @@ #include <linux/ptrace.h> #include <linux/uprobes.h> #include <linux/uaccess.h> +#include <linux/syscalls.h> #include <linux/kdebug.h> #include <asm/processor.h> #include <asm/insn.h> #include <asm/mmu_context.h> +#include <asm/syscalls.h> /* Post-execution fixups. */ @@ -308,6 +310,53 @@ static int uprobe_init_insn(struct arch_uprobe *auprobe, struct insn *insn, bool } #ifdef CONFIG_X86_64 + +asm ( + ".pushsection .rodata\n" + ".global uretprobe_syscall_entry\n" + "uretprobe_syscall_entry:\n" + "push %rax\n" + "mov $462, %rax\n" + "syscall\n" + ".global uretprobe_syscall_end\n" + "uretprobe_syscall_end:\n" + ".popsection\n" +); + +extern __visible u8 uretprobe_syscall_entry[]; +extern __visible u8 uretprobe_syscall_end[]; + +uprobe_opcode_t* arch_uprobe_trampoline(unsigned long *psize) +{ + *psize = uretprobe_syscall_end - uretprobe_syscall_entry; + return uretprobe_syscall_entry; +} + +SYSCALL_DEFINE1(uprobe, unsigned long, cmd) +{ + struct pt_regs *regs = task_pt_regs(current); + unsigned long ax, err; + + /* + * We get invoked from the trampoline that pushed rax + * value on stack, read and restore the value. + */ + err = copy_from_user((void*) &ax, (void *) regs->sp, sizeof(ax)); + WARN_ON_ONCE(err); + + regs->ax = ax; + regs->orig_ax = ax; + + /* + * And pop the stack back, because we jump to original + * return value. + */ + regs->sp = regs->sp + sizeof(regs->sp); + + uprobe_handle_trampoline(regs); + return ax; +} + /* * If arch_uprobe->insn doesn't use rip-relative addressing, return * immediately. Otherwise, rewrite the instruction so that it accesses diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 77eb9b0e7685..4f7d5b41b718 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -987,6 +987,8 @@ asmlinkage long sys_spu_run(int fd, __u32 __user *unpc, asmlinkage long sys_spu_create(const char __user *name, unsigned int flags, umode_t mode, int fd); +asmlinkage long sys_uprobe(unsigned long cmd); + /* * Deprecated system calls which are still defined in diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index f46e0ca0169c..9ef244c8ff19 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -138,6 +138,8 @@ extern bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs); extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, void *src, unsigned long len); +extern void uprobe_handle_trampoline(struct pt_regs *regs); +uprobe_opcode_t* arch_uprobe_trampoline(unsigned long *psize); #else /* !CONFIG_UPROBES */ struct uprobes_state { }; diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 75f00965ab15..2702799648e6 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -842,8 +842,11 @@ __SYSCALL(__NR_lsm_set_self_attr, sys_lsm_set_self_attr) #define __NR_lsm_list_modules 461 __SYSCALL(__NR_lsm_list_modules, sys_lsm_list_modules) +#define __NR_uprobe 462 +__SYSCALL(__NR_uprobe, sys_uprobe) + #undef __NR_syscalls -#define __NR_syscalls 462 +#define __NR_syscalls 463 /* * 32 bit systems traditionally used different diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 929e98c62965..fefaf4804e1f 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1474,10 +1474,19 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area) return ret; } +uprobe_opcode_t* __weak arch_uprobe_trampoline(unsigned long *psize) +{ + static uprobe_opcode_t insn = UPROBE_SWBP_INSN; + + *psize = UPROBE_SWBP_INSN_SIZE; + return &insn; +} + static struct xol_area *__create_xol_area(unsigned long vaddr) { struct mm_struct *mm = current->mm; - uprobe_opcode_t insn = UPROBE_SWBP_INSN; + unsigned long insns_size; + uprobe_opcode_t *insns; struct xol_area *area; area = kmalloc(sizeof(*area), GFP_KERNEL); @@ -1502,7 +1511,8 @@ static struct xol_area *__create_xol_area(unsigned long vaddr) /* Reserve the 1st slot for get_trampoline_vaddr() */ set_bit(0, area->bitmap); atomic_set(&area->slot_count, 1); - arch_uprobe_copy_ixol(area->pages[0], 0, &insn, UPROBE_SWBP_INSN_SIZE); + insns = arch_uprobe_trampoline(&insns_size); + arch_uprobe_copy_ixol(area->pages[0], 0, insns, insns_size); if (!xol_add_vma(mm, area)) return area; @@ -2123,7 +2133,7 @@ static struct return_instance *find_next_ret_chain(struct return_instance *ri) return ri; } -static void handle_trampoline(struct pt_regs *regs) +void uprobe_handle_trampoline(struct pt_regs *regs) { struct uprobe_task *utask; struct return_instance *ri, *next; @@ -2188,7 +2198,7 @@ static void handle_swbp(struct pt_regs *regs) bp_vaddr = uprobe_get_swbp_addr(regs); if (bp_vaddr == get_trampoline_vaddr()) - return handle_trampoline(regs); + return uprobe_handle_trampoline(regs); uprobe = find_active_uprobe(bp_vaddr, &is_swbp); if (!uprobe) { diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index faad00cce269..ddc954f28317 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -391,3 +391,5 @@ COND_SYSCALL(setuid16); /* restartable sequence */ COND_SYSCALL(rseq); + +COND_SYSCALL(uprobe); ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [LSF/MM/BPF TOPIC] faster uprobes 2024-03-11 10:59 ` Jiri Olsa @ 2024-03-11 15:06 ` Oleg Nesterov 2024-03-11 16:46 ` Jiri Olsa 2024-03-11 17:32 ` Andrii Nakryiko 1 sibling, 1 reply; 31+ messages in thread From: Oleg Nesterov @ 2024-03-11 15:06 UTC (permalink / raw) To: Jiri Olsa Cc: Andrii Nakryiko, Alexei Starovoitov, yunwei356, bpf, Alexei Starovoitov, lsf-pc, Yonghong Song, Daniel Borkmann I forgot everything about the low-level x86_64 code, but... On 03/11, Jiri Olsa wrote: > > #ifdef CONFIG_X86_64 > + > +asm ( > + ".pushsection .rodata\n" > + ".global uretprobe_syscall_entry\n" > + "uretprobe_syscall_entry:\n" > + "push %rax\n" > + "mov $462, %rax\n" > + "syscall\n" Hmm... I think you need to save/restore more registers clobbered by syscall/entry_SYSCALL_64 ? > +SYSCALL_DEFINE1(uprobe, unsigned long, cmd) > +{ > + struct pt_regs *regs = task_pt_regs(current); > + unsigned long ax, err; > + > + /* > + * We get invoked from the trampoline that pushed rax > + * value on stack, read and restore the value. > + */ > + err = copy_from_user((void*) &ax, (void *) regs->sp, sizeof(ax)); > + WARN_ON_ONCE(err); > + > + regs->ax = ax; probably not strictly needed, we are going to return ax... > + regs->orig_ax = ax; This doesn't look right. I think you need regs->orig_ax = -1; Say, to avoid the "Did we come from a system call" checks in arch_do_signal_or_restart() or handle_signal(). Oleg. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM/BPF TOPIC] faster uprobes 2024-03-11 15:06 ` Oleg Nesterov @ 2024-03-11 16:46 ` Jiri Olsa 2024-03-11 17:02 ` Oleg Nesterov 0 siblings, 1 reply; 31+ messages in thread From: Jiri Olsa @ 2024-03-11 16:46 UTC (permalink / raw) To: Oleg Nesterov Cc: Jiri Olsa, Andrii Nakryiko, Alexei Starovoitov, yunwei356, bpf, Alexei Starovoitov, lsf-pc, Yonghong Song, Daniel Borkmann On Mon, Mar 11, 2024 at 04:06:59PM +0100, Oleg Nesterov wrote: > I forgot everything about the low-level x86_64 code, but... > > On 03/11, Jiri Olsa wrote: > > > > #ifdef CONFIG_X86_64 > > + > > +asm ( > > + ".pushsection .rodata\n" > > + ".global uretprobe_syscall_entry\n" > > + "uretprobe_syscall_entry:\n" > > + "push %rax\n" > > + "mov $462, %rax\n" > > + "syscall\n" > > Hmm... I think you need to save/restore more registers clobbered by > syscall/entry_SYSCALL_64 ? hum, so the call happens on the function call return, so I thought we should just preserve callee saved registers which seems to be taken care of by the entry_SYSCALL_64 path.. I will double check > > > +SYSCALL_DEFINE1(uprobe, unsigned long, cmd) > > +{ > > + struct pt_regs *regs = task_pt_regs(current); > > + unsigned long ax, err; > > + > > + /* > > + * We get invoked from the trampoline that pushed rax > > + * value on stack, read and restore the value. > > + */ > > + err = copy_from_user((void*) &ax, (void *) regs->sp, sizeof(ax)); > > + WARN_ON_ONCE(err); > > + > > + regs->ax = ax; > > probably not strictly needed, we are going to return ax... it needs to be there for the bpf program to read proper return value from regs > > > + regs->orig_ax = ax; > > This doesn't look right. I think you need > > regs->orig_ax = -1; > > Say, to avoid the "Did we come from a system call" checks in > arch_do_signal_or_restart() or handle_signal(). ugh right that's probably wrong, I need check on that thanks, jirka ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM/BPF TOPIC] faster uprobes 2024-03-11 16:46 ` Jiri Olsa @ 2024-03-11 17:02 ` Oleg Nesterov 2024-03-11 21:11 ` Jiri Olsa 0 siblings, 1 reply; 31+ messages in thread From: Oleg Nesterov @ 2024-03-11 17:02 UTC (permalink / raw) To: Jiri Olsa Cc: Andrii Nakryiko, Alexei Starovoitov, yunwei356, bpf, Alexei Starovoitov, lsf-pc, Yonghong Song, Daniel Borkmann On 03/11, Jiri Olsa wrote: > > On Mon, Mar 11, 2024 at 04:06:59PM +0100, Oleg Nesterov wrote: > > I forgot everything about the low-level x86_64 code, but... > > > > On 03/11, Jiri Olsa wrote: > > > > > > #ifdef CONFIG_X86_64 > > > + > > > +asm ( > > > + ".pushsection .rodata\n" > > > + ".global uretprobe_syscall_entry\n" > > > + "uretprobe_syscall_entry:\n" > > > + "push %rax\n" > > > + "mov $462, %rax\n" > > > + "syscall\n" > > > > Hmm... I think you need to save/restore more registers clobbered by > > syscall/entry_SYSCALL_64 ? > > hum, so the call happens on the function call return, so I thought > we should just preserve callee saved registers which seems to be > taken care of by the entry_SYSCALL_64 path.. Yes, but we do not know if the (ret)probed function obeys the C-calling convention, perhaps it is low level asm code or not a C function. > I will double check but I won't insist if you think we do not care. > > > + > > > + regs->ax = ax; > > > > probably not strictly needed, we are going to return ax... > > it needs to be there for the bpf program to read proper return > value from regs OK, I see, thanks. Oleg. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM/BPF TOPIC] faster uprobes 2024-03-11 17:02 ` Oleg Nesterov @ 2024-03-11 21:11 ` Jiri Olsa 0 siblings, 0 replies; 31+ messages in thread From: Jiri Olsa @ 2024-03-11 21:11 UTC (permalink / raw) To: Oleg Nesterov Cc: Jiri Olsa, Andrii Nakryiko, Alexei Starovoitov, yunwei356, bpf, Alexei Starovoitov, lsf-pc, Yonghong Song, Daniel Borkmann On Mon, Mar 11, 2024 at 06:02:31PM +0100, Oleg Nesterov wrote: > On 03/11, Jiri Olsa wrote: > > > > On Mon, Mar 11, 2024 at 04:06:59PM +0100, Oleg Nesterov wrote: > > > I forgot everything about the low-level x86_64 code, but... > > > > > > On 03/11, Jiri Olsa wrote: > > > > > > > > #ifdef CONFIG_X86_64 > > > > + > > > > +asm ( > > > > + ".pushsection .rodata\n" > > > > + ".global uretprobe_syscall_entry\n" > > > > + "uretprobe_syscall_entry:\n" > > > > + "push %rax\n" > > > > + "mov $462, %rax\n" > > > > + "syscall\n" > > > > > > Hmm... I think you need to save/restore more registers clobbered by > > > syscall/entry_SYSCALL_64 ? > > > > hum, so the call happens on the function call return, so I thought > > we should just preserve callee saved registers which seems to be > > taken care of by the entry_SYSCALL_64 path.. > > Yes, but we do not know if the (ret)probed function obeys the C-calling > convention, perhaps it is low level asm code or not a C function. ah right.. I think we need to make sure all is saved/restored thanks, jirka > > > I will double check > > but I won't insist if you think we do not care. > > > > > + > > > > + regs->ax = ax; > > > > > > probably not strictly needed, we are going to return ax... > > > > it needs to be there for the bpf program to read proper return > > value from regs > > OK, I see, thanks. > > Oleg. > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM/BPF TOPIC] faster uprobes 2024-03-11 10:59 ` Jiri Olsa 2024-03-11 15:06 ` Oleg Nesterov @ 2024-03-11 17:32 ` Andrii Nakryiko 2024-03-11 21:26 ` Jiri Olsa 1 sibling, 1 reply; 31+ messages in thread From: Andrii Nakryiko @ 2024-03-11 17:32 UTC (permalink / raw) To: Jiri Olsa Cc: Alexei Starovoitov, yunwei356, bpf, Alexei Starovoitov, lsf-pc, Yonghong Song, Oleg Nesterov, Daniel Borkmann On Mon, Mar 11, 2024 at 3:59 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > On Tue, Mar 05, 2024 at 09:24:08AM +0100, Jiri Olsa wrote: > > On Mon, Mar 04, 2024 at 04:55:33PM -0800, Andrii Nakryiko wrote: > > > On Sun, Mar 3, 2024 at 2:20 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > > > On Fri, Mar 01, 2024 at 09:26:57AM -0800, Andrii Nakryiko wrote: > > > > > On Fri, Mar 1, 2024 at 9:01 AM Alexei Starovoitov > > > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > > > > > On Fri, Mar 1, 2024 at 12:18 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > > > > > > > > > On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote: > > > > > > > > On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > > > > > > > > > > > > > One of uprobe pain points is having slow execution that involves > > > > > > > > > two traps in worst case scenario or single trap if the original > > > > > > > > > instruction can be emulated. For return uprobes there's one extra > > > > > > > > > trap on top of that. > > > > > > > > > > > > > > > > > > My current idea on how to make this faster is to follow the optimized > > > > > > > > > kprobes and replace the normal uprobe trap instruction with jump to > > > > > > > > > user space trampoline that: > > > > > > > > > > > > > > > > > > - executes syscall to call uprobe consumers callbacks > > > > > > > > > > > > > > > > Did you get a chance to measure relative performance of syscall vs > > > > > > > > int3 interrupt handling? If not, do you think you'll be able to get > > > > > > > > some numbers by the time the conference starts? This should inform the > > > > > > > > decision whether it even makes sense to go through all the trouble. > > > > > > > > > > > > > > right, will do that > > > > > > > > > > > > I believe Yusheng measured syscall vs uprobe performance > > > > > > difference during LPC. iirc it was something like 3x. > > > > > > > > > > Do you have a link to slides? Was it actual uprobe vs just some fast > > > > > syscall (not doing BPF program execution) comparison? Or comparing the > > > > > performance of int3 handling vs equivalent syscall handling. > > > > > > > > > > I suspect it's the former, and so probably not that representative. > > > > > I'm curious about the performance of going > > > > > userspace->kernel->userspace through int3 vs syscall (all other things > > > > > being equal). > > > > > > > > I have a simple test [1] comparing: > > > > - uprobe with 2 traps > > > > - uprobe with 1 trap > > > > - syscall executing uprobe > > > > > > > > the syscall takes uprobe address as argument, finds the uprobe and executes > > > > its consumers, which should be comparable to what the trampoline will do > > > > > > > > test does same amount of loops triggering each uprobe type and measures > > > > the time it took > > > > > > > > # ./test_progs -t uprobe_syscall_bench -v > > > > bpf_testmod.ko is already unloaded. > > > > Loading bpf_testmod.ko... > > > > Successfully loaded bpf_testmod.ko. > > > > test_bench_1:PASS:uprobe_bench__open_and_load 0 nsec > > > > test_bench_1:PASS:uprobe_bench__attach 0 nsec > > > > test_bench_1:PASS:uprobe1_cnt 0 nsec > > > > test_bench_1:PASS:syscalls_uprobe1_cnt 0 nsec > > > > test_bench_1:PASS:uprobe2_cnt 0 nsec > > > > test_bench_1: uprobes (1 trap) in 36.439s > > > > test_bench_1: uprobes (2 trap) in 91.960s > > > > test_bench_1: syscalls in 17.872s > > > > #395/1 uprobe_syscall_bench/bench_1:OK > > > > #395 uprobe_syscall_bench:OK > > > > Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED > > > > > > > > syscall uprobe execution seems to be ~2x faster than 1 trap uprobe > > > > and ~5x faster than 2 traps uprobe > > > > > > > > > > Thanks for running benchmarks! I quickly looked at the selftest and > > > noticed this: > > > > > > +/* > > > + * Assuming following prolog: > > > + * > > > + * 6984ac: 55 push %rbp > > > + * 6984ad: 48 89 e5 mov %rsp,%rbp > > > + */ > > > +noinline void uprobe2_bench_trigger(void) > > > +{ > > > + asm volatile (""); > > > +} > > > > > > This actually will be optimized out to just ret in -O2 mode (make > > > RELEASE=1 for selftests): > > > > > > 00000000005a0ce0 <uprobe2_bench_trigger>: > > > 5a0ce0: c3 retq > > > 5a0ce1: 66 66 2e 0f 1f 84 00 00 00 00 00 nopw %cs:(%rax,%rax) > > > 5a0cec: 0f 1f 40 00 nopl (%rax) > > > > > > So be careful with that. > > > > right, I did not mean for this to be checked in, just wanted to get the > > numbers quickly > > > > > > > > Also, I just updated our existing set of uprobe benchmarks (see [0]), > > > do you mind adding your syscall-based one as another one there and > > > running all of them and sharing the numbers with us? Very curious to > > > see both absolute and relative numbers from that benchmark. (and > > > please do build with RELEASE=1) > > > > > > You should be able to just run benchs/run_bench_uprobes.sh (also don't > > > forget to add your syscall-based benchmark to the list of benchmarks > > > in that shell script). > > > > yes, saw it and was going to run/compare it.. it's good idea to add > > the syscall one and get all numbers together, will do that > > > > > > > > Thank you! > > > > > > > > > BTW, while I think patching multiple instructions for syscall-based > > > uprobe is going to be extremely tricky, I think at least u*ret*probe's > > > int3 can be pretty easily optimized away with syscall, given that the > > > kernel controls code generation there. If anything, it will get the > > > uretprobe case a bit closer to the performance of uprobe. Give it some > > > thought. > > > > hm, right.. the trampoline is there already, but at the moment is global > > and used by all uretprobes.. and int3 code moves userspace (changes rip) > > to the original return address.. maybe we can do that through syscall > > as well > > it seems like good idea, I tried change below (use syscall on return > trampoline) and got some speedup: > > current: > base : 15.817 ± 0.009M/s > uprobe-nop : 2.901 ± 0.000M/s > uprobe-push : 2.743 ± 0.002M/s > uprobe-ret : 1.089 ± 0.001M/s > uretprobe-nop : 1.448 ± 0.001M/s > uretprobe-push : 1.407 ± 0.001M/s > uretprobe-ret : 0.792 ± 0.001M/s > > with syscall: > base : 15.831 ± 0.026M/s > uprobe-nop : 2.904 ± 0.001M/s > uprobe-push : 2.764 ± 0.002M/s > uprobe-ret : 1.082 ± 0.001M/s > uretprobe-nop : 1.785 ± 0.000M/s > uretprobe-push : 1.733 ± 0.001M/s > uretprobe-ret : 0.885 ± 0.004M/s > > ~23% for nop/push (emulated) cases, ~11% for ret (sstep) case > > jirka heh, I tried this as well over weekend, though I cut few more corners (see diff below, I didn't add saving/restoring rax, though that would be required, of course). My test machine is (way) slower, though, so I got a slightly different numbers (up to 15%): ### baseline uprobe-base : 79.462 ± 0.058M/s base : 2.920 ± 0.004M/s uprobe-nop : 1.093 ± 0.001M/s uprobe-push : 1.066 ± 0.001M/s uprobe-ret : 0.480 ± 0.001M/s uretprobe-nop : 0.555 ± 0.000M/s uretprobe-push : 0.549 ± 0.000M/s uretprobe-ret : 0.338 ± 0.000M/s ### uretprobe syscall (vs baseline) uprobe-base : 79.488 ± 0.033M/s base : 2.917 ± 0.003M/s uprobe-nop : 1.095 ± 0.001M/s uprobe-push : 1.058 ± 0.000M/s uprobe-ret : 0.483 ± 0.000M/s uretprobe-nop : 0.638 ± 0.000M/s (+15%) uretprobe-push : 0.627 ± 0.000M/s (+14.2%) uretprobe-ret : 0.366 ± 0.000M/s (+8.3%) Either way, yes, we should implement this. Are you planning to send an official patch some time soon? I'm working on other small improvements in uprobe/uretprobe, I'll probably send the first patches today/tomorrow, but they shouldn't interfere with this uretprobe code path. See also a few comments on your code below. My hacky changes: commit dcf59baa5ad8ea4edb86ff1558eb1dcc28fcc7c0 Author: Andrii Nakryiko <andrii@kernel.org> Date: Sat Mar 9 20:16:02 2024 -0800 [WIP] uprobes: implement uretprobe through syscall Signed-off-by: Andrii Nakryiko <andrii@kernel.org> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 5f8591ce7f25..0207dfb5018d 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -466,3 +466,4 @@ 459 i386 lsm_get_self_attr sys_lsm_get_self_attr 460 i386 lsm_set_self_attr sys_lsm_set_self_attr 461 i386 lsm_list_modules sys_lsm_list_modules +462 i386 trace sys_trace diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 7e8d46f4147f..c16599e15bbb 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -383,6 +383,7 @@ 459 common lsm_get_self_attr sys_lsm_get_self_attr 460 common lsm_set_self_attr sys_lsm_set_self_attr 461 common lsm_list_modules sys_lsm_list_modules +462 common trace sys_trace # # Due to a historical design error, certain syscalls are numbered differently diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 77eb9b0e7685..f2863c9fab3d 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -965,6 +965,8 @@ asmlinkage long sys_lsm_set_self_attr(unsigned int attr, struct lsm_ctx *ctx, size_t size, __u32 flags); asmlinkage long sys_lsm_list_modules(u64 *ids, size_t *size, u32 flags); +asmlinkage long sys_trace(int cmd); + /* * Architecture-specific system calls */ diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 75f00965ab15..2f66eb8b068e 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -842,8 +842,11 @@ __SYSCALL(__NR_lsm_set_self_attr, sys_lsm_set_self_attr) #define __NR_lsm_list_modules 461 __SYSCALL(__NR_lsm_list_modules, sys_lsm_list_modules) +#define __NR_trace 462 +__SYSCALL(__NR_TRACE, sys_trace) + #undef __NR_syscalls -#define __NR_syscalls 462 +#define __NR_syscalls 463 /* * 32 bit systems traditionally used different diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 56a460719628..c6fc15bdbffc 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -28,6 +28,7 @@ #include <linux/khugepaged.h> #include <linux/uprobes.h> +#include <linux/syscalls.h> #define UINSNS_PER_PAGE (PAGE_SIZE/UPROBE_XOL_SLOT_BYTES) #define MAX_UPROBE_XOL_SLOTS UINSNS_PER_PAGE @@ -1488,8 +1489,15 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area) static struct xol_area *__create_xol_area(unsigned long vaddr) { struct mm_struct *mm = current->mm; - uprobe_opcode_t insn = UPROBE_SWBP_INSN; + //uprobe_opcode_t insn = UPROBE_SWBP_INSN; struct xol_area *area; + char uret_syscall_patch[] = { + /* mov rax, __NR_trace */ + 0x48, 0xC7, 0xC0, 0xCE, 0x01, 0x00, 0x00, + /* syscall */ + 0x0F, 0x05, + }; + const int URET_SYSCALL_INSNS_SIZE = ARRAY_SIZE(uret_syscall_patch); area = kmalloc(sizeof(*area), GFP_KERNEL); if (unlikely(!area)) @@ -1513,7 +1521,8 @@ static struct xol_area *__create_xol_area(unsigned long vaddr) /* Reserve the 1st slot for get_trampoline_vaddr() */ set_bit(0, area->bitmap); atomic_set(&area->slot_count, 1); - arch_uprobe_copy_ixol(area->pages[0], 0, &insn, UPROBE_SWBP_INSN_SIZE); + arch_uprobe_copy_ixol(area->pages[0], 0, &uret_syscall_patch, URET_SYSCALL_INSNS_SIZE); + //arch_uprobe_copy_ixol(area->pages[0], 0, &insn, UPROBE_SWBP_INSN_SIZE); if (!xol_add_vma(mm, area)) return area; @@ -2366,3 +2375,16 @@ void __init uprobes_init(void) BUG_ON(register_die_notifier(&uprobe_exception_nb)); } + +SYSCALL_DEFINE0(trace) +{ + struct pt_regs *regs = current_pt_regs(); + + //printk("BEFORE UTRACE!!!\n"); + + handle_trampoline(regs); + + //printk("AFTER UTRACE!!!\n"); + + return 0; +} diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index faad00cce269..4a3bc957dd43 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -174,6 +174,7 @@ COND_SYSCALL_COMPAT(fadvise64_64); COND_SYSCALL(lsm_get_self_attr); COND_SYSCALL(lsm_set_self_attr); COND_SYSCALL(lsm_list_modules); +COND_SYSCALL(trace); /* CONFIG_MMU only */ COND_SYSCALL(swapon); > > > --- > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl > index 7e8d46f4147f..fa5f8a058bc2 100644 > --- a/arch/x86/entry/syscalls/syscall_64.tbl > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > @@ -383,6 +383,7 @@ > 459 common lsm_get_self_attr sys_lsm_get_self_attr > 460 common lsm_set_self_attr sys_lsm_set_self_attr > 461 common lsm_list_modules sys_lsm_list_modules > +462 64 uprobe sys_uprobe > we should call it "uretprobe", "uprobe" will be a separate thing with different logic. I went with generic "trace", but realized that it would be better to have separate more targeted "special/internal" syscalls (where, if necessary, extra arguments would be passed through stack to avoid storing/restoring user-space registers). We have rg_sigreturn precedent which explicitly states that userspace shouldn't use it and shouldn't rely on any specific arguments conventions. [...] > /* > * Deprecated system calls which are still defined in > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h > index f46e0ca0169c..9ef244c8ff19 100644 > --- a/include/linux/uprobes.h > +++ b/include/linux/uprobes.h > @@ -138,6 +138,8 @@ extern bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c > extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs); > extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, > void *src, unsigned long len); > +extern void uprobe_handle_trampoline(struct pt_regs *regs); > +uprobe_opcode_t* arch_uprobe_trampoline(unsigned long *psize); just `void *` here? it can be a sequence of instructions now > #else /* !CONFIG_UPROBES */ > struct uprobes_state { > }; > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h > index 75f00965ab15..2702799648e6 100644 > --- a/include/uapi/asm-generic/unistd.h > +++ b/include/uapi/asm-generic/unistd.h > @@ -842,8 +842,11 @@ __SYSCALL(__NR_lsm_set_self_attr, sys_lsm_set_self_attr) > #define __NR_lsm_list_modules 461 > __SYSCALL(__NR_lsm_list_modules, sys_lsm_list_modules) > > +#define __NR_uprobe 462 > +__SYSCALL(__NR_uprobe, sys_uprobe) [...] ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [LSF/MM/BPF TOPIC] faster uprobes 2024-03-11 17:32 ` Andrii Nakryiko @ 2024-03-11 21:26 ` Jiri Olsa 2024-03-11 23:05 ` Andrii Nakryiko 0 siblings, 1 reply; 31+ messages in thread From: Jiri Olsa @ 2024-03-11 21:26 UTC (permalink / raw) To: Andrii Nakryiko Cc: Jiri Olsa, Alexei Starovoitov, yunwei356, bpf, Alexei Starovoitov, lsf-pc, Yonghong Song, Oleg Nesterov, Daniel Borkmann On Mon, Mar 11, 2024 at 10:32:23AM -0700, Andrii Nakryiko wrote: > On Mon, Mar 11, 2024 at 3:59 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > On Tue, Mar 05, 2024 at 09:24:08AM +0100, Jiri Olsa wrote: > > > On Mon, Mar 04, 2024 at 04:55:33PM -0800, Andrii Nakryiko wrote: > > > > On Sun, Mar 3, 2024 at 2:20 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > > > > > On Fri, Mar 01, 2024 at 09:26:57AM -0800, Andrii Nakryiko wrote: > > > > > > On Fri, Mar 1, 2024 at 9:01 AM Alexei Starovoitov > > > > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > > > > > > > On Fri, Mar 1, 2024 at 12:18 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > > > > > > > > > > > On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote: > > > > > > > > > On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > > One of uprobe pain points is having slow execution that involves > > > > > > > > > > two traps in worst case scenario or single trap if the original > > > > > > > > > > instruction can be emulated. For return uprobes there's one extra > > > > > > > > > > trap on top of that. > > > > > > > > > > > > > > > > > > > > My current idea on how to make this faster is to follow the optimized > > > > > > > > > > kprobes and replace the normal uprobe trap instruction with jump to > > > > > > > > > > user space trampoline that: > > > > > > > > > > > > > > > > > > > > - executes syscall to call uprobe consumers callbacks > > > > > > > > > > > > > > > > > > Did you get a chance to measure relative performance of syscall vs > > > > > > > > > int3 interrupt handling? If not, do you think you'll be able to get > > > > > > > > > some numbers by the time the conference starts? This should inform the > > > > > > > > > decision whether it even makes sense to go through all the trouble. > > > > > > > > > > > > > > > > right, will do that > > > > > > > > > > > > > > I believe Yusheng measured syscall vs uprobe performance > > > > > > > difference during LPC. iirc it was something like 3x. > > > > > > > > > > > > Do you have a link to slides? Was it actual uprobe vs just some fast > > > > > > syscall (not doing BPF program execution) comparison? Or comparing the > > > > > > performance of int3 handling vs equivalent syscall handling. > > > > > > > > > > > > I suspect it's the former, and so probably not that representative. > > > > > > I'm curious about the performance of going > > > > > > userspace->kernel->userspace through int3 vs syscall (all other things > > > > > > being equal). > > > > > > > > > > I have a simple test [1] comparing: > > > > > - uprobe with 2 traps > > > > > - uprobe with 1 trap > > > > > - syscall executing uprobe > > > > > > > > > > the syscall takes uprobe address as argument, finds the uprobe and executes > > > > > its consumers, which should be comparable to what the trampoline will do > > > > > > > > > > test does same amount of loops triggering each uprobe type and measures > > > > > the time it took > > > > > > > > > > # ./test_progs -t uprobe_syscall_bench -v > > > > > bpf_testmod.ko is already unloaded. > > > > > Loading bpf_testmod.ko... > > > > > Successfully loaded bpf_testmod.ko. > > > > > test_bench_1:PASS:uprobe_bench__open_and_load 0 nsec > > > > > test_bench_1:PASS:uprobe_bench__attach 0 nsec > > > > > test_bench_1:PASS:uprobe1_cnt 0 nsec > > > > > test_bench_1:PASS:syscalls_uprobe1_cnt 0 nsec > > > > > test_bench_1:PASS:uprobe2_cnt 0 nsec > > > > > test_bench_1: uprobes (1 trap) in 36.439s > > > > > test_bench_1: uprobes (2 trap) in 91.960s > > > > > test_bench_1: syscalls in 17.872s > > > > > #395/1 uprobe_syscall_bench/bench_1:OK > > > > > #395 uprobe_syscall_bench:OK > > > > > Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED > > > > > > > > > > syscall uprobe execution seems to be ~2x faster than 1 trap uprobe > > > > > and ~5x faster than 2 traps uprobe > > > > > > > > > > > > > Thanks for running benchmarks! I quickly looked at the selftest and > > > > noticed this: > > > > > > > > +/* > > > > + * Assuming following prolog: > > > > + * > > > > + * 6984ac: 55 push %rbp > > > > + * 6984ad: 48 89 e5 mov %rsp,%rbp > > > > + */ > > > > +noinline void uprobe2_bench_trigger(void) > > > > +{ > > > > + asm volatile (""); > > > > +} > > > > > > > > This actually will be optimized out to just ret in -O2 mode (make > > > > RELEASE=1 for selftests): > > > > > > > > 00000000005a0ce0 <uprobe2_bench_trigger>: > > > > 5a0ce0: c3 retq > > > > 5a0ce1: 66 66 2e 0f 1f 84 00 00 00 00 00 nopw %cs:(%rax,%rax) > > > > 5a0cec: 0f 1f 40 00 nopl (%rax) > > > > > > > > So be careful with that. > > > > > > right, I did not mean for this to be checked in, just wanted to get the > > > numbers quickly > > > > > > > > > > > Also, I just updated our existing set of uprobe benchmarks (see [0]), > > > > do you mind adding your syscall-based one as another one there and > > > > running all of them and sharing the numbers with us? Very curious to > > > > see both absolute and relative numbers from that benchmark. (and > > > > please do build with RELEASE=1) > > > > > > > > You should be able to just run benchs/run_bench_uprobes.sh (also don't > > > > forget to add your syscall-based benchmark to the list of benchmarks > > > > in that shell script). > > > > > > yes, saw it and was going to run/compare it.. it's good idea to add > > > the syscall one and get all numbers together, will do that > > > > > > > > > > > Thank you! > > > > > > > > > > > > BTW, while I think patching multiple instructions for syscall-based > > > > uprobe is going to be extremely tricky, I think at least u*ret*probe's > > > > int3 can be pretty easily optimized away with syscall, given that the > > > > kernel controls code generation there. If anything, it will get the > > > > uretprobe case a bit closer to the performance of uprobe. Give it some > > > > thought. > > > > > > hm, right.. the trampoline is there already, but at the moment is global > > > and used by all uretprobes.. and int3 code moves userspace (changes rip) > > > to the original return address.. maybe we can do that through syscall > > > as well > > > > it seems like good idea, I tried change below (use syscall on return > > trampoline) and got some speedup: > > > > current: > > base : 15.817 ± 0.009M/s > > uprobe-nop : 2.901 ± 0.000M/s > > uprobe-push : 2.743 ± 0.002M/s > > uprobe-ret : 1.089 ± 0.001M/s > > uretprobe-nop : 1.448 ± 0.001M/s > > uretprobe-push : 1.407 ± 0.001M/s > > uretprobe-ret : 0.792 ± 0.001M/s > > > > with syscall: > > base : 15.831 ± 0.026M/s > > uprobe-nop : 2.904 ± 0.001M/s > > uprobe-push : 2.764 ± 0.002M/s > > uprobe-ret : 1.082 ± 0.001M/s > > uretprobe-nop : 1.785 ± 0.000M/s > > uretprobe-push : 1.733 ± 0.001M/s > > uretprobe-ret : 0.885 ± 0.004M/s > > > > ~23% for nop/push (emulated) cases, ~11% for ret (sstep) case > > > > jirka > > heh, I tried this as well over weekend, though I cut few more corners > (see diff below, I didn't add saving/restoring rax, though that would > be required, of course). My test machine is (way) slower, though, so I > got a slightly different numbers (up to 15%): nice :-) btw I just checked on another slower amd server and it's ~10% in all 3 cases, my previous results are from intel machine.. I guess the hw trap behaviour/speed makes this not proportional across archs > > ### baseline > uprobe-base : 79.462 ± 0.058M/s > base : 2.920 ± 0.004M/s > uprobe-nop : 1.093 ± 0.001M/s > uprobe-push : 1.066 ± 0.001M/s > uprobe-ret : 0.480 ± 0.001M/s > uretprobe-nop : 0.555 ± 0.000M/s > uretprobe-push : 0.549 ± 0.000M/s > uretprobe-ret : 0.338 ± 0.000M/s > > > ### uretprobe syscall (vs baseline) > uprobe-base : 79.488 ± 0.033M/s > base : 2.917 ± 0.003M/s > uprobe-nop : 1.095 ± 0.001M/s > uprobe-push : 1.058 ± 0.000M/s > uprobe-ret : 0.483 ± 0.000M/s > uretprobe-nop : 0.638 ± 0.000M/s (+15%) > uretprobe-push : 0.627 ± 0.000M/s (+14.2%) > uretprobe-ret : 0.366 ± 0.000M/s (+8.3%) > > Either way, yes, we should implement this. Are you planning to send an > official patch some time soon? I'm working on other small improvements > in uprobe/uretprobe, I'll probably send the first patches > today/tomorrow, but they shouldn't interfere with this uretprobe code > path. yes, wanted to finish/post it this week SNIP > > --- > > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl > > index 7e8d46f4147f..fa5f8a058bc2 100644 > > --- a/arch/x86/entry/syscalls/syscall_64.tbl > > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > > @@ -383,6 +383,7 @@ > > 459 common lsm_get_self_attr sys_lsm_get_self_attr > > 460 common lsm_set_self_attr sys_lsm_set_self_attr > > 461 common lsm_list_modules sys_lsm_list_modules > > +462 64 uprobe sys_uprobe > > > > we should call it "uretprobe", "uprobe" will be a separate thing with > different logic. > > I went with generic "trace", but realized that it would be better to > have separate more targeted "special/internal" syscalls (where, if > necessary, extra arguments would be passed through stack to avoid > storing/restoring user-space registers). We have rg_sigreturn > precedent which explicitly states that userspace shouldn't use it and > shouldn't rely on any specific arguments conventions. somehow I thought of syscalls as of scare resource and wanted to add arguments/commands to the uprobe syscalls.. but having uretprobe dedicated syscall makes things easier > > [...] > > > /* > > * Deprecated system calls which are still defined in > > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h > > index f46e0ca0169c..9ef244c8ff19 100644 > > --- a/include/linux/uprobes.h > > +++ b/include/linux/uprobes.h > > @@ -138,6 +138,8 @@ extern bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c > > extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs); > > extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, > > void *src, unsigned long len); > > +extern void uprobe_handle_trampoline(struct pt_regs *regs); > > +uprobe_opcode_t* arch_uprobe_trampoline(unsigned long *psize); > > just `void *` here? it can be a sequence of instructions now hm, it's pointer to u8, which should be fine no? is there benefit to have void* in here instead? thanks, jirka ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM/BPF TOPIC] faster uprobes 2024-03-11 21:26 ` Jiri Olsa @ 2024-03-11 23:05 ` Andrii Nakryiko 0 siblings, 0 replies; 31+ messages in thread From: Andrii Nakryiko @ 2024-03-11 23:05 UTC (permalink / raw) To: Jiri Olsa Cc: Alexei Starovoitov, yunwei356, bpf, Alexei Starovoitov, lsf-pc, Yonghong Song, Oleg Nesterov, Daniel Borkmann On Mon, Mar 11, 2024 at 2:26 PM Jiri Olsa <olsajiri@gmail.com> wrote: > > On Mon, Mar 11, 2024 at 10:32:23AM -0700, Andrii Nakryiko wrote: > > On Mon, Mar 11, 2024 at 3:59 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > On Tue, Mar 05, 2024 at 09:24:08AM +0100, Jiri Olsa wrote: > > > > On Mon, Mar 04, 2024 at 04:55:33PM -0800, Andrii Nakryiko wrote: > > > > > On Sun, Mar 3, 2024 at 2:20 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > > > > > > > On Fri, Mar 01, 2024 at 09:26:57AM -0800, Andrii Nakryiko wrote: > > > > > > > On Fri, Mar 1, 2024 at 9:01 AM Alexei Starovoitov > > > > > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > > > > > > > > > On Fri, Mar 1, 2024 at 12:18 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > > > > > > > > > > > > > On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote: > > > > > > > > > > On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > > > > One of uprobe pain points is having slow execution that involves > > > > > > > > > > > two traps in worst case scenario or single trap if the original > > > > > > > > > > > instruction can be emulated. For return uprobes there's one extra > > > > > > > > > > > trap on top of that. > > > > > > > > > > > > > > > > > > > > > > My current idea on how to make this faster is to follow the optimized > > > > > > > > > > > kprobes and replace the normal uprobe trap instruction with jump to > > > > > > > > > > > user space trampoline that: > > > > > > > > > > > > > > > > > > > > > > - executes syscall to call uprobe consumers callbacks > > > > > > > > > > > > > > > > > > > > Did you get a chance to measure relative performance of syscall vs > > > > > > > > > > int3 interrupt handling? If not, do you think you'll be able to get > > > > > > > > > > some numbers by the time the conference starts? This should inform the > > > > > > > > > > decision whether it even makes sense to go through all the trouble. > > > > > > > > > > > > > > > > > > right, will do that > > > > > > > > > > > > > > > > I believe Yusheng measured syscall vs uprobe performance > > > > > > > > difference during LPC. iirc it was something like 3x. > > > > > > > > > > > > > > Do you have a link to slides? Was it actual uprobe vs just some fast > > > > > > > syscall (not doing BPF program execution) comparison? Or comparing the > > > > > > > performance of int3 handling vs equivalent syscall handling. > > > > > > > > > > > > > > I suspect it's the former, and so probably not that representative. > > > > > > > I'm curious about the performance of going > > > > > > > userspace->kernel->userspace through int3 vs syscall (all other things > > > > > > > being equal). > > > > > > > > > > > > I have a simple test [1] comparing: > > > > > > - uprobe with 2 traps > > > > > > - uprobe with 1 trap > > > > > > - syscall executing uprobe > > > > > > > > > > > > the syscall takes uprobe address as argument, finds the uprobe and executes > > > > > > its consumers, which should be comparable to what the trampoline will do > > > > > > > > > > > > test does same amount of loops triggering each uprobe type and measures > > > > > > the time it took > > > > > > > > > > > > # ./test_progs -t uprobe_syscall_bench -v > > > > > > bpf_testmod.ko is already unloaded. > > > > > > Loading bpf_testmod.ko... > > > > > > Successfully loaded bpf_testmod.ko. > > > > > > test_bench_1:PASS:uprobe_bench__open_and_load 0 nsec > > > > > > test_bench_1:PASS:uprobe_bench__attach 0 nsec > > > > > > test_bench_1:PASS:uprobe1_cnt 0 nsec > > > > > > test_bench_1:PASS:syscalls_uprobe1_cnt 0 nsec > > > > > > test_bench_1:PASS:uprobe2_cnt 0 nsec > > > > > > test_bench_1: uprobes (1 trap) in 36.439s > > > > > > test_bench_1: uprobes (2 trap) in 91.960s > > > > > > test_bench_1: syscalls in 17.872s > > > > > > #395/1 uprobe_syscall_bench/bench_1:OK > > > > > > #395 uprobe_syscall_bench:OK > > > > > > Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED > > > > > > > > > > > > syscall uprobe execution seems to be ~2x faster than 1 trap uprobe > > > > > > and ~5x faster than 2 traps uprobe > > > > > > > > > > > > > > > > Thanks for running benchmarks! I quickly looked at the selftest and > > > > > noticed this: > > > > > > > > > > +/* > > > > > + * Assuming following prolog: > > > > > + * > > > > > + * 6984ac: 55 push %rbp > > > > > + * 6984ad: 48 89 e5 mov %rsp,%rbp > > > > > + */ > > > > > +noinline void uprobe2_bench_trigger(void) > > > > > +{ > > > > > + asm volatile (""); > > > > > +} > > > > > > > > > > This actually will be optimized out to just ret in -O2 mode (make > > > > > RELEASE=1 for selftests): > > > > > > > > > > 00000000005a0ce0 <uprobe2_bench_trigger>: > > > > > 5a0ce0: c3 retq > > > > > 5a0ce1: 66 66 2e 0f 1f 84 00 00 00 00 00 nopw %cs:(%rax,%rax) > > > > > 5a0cec: 0f 1f 40 00 nopl (%rax) > > > > > > > > > > So be careful with that. > > > > > > > > right, I did not mean for this to be checked in, just wanted to get the > > > > numbers quickly > > > > > > > > > > > > > > Also, I just updated our existing set of uprobe benchmarks (see [0]), > > > > > do you mind adding your syscall-based one as another one there and > > > > > running all of them and sharing the numbers with us? Very curious to > > > > > see both absolute and relative numbers from that benchmark. (and > > > > > please do build with RELEASE=1) > > > > > > > > > > You should be able to just run benchs/run_bench_uprobes.sh (also don't > > > > > forget to add your syscall-based benchmark to the list of benchmarks > > > > > in that shell script). > > > > > > > > yes, saw it and was going to run/compare it.. it's good idea to add > > > > the syscall one and get all numbers together, will do that > > > > > > > > > > > > > > Thank you! > > > > > > > > > > > > > > > BTW, while I think patching multiple instructions for syscall-based > > > > > uprobe is going to be extremely tricky, I think at least u*ret*probe's > > > > > int3 can be pretty easily optimized away with syscall, given that the > > > > > kernel controls code generation there. If anything, it will get the > > > > > uretprobe case a bit closer to the performance of uprobe. Give it some > > > > > thought. > > > > > > > > hm, right.. the trampoline is there already, but at the moment is global > > > > and used by all uretprobes.. and int3 code moves userspace (changes rip) > > > > to the original return address.. maybe we can do that through syscall > > > > as well > > > > > > it seems like good idea, I tried change below (use syscall on return > > > trampoline) and got some speedup: > > > > > > current: > > > base : 15.817 ± 0.009M/s > > > uprobe-nop : 2.901 ± 0.000M/s > > > uprobe-push : 2.743 ± 0.002M/s > > > uprobe-ret : 1.089 ± 0.001M/s > > > uretprobe-nop : 1.448 ± 0.001M/s > > > uretprobe-push : 1.407 ± 0.001M/s > > > uretprobe-ret : 0.792 ± 0.001M/s > > > > > > with syscall: > > > base : 15.831 ± 0.026M/s > > > uprobe-nop : 2.904 ± 0.001M/s > > > uprobe-push : 2.764 ± 0.002M/s > > > uprobe-ret : 1.082 ± 0.001M/s > > > uretprobe-nop : 1.785 ± 0.000M/s > > > uretprobe-push : 1.733 ± 0.001M/s > > > uretprobe-ret : 0.885 ± 0.004M/s > > > > > > ~23% for nop/push (emulated) cases, ~11% for ret (sstep) case > > > > > > jirka > > > > heh, I tried this as well over weekend, though I cut few more corners > > (see diff below, I didn't add saving/restoring rax, though that would > > be required, of course). My test machine is (way) slower, though, so I > > got a slightly different numbers (up to 15%): > > nice :-) btw I just checked on another slower amd server and it's ~10% in > all 3 cases, my previous results are from intel machine.. I guess the hw > trap behaviour/speed makes this not proportional across archs > > > > > ### baseline > > uprobe-base : 79.462 ± 0.058M/s > > base : 2.920 ± 0.004M/s > > uprobe-nop : 1.093 ± 0.001M/s > > uprobe-push : 1.066 ± 0.001M/s > > uprobe-ret : 0.480 ± 0.001M/s > > uretprobe-nop : 0.555 ± 0.000M/s > > uretprobe-push : 0.549 ± 0.000M/s > > uretprobe-ret : 0.338 ± 0.000M/s > > > > > > ### uretprobe syscall (vs baseline) > > uprobe-base : 79.488 ± 0.033M/s > > base : 2.917 ± 0.003M/s > > uprobe-nop : 1.095 ± 0.001M/s > > uprobe-push : 1.058 ± 0.000M/s > > uprobe-ret : 0.483 ± 0.000M/s > > uretprobe-nop : 0.638 ± 0.000M/s (+15%) > > uretprobe-push : 0.627 ± 0.000M/s (+14.2%) > > uretprobe-ret : 0.366 ± 0.000M/s (+8.3%) > > > > Either way, yes, we should implement this. Are you planning to send an > > official patch some time soon? I'm working on other small improvements > > in uprobe/uretprobe, I'll probably send the first patches > > today/tomorrow, but they shouldn't interfere with this uretprobe code > > path. > > yes, wanted to finish/post it this week > great, looking forward, we can use these speeds up for uretprobe in our production > SNIP > > > > --- > > > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl > > > index 7e8d46f4147f..fa5f8a058bc2 100644 > > > --- a/arch/x86/entry/syscalls/syscall_64.tbl > > > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > > > @@ -383,6 +383,7 @@ > > > 459 common lsm_get_self_attr sys_lsm_get_self_attr > > > 460 common lsm_set_self_attr sys_lsm_set_self_attr > > > 461 common lsm_list_modules sys_lsm_list_modules > > > +462 64 uprobe sys_uprobe > > > > > > > we should call it "uretprobe", "uprobe" will be a separate thing with > > different logic. > > > > I went with generic "trace", but realized that it would be better to > > have separate more targeted "special/internal" syscalls (where, if > > necessary, extra arguments would be passed through stack to avoid > > storing/restoring user-space registers). We have rg_sigreturn > > precedent which explicitly states that userspace shouldn't use it and > > shouldn't rely on any specific arguments conventions. > > somehow I thought of syscalls as of scare resource and wanted to add > arguments/commands to the uprobe syscalls.. but having uretprobe > dedicated syscall makes things easier given we are at 462 already, not sure I believe it's scarce :) but it's also a performance aspect, not using any arguments means we can avoid saving/restoring user-space registers, so it makes sense to have 2-3 dedicated uprobe/uretprobe syscalls vs 1 more generic on (IMO), at least from performance POV. > > > > > [...] > > > > > /* > > > * Deprecated system calls which are still defined in > > > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h > > > index f46e0ca0169c..9ef244c8ff19 100644 > > > --- a/include/linux/uprobes.h > > > +++ b/include/linux/uprobes.h > > > @@ -138,6 +138,8 @@ extern bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c > > > extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs); > > > extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, > > > void *src, unsigned long len); > > > +extern void uprobe_handle_trampoline(struct pt_regs *regs); > > > +uprobe_opcode_t* arch_uprobe_trampoline(unsigned long *psize); > > > > just `void *` here? it can be a sequence of instructions now > > hm, it's pointer to u8, which should be fine no? is there benefit to > have void* in here instead? > Quick grepping initially brought up `typedef u32 uprobe_opcode_t;`, but that's for non-x86 architectures. I don't think it matters all that much (in my mind it's just a generated code blob, so just `void *` memory, we don't have to look at its contents). > thanks, > jirka ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM/BPF TOPIC] faster uprobes 2024-03-01 17:01 ` Alexei Starovoitov 2024-03-01 17:26 ` Andrii Nakryiko @ 2024-03-02 20:46 ` Jiri Olsa 2024-03-02 21:08 ` Alexei Starovoitov 1 sibling, 1 reply; 31+ messages in thread From: Jiri Olsa @ 2024-03-02 20:46 UTC (permalink / raw) To: Alexei Starovoitov Cc: Jiri Olsa, yunwei356, Andrii Nakryiko, bpf, Alexei Starovoitov, lsf-pc, Yonghong Song, Oleg Nesterov, Daniel Borkmann On Fri, Mar 01, 2024 at 09:01:07AM -0800, Alexei Starovoitov wrote: > On Fri, Mar 1, 2024 at 12:18 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote: > > > On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > > > One of uprobe pain points is having slow execution that involves > > > > two traps in worst case scenario or single trap if the original > > > > instruction can be emulated. For return uprobes there's one extra > > > > trap on top of that. > > > > > > > > My current idea on how to make this faster is to follow the optimized > > > > kprobes and replace the normal uprobe trap instruction with jump to > > > > user space trampoline that: > > > > > > > > - executes syscall to call uprobe consumers callbacks > > > > > > Did you get a chance to measure relative performance of syscall vs > > > int3 interrupt handling? If not, do you think you'll be able to get > > > some numbers by the time the conference starts? This should inform the > > > decision whether it even makes sense to go through all the trouble. > > > > right, will do that > > I believe Yusheng measured syscall vs uprobe performance > difference during LPC. iirc it was something like 3x. > Certainly necessary to have a benchmark. > selftests/bpf/bench has one for uprobe. > Probably should extend with sys_bpf. > ok, did not know there was uprobe benchmark, will check > Regarding: > > replace the normal uprobe trap instruction with jump to > user space trampoline > > it should probably be a call to trampoline instead of a jump. > Unless you plan to generate a different trampoline for every location ? I wanted to store the ip of the uprobe as argument for the syscall, but the call instruction will push return address on stack and we can use it to get uprobe's address.. great > > Also how would you pick a space for a trampoline in the target process ? > Analyze /proc/pid/maps and look for gaps in executable sections? As Andrii mentioned in other response there's already one page mapped as '[uprobes]' mapping, it's used as trampoline for return uprobes (contains just int3 instruction) and as buffers to hold the original instruction for the single step execution I think if we endup with just single trampoline we can just use some of the space from that page, our trampoline should not be big > > We can start simple with a USDT that uses nop5 instead of nop1 > and explicit single trampoline for all USDT locations > that saves all (callee and caller saved) registers and > then does sys_bpf with a new cmd. ah, I did not realize USDTs are like that, will check, good idea > > To replace nop5 with a call to trampoline we can use text_poke_bp > approach: replace 1st byte with int3, replace 2-5 with target addr, > replace 1st byte to make an actual call insn. I'm bit in the dark in here, but uprobe_write_opcode stores the int3 byte by allocating new page, copying the contents of the old page over and updating it with int3 byte.. then calls __replace_page to put new page in place should that be enough also for 5 bytes update? the cpu executing that exact page will page fault and get the new updated page? I discussed with Oleg and got this understanding, I might be wrong hm what if the cpu is just executing the address in the middle of the uprobe's original instructions and the page gets updated.. I need to check more on this ;-) > > Once patched there will be no simulation of insns or kernel traps. > Just normal user code that calls into trampoline, that calls sys_bpf, > and returns back. I saw this as generic uprobe enhancement, should it be sys_bpf syscall, not a some generic one? we will call all the uprobe's handlers/consumers thanks, jirka ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM/BPF TOPIC] faster uprobes 2024-03-02 20:46 ` Jiri Olsa @ 2024-03-02 21:08 ` Alexei Starovoitov 2024-03-02 21:49 ` Oleg Nesterov 0 siblings, 1 reply; 31+ messages in thread From: Alexei Starovoitov @ 2024-03-02 21:08 UTC (permalink / raw) To: Jiri Olsa Cc: yunwei356, Andrii Nakryiko, bpf, Alexei Starovoitov, lsf-pc, Yonghong Song, Oleg Nesterov, Daniel Borkmann On Sat, Mar 2, 2024 at 12:46 PM Jiri Olsa <olsajiri@gmail.com> wrote: > > > I'm bit in the dark in here, but uprobe_write_opcode stores the int3 > byte by allocating new page, copying the contents of the old page over > and updating it with int3 byte.. then calls __replace_page to put new > page in place > > should that be enough also for 5 bytes update? the cpu executing that > exact page will page fault and get the new updated page? I discussed > with Oleg and got this understanding, I might be wrong > > hm what if the cpu is just executing the address in the middle of the > uprobe's original instructions and the page gets updated.. I need to > check more on this ;-) I suspect it's all working fine already. Only x86 is using single byte uprobe. All other archs are using 2 or 4 byte. So replacing an insn or two with a call should work. > I saw this as generic uprobe enhancement, should it be sys_bpf syscall, > not a some generic one? we will call all the uprobe's handlers/consumers yeah. If we can make all uprobes faster without relying on nop5 usdt then it's certainly better. But if "replace any insn" turns out to be too complex we can limit it to replacing nop5 or replacing simple insns in the prologue like push, mov. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM/BPF TOPIC] faster uprobes 2024-03-02 21:08 ` Alexei Starovoitov @ 2024-03-02 21:49 ` Oleg Nesterov 0 siblings, 0 replies; 31+ messages in thread From: Oleg Nesterov @ 2024-03-02 21:49 UTC (permalink / raw) To: Alexei Starovoitov Cc: Jiri Olsa, yunwei356, Andrii Nakryiko, bpf, Alexei Starovoitov, lsf-pc, Yonghong Song, Daniel Borkmann On 03/02, Alexei Starovoitov wrote: > > I suspect it's all working fine already. > Only x86 is using single byte uprobe. > All other archs are using 2 or 4 byte. Yes, so we have UPROBE_SWBP_INSN_SIZE > So replacing an insn or two with a call should work. Please note that __uprobe_register(offset) fails if !IS_ALIGNED(offset, UPROBE_SWBP_INSN_SIZE) Not to mention that if "call" replaces 2 insns we have another problem: what if another consumer wants to probe the 2ns insn ? but perhaps (quite possibly) I misunderstand you. Oleg. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM/BPF TOPIC] faster uprobes 2024-02-29 14:39 [LSF/MM/BPF TOPIC] faster uprobes Jiri Olsa 2024-03-01 0:25 ` Andrii Nakryiko @ 2024-03-01 19:39 ` Kui-Feng Lee 2024-03-05 17:18 ` Jiri Olsa 1 sibling, 1 reply; 31+ messages in thread From: Kui-Feng Lee @ 2024-03-01 19:39 UTC (permalink / raw) To: Jiri Olsa, bpf, Alexei Starovoitov, lsf-pc, Andrii Nakryiko, Yonghong Song, Oleg Nesterov, Daniel Borkmann On 2/29/24 06:39, Jiri Olsa wrote: > One of uprobe pain points is having slow execution that involves > two traps in worst case scenario or single trap if the original > instruction can be emulated. For return uprobes there's one extra > trap on top of that. > > My current idea on how to make this faster is to follow the optimized > kprobes and replace the normal uprobe trap instruction with jump to > user space trampoline that: > > - executes syscall to call uprobe consumers callbacks > - executes original instructions > - jumps back to continue with the original code > > There are of course corner cases where above will have trouble or > won't work completely, like: > > - executing original instructions in the trampoline is tricky wrt > rip relative addressing > > - some instructions we can't move to trampoline at all > > - the uprobe address is on page boundary so the jump instruction to > trampoline would span across 2 pages, hence the page replace won't > be atomic, which might cause issues > > - ... ? many others I'm sure > > Still with all the limitations I think we could be able to speed up > some amount of the uprobes, which seems worth doing. Just a random idea related to this. Could we also run jit code of bpf programs in the user space to collect information instead of going back to the kernel every time? These jit code should not be able to access helpers or kfuncs, but they still can collect and aggregate data, store data in bpf maps, and change behavior of user space programs. > > I'd like to have the discussion on the topic and get some agreement > or directions on how this should be done. > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM/BPF TOPIC] faster uprobes 2024-03-01 19:39 ` Kui-Feng Lee @ 2024-03-05 17:18 ` Jiri Olsa 2024-03-05 23:53 ` Song Liu 0 siblings, 1 reply; 31+ messages in thread From: Jiri Olsa @ 2024-03-05 17:18 UTC (permalink / raw) To: Kui-Feng Lee Cc: Jiri Olsa, bpf, Alexei Starovoitov, lsf-pc, Andrii Nakryiko, Yonghong Song, Oleg Nesterov, Daniel Borkmann On Fri, Mar 01, 2024 at 11:39:03AM -0800, Kui-Feng Lee wrote: > > > > On 2/29/24 06:39, Jiri Olsa wrote: > > One of uprobe pain points is having slow execution that involves > > two traps in worst case scenario or single trap if the original > > instruction can be emulated. For return uprobes there's one extra > > trap on top of that. > > > > My current idea on how to make this faster is to follow the optimized > > kprobes and replace the normal uprobe trap instruction with jump to > > user space trampoline that: > > > > - executes syscall to call uprobe consumers callbacks > > - executes original instructions > > - jumps back to continue with the original code > > > > There are of course corner cases where above will have trouble or > > won't work completely, like: > > > > - executing original instructions in the trampoline is tricky wrt > > rip relative addressing > > > > - some instructions we can't move to trampoline at all > > > > - the uprobe address is on page boundary so the jump instruction to > > trampoline would span across 2 pages, hence the page replace won't > > be atomic, which might cause issues > > > > - ... ? many others I'm sure > > > > Still with all the limitations I think we could be able to speed up > > some amount of the uprobes, which seems worth doing. > > Just a random idea related to this. > Could we also run jit code of bpf programs in the user space to collect > information instead of going back to the kernel every time? sorry for late reply, do you mean like ubpf? the scope of this change is to speed up the generic uprobe, ebpf is just one of the consumers jirka > These jit code should not be able to access helpers or kfuncs, but they > still can collect and aggregate data, store data in bpf maps, and change > behavior of user space programs. > > > > > I'd like to have the discussion on the topic and get some agreement > > or directions on how this should be done. > > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM/BPF TOPIC] faster uprobes 2024-03-05 17:18 ` Jiri Olsa @ 2024-03-05 23:53 ` Song Liu 2024-03-07 9:15 ` Jiri Olsa 2024-03-07 23:02 ` Kui-Feng Lee 0 siblings, 2 replies; 31+ messages in thread From: Song Liu @ 2024-03-05 23:53 UTC (permalink / raw) To: Jiri Olsa Cc: Kui-Feng Lee, bpf, Alexei Starovoitov, lsf-pc, Andrii Nakryiko, Yonghong Song, Oleg Nesterov, Daniel Borkmann On Tue, Mar 5, 2024 at 9:18 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > On Fri, Mar 01, 2024 at 11:39:03AM -0800, Kui-Feng Lee wrote: > > > > > > > > On 2/29/24 06:39, Jiri Olsa wrote: > > > One of uprobe pain points is having slow execution that involves > > > two traps in worst case scenario or single trap if the original > > > instruction can be emulated. For return uprobes there's one extra > > > trap on top of that. > > > > > > My current idea on how to make this faster is to follow the optimized > > > kprobes and replace the normal uprobe trap instruction with jump to > > > user space trampoline that: > > > > > > - executes syscall to call uprobe consumers callbacks > > > - executes original instructions > > > - jumps back to continue with the original code > > > > > > There are of course corner cases where above will have trouble or > > > won't work completely, like: > > > > > > - executing original instructions in the trampoline is tricky wrt > > > rip relative addressing > > > > > > - some instructions we can't move to trampoline at all > > > > > > - the uprobe address is on page boundary so the jump instruction to > > > trampoline would span across 2 pages, hence the page replace won't > > > be atomic, which might cause issues > > > > > > - ... ? many others I'm sure > > > > > > Still with all the limitations I think we could be able to speed up > > > some amount of the uprobes, which seems worth doing. > > > > Just a random idea related to this. > > Could we also run jit code of bpf programs in the user space to collect > > information instead of going back to the kernel every time? I was thinking about a similar idea. I guess these user space BPF programs will have limited features that we can probably use them update bpf maps. For this limited scope, we still need bpf_arena. Otherwise, the user space bpf program will need to update the bpf maps with sys_bpf(), which adds the same overhead as triggering the program with a syscall. > > sorry for late reply, do you mean like ubpf? the scope of this change > is to speed up the generic uprobe, ebpf is just one of the consumers I guess this means we need a new syscall? Thanks, Song ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM/BPF TOPIC] faster uprobes 2024-03-05 23:53 ` Song Liu @ 2024-03-07 9:15 ` Jiri Olsa 2024-03-07 23:02 ` Kui-Feng Lee 1 sibling, 0 replies; 31+ messages in thread From: Jiri Olsa @ 2024-03-07 9:15 UTC (permalink / raw) To: Song Liu Cc: Jiri Olsa, Kui-Feng Lee, bpf, Alexei Starovoitov, lsf-pc, Andrii Nakryiko, Yonghong Song, Oleg Nesterov, Daniel Borkmann On Tue, Mar 05, 2024 at 03:53:35PM -0800, Song Liu wrote: > On Tue, Mar 5, 2024 at 9:18 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > On Fri, Mar 01, 2024 at 11:39:03AM -0800, Kui-Feng Lee wrote: > > > > > > > > > > > > On 2/29/24 06:39, Jiri Olsa wrote: > > > > One of uprobe pain points is having slow execution that involves > > > > two traps in worst case scenario or single trap if the original > > > > instruction can be emulated. For return uprobes there's one extra > > > > trap on top of that. > > > > > > > > My current idea on how to make this faster is to follow the optimized > > > > kprobes and replace the normal uprobe trap instruction with jump to > > > > user space trampoline that: > > > > > > > > - executes syscall to call uprobe consumers callbacks > > > > - executes original instructions > > > > - jumps back to continue with the original code > > > > > > > > There are of course corner cases where above will have trouble or > > > > won't work completely, like: > > > > > > > > - executing original instructions in the trampoline is tricky wrt > > > > rip relative addressing > > > > > > > > - some instructions we can't move to trampoline at all > > > > > > > > - the uprobe address is on page boundary so the jump instruction to > > > > trampoline would span across 2 pages, hence the page replace won't > > > > be atomic, which might cause issues > > > > > > > > - ... ? many others I'm sure > > > > > > > > Still with all the limitations I think we could be able to speed up > > > > some amount of the uprobes, which seems worth doing. > > > > > > Just a random idea related to this. > > > Could we also run jit code of bpf programs in the user space to collect > > > information instead of going back to the kernel every time? > > I was thinking about a similar idea. I guess these user space BPF > programs will have limited features that we can probably use them > update bpf maps. For this limited scope, we still need bpf_arena. > Otherwise, the user space bpf program will need to update the bpf > maps with sys_bpf(), which adds the same overhead as triggering > the program with a syscall. > > > > > sorry for late reply, do you mean like ubpf? the scope of this change > > is to speed up the generic uprobe, ebpf is just one of the consumers > > I guess this means we need a new syscall? yes that's the idea, to replace the trap with syscall, so far I used light version of that for initial testing [1] jirka [1] https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/log/?h=uprobe_syscall_bench_1 ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM/BPF TOPIC] faster uprobes 2024-03-05 23:53 ` Song Liu 2024-03-07 9:15 ` Jiri Olsa @ 2024-03-07 23:02 ` Kui-Feng Lee 2024-03-08 15:43 ` Andrei Matei 1 sibling, 1 reply; 31+ messages in thread From: Kui-Feng Lee @ 2024-03-07 23:02 UTC (permalink / raw) To: Song Liu, Jiri Olsa Cc: bpf, Alexei Starovoitov, lsf-pc, Andrii Nakryiko, Yonghong Song, Oleg Nesterov, Daniel Borkmann On 3/5/24 15:53, Song Liu wrote: > On Tue, Mar 5, 2024 at 9:18 AM Jiri Olsa <olsajiri@gmail.com> wrote: >> >> On Fri, Mar 01, 2024 at 11:39:03AM -0800, Kui-Feng Lee wrote: >>> >>> >>> >>> On 2/29/24 06:39, Jiri Olsa wrote: >>>> One of uprobe pain points is having slow execution that involves >>>> two traps in worst case scenario or single trap if the original >>>> instruction can be emulated. For return uprobes there's one extra >>>> trap on top of that. >>>> >>>> My current idea on how to make this faster is to follow the optimized >>>> kprobes and replace the normal uprobe trap instruction with jump to >>>> user space trampoline that: >>>> >>>> - executes syscall to call uprobe consumers callbacks >>>> - executes original instructions >>>> - jumps back to continue with the original code >>>> >>>> There are of course corner cases where above will have trouble or >>>> won't work completely, like: >>>> >>>> - executing original instructions in the trampoline is tricky wrt >>>> rip relative addressing >>>> >>>> - some instructions we can't move to trampoline at all >>>> >>>> - the uprobe address is on page boundary so the jump instruction to >>>> trampoline would span across 2 pages, hence the page replace won't >>>> be atomic, which might cause issues >>>> >>>> - ... ? many others I'm sure >>>> >>>> Still with all the limitations I think we could be able to speed up >>>> some amount of the uprobes, which seems worth doing. >>> >>> Just a random idea related to this. >>> Could we also run jit code of bpf programs in the user space to collect >>> information instead of going back to the kernel every time? > > I was thinking about a similar idea. I guess these user space BPF > programs will have limited features that we can probably use them > update bpf maps. For this limited scope, we still need bpf_arena. > Otherwise, the user space bpf program will need to update the bpf > maps with sys_bpf(), which adds the same overhead as triggering That is true. However, even without bpf_arena, it still works with some workarounds without going through sys_bpf(). > the program with a syscall. > >> >> sorry for late reply, do you mean like ubpf? the scope of this change >> is to speed up the generic uprobe, ebpf is just one of the consumers > > I guess this means we need a new syscall? > > Thanks, > Song ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM/BPF TOPIC] faster uprobes 2024-03-07 23:02 ` Kui-Feng Lee @ 2024-03-08 15:43 ` Andrei Matei 2024-03-12 17:16 ` Kui-Feng Lee 0 siblings, 1 reply; 31+ messages in thread From: Andrei Matei @ 2024-03-08 15:43 UTC (permalink / raw) To: Kui-Feng Lee Cc: Song Liu, Jiri Olsa, bpf, Alexei Starovoitov, lsf-pc, Andrii Nakryiko, Yonghong Song, Oleg Nesterov, Daniel Borkmann On Thu, Mar 7, 2024 at 6:02 PM Kui-Feng Lee <sinquersw@gmail.com> wrote: > > > > On 3/5/24 15:53, Song Liu wrote: > > On Tue, Mar 5, 2024 at 9:18 AM Jiri Olsa <olsajiri@gmail.com> wrote: > >> > >> On Fri, Mar 01, 2024 at 11:39:03AM -0800, Kui-Feng Lee wrote: > >>> > >>> > >>> > >>> On 2/29/24 06:39, Jiri Olsa wrote: > >>>> One of uprobe pain points is having slow execution that involves > >>>> two traps in worst case scenario or single trap if the original > >>>> instruction can be emulated. For return uprobes there's one extra > >>>> trap on top of that. > >>>> > >>>> My current idea on how to make this faster is to follow the optimized > >>>> kprobes and replace the normal uprobe trap instruction with jump to > >>>> user space trampoline that: > >>>> > >>>> - executes syscall to call uprobe consumers callbacks > >>>> - executes original instructions > >>>> - jumps back to continue with the original code > >>>> > >>>> There are of course corner cases where above will have trouble or > >>>> won't work completely, like: > >>>> > >>>> - executing original instructions in the trampoline is tricky wrt > >>>> rip relative addressing > >>>> > >>>> - some instructions we can't move to trampoline at all > >>>> > >>>> - the uprobe address is on page boundary so the jump instruction to > >>>> trampoline would span across 2 pages, hence the page replace won't > >>>> be atomic, which might cause issues > >>>> > >>>> - ... ? many others I'm sure > >>>> > >>>> Still with all the limitations I think we could be able to speed up > >>>> some amount of the uprobes, which seems worth doing. > >>> > >>> Just a random idea related to this. > >>> Could we also run jit code of bpf programs in the user space to collect > >>> information instead of going back to the kernel every time? > > > > I was thinking about a similar idea. I guess these user space BPF > > programs will have limited features that we can probably use them > > update bpf maps. For this limited scope, we still need bpf_arena. > > Otherwise, the user space bpf program will need to update the bpf > > maps with sys_bpf(), which adds the same overhead as triggering > > That is true. However, even without bpf_arena, it still works with > some workarounds without going through sys_bpf(). Anything making uprobes faster would be very welcomed for my project. The biggest performance problem for us is the cost of bpf_probe_read_user() relative to raw memory access. Every call to this helper walks the process' page table to check that the access would not cause a fault (I think); this is very slow. I wonder if there's some other option that would keep the safety requirement for the memory access -- I'm imagining an optimistic mode where the raw access is performed (in the target process' memory space) and, in the rare case when a fault happens, the kernel would somehow recover from the fault and fail the bpf_probe_read_user() helper. Would something like that be technically feasible / has there been any prior interest in faster access to user memory? A more limited option that might be helpful would be a vectorized version of bpf_probe_read_user() that verifies many pointers at once. > > > the program with a syscall. > > > >> > >> sorry for late reply, do you mean like ubpf? the scope of this change > >> is to speed up the generic uprobe, ebpf is just one of the consumers > > > > I guess this means we need a new syscall? > > > > Thanks, > > Song > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM/BPF TOPIC] faster uprobes 2024-03-08 15:43 ` Andrei Matei @ 2024-03-12 17:16 ` Kui-Feng Lee 2024-03-13 1:32 ` Andrei Matei 0 siblings, 1 reply; 31+ messages in thread From: Kui-Feng Lee @ 2024-03-12 17:16 UTC (permalink / raw) To: Andrei Matei Cc: Song Liu, Jiri Olsa, bpf, Alexei Starovoitov, lsf-pc, Andrii Nakryiko, Yonghong Song, Oleg Nesterov, Daniel Borkmann On 3/8/24 07:43, Andrei Matei wrote: > On Thu, Mar 7, 2024 at 6:02 PM Kui-Feng Lee <sinquersw@gmail.com> wrote: >> >> >> >> On 3/5/24 15:53, Song Liu wrote: >>> On Tue, Mar 5, 2024 at 9:18 AM Jiri Olsa <olsajiri@gmail.com> wrote: >>>> >>>> On Fri, Mar 01, 2024 at 11:39:03AM -0800, Kui-Feng Lee wrote: >>>>> >>>>> >>>>> >>>>> On 2/29/24 06:39, Jiri Olsa wrote: >>>>>> One of uprobe pain points is having slow execution that involves >>>>>> two traps in worst case scenario or single trap if the original >>>>>> instruction can be emulated. For return uprobes there's one extra >>>>>> trap on top of that. >>>>>> >>>>>> My current idea on how to make this faster is to follow the optimized >>>>>> kprobes and replace the normal uprobe trap instruction with jump to >>>>>> user space trampoline that: >>>>>> >>>>>> - executes syscall to call uprobe consumers callbacks >>>>>> - executes original instructions >>>>>> - jumps back to continue with the original code >>>>>> >>>>>> There are of course corner cases where above will have trouble or >>>>>> won't work completely, like: >>>>>> >>>>>> - executing original instructions in the trampoline is tricky wrt >>>>>> rip relative addressing >>>>>> >>>>>> - some instructions we can't move to trampoline at all >>>>>> >>>>>> - the uprobe address is on page boundary so the jump instruction to >>>>>> trampoline would span across 2 pages, hence the page replace won't >>>>>> be atomic, which might cause issues >>>>>> >>>>>> - ... ? many others I'm sure >>>>>> >>>>>> Still with all the limitations I think we could be able to speed up >>>>>> some amount of the uprobes, which seems worth doing. >>>>> >>>>> Just a random idea related to this. >>>>> Could we also run jit code of bpf programs in the user space to collect >>>>> information instead of going back to the kernel every time? >>> >>> I was thinking about a similar idea. I guess these user space BPF >>> programs will have limited features that we can probably use them >>> update bpf maps. For this limited scope, we still need bpf_arena. >>> Otherwise, the user space bpf program will need to update the bpf >>> maps with sys_bpf(), which adds the same overhead as triggering >> >> That is true. However, even without bpf_arena, it still works with >> some workarounds without going through sys_bpf(). > > Anything making uprobes faster would be very welcomed for my project. The > biggest performance problem for us is the cost of bpf_probe_read_user() > relative to raw memory access. Every call to this helper walks the process' "raw memory access"? Do you mean not going through any helper function, reading from a pointer directly? > page table to check that the access would not cause a fault (I think); this is > very slow. I wonder if there's some other option that would keep the safety > requirement for the memory access -- I'm imagining an optimistic mode where the > raw access is performed (in the target process' memory space) and, in the rare > case when a fault happens, the kernel would somehow recover from the fault and I am not very familiar with this part. I read the implementation of bpf_probe_read_user() a little bit. It does what you mentioned here. It would cause page faults, however, the handler will skip the instruction leaving the counter non-zero. By checking the counter, it knows the instruction is not completed, and returns an error. I am curious about what your access pattern looks like. Does it access a large number of small chunks of data? Or, does it access a small number of big chunks of data? > fail the bpf_probe_read_user() helper. Would something like that be technically > feasible / has there been any prior interest in faster access to user memory > > A more limited option that might be helpful would be a vectorized version of > bpf_probe_read_user() that verifies many pointers at once. > > >> >>> the program with a syscall. >>> >>>> >>>> sorry for late reply, do you mean like ubpf? the scope of this change >>>> is to speed up the generic uprobe, ebpf is just one of the consumers >>> >>> I guess this means we need a new syscall? >>> >>> Thanks, >>> Song >> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM/BPF TOPIC] faster uprobes 2024-03-12 17:16 ` Kui-Feng Lee @ 2024-03-13 1:32 ` Andrei Matei 2024-03-13 5:42 ` Kui-Feng Lee 0 siblings, 1 reply; 31+ messages in thread From: Andrei Matei @ 2024-03-13 1:32 UTC (permalink / raw) To: Kui-Feng Lee Cc: Song Liu, Jiri Olsa, bpf, Alexei Starovoitov, lsf-pc, Andrii Nakryiko, Yonghong Song, Oleg Nesterov, Daniel Borkmann On Tue, Mar 12, 2024 at 1:16 PM Kui-Feng Lee <sinquersw@gmail.com> wrote: > > > > On 3/8/24 07:43, Andrei Matei wrote: > > On Thu, Mar 7, 2024 at 6:02 PM Kui-Feng Lee <sinquersw@gmail.com> wrote: > >> > >> > >> > >> On 3/5/24 15:53, Song Liu wrote: > >>> On Tue, Mar 5, 2024 at 9:18 AM Jiri Olsa <olsajiri@gmail.com> wrote: > >>>> > >>>> On Fri, Mar 01, 2024 at 11:39:03AM -0800, Kui-Feng Lee wrote: > >>>>> > >>>>> > >>>>> > >>>>> On 2/29/24 06:39, Jiri Olsa wrote: > >>>>>> One of uprobe pain points is having slow execution that involves > >>>>>> two traps in worst case scenario or single trap if the original > >>>>>> instruction can be emulated. For return uprobes there's one extra > >>>>>> trap on top of that. > >>>>>> > >>>>>> My current idea on how to make this faster is to follow the optimized > >>>>>> kprobes and replace the normal uprobe trap instruction with jump to > >>>>>> user space trampoline that: > >>>>>> > >>>>>> - executes syscall to call uprobe consumers callbacks > >>>>>> - executes original instructions > >>>>>> - jumps back to continue with the original code > >>>>>> > >>>>>> There are of course corner cases where above will have trouble or > >>>>>> won't work completely, like: > >>>>>> > >>>>>> - executing original instructions in the trampoline is tricky wrt > >>>>>> rip relative addressing > >>>>>> > >>>>>> - some instructions we can't move to trampoline at all > >>>>>> > >>>>>> - the uprobe address is on page boundary so the jump instruction to > >>>>>> trampoline would span across 2 pages, hence the page replace won't > >>>>>> be atomic, which might cause issues > >>>>>> > >>>>>> - ... ? many others I'm sure > >>>>>> > >>>>>> Still with all the limitations I think we could be able to speed up > >>>>>> some amount of the uprobes, which seems worth doing. > >>>>> > >>>>> Just a random idea related to this. > >>>>> Could we also run jit code of bpf programs in the user space to collect > >>>>> information instead of going back to the kernel every time? > >>> > >>> I was thinking about a similar idea. I guess these user space BPF > >>> programs will have limited features that we can probably use them > >>> update bpf maps. For this limited scope, we still need bpf_arena. > >>> Otherwise, the user space bpf program will need to update the bpf > >>> maps with sys_bpf(), which adds the same overhead as triggering > >> > >> That is true. However, even without bpf_arena, it still works with > >> some workarounds without going through sys_bpf(). > > > > Anything making uprobes faster would be very welcomed for my project. The > > biggest performance problem for us is the cost of bpf_probe_read_user() > > relative to raw memory access. Every call to this helper walks the process' > > "raw memory access"? Do you mean not going through any helper function, > reading from a pointer directly? Right. I recognize that, as long as bpf runs "in the kernel", one cannot simply dereference a user-space pointer since the kernel is a different virtual memory space (*). Still, I wish there bpf_probe_read_user() were faster. (*) Or, is it indeed a different memory space or is the kernel's virtual address space mapped into every process? Did this change through KPTI? I would be curious to read a good resource on what exactly it means to switch from user-space to the kernel and back, if such a thing exists. > > > page table to check that the access would not cause a fault (I think); this is > > very slow. I wonder if there's some other option that would keep the safety > > requirement for the memory access -- I'm imagining an optimistic mode where the > > raw access is performed (in the target process' memory space) and, in the rare > > case when a fault happens, the kernel would somehow recover from the fault and > > I am not very familiar with this part. I read the implementation of > bpf_probe_read_user() a little bit. It does what you mentioned here. It > would cause page faults, however, the handler will skip the instruction > leaving the counter non-zero. By checking the counter, it knows the > instruction is not completed, and returns an error. > > I am curious about what your access pattern looks like. Does it access a > large number of small chunks of data? Or, does it access a small number > of big chunks of data? My access pattern looks like a lot of small reads. Some of these reads could be done at the same time if we had a vectorized API (i.e. some of the pointers are known in advance); for others there are data dependencies (i.e. we need to dereference a pointer to know what we'll want to read next). Specifically, the use case is a debugger of sorts which uses BPF uprobes for poking around in the target process' memory, rather than the more traditional ptrace-based techniques (ptrace being very slow). This debugger needs to walk a lot of thread stacks by following stack pointers or by using DWARF unwind information, and then it further reads data structures from the target process' stacks and heaps, chasing pointers recursively. > > > fail the bpf_probe_read_user() helper. Would something like that be technically > > feasible / has there been any prior interest in faster access to user memory > > > > A more limited option that might be helpful would be a vectorized version of > > bpf_probe_read_user() that verifies many pointers at once. > > > > > >> > >>> the program with a syscall. > >>> > >>>> > >>>> sorry for late reply, do you mean like ubpf? the scope of this change > >>>> is to speed up the generic uprobe, ebpf is just one of the consumers > >>> > >>> I guess this means we need a new syscall? > >>> > >>> Thanks, > >>> Song > >> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM/BPF TOPIC] faster uprobes 2024-03-13 1:32 ` Andrei Matei @ 2024-03-13 5:42 ` Kui-Feng Lee 0 siblings, 0 replies; 31+ messages in thread From: Kui-Feng Lee @ 2024-03-13 5:42 UTC (permalink / raw) To: Andrei Matei Cc: Song Liu, Jiri Olsa, bpf, Alexei Starovoitov, lsf-pc, Andrii Nakryiko, Yonghong Song, Oleg Nesterov, Daniel Borkmann On 3/12/24 18:32, Andrei Matei wrote: > On Tue, Mar 12, 2024 at 1:16 PM Kui-Feng Lee <sinquersw@gmail.com> wrote: >> >> >> >> On 3/8/24 07:43, Andrei Matei wrote: >>> On Thu, Mar 7, 2024 at 6:02 PM Kui-Feng Lee <sinquersw@gmail.com> wrote: >>>> >>>> >>>> >>>> On 3/5/24 15:53, Song Liu wrote: >>>>> On Tue, Mar 5, 2024 at 9:18 AM Jiri Olsa <olsajiri@gmail.com> wrote: >>>>>> >>>>>> On Fri, Mar 01, 2024 at 11:39:03AM -0800, Kui-Feng Lee wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 2/29/24 06:39, Jiri Olsa wrote: >>>>>>>> One of uprobe pain points is having slow execution that involves >>>>>>>> two traps in worst case scenario or single trap if the original >>>>>>>> instruction can be emulated. For return uprobes there's one extra >>>>>>>> trap on top of that. >>>>>>>> >>>>>>>> My current idea on how to make this faster is to follow the optimized >>>>>>>> kprobes and replace the normal uprobe trap instruction with jump to >>>>>>>> user space trampoline that: >>>>>>>> >>>>>>>> - executes syscall to call uprobe consumers callbacks >>>>>>>> - executes original instructions >>>>>>>> - jumps back to continue with the original code >>>>>>>> >>>>>>>> There are of course corner cases where above will have trouble or >>>>>>>> won't work completely, like: >>>>>>>> >>>>>>>> - executing original instructions in the trampoline is tricky wrt >>>>>>>> rip relative addressing >>>>>>>> >>>>>>>> - some instructions we can't move to trampoline at all >>>>>>>> >>>>>>>> - the uprobe address is on page boundary so the jump instruction to >>>>>>>> trampoline would span across 2 pages, hence the page replace won't >>>>>>>> be atomic, which might cause issues >>>>>>>> >>>>>>>> - ... ? many others I'm sure >>>>>>>> >>>>>>>> Still with all the limitations I think we could be able to speed up >>>>>>>> some amount of the uprobes, which seems worth doing. >>>>>>> >>>>>>> Just a random idea related to this. >>>>>>> Could we also run jit code of bpf programs in the user space to collect >>>>>>> information instead of going back to the kernel every time? >>>>> >>>>> I was thinking about a similar idea. I guess these user space BPF >>>>> programs will have limited features that we can probably use them >>>>> update bpf maps. For this limited scope, we still need bpf_arena. >>>>> Otherwise, the user space bpf program will need to update the bpf >>>>> maps with sys_bpf(), which adds the same overhead as triggering >>>> >>>> That is true. However, even without bpf_arena, it still works with >>>> some workarounds without going through sys_bpf(). >>> >>> Anything making uprobes faster would be very welcomed for my project. The >>> biggest performance problem for us is the cost of bpf_probe_read_user() >>> relative to raw memory access. Every call to this helper walks the process' >> >> "raw memory access"? Do you mean not going through any helper function, >> reading from a pointer directly? > > Right. > I recognize that, as long as bpf runs "in the kernel", one cannot simply > dereference a user-space pointer since the kernel is a different virtual memory > space (*). Still, I wish there bpf_probe_read_user() were faster. > > (*) Or, is it indeed a different memory space or is the kernel's virtual > address space mapped into every process? Did this change through KPTI? I would > be curious to read a good resource on what exactly it means to switch from > user-space to the kernel and back, if such a thing exists. FYI! This is architecture dependent. AFAIK, with x86 platforms, kernel can access the memory of the user space directly if it is in a process/task context. But, you should not relies on it. If you look into bpf_probe_read_user(), it eventually do something like "rep movsb" on x86 platforms. Access user space memory directly with some extra checks. So, the bottleneck here can be the extra checks and memory copying. If you access small chunks like what you said bellow, the overhead of checks could be expensive. > >> >>> page table to check that the access would not cause a fault (I think); this is >>> very slow. I wonder if there's some other option that would keep the safety >>> requirement for the memory access -- I'm imagining an optimistic mode where the >>> raw access is performed (in the target process' memory space) and, in the rare >>> case when a fault happens, the kernel would somehow recover from the fault and >> >> I am not very familiar with this part. I read the implementation of >> bpf_probe_read_user() a little bit. It does what you mentioned here. It >> would cause page faults, however, the handler will skip the instruction >> leaving the counter non-zero. By checking the counter, it knows the >> instruction is not completed, and returns an error. >> >> I am curious about what your access pattern looks like. Does it access a >> large number of small chunks of data? Or, does it access a small number >> of big chunks of data? > > My access pattern looks like a lot of small reads. Some of these reads could be > done at the same time if we had a vectorized API (i.e. some of the pointers are > known in advance); for others there are data dependencies (i.e. we need to > dereference a pointer to know what we'll want to read next). Specifically, the > use case is a debugger of sorts which uses BPF uprobes for poking around in the > target process' memory, rather than the more traditional ptrace-based > techniques (ptrace being very slow). This debugger needs to walk a lot of > thread stacks by following stack pointers or by using DWARF unwind information, > and then it further reads data structures from the target process' stacks and > heaps, chasing pointers recursively. A related information. You may already know that bpf_probe_read_user() can fail if a page fault happens. A vectorized API probably doesn't change it. It is a limitation of non-sleepable BPF programs. Sleepable BPF programs should be able to overcome it. > > >> >>> fail the bpf_probe_read_user() helper. Would something like that be technically >>> feasible / has there been any prior interest in faster access to user memory >>> >>> A more limited option that might be helpful would be a vectorized version of >>> bpf_probe_read_user() that verifies many pointers at once. >>> >>> >>>> >>>>> the program with a syscall. >>>>> >>>>>> >>>>>> sorry for late reply, do you mean like ubpf? the scope of this change >>>>>> is to speed up the generic uprobe, ebpf is just one of the consumers >>>>> >>>>> I guess this means we need a new syscall? >>>>> >>>>> Thanks, >>>>> Song >>>> ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2024-03-13 5:42 UTC | newest] Thread overview: 31+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-02-29 14:39 [LSF/MM/BPF TOPIC] faster uprobes Jiri Olsa 2024-03-01 0:25 ` Andrii Nakryiko 2024-03-01 8:18 ` Jiri Olsa 2024-03-01 17:01 ` Alexei Starovoitov 2024-03-01 17:26 ` Andrii Nakryiko 2024-03-01 18:08 ` Yunwei 123 2024-03-03 10:20 ` Jiri Olsa 2024-03-05 0:55 ` Andrii Nakryiko 2024-03-05 8:24 ` Jiri Olsa 2024-03-05 15:30 ` Jiri Olsa 2024-03-05 17:30 ` Andrii Nakryiko 2024-03-11 10:59 ` Jiri Olsa 2024-03-11 15:06 ` Oleg Nesterov 2024-03-11 16:46 ` Jiri Olsa 2024-03-11 17:02 ` Oleg Nesterov 2024-03-11 21:11 ` Jiri Olsa 2024-03-11 17:32 ` Andrii Nakryiko 2024-03-11 21:26 ` Jiri Olsa 2024-03-11 23:05 ` Andrii Nakryiko 2024-03-02 20:46 ` Jiri Olsa 2024-03-02 21:08 ` Alexei Starovoitov 2024-03-02 21:49 ` Oleg Nesterov 2024-03-01 19:39 ` Kui-Feng Lee 2024-03-05 17:18 ` Jiri Olsa 2024-03-05 23:53 ` Song Liu 2024-03-07 9:15 ` Jiri Olsa 2024-03-07 23:02 ` Kui-Feng Lee 2024-03-08 15:43 ` Andrei Matei 2024-03-12 17:16 ` Kui-Feng Lee 2024-03-13 1:32 ` Andrei Matei 2024-03-13 5:42 ` Kui-Feng Lee
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox