linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFCv3 00/23] uprobes: Add support to optimize usdt probes on x86_64
@ 2025-03-20 11:41 Jiri Olsa
  2025-03-20 11:41 ` [PATCH RFCv3 01/23] uprobes: Rename arch_uretprobe_trampoline function Jiri Olsa
                   ` (24 more replies)
  0 siblings, 25 replies; 37+ messages in thread
From: Jiri Olsa @ 2025-03-20 11:41 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: Eyal Birger, kees, bpf, linux-kernel, linux-trace-kernel, x86,
	Song Liu, Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, David Laight,
	Thomas Weißschuh

hi,
this patchset adds support to optimize usdt probes on top of 5-byte
nop instruction.

The generic approach (optimize all uprobes) is hard due to emulating
possible multiple original instructions and its related issues. The
usdt case, which stores 5-byte nop seems much easier, so starting
with that.

The basic idea is to replace breakpoint exception with syscall which
is faster on x86_64. For more details please see changelog of patch 8.

The run_bench_uprobes.sh benchmark triggers uprobe (on top of different
original instructions) in a loop and counts how many of those happened
per second (the unit below is million loops).

There's big speed up if you consider current usdt implementation
(uprobe-nop) compared to proposed usdt (uprobe-nop5):

current:
        usermode-count :  152.604 ± 0.044M/s
        syscall-count  :   13.359 ± 0.042M/s
-->     uprobe-nop     :    3.229 ± 0.002M/s
        uprobe-push    :    3.086 ± 0.004M/s
        uprobe-ret     :    1.114 ± 0.004M/s
        uprobe-nop5    :    1.121 ± 0.005M/s
        uretprobe-nop  :    2.145 ± 0.002M/s
        uretprobe-push :    2.070 ± 0.001M/s
        uretprobe-ret  :    0.931 ± 0.001M/s
        uretprobe-nop5 :    0.957 ± 0.001M/s

after the change:
        usermode-count :  152.448 ± 0.244M/s
        syscall-count  :   14.321 ± 0.059M/s
        uprobe-nop     :    3.148 ± 0.007M/s
        uprobe-push    :    2.976 ± 0.004M/s
        uprobe-ret     :    1.068 ± 0.003M/s
-->     uprobe-nop5    :    7.038 ± 0.007M/s
        uretprobe-nop  :    2.109 ± 0.004M/s
        uretprobe-push :    2.035 ± 0.001M/s
        uretprobe-ret  :    0.908 ± 0.001M/s
        uretprobe-nop5 :    3.377 ± 0.009M/s

I see bit more speed up on Intel (above) compared to AMD. The big nop5
speed up is partly due to emulating nop5 and partly due to optimization.

The key speed up we do this for is the USDT switch from nop to nop5:
        uprobe-nop     :    3.148 ± 0.007M/s
        uprobe-nop5    :    7.038 ± 0.007M/s


rfc v3 changes:
- I tried to have just single syscall for both entry and return uprobe,
  but it turned out to be slower than having two separated syscalls,
  probably due to extra save/restore processing we have to do for
  argument reg, I see differences like:

    2 syscalls:      uprobe-nop5    :    7.038 ± 0.007M/s
    1 syscall:       uprobe-nop5    :    6.943 ± 0.003M/s

- use instructions (nop5/int3/call) to determine the state of the
  uprobe update in the process
- removed endbr instruction from uprobe trampoline
- seccomp changes

pending todo (or follow ups):
- shadow stack fails for uprobe session setup, will fix it in next version
- use PROCMAP_QUERY in tests
- alloc 'struct uprobes_state' for mm_struct only when needed [Andrii]

thanks,
jirka


Cc: Eyal Birger <eyal.birger@gmail.com>
Cc: kees@kernel.org
---
Jiri Olsa (23):
      uprobes: Rename arch_uretprobe_trampoline function
      uprobes: Make copy_from_page global
      uprobes: Move ref_ctr_offset update out of uprobe_write_opcode
      uprobes: Add uprobe_write function
      uprobes: Add nbytes argument to uprobe_write_opcode
      uprobes: Add orig argument to uprobe_write and uprobe_write_opcode
      uprobes: Remove breakpoint in unapply_uprobe under mmap_write_lock
      uprobes/x86: Add uprobe syscall to speed up uprobe
      uprobes/x86: Add mapping for optimized uprobe trampolines
      uprobes/x86: Add support to emulate nop5 instruction
      uprobes/x86: Add support to optimize uprobes
      selftests/bpf: Use 5-byte nop for x86 usdt probes
      selftests/bpf: Reorg the uprobe_syscall test function
      selftests/bpf: Rename uprobe_syscall_executed prog to test_uretprobe_multi
      selftests/bpf: Add uprobe/usdt syscall tests
      selftests/bpf: Add hit/attach/detach race optimized uprobe test
      selftests/bpf: Add uprobe syscall sigill signal test
      selftests/bpf: Add optimized usdt variant for basic usdt test
      selftests/bpf: Add uprobe_regs_equal test
      selftests/bpf: Change test_uretprobe_regs_change for uprobe and uretprobe
      selftests/bpf: Add 5-byte nop uprobe trigger bench
      seccomp: passthrough uprobe systemcall without filtering
      selftests/seccomp: validate uprobe syscall passes through seccomp

 arch/arm/probes/uprobes/core.c                              |   2 +-
 arch/x86/entry/syscalls/syscall_64.tbl                      |   1 +
 arch/x86/include/asm/uprobes.h                              |   7 ++
 arch/x86/kernel/uprobes.c                                   | 540 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/syscalls.h                                    |   2 +
 include/linux/uprobes.h                                     |  19 +++-
 kernel/events/uprobes.c                                     | 141 +++++++++++++++++-------
 kernel/fork.c                                               |   1 +
 kernel/seccomp.c                                            |  32 ++++--
 kernel/sys_ni.c                                             |   1 +
 tools/testing/selftests/bpf/bench.c                         |  12 +++
 tools/testing/selftests/bpf/benchs/bench_trigger.c          |  42 ++++++++
 tools/testing/selftests/bpf/benchs/run_bench_uprobes.sh     |   2 +-
 tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c     | 453 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
 tools/testing/selftests/bpf/prog_tests/usdt.c               |  38 ++++---
 tools/testing/selftests/bpf/progs/uprobe_syscall.c          |   4 +-
 tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c |  41 ++++++-
 tools/testing/selftests/bpf/sdt.h                           |   9 +-
 tools/testing/selftests/bpf/test_kmods/bpf_testmod.c        |  11 +-
 tools/testing/selftests/seccomp/seccomp_bpf.c               | 107 ++++++++++++++----
 20 files changed, 1338 insertions(+), 127 deletions(-)

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2025-04-11 12:18 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-20 11:41 [PATCH RFCv3 00/23] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
2025-03-20 11:41 ` [PATCH RFCv3 01/23] uprobes: Rename arch_uretprobe_trampoline function Jiri Olsa
2025-03-20 11:41 ` [PATCH RFCv3 02/23] uprobes: Make copy_from_page global Jiri Olsa
2025-03-20 11:41 ` [PATCH RFCv3 03/23] uprobes: Move ref_ctr_offset update out of uprobe_write_opcode Jiri Olsa
2025-03-20 11:41 ` [PATCH RFCv3 04/23] uprobes: Add uprobe_write function Jiri Olsa
2025-03-20 11:41 ` [PATCH RFCv3 05/23] uprobes: Add nbytes argument to uprobe_write_opcode Jiri Olsa
2025-03-20 11:41 ` [PATCH RFCv3 06/23] uprobes: Add orig argument to uprobe_write and uprobe_write_opcode Jiri Olsa
2025-04-04 20:33   ` Andrii Nakryiko
2025-04-07 11:13     ` Jiri Olsa
2025-03-20 11:41 ` [PATCH RFCv3 07/23] uprobes: Remove breakpoint in unapply_uprobe under mmap_write_lock Jiri Olsa
2025-03-20 11:41 ` [PATCH RFCv3 08/23] uprobes/x86: Add uprobe syscall to speed up uprobe Jiri Olsa
2025-04-04 20:33   ` Andrii Nakryiko
2025-04-07 10:58     ` Jiri Olsa
2025-03-20 11:41 ` [PATCH RFCv3 09/23] uprobes/x86: Add mapping for optimized uprobe trampolines Jiri Olsa
2025-03-20 11:41 ` [PATCH RFCv3 10/23] uprobes/x86: Add support to emulate nop5 instruction Jiri Olsa
2025-04-04 20:33   ` Andrii Nakryiko
2025-04-07 11:07     ` Jiri Olsa
2025-04-08 20:21       ` Jiri Olsa
2025-04-09 18:19         ` Andrii Nakryiko
2025-04-11 12:18           ` Jiri Olsa
2025-03-20 11:41 ` [PATCH RFCv3 11/23] uprobes/x86: Add support to optimize uprobes Jiri Olsa
2025-03-20 11:41 ` [PATCH RFCv3 12/23] selftests/bpf: Use 5-byte nop for x86 usdt probes Jiri Olsa
2025-03-20 11:41 ` [PATCH RFCv3 13/23] selftests/bpf: Reorg the uprobe_syscall test function Jiri Olsa
2025-03-20 11:41 ` [PATCH RFCv3 14/23] selftests/bpf: Rename uprobe_syscall_executed prog to test_uretprobe_multi Jiri Olsa
2025-03-20 11:41 ` [PATCH RFCv3 15/23] selftests/bpf: Add uprobe/usdt syscall tests Jiri Olsa
2025-03-20 11:41 ` [PATCH RFCv3 16/23] selftests/bpf: Add hit/attach/detach race optimized uprobe test Jiri Olsa
2025-03-20 11:41 ` [PATCH RFCv3 17/23] selftests/bpf: Add uprobe syscall sigill signal test Jiri Olsa
2025-03-20 11:41 ` [PATCH RFCv3 18/23] selftests/bpf: Add optimized usdt variant for basic usdt test Jiri Olsa
2025-03-20 11:41 ` [PATCH RFCv3 19/23] selftests/bpf: Add uprobe_regs_equal test Jiri Olsa
2025-03-20 11:41 ` [PATCH RFCv3 20/23] selftests/bpf: Change test_uretprobe_regs_change for uprobe and uretprobe Jiri Olsa
2025-03-20 11:41 ` [PATCH RFCv3 21/23] selftests/bpf: Add 5-byte nop uprobe trigger bench Jiri Olsa
2025-03-20 11:41 ` [PATCH RFCv3 22/23] seccomp: passthrough uprobe systemcall without filtering Jiri Olsa
2025-03-20 11:41 ` [PATCH RFCv3 23/23] selftests/seccomp: validate uprobe syscall passes through seccomp Jiri Olsa
2025-03-20 12:23 ` [PATCH RFCv3 00/23] uprobes: Add support to optimize usdt probes on x86_64 Oleg Nesterov
2025-03-20 13:51   ` Jiri Olsa
2025-04-04 20:36 ` Andrii Nakryiko
2025-04-07 11:17   ` Jiri Olsa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).