Re: [RFC 00/11] uprobes: Add support to optimize usdt probes on x86_64

public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed

From: Jiri Olsa <olsajiri@gmail.com>
To: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Andrii Nakryiko <andrii@kernel.org>,
	bpf@vger.kernel.org, Song Liu <songliubraving@fb.com>,
	Yonghong Song <yhs@fb.com>,
	John Fastabend <john.fastabend@gmail.com>,
	Hao Luo <haoluo@google.com>, Steven Rostedt <rostedt@goodmis.org>,
	Alan Maguire <alan.maguire@oracle.com>,
	linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org,
	Beau Belgrave <beaub@linux.microsoft.com>
Subject: Re: [RFC 00/11] uprobes: Add support to optimize usdt probes on x86_64
Date: Mon, 18 Nov 2024 10:52:50 +0100	[thread overview]
Message-ID: <ZzsOcgJkdnq5ywE_@krava> (raw)
In-Reply-To: <20241118170458.c825bf255c2fb93f2e6a3519@kernel.org>

On Mon, Nov 18, 2024 at 05:04:58PM +0900, Masami Hiramatsu wrote:
> Hi Jiri,
> 
> On Tue,  5 Nov 2024 14:33:54 +0100
> Jiri Olsa <jolsa@kernel.org> wrote:
> 
> > hi,
> > this patchset adds support to optimize usdt probes on top of 5-byte
> > nop instruction.
> > 
> > The generic approach (optimize all uprobes) is hard due to emulating
> > possible multiple original instructions and its related issues. The
> > usdt case, which stores 5-byte nop seems much easier, so starting
> > with that.
> > 
> > The basic idea is to replace breakpoint exception with syscall which
> > is faster on x86_64. For more details please see changelog of patch 7.
> 
> This looks like a great idea!
> 
> > 
> > The first benchmark shows about 68% speed up (see below). The benchmark
> > triggers usdt probe in a loop and counts how many of those happened
> > per second.
> 
> Hmm, interesting result. I'd like to compare it with user-space event,
> which is also use "write" syscall to write the pre-defined events
> in the ftrace trace buffer.
> 
> But if uprobe trampoline can run in the comparable speed, user may
> want to use uprobes because it is based on widely used usdt and
> avoid accessing tracefs from application. (The user event user
> application has to setup their events via tracefs interface)

I see, will check more details on user events

> 
> > 
> > It's still rfc state with some loose ends, but I'd be interested in
> > any feedback about the direction of this.
> 
> So does this change the usdt macro? Or it just reuse the usdt so that
> user applications does not need to be recompiled?

yes, ideally apps keeps using the current usdt macro (or switch to [1])
which would unwind to nop5 (or whatever we will end up using)

there's an issue wrt to old kernels running apps compiled with the new usdt macro,
but it's solvable as Andrii pointed out in the other reply [2]

thanks,
jirka

[1] https://github.com/libbpf/usdt
[2] https://lore.kernel.org/bpf/20241118170458.c825bf255c2fb93f2e6a3519@kernel.org/T/#m20da8469e210880cae07f01783f4a51817ffbe4d

> 
> Thank you,
> 
> > 
> > It's based on tip/perf/core with bpf-next/master merged on top of
> > that together with uprobe session patchset.
> > 
> > thanks,
> > jirka
> > 
> > 
> > current:
> >         # ./bench -w2 -d5 -a  trig-usdt
> >         Setting up benchmark 'trig-usdt'...
> >         Benchmark 'trig-usdt' started.
> >         Iter   0 ( 46.982us): hits    4.893M/s (  4.893M/prod), drops    0.000M/s, total operations    4.893M/s
> >         Iter   1 ( -5.967us): hits    4.892M/s (  4.892M/prod), drops    0.000M/s, total operations    4.892M/s
> >         Iter   2 ( -2.771us): hits    4.899M/s (  4.899M/prod), drops    0.000M/s, total operations    4.899M/s
> >         Iter   3 (  1.286us): hits    4.889M/s (  4.889M/prod), drops    0.000M/s, total operations    4.889M/s
> >         Iter   4 ( -2.871us): hits    4.881M/s (  4.881M/prod), drops    0.000M/s, total operations    4.881M/s
> >         Iter   5 (  1.005us): hits    4.886M/s (  4.886M/prod), drops    0.000M/s, total operations    4.886M/s
> >         Iter   6 ( 11.626us): hits    4.906M/s (  4.906M/prod), drops    0.000M/s, total operations    4.906M/s
> >         Iter   7 ( -6.638us): hits    4.896M/s (  4.896M/prod), drops    0.000M/s, total operations    4.896M/s
> >         Summary: hits    4.893 +- 0.009M/s (  4.893M/prod), drops    0.000 +- 0.000M/s, total operations    4.893 +- 0.009M/s
> > 
> > optimized:
> >         # ./bench -w2 -d5 -a  trig-usdt
> >         Setting up benchmark 'trig-usdt'...
> >         Benchmark 'trig-usdt' started.
> >         Iter   0 ( 46.073us): hits    8.258M/s (  8.258M/prod), drops    0.000M/s, total operations    8.258M/s
> >         Iter   1 ( -5.752us): hits    8.264M/s (  8.264M/prod), drops    0.000M/s, total operations    8.264M/s
> >         Iter   2 ( -1.333us): hits    8.263M/s (  8.263M/prod), drops    0.000M/s, total operations    8.263M/s
> >         Iter   3 ( -2.996us): hits    8.265M/s (  8.265M/prod), drops    0.000M/s, total operations    8.265M/s
> >         Iter   4 ( -0.620us): hits    8.264M/s (  8.264M/prod), drops    0.000M/s, total operations    8.264M/s
> >         Iter   5 ( -2.624us): hits    8.236M/s (  8.236M/prod), drops    0.000M/s, total operations    8.236M/s
> >         Iter   6 ( -0.840us): hits    8.232M/s (  8.232M/prod), drops    0.000M/s, total operations    8.232M/s
> >         Iter   7 ( -1.783us): hits    8.235M/s (  8.235M/prod), drops    0.000M/s, total operations    8.235M/s
> >         Summary: hits    8.249 +- 0.016M/s (  8.249M/prod), drops    0.000 +- 0.000M/s, total operations    8.249 +- 0.016M/s
> > 
> > ---
> > Jiri Olsa (11):
> >       uprobes: Rename arch_uretprobe_trampoline function
> >       uprobes: Make copy_from_page global
> >       uprobes: Add len argument to uprobe_write_opcode
> >       uprobes: Add data argument to uprobe_write_opcode function
> >       uprobes: Add mapping for optimized uprobe trampolines
> >       uprobes: Add uprobe syscall to speed up uprobe
> >       uprobes/x86: Add support to optimize uprobes
> >       selftests/bpf: Use 5-byte nop for x86 usdt probes
> >       selftests/bpf: Add usdt trigger bench
> >       selftests/bpf: Add uprobe/usdt optimized test
> >       selftests/bpf: Add hit/attach/detach race optimized uprobe test
> > 
> >  arch/x86/entry/syscalls/syscall_64.tbl                    |   1 +
> >  arch/x86/include/asm/uprobes.h                            |   7 +++
> >  arch/x86/kernel/uprobes.c                                 | 180 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
> >  include/linux/syscalls.h                                  |   2 +
> >  include/linux/uprobes.h                                   |  25 +++++++++-
> >  kernel/events/uprobes.c                                   | 222 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
> >  kernel/fork.c                                             |   2 +
> >  kernel/sys_ni.c                                           |   1 +
> >  tools/testing/selftests/bpf/bench.c                       |   2 +
> >  tools/testing/selftests/bpf/benchs/bench_trigger.c        |  45 +++++++++++++++++
> >  tools/testing/selftests/bpf/prog_tests/uprobe_optimized.c | 252 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  tools/testing/selftests/bpf/progs/trigger_bench.c         |  10 +++-
> >  tools/testing/selftests/bpf/progs/uprobe_optimized.c      |  29 +++++++++++
> >  tools/testing/selftests/bpf/sdt.h                         |   9 +++-
> >  14 files changed, 768 insertions(+), 19 deletions(-)
> >  create mode 100644 tools/testing/selftests/bpf/prog_tests/uprobe_optimized.c
> >  create mode 100644 tools/testing/selftests/bpf/progs/uprobe_optimized.c
> 
> 
> -- 
> Masami Hiramatsu (Google) <mhiramat@kernel.org>

     prev parent reply	other threads:[~2024-11-18  9:52 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-05 13:33 [RFC 00/11] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
2024-11-05 13:33 ` [RFC perf/core 01/11] uprobes: Rename arch_uretprobe_trampoline function Jiri Olsa
2024-11-05 13:33 ` [RFC perf/core 02/11] uprobes: Make copy_from_page global Jiri Olsa
2024-11-14 23:40   ` Andrii Nakryiko
2024-11-16 21:41     ` Jiri Olsa
2024-11-05 13:33 ` [RFC perf/core 03/11] uprobes: Add len argument to uprobe_write_opcode Jiri Olsa
2024-11-14 23:41   ` Andrii Nakryiko
2024-11-16 21:41     ` Jiri Olsa
2024-11-05 13:33 ` [RFC perf/core 04/11] uprobes: Add data argument to uprobe_write_opcode function Jiri Olsa
2024-11-14 23:41   ` Andrii Nakryiko
2024-11-16 21:43     ` Jiri Olsa
2024-11-05 13:33 ` [RFC perf/core 05/11] uprobes: Add mapping for optimized uprobe trampolines Jiri Olsa
2024-11-05 14:23   ` Peter Zijlstra
2024-11-05 16:33     ` Jiri Olsa
2024-11-14 23:44       ` Andrii Nakryiko
2024-11-16 21:44         ` Jiri Olsa
2024-11-19  6:06           ` Andrii Nakryiko
2024-11-19  9:13             ` Peter Zijlstra
2024-11-19 15:15               ` Jiri Olsa
2024-11-21  0:07               ` Andrii Nakryiko
2024-11-21 11:53                 ` Peter Zijlstra
2024-11-21 16:02                   ` Alexei Starovoitov
2024-11-21 16:34                     ` Peter Zijlstra
2024-11-21 16:47                       ` Alexei Starovoitov
2024-11-21 19:38                         ` Mark Rutland
2024-11-14 23:44   ` Andrii Nakryiko
2024-11-16 21:44     ` Jiri Olsa
2024-11-19  6:05       ` Andrii Nakryiko
2024-11-19 15:14         ` Jiri Olsa
2024-11-21  0:10           ` Andrii Nakryiko
2024-11-05 13:34 ` [RFC perf/core 06/11] uprobes: Add uprobe syscall to speed up uprobe Jiri Olsa
2024-11-05 13:34 ` [RFC perf/core 07/11] uprobes/x86: Add support to optimize uprobes Jiri Olsa
2024-11-14 23:44   ` Andrii Nakryiko
2024-11-16 21:44     ` Jiri Olsa
2024-11-18  8:18   ` Masami Hiramatsu
2024-11-18  9:39     ` Jiri Olsa
2024-11-05 13:34 ` [RFC bpf-next 08/11] selftests/bpf: Use 5-byte nop for x86 usdt probes Jiri Olsa
2024-11-05 13:34 ` [RFC bpf-next 09/11] selftests/bpf: Add usdt trigger bench Jiri Olsa
2024-11-14 23:40   ` Andrii Nakryiko
2024-11-16 21:45     ` Jiri Olsa
2024-11-19  6:08       ` Andrii Nakryiko
2024-11-05 13:34 ` [RFC bpf-next 10/11] selftests/bpf: Add uprobe/usdt optimized test Jiri Olsa
2024-11-05 13:34 ` [RFC bpf-next 11/11] selftests/bpf: Add hit/attach/detach race optimized uprobe test Jiri Olsa
2024-11-17 11:49 ` [RFC 00/11] uprobes: Add support to optimize usdt probes on x86_64 Peter Zijlstra
2024-11-18  9:29   ` Jiri Olsa
2024-11-18 10:06   ` Mark Rutland
2024-11-19  6:13     ` Andrii Nakryiko
2024-11-21 18:18       ` Mark Rutland
2024-11-26 19:13         ` Andrii Nakryiko
2024-11-18  8:04 ` Masami Hiramatsu
2024-11-18  9:52   ` Jiri Olsa [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZzsOcgJkdnq5ywE_@krava \
    --to=olsajiri@gmail.com \
    --cc=alan.maguire@oracle.com \
    --cc=andrii@kernel.org \
    --cc=beaub@linux.microsoft.com \
    --cc=bpf@vger.kernel.org \
    --cc=haoluo@google.com \
    --cc=john.fastabend@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=mhiramat@kernel.org \
    --cc=oleg@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=songliubraving@fb.com \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox