From: Jiri Olsa <olsajiri@gmail.com>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: "Jiri Olsa" <olsajiri@gmail.com>,
"Oleg Nesterov" <oleg@redhat.com>,
"Peter Zijlstra" <peterz@infradead.org>,
"Andrii Nakryiko" <andrii@kernel.org>,
bpf@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-trace-kernel@vger.kernel.org, x86@kernel.org,
"Song Liu" <songliubraving@fb.com>, "Yonghong Song" <yhs@fb.com>,
"John Fastabend" <john.fastabend@gmail.com>,
"Hao Luo" <haoluo@google.com>,
"Steven Rostedt" <rostedt@goodmis.org>,
"Masami Hiramatsu" <mhiramat@kernel.org>,
"Alan Maguire" <alan.maguire@oracle.com>,
"David Laight" <David.Laight@aculab.com>,
"Thomas Weißschuh" <thomas@t-8ch.de>
Subject: Re: [PATCH RFCv2 12/18] uprobes/x86: Add support to optimize uprobes
Date: Sat, 1 Mar 2025 00:18:21 +0100 [thread overview]
Message-ID: <Z8JEPdAHkkEL4x7k@krava> (raw)
In-Reply-To: <CAEf4BzbxLMB8RJWWZjtg6NkumHHZA=vhWZfHqZBf90O=aJVC+A@mail.gmail.com>
On Fri, Feb 28, 2025 at 03:00:22PM -0800, Andrii Nakryiko wrote:
> On Fri, Feb 28, 2025 at 2:55 PM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Fri, Feb 28, 2025 at 10:55:24AM -0800, Andrii Nakryiko wrote:
> > > On Mon, Feb 24, 2025 at 6:04 AM Jiri Olsa <jolsa@kernel.org> wrote:
> > > >
> > > > Putting together all the previously added pieces to support optimized
> > > > uprobes on top of 5-byte nop instruction.
> > > >
> > > > The current uprobe execution goes through following:
> > > > - installs breakpoint instruction over original instruction
> > > > - exception handler hit and calls related uprobe consumers
> > > > - and either simulates original instruction or does out of line single step
> > > > execution of it
> > > > - returns to user space
> > > >
> > > > The optimized uprobe path
> > > >
> > > > - checks the original instruction is 5-byte nop (plus other checks)
> > > > - adds (or uses existing) user space trampoline and overwrites original
> > > > instruction (5-byte nop) with call to user space trampoline
> > > > - the user space trampoline executes uprobe syscall that calls related uprobe
> > > > consumers
> > > > - trampoline returns back to next instruction
> > > >
> > > > This approach won't speed up all uprobes as it's limited to using nop5 as
> > > > original instruction, but we could use nop5 as USDT probe instruction (which
> > > > uses single byte nop ATM) and speed up the USDT probes.
> > > >
> > > > This patch overloads related arch functions in uprobe_write_opcode and
> > > > set_orig_insn so they can install call instruction if needed.
> > > >
> > > > The arch_uprobe_optimize triggers the uprobe optimization and is called after
> > > > first uprobe hit. I originally had it called on uprobe installation but then
> > > > it clashed with elf loader, because the user space trampoline was added in a
> > > > place where loader might need to put elf segments, so I decided to do it after
> > > > first uprobe hit when loading is done.
> > > >
> > > > We do not unmap and release uprobe trampoline when it's no longer needed,
> > > > because there's no easy way to make sure none of the threads is still
> > > > inside the trampoline. But we do not waste memory, because there's just
> > > > single page for all the uprobe trampoline mappings.
> > > >
> > > > We do waste frmae on page mapping for every 4GB by keeping the uprobe
> > > > trampoline page mapped, but that seems ok.
> > > >
> > > > Attaching the speed up from benchs/run_bench_uprobes.sh script:
> > > >
> > > > current:
> > > > usermode-count : 818.836 ± 2.842M/s
> > > > syscall-count : 8.917 ± 0.003M/s
> > > > uprobe-nop : 3.056 ± 0.013M/s
> > > > uprobe-push : 2.903 ± 0.002M/s
> > > > uprobe-ret : 1.533 ± 0.001M/s
> > > > --> uprobe-nop5 : 1.492 ± 0.000M/s
> > > > uretprobe-nop : 1.783 ± 0.000M/s
> > > > uretprobe-push : 1.672 ± 0.001M/s
> > > > uretprobe-ret : 1.067 ± 0.002M/s
> > > > --> uretprobe-nop5 : 1.052 ± 0.000M/s
> > > >
> > > > after the change:
> > > >
> > > > usermode-count : 818.386 ± 1.886M/s
> > > > syscall-count : 8.923 ± 0.003M/s
> > > > uprobe-nop : 3.086 ± 0.005M/s
> > > > uprobe-push : 2.751 ± 0.001M/s
> > > > uprobe-ret : 1.481 ± 0.000M/s
> > > > --> uprobe-nop5 : 4.016 ± 0.002M/s
> > > > uretprobe-nop : 1.712 ± 0.008M/s
> > > > uretprobe-push : 1.616 ± 0.001M/s
> > > > uretprobe-ret : 1.052 ± 0.000M/s
> > > > --> uretprobe-nop5 : 2.015 ± 0.000M/s
> > > >
> > > > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > > > ---
> > > > arch/x86/include/asm/uprobes.h | 6 ++
> > > > arch/x86/kernel/uprobes.c | 191 ++++++++++++++++++++++++++++++++-
> > > > include/linux/uprobes.h | 6 +-
> > > > kernel/events/uprobes.c | 16 ++-
> > > > 4 files changed, 209 insertions(+), 10 deletions(-)
> > > >
> > > > diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
> > > > index 678fb546f0a7..7d4df920bb59 100644
> > > > --- a/arch/x86/include/asm/uprobes.h
> > > > +++ b/arch/x86/include/asm/uprobes.h
> > > > @@ -20,6 +20,10 @@ typedef u8 uprobe_opcode_t;
> > > > #define UPROBE_SWBP_INSN 0xcc
> > > > #define UPROBE_SWBP_INSN_SIZE 1
> > > >
> > > > +enum {
> > > > + ARCH_UPROBE_FLAG_CAN_OPTIMIZE = 0,
> > > > +};
> > > > +
> > > > struct uprobe_xol_ops;
> > > >
> > > > struct arch_uprobe {
> > > > @@ -45,6 +49,8 @@ struct arch_uprobe {
> > > > u8 ilen;
> > > > } push;
> > > > };
> > > > +
> > > > + unsigned long flags;
> > > > };
> > > >
> > > > struct arch_uprobe_task {
> > > > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> > > > index e8aebbda83bc..73ddff823904 100644
> > > > --- a/arch/x86/kernel/uprobes.c
> > > > +++ b/arch/x86/kernel/uprobes.c
> > > > @@ -18,6 +18,7 @@
> > > > #include <asm/processor.h>
> > > > #include <asm/insn.h>
> > > > #include <asm/mmu_context.h>
> > > > +#include <asm/nops.h>
> > > >
> > > > /* Post-execution fixups. */
> > > >
> > > > @@ -768,7 +769,7 @@ static struct uprobe_trampoline *create_uprobe_trampoline(unsigned long vaddr)
> > > > return NULL;
> > > > }
> > > >
> > > > -static __maybe_unused struct uprobe_trampoline *uprobe_trampoline_get(unsigned long vaddr)
> > > > +static struct uprobe_trampoline *uprobe_trampoline_get(unsigned long vaddr)
> > > > {
> > > > struct uprobes_state *state = ¤t->mm->uprobes_state;
> > > > struct uprobe_trampoline *tramp = NULL;
> > > > @@ -794,7 +795,7 @@ static void destroy_uprobe_trampoline(struct uprobe_trampoline *tramp)
> > > > kfree(tramp);
> > > > }
> > > >
> > > > -static __maybe_unused void uprobe_trampoline_put(struct uprobe_trampoline *tramp)
> > > > +static void uprobe_trampoline_put(struct uprobe_trampoline *tramp)
> > > > {
> > > > if (tramp == NULL)
> > > > return;
> > > > @@ -807,6 +808,7 @@ struct mm_uprobe {
> > > > struct rb_node rb_node;
> > > > unsigned long auprobe;
> > > > unsigned long vaddr;
> > > > + bool optimized;
> > > > };
> > > >
> > >
> > > I'm trying to understand if this RB-tree based mm_uprobe is strictly
> > > necessary. Is it? Sure we keep optimized flag, but that's more for
> > > defensive checks, no? Is there any other reason we need this separate
> > > look up data structure?
> >
> > so the call instruction update is done in 2 locked steps:
> > - first we write breakpoint as part of normal uprobe registration
> > - then uprobe is hit, we overwrite breakpoint with call instruction
> >
> > in between we could race with another thread that could either unregister the
> > uprobe or try to optimize the uprobe as well
> >
> > I think we either need to keep the state of the uprobe per process (mm_struct),
> > or we would need to read the probed instruction each time when we need to make
> > decision based on what state are we at (nop5,breakpoint,call)
>
> This decision is only done in "slow path", right? Only when
> registering/unregistering. And those operations are done under lock.
> So reading those 5 bytes every time we register/unregister seems
> completely acceptable, rather than now *also* having a per-mm uprobe
> lookup tree.
true.. I was also thinking about having another flag in that tree for
when we fail to optimize the uprobe for other reason than it being on
page alignment.. without such flag we'd need to read the 5 bytes each
time we hit that uprobe .. but that might be rare case
jirka
next prev parent reply other threads:[~2025-02-28 23:18 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-24 14:01 [PATCH RFCv2 00/18] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 01/18] uprobes: Rename arch_uretprobe_trampoline function Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 02/18] uprobes: Make copy_from_page global Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 03/18] uprobes: Move ref_ctr_offset update out of uprobe_write_opcode Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 04/18] uprobes: Add uprobe_write function Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 05/18] uprobes: Add nbytes argument to uprobe_write_opcode Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 06/18] uprobes: Add orig argument to uprobe_write and uprobe_write_opcode Jiri Olsa
2025-02-28 19:07 ` Andrii Nakryiko
2025-02-28 23:12 ` Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 07/18] uprobes: Add swbp argument to arch_uretprobe_hijack_return_addr Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 08/18] uprobes/x86: Add uprobe syscall to speed up uprobe Jiri Olsa
2025-02-24 19:22 ` Alexei Starovoitov
2025-02-25 13:35 ` Jiri Olsa
2025-02-25 17:10 ` Andrii Nakryiko
2025-02-25 18:06 ` Alexei Starovoitov
2025-02-26 2:36 ` Alexei Starovoitov
2025-02-24 14:01 ` [PATCH RFCv2 09/18] uprobes/x86: Add mapping for optimized uprobe trampolines Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 10/18] uprobes/x86: Add mm_uprobe objects to track uprobes within mm Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 11/18] uprobes/x86: Add support to emulate nop5 instruction Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 12/18] uprobes/x86: Add support to optimize uprobes Jiri Olsa
2025-02-28 18:55 ` Andrii Nakryiko
2025-02-28 22:55 ` Jiri Olsa
2025-02-28 23:00 ` Andrii Nakryiko
2025-02-28 23:18 ` Jiri Olsa [this message]
2025-02-28 23:27 ` Andrii Nakryiko
2025-02-28 23:00 ` Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 13/18] selftests/bpf: Reorg the uprobe_syscall test function Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 14/18] selftests/bpf: Use 5-byte nop for x86 usdt probes Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 15/18] selftests/bpf: Add uprobe/usdt syscall tests Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 16/18] selftests/bpf: Add hit/attach/detach race optimized uprobe test Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 17/18] selftests/bpf: Add uprobe syscall sigill signal test Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 18/18] selftests/bpf: Add 5-byte nop uprobe trigger bench Jiri Olsa
2025-02-24 18:46 ` [PATCH RFCv2 00/18] uprobes: Add support to optimize usdt probes on x86_64 Ingo Molnar
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z8JEPdAHkkEL4x7k@krava \
--to=olsajiri@gmail.com \
--cc=David.Laight@aculab.com \
--cc=alan.maguire@oracle.com \
--cc=andrii.nakryiko@gmail.com \
--cc=andrii@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=haoluo@google.com \
--cc=john.fastabend@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=mhiramat@kernel.org \
--cc=oleg@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=songliubraving@fb.com \
--cc=thomas@t-8ch.de \
--cc=x86@kernel.org \
--cc=yhs@fb.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).