From: Jiri Olsa <olsajiri@gmail.com>
To: Jiri Olsa <olsajiri@gmail.com>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>,
Alexei Starovoitov <alexei.starovoitov@gmail.com>,
yunwei356@gmail.com, bpf <bpf@vger.kernel.org>,
Alexei Starovoitov <ast@kernel.org>,
lsf-pc <lsf-pc@lists.linux-foundation.org>,
Yonghong Song <yonghong.song@linux.dev>,
Oleg Nesterov <oleg@redhat.com>,
Daniel Borkmann <daniel@iogearbox.net>
Subject: Re: [LSF/MM/BPF TOPIC] faster uprobes
Date: Mon, 11 Mar 2024 11:59:36 +0100 [thread overview]
Message-ID: <Ze7kGP5aAYGseRuM@krava> (raw)
In-Reply-To: <ZebWqIgABbk8dQXT@krava>
On Tue, Mar 05, 2024 at 09:24:08AM +0100, Jiri Olsa wrote:
> On Mon, Mar 04, 2024 at 04:55:33PM -0800, Andrii Nakryiko wrote:
> > On Sun, Mar 3, 2024 at 2:20 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > >
> > > On Fri, Mar 01, 2024 at 09:26:57AM -0800, Andrii Nakryiko wrote:
> > > > On Fri, Mar 1, 2024 at 9:01 AM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Fri, Mar 1, 2024 at 12:18 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > > > > >
> > > > > > On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote:
> > > > > > > On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > > > > > > >
> > > > > > > > One of uprobe pain points is having slow execution that involves
> > > > > > > > two traps in worst case scenario or single trap if the original
> > > > > > > > instruction can be emulated. For return uprobes there's one extra
> > > > > > > > trap on top of that.
> > > > > > > >
> > > > > > > > My current idea on how to make this faster is to follow the optimized
> > > > > > > > kprobes and replace the normal uprobe trap instruction with jump to
> > > > > > > > user space trampoline that:
> > > > > > > >
> > > > > > > > - executes syscall to call uprobe consumers callbacks
> > > > > > >
> > > > > > > Did you get a chance to measure relative performance of syscall vs
> > > > > > > int3 interrupt handling? If not, do you think you'll be able to get
> > > > > > > some numbers by the time the conference starts? This should inform the
> > > > > > > decision whether it even makes sense to go through all the trouble.
> > > > > >
> > > > > > right, will do that
> > > > >
> > > > > I believe Yusheng measured syscall vs uprobe performance
> > > > > difference during LPC. iirc it was something like 3x.
> > > >
> > > > Do you have a link to slides? Was it actual uprobe vs just some fast
> > > > syscall (not doing BPF program execution) comparison? Or comparing the
> > > > performance of int3 handling vs equivalent syscall handling.
> > > >
> > > > I suspect it's the former, and so probably not that representative.
> > > > I'm curious about the performance of going
> > > > userspace->kernel->userspace through int3 vs syscall (all other things
> > > > being equal).
> > >
> > > I have a simple test [1] comparing:
> > > - uprobe with 2 traps
> > > - uprobe with 1 trap
> > > - syscall executing uprobe
> > >
> > > the syscall takes uprobe address as argument, finds the uprobe and executes
> > > its consumers, which should be comparable to what the trampoline will do
> > >
> > > test does same amount of loops triggering each uprobe type and measures
> > > the time it took
> > >
> > > # ./test_progs -t uprobe_syscall_bench -v
> > > bpf_testmod.ko is already unloaded.
> > > Loading bpf_testmod.ko...
> > > Successfully loaded bpf_testmod.ko.
> > > test_bench_1:PASS:uprobe_bench__open_and_load 0 nsec
> > > test_bench_1:PASS:uprobe_bench__attach 0 nsec
> > > test_bench_1:PASS:uprobe1_cnt 0 nsec
> > > test_bench_1:PASS:syscalls_uprobe1_cnt 0 nsec
> > > test_bench_1:PASS:uprobe2_cnt 0 nsec
> > > test_bench_1: uprobes (1 trap) in 36.439s
> > > test_bench_1: uprobes (2 trap) in 91.960s
> > > test_bench_1: syscalls in 17.872s
> > > #395/1 uprobe_syscall_bench/bench_1:OK
> > > #395 uprobe_syscall_bench:OK
> > > Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED
> > >
> > > syscall uprobe execution seems to be ~2x faster than 1 trap uprobe
> > > and ~5x faster than 2 traps uprobe
> > >
> >
> > Thanks for running benchmarks! I quickly looked at the selftest and
> > noticed this:
> >
> > +/*
> > + * Assuming following prolog:
> > + *
> > + * 6984ac: 55 push %rbp
> > + * 6984ad: 48 89 e5 mov %rsp,%rbp
> > + */
> > +noinline void uprobe2_bench_trigger(void)
> > +{
> > + asm volatile ("");
> > +}
> >
> > This actually will be optimized out to just ret in -O2 mode (make
> > RELEASE=1 for selftests):
> >
> > 00000000005a0ce0 <uprobe2_bench_trigger>:
> > 5a0ce0: c3 retq
> > 5a0ce1: 66 66 2e 0f 1f 84 00 00 00 00 00 nopw %cs:(%rax,%rax)
> > 5a0cec: 0f 1f 40 00 nopl (%rax)
> >
> > So be careful with that.
>
> right, I did not mean for this to be checked in, just wanted to get the
> numbers quickly
>
> >
> > Also, I just updated our existing set of uprobe benchmarks (see [0]),
> > do you mind adding your syscall-based one as another one there and
> > running all of them and sharing the numbers with us? Very curious to
> > see both absolute and relative numbers from that benchmark. (and
> > please do build with RELEASE=1)
> >
> > You should be able to just run benchs/run_bench_uprobes.sh (also don't
> > forget to add your syscall-based benchmark to the list of benchmarks
> > in that shell script).
>
> yes, saw it and was going to run/compare it.. it's good idea to add
> the syscall one and get all numbers together, will do that
>
> >
> > Thank you!
> >
> >
> > BTW, while I think patching multiple instructions for syscall-based
> > uprobe is going to be extremely tricky, I think at least u*ret*probe's
> > int3 can be pretty easily optimized away with syscall, given that the
> > kernel controls code generation there. If anything, it will get the
> > uretprobe case a bit closer to the performance of uprobe. Give it some
> > thought.
>
> hm, right.. the trampoline is there already, but at the moment is global
> and used by all uretprobes.. and int3 code moves userspace (changes rip)
> to the original return address.. maybe we can do that through syscall
> as well
it seems like good idea, I tried change below (use syscall on return
trampoline) and got some speedup:
current:
base : 15.817 ± 0.009M/s
uprobe-nop : 2.901 ± 0.000M/s
uprobe-push : 2.743 ± 0.002M/s
uprobe-ret : 1.089 ± 0.001M/s
uretprobe-nop : 1.448 ± 0.001M/s
uretprobe-push : 1.407 ± 0.001M/s
uretprobe-ret : 0.792 ± 0.001M/s
with syscall:
base : 15.831 ± 0.026M/s
uprobe-nop : 2.904 ± 0.001M/s
uprobe-push : 2.764 ± 0.002M/s
uprobe-ret : 1.082 ± 0.001M/s
uretprobe-nop : 1.785 ± 0.000M/s
uretprobe-push : 1.733 ± 0.001M/s
uretprobe-ret : 0.885 ± 0.004M/s
~23% for nop/push (emulated) cases, ~11% for ret (sstep) case
jirka
---
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 7e8d46f4147f..fa5f8a058bc2 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -383,6 +383,7 @@
459 common lsm_get_self_attr sys_lsm_get_self_attr
460 common lsm_set_self_attr sys_lsm_set_self_attr
461 common lsm_list_modules sys_lsm_list_modules
+462 64 uprobe sys_uprobe
#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 6c07f6daaa22..fceef2b4e243 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -12,11 +12,13 @@
#include <linux/ptrace.h>
#include <linux/uprobes.h>
#include <linux/uaccess.h>
+#include <linux/syscalls.h>
#include <linux/kdebug.h>
#include <asm/processor.h>
#include <asm/insn.h>
#include <asm/mmu_context.h>
+#include <asm/syscalls.h>
/* Post-execution fixups. */
@@ -308,6 +310,53 @@ static int uprobe_init_insn(struct arch_uprobe *auprobe, struct insn *insn, bool
}
#ifdef CONFIG_X86_64
+
+asm (
+ ".pushsection .rodata\n"
+ ".global uretprobe_syscall_entry\n"
+ "uretprobe_syscall_entry:\n"
+ "push %rax\n"
+ "mov $462, %rax\n"
+ "syscall\n"
+ ".global uretprobe_syscall_end\n"
+ "uretprobe_syscall_end:\n"
+ ".popsection\n"
+);
+
+extern __visible u8 uretprobe_syscall_entry[];
+extern __visible u8 uretprobe_syscall_end[];
+
+uprobe_opcode_t* arch_uprobe_trampoline(unsigned long *psize)
+{
+ *psize = uretprobe_syscall_end - uretprobe_syscall_entry;
+ return uretprobe_syscall_entry;
+}
+
+SYSCALL_DEFINE1(uprobe, unsigned long, cmd)
+{
+ struct pt_regs *regs = task_pt_regs(current);
+ unsigned long ax, err;
+
+ /*
+ * We get invoked from the trampoline that pushed rax
+ * value on stack, read and restore the value.
+ */
+ err = copy_from_user((void*) &ax, (void *) regs->sp, sizeof(ax));
+ WARN_ON_ONCE(err);
+
+ regs->ax = ax;
+ regs->orig_ax = ax;
+
+ /*
+ * And pop the stack back, because we jump to original
+ * return value.
+ */
+ regs->sp = regs->sp + sizeof(regs->sp);
+
+ uprobe_handle_trampoline(regs);
+ return ax;
+}
+
/*
* If arch_uprobe->insn doesn't use rip-relative addressing, return
* immediately. Otherwise, rewrite the instruction so that it accesses
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 77eb9b0e7685..4f7d5b41b718 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -987,6 +987,8 @@ asmlinkage long sys_spu_run(int fd, __u32 __user *unpc,
asmlinkage long sys_spu_create(const char __user *name,
unsigned int flags, umode_t mode, int fd);
+asmlinkage long sys_uprobe(unsigned long cmd);
+
/*
* Deprecated system calls which are still defined in
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index f46e0ca0169c..9ef244c8ff19 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -138,6 +138,8 @@ extern bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c
extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
void *src, unsigned long len);
+extern void uprobe_handle_trampoline(struct pt_regs *regs);
+uprobe_opcode_t* arch_uprobe_trampoline(unsigned long *psize);
#else /* !CONFIG_UPROBES */
struct uprobes_state {
};
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 75f00965ab15..2702799648e6 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -842,8 +842,11 @@ __SYSCALL(__NR_lsm_set_self_attr, sys_lsm_set_self_attr)
#define __NR_lsm_list_modules 461
__SYSCALL(__NR_lsm_list_modules, sys_lsm_list_modules)
+#define __NR_uprobe 462
+__SYSCALL(__NR_uprobe, sys_uprobe)
+
#undef __NR_syscalls
-#define __NR_syscalls 462
+#define __NR_syscalls 463
/*
* 32 bit systems traditionally used different
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 929e98c62965..fefaf4804e1f 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1474,10 +1474,19 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
return ret;
}
+uprobe_opcode_t* __weak arch_uprobe_trampoline(unsigned long *psize)
+{
+ static uprobe_opcode_t insn = UPROBE_SWBP_INSN;
+
+ *psize = UPROBE_SWBP_INSN_SIZE;
+ return &insn;
+}
+
static struct xol_area *__create_xol_area(unsigned long vaddr)
{
struct mm_struct *mm = current->mm;
- uprobe_opcode_t insn = UPROBE_SWBP_INSN;
+ unsigned long insns_size;
+ uprobe_opcode_t *insns;
struct xol_area *area;
area = kmalloc(sizeof(*area), GFP_KERNEL);
@@ -1502,7 +1511,8 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
/* Reserve the 1st slot for get_trampoline_vaddr() */
set_bit(0, area->bitmap);
atomic_set(&area->slot_count, 1);
- arch_uprobe_copy_ixol(area->pages[0], 0, &insn, UPROBE_SWBP_INSN_SIZE);
+ insns = arch_uprobe_trampoline(&insns_size);
+ arch_uprobe_copy_ixol(area->pages[0], 0, insns, insns_size);
if (!xol_add_vma(mm, area))
return area;
@@ -2123,7 +2133,7 @@ static struct return_instance *find_next_ret_chain(struct return_instance *ri)
return ri;
}
-static void handle_trampoline(struct pt_regs *regs)
+void uprobe_handle_trampoline(struct pt_regs *regs)
{
struct uprobe_task *utask;
struct return_instance *ri, *next;
@@ -2188,7 +2198,7 @@ static void handle_swbp(struct pt_regs *regs)
bp_vaddr = uprobe_get_swbp_addr(regs);
if (bp_vaddr == get_trampoline_vaddr())
- return handle_trampoline(regs);
+ return uprobe_handle_trampoline(regs);
uprobe = find_active_uprobe(bp_vaddr, &is_swbp);
if (!uprobe) {
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index faad00cce269..ddc954f28317 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -391,3 +391,5 @@ COND_SYSCALL(setuid16);
/* restartable sequence */
COND_SYSCALL(rseq);
+
+COND_SYSCALL(uprobe);
next prev parent reply other threads:[~2024-03-11 10:59 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-29 14:39 [LSF/MM/BPF TOPIC] faster uprobes Jiri Olsa
2024-03-01 0:25 ` Andrii Nakryiko
2024-03-01 8:18 ` Jiri Olsa
2024-03-01 17:01 ` Alexei Starovoitov
2024-03-01 17:26 ` Andrii Nakryiko
2024-03-01 18:08 ` Yunwei 123
2024-03-03 10:20 ` Jiri Olsa
2024-03-05 0:55 ` Andrii Nakryiko
2024-03-05 8:24 ` Jiri Olsa
2024-03-05 15:30 ` Jiri Olsa
2024-03-05 17:30 ` Andrii Nakryiko
2024-03-11 10:59 ` Jiri Olsa [this message]
2024-03-11 15:06 ` Oleg Nesterov
2024-03-11 16:46 ` Jiri Olsa
2024-03-11 17:02 ` Oleg Nesterov
2024-03-11 21:11 ` Jiri Olsa
2024-03-11 17:32 ` Andrii Nakryiko
2024-03-11 21:26 ` Jiri Olsa
2024-03-11 23:05 ` Andrii Nakryiko
2024-03-02 20:46 ` Jiri Olsa
2024-03-02 21:08 ` Alexei Starovoitov
2024-03-02 21:49 ` Oleg Nesterov
2024-03-01 19:39 ` Kui-Feng Lee
2024-03-05 17:18 ` Jiri Olsa
2024-03-05 23:53 ` Song Liu
2024-03-07 9:15 ` Jiri Olsa
2024-03-07 23:02 ` Kui-Feng Lee
2024-03-08 15:43 ` Andrei Matei
2024-03-12 17:16 ` Kui-Feng Lee
2024-03-13 1:32 ` Andrei Matei
2024-03-13 5:42 ` Kui-Feng Lee
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Ze7kGP5aAYGseRuM@krava \
--to=olsajiri@gmail.com \
--cc=alexei.starovoitov@gmail.com \
--cc=andrii.nakryiko@gmail.com \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=oleg@redhat.com \
--cc=yonghong.song@linux.dev \
--cc=yunwei356@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.