linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 00/11] uprobes: Add support to optimize usdt probes on x86_64
@ 2024-11-05 13:33 Jiri Olsa
  2024-11-05 13:33 ` [RFC perf/core 01/11] uprobes: Rename arch_uretprobe_trampoline function Jiri Olsa
                   ` (12 more replies)
  0 siblings, 13 replies; 51+ messages in thread
From: Jiri Olsa @ 2024-11-05 13:33 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, Song Liu, Yonghong Song, John Fastabend, Hao Luo,
	Steven Rostedt, Masami Hiramatsu, Alan Maguire, linux-kernel,
	linux-trace-kernel

hi,
this patchset adds support to optimize usdt probes on top of 5-byte
nop instruction.

The generic approach (optimize all uprobes) is hard due to emulating
possible multiple original instructions and its related issues. The
usdt case, which stores 5-byte nop seems much easier, so starting
with that.

The basic idea is to replace breakpoint exception with syscall which
is faster on x86_64. For more details please see changelog of patch 7.

The first benchmark shows about 68% speed up (see below). The benchmark
triggers usdt probe in a loop and counts how many of those happened
per second.

It's still rfc state with some loose ends, but I'd be interested in
any feedback about the direction of this.

It's based on tip/perf/core with bpf-next/master merged on top of
that together with uprobe session patchset.

thanks,
jirka


current:
        # ./bench -w2 -d5 -a  trig-usdt
        Setting up benchmark 'trig-usdt'...
        Benchmark 'trig-usdt' started.
        Iter   0 ( 46.982us): hits    4.893M/s (  4.893M/prod), drops    0.000M/s, total operations    4.893M/s
        Iter   1 ( -5.967us): hits    4.892M/s (  4.892M/prod), drops    0.000M/s, total operations    4.892M/s
        Iter   2 ( -2.771us): hits    4.899M/s (  4.899M/prod), drops    0.000M/s, total operations    4.899M/s
        Iter   3 (  1.286us): hits    4.889M/s (  4.889M/prod), drops    0.000M/s, total operations    4.889M/s
        Iter   4 ( -2.871us): hits    4.881M/s (  4.881M/prod), drops    0.000M/s, total operations    4.881M/s
        Iter   5 (  1.005us): hits    4.886M/s (  4.886M/prod), drops    0.000M/s, total operations    4.886M/s
        Iter   6 ( 11.626us): hits    4.906M/s (  4.906M/prod), drops    0.000M/s, total operations    4.906M/s
        Iter   7 ( -6.638us): hits    4.896M/s (  4.896M/prod), drops    0.000M/s, total operations    4.896M/s
        Summary: hits    4.893 +- 0.009M/s (  4.893M/prod), drops    0.000 +- 0.000M/s, total operations    4.893 +- 0.009M/s

optimized:
        # ./bench -w2 -d5 -a  trig-usdt
        Setting up benchmark 'trig-usdt'...
        Benchmark 'trig-usdt' started.
        Iter   0 ( 46.073us): hits    8.258M/s (  8.258M/prod), drops    0.000M/s, total operations    8.258M/s
        Iter   1 ( -5.752us): hits    8.264M/s (  8.264M/prod), drops    0.000M/s, total operations    8.264M/s
        Iter   2 ( -1.333us): hits    8.263M/s (  8.263M/prod), drops    0.000M/s, total operations    8.263M/s
        Iter   3 ( -2.996us): hits    8.265M/s (  8.265M/prod), drops    0.000M/s, total operations    8.265M/s
        Iter   4 ( -0.620us): hits    8.264M/s (  8.264M/prod), drops    0.000M/s, total operations    8.264M/s
        Iter   5 ( -2.624us): hits    8.236M/s (  8.236M/prod), drops    0.000M/s, total operations    8.236M/s
        Iter   6 ( -0.840us): hits    8.232M/s (  8.232M/prod), drops    0.000M/s, total operations    8.232M/s
        Iter   7 ( -1.783us): hits    8.235M/s (  8.235M/prod), drops    0.000M/s, total operations    8.235M/s
        Summary: hits    8.249 +- 0.016M/s (  8.249M/prod), drops    0.000 +- 0.000M/s, total operations    8.249 +- 0.016M/s

---
Jiri Olsa (11):
      uprobes: Rename arch_uretprobe_trampoline function
      uprobes: Make copy_from_page global
      uprobes: Add len argument to uprobe_write_opcode
      uprobes: Add data argument to uprobe_write_opcode function
      uprobes: Add mapping for optimized uprobe trampolines
      uprobes: Add uprobe syscall to speed up uprobe
      uprobes/x86: Add support to optimize uprobes
      selftests/bpf: Use 5-byte nop for x86 usdt probes
      selftests/bpf: Add usdt trigger bench
      selftests/bpf: Add uprobe/usdt optimized test
      selftests/bpf: Add hit/attach/detach race optimized uprobe test

 arch/x86/entry/syscalls/syscall_64.tbl                    |   1 +
 arch/x86/include/asm/uprobes.h                            |   7 +++
 arch/x86/kernel/uprobes.c                                 | 180 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/syscalls.h                                  |   2 +
 include/linux/uprobes.h                                   |  25 +++++++++-
 kernel/events/uprobes.c                                   | 222 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
 kernel/fork.c                                             |   2 +
 kernel/sys_ni.c                                           |   1 +
 tools/testing/selftests/bpf/bench.c                       |   2 +
 tools/testing/selftests/bpf/benchs/bench_trigger.c        |  45 +++++++++++++++++
 tools/testing/selftests/bpf/prog_tests/uprobe_optimized.c | 252 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 tools/testing/selftests/bpf/progs/trigger_bench.c         |  10 +++-
 tools/testing/selftests/bpf/progs/uprobe_optimized.c      |  29 +++++++++++
 tools/testing/selftests/bpf/sdt.h                         |   9 +++-
 14 files changed, 768 insertions(+), 19 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/uprobe_optimized.c
 create mode 100644 tools/testing/selftests/bpf/progs/uprobe_optimized.c

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [RFC perf/core 01/11] uprobes: Rename arch_uretprobe_trampoline function
  2024-11-05 13:33 [RFC 00/11] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
@ 2024-11-05 13:33 ` Jiri Olsa
  2024-11-05 13:33 ` [RFC perf/core 02/11] uprobes: Make copy_from_page global Jiri Olsa
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 51+ messages in thread
From: Jiri Olsa @ 2024-11-05 13:33 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, Song Liu, Yonghong Song, John Fastabend, Hao Luo,
	Steven Rostedt, Masami Hiramatsu, Alan Maguire, linux-kernel,
	linux-trace-kernel

We are about to add uprobe trampoline, so cleaning up the namespace.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 arch/x86/kernel/uprobes.c | 2 +-
 include/linux/uprobes.h   | 2 +-
 kernel/events/uprobes.c   | 4 ++--
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 5a952c5ea66b..22a17c149a55 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -338,7 +338,7 @@ extern u8 uretprobe_trampoline_entry[];
 extern u8 uretprobe_trampoline_end[];
 extern u8 uretprobe_syscall_check[];
 
-void *arch_uprobe_trampoline(unsigned long *psize)
+void *arch_uretprobe_trampoline(unsigned long *psize)
 {
 	static uprobe_opcode_t insn = UPROBE_SWBP_INSN;
 	struct pt_regs *regs = task_pt_regs(current);
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 7a051b5d2edd..2f500bc97263 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -211,7 +211,7 @@ extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
 extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
 					 void *src, unsigned long len);
 extern void uprobe_handle_trampoline(struct pt_regs *regs);
-extern void *arch_uprobe_trampoline(unsigned long *psize);
+extern void *arch_uretprobe_trampoline(unsigned long *psize);
 extern unsigned long uprobe_get_trampoline_vaddr(void);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index a76ddc5fc982..0b04c051d712 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1696,7 +1696,7 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
 	return ret;
 }
 
-void * __weak arch_uprobe_trampoline(unsigned long *psize)
+void * __weak arch_uretprobe_trampoline(unsigned long *psize)
 {
 	static uprobe_opcode_t insn = UPROBE_SWBP_INSN;
 
@@ -1728,7 +1728,7 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
 	init_waitqueue_head(&area->wq);
 	/* Reserve the 1st slot for get_trampoline_vaddr() */
 	set_bit(0, area->bitmap);
-	insns = arch_uprobe_trampoline(&insns_size);
+	insns = arch_uretprobe_trampoline(&insns_size);
 	arch_uprobe_copy_ixol(area->page, 0, insns, insns_size);
 
 	if (!xol_add_vma(mm, area))
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC perf/core 02/11] uprobes: Make copy_from_page global
  2024-11-05 13:33 [RFC 00/11] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
  2024-11-05 13:33 ` [RFC perf/core 01/11] uprobes: Rename arch_uretprobe_trampoline function Jiri Olsa
@ 2024-11-05 13:33 ` Jiri Olsa
  2024-11-14 23:40   ` Andrii Nakryiko
  2024-11-05 13:33 ` [RFC perf/core 03/11] uprobes: Add len argument to uprobe_write_opcode Jiri Olsa
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 51+ messages in thread
From: Jiri Olsa @ 2024-11-05 13:33 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, Song Liu, Yonghong Song, John Fastabend, Hao Luo,
	Steven Rostedt, Masami Hiramatsu, Alan Maguire, linux-kernel,
	linux-trace-kernel

Making copy_from_page global and adding uprobe prefix.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 include/linux/uprobes.h |  1 +
 kernel/events/uprobes.c | 10 +++++-----
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 2f500bc97263..28068f9fcdc1 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -213,6 +213,7 @@ extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
 extern void uprobe_handle_trampoline(struct pt_regs *regs);
 extern void *arch_uretprobe_trampoline(unsigned long *psize);
 extern unsigned long uprobe_get_trampoline_vaddr(void);
+extern void uprobe_copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 0b04c051d712..e9308649bba3 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -250,7 +250,7 @@ bool __weak is_trap_insn(uprobe_opcode_t *insn)
 	return is_swbp_insn(insn);
 }
 
-static void copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len)
+void uprobe_copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len)
 {
 	void *kaddr = kmap_atomic(page);
 	memcpy(dst, kaddr + (vaddr & ~PAGE_MASK), len);
@@ -278,7 +278,7 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t
 	 * is a trap variant; uprobes always wins over any other (gdb)
 	 * breakpoint.
 	 */
-	copy_from_page(page, vaddr, &old_opcode, UPROBE_SWBP_INSN_SIZE);
+	uprobe_copy_from_page(page, vaddr, &old_opcode, UPROBE_SWBP_INSN_SIZE);
 	is_swbp = is_swbp_insn(&old_opcode);
 
 	if (is_swbp_insn(new_opcode)) {
@@ -1027,7 +1027,7 @@ static int __copy_insn(struct address_space *mapping, struct file *filp,
 	if (IS_ERR(page))
 		return PTR_ERR(page);
 
-	copy_from_page(page, offset, insn, nbytes);
+	uprobe_copy_from_page(page, offset, insn, nbytes);
 	put_page(page);
 
 	return 0;
@@ -1368,7 +1368,7 @@ struct uprobe *uprobe_register(struct inode *inode,
 		return ERR_PTR(-EINVAL);
 
 	/*
-	 * This ensures that copy_from_page(), copy_to_page() and
+	 * This ensures that uprobe_copy_from_page(), copy_to_page() and
 	 * __update_ref_ctr() can't cross page boundary.
 	 */
 	if (!IS_ALIGNED(offset, UPROBE_SWBP_INSN_SIZE))
@@ -2288,7 +2288,7 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr)
 	if (result < 0)
 		return result;
 
-	copy_from_page(page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
+	uprobe_copy_from_page(page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
 	put_page(page);
  out:
 	/* This needs to return true for any variant of the trap insn */
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC perf/core 03/11] uprobes: Add len argument to uprobe_write_opcode
  2024-11-05 13:33 [RFC 00/11] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
  2024-11-05 13:33 ` [RFC perf/core 01/11] uprobes: Rename arch_uretprobe_trampoline function Jiri Olsa
  2024-11-05 13:33 ` [RFC perf/core 02/11] uprobes: Make copy_from_page global Jiri Olsa
@ 2024-11-05 13:33 ` Jiri Olsa
  2024-11-14 23:41   ` Andrii Nakryiko
  2024-11-05 13:33 ` [RFC perf/core 04/11] uprobes: Add data argument to uprobe_write_opcode function Jiri Olsa
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 51+ messages in thread
From: Jiri Olsa @ 2024-11-05 13:33 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, Song Liu, Yonghong Song, John Fastabend, Hao Luo,
	Steven Rostedt, Masami Hiramatsu, Alan Maguire, linux-kernel,
	linux-trace-kernel

Adding len argument to uprobe_write_opcode as preparation
fo writing longer instructions in following changes.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 include/linux/uprobes.h |  3 ++-
 kernel/events/uprobes.c | 14 ++++++++------
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 28068f9fcdc1..7d23a4fee6f4 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -181,7 +181,8 @@ extern bool is_swbp_insn(uprobe_opcode_t *insn);
 extern bool is_trap_insn(uprobe_opcode_t *insn);
 extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs);
 extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
-extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t);
+extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
+			       unsigned long vaddr, uprobe_opcode_t *insn, int len);
 extern struct uprobe *uprobe_register(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
 extern int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool);
 extern void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index e9308649bba3..3e275717789b 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -471,7 +471,7 @@ static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm,
  * Return 0 (success) or a negative errno.
  */
 int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
-			unsigned long vaddr, uprobe_opcode_t opcode)
+			unsigned long vaddr, uprobe_opcode_t *insn, int len)
 {
 	struct uprobe *uprobe;
 	struct page *old_page, *new_page;
@@ -480,7 +480,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
 	bool orig_page_huge = false;
 	unsigned int gup_flags = FOLL_FORCE;
 
-	is_register = is_swbp_insn(&opcode);
+	is_register = is_swbp_insn(insn);
 	uprobe = container_of(auprobe, struct uprobe, arch);
 
 retry:
@@ -491,7 +491,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
 	if (IS_ERR(old_page))
 		return PTR_ERR(old_page);
 
-	ret = verify_opcode(old_page, vaddr, &opcode);
+	ret = verify_opcode(old_page, vaddr, insn);
 	if (ret <= 0)
 		goto put_old;
 
@@ -525,7 +525,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
 
 	__SetPageUptodate(new_page);
 	copy_highpage(new_page, old_page);
-	copy_to_page(new_page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
+	copy_to_page(new_page, vaddr, insn, len);
 
 	if (!is_register) {
 		struct page *orig_page;
@@ -582,7 +582,9 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
  */
 int __weak set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr)
 {
-	return uprobe_write_opcode(auprobe, mm, vaddr, UPROBE_SWBP_INSN);
+	uprobe_opcode_t insn = UPROBE_SWBP_INSN;
+
+	return uprobe_write_opcode(auprobe, mm, vaddr, &insn, UPROBE_SWBP_INSN_SIZE);
 }
 
 /**
@@ -598,7 +600,7 @@ int __weak
 set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr)
 {
 	return uprobe_write_opcode(auprobe, mm, vaddr,
-			*(uprobe_opcode_t *)&auprobe->insn);
+			(uprobe_opcode_t *)&auprobe->insn, UPROBE_SWBP_INSN_SIZE);
 }
 
 /* uprobe should have guaranteed positive refcount */
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC perf/core 04/11] uprobes: Add data argument to uprobe_write_opcode function
  2024-11-05 13:33 [RFC 00/11] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (2 preceding siblings ...)
  2024-11-05 13:33 ` [RFC perf/core 03/11] uprobes: Add len argument to uprobe_write_opcode Jiri Olsa
@ 2024-11-05 13:33 ` Jiri Olsa
  2024-11-14 23:41   ` Andrii Nakryiko
  2024-11-05 13:33 ` [RFC perf/core 05/11] uprobes: Add mapping for optimized uprobe trampolines Jiri Olsa
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 51+ messages in thread
From: Jiri Olsa @ 2024-11-05 13:33 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, Song Liu, Yonghong Song, John Fastabend, Hao Luo,
	Steven Rostedt, Masami Hiramatsu, Alan Maguire, linux-kernel,
	linux-trace-kernel

Adding data argument to uprobe_write_opcode function and passing
it to newly added arch overloaded functions:

  arch_uprobe_verify_opcode
  arch_uprobe_is_register

This way each architecture can provide custmized verification.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 include/linux/uprobes.h |  6 +++++-
 kernel/events/uprobes.c | 25 +++++++++++++++++++------
 2 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 7d23a4fee6f4..be306028ed59 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -182,7 +182,7 @@ extern bool is_trap_insn(uprobe_opcode_t *insn);
 extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs);
 extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
 extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
-			       unsigned long vaddr, uprobe_opcode_t *insn, int len);
+			       unsigned long vaddr, uprobe_opcode_t *insn, int len, void *data);
 extern struct uprobe *uprobe_register(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
 extern int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool);
 extern void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc);
@@ -215,6 +215,10 @@ extern void uprobe_handle_trampoline(struct pt_regs *regs);
 extern void *arch_uretprobe_trampoline(unsigned long *psize);
 extern unsigned long uprobe_get_trampoline_vaddr(void);
 extern void uprobe_copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len);
+extern int uprobe_verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode);
+extern int arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr,
+				     uprobe_opcode_t *new_opcode, void *data);
+extern bool arch_uprobe_is_register(uprobe_opcode_t *insn, int len, void *data);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 3e275717789b..944d9df1f081 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -264,7 +264,13 @@ static void copy_to_page(struct page *page, unsigned long vaddr, const void *src
 	kunmap_atomic(kaddr);
 }
 
-static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode)
+__weak bool arch_uprobe_is_register(uprobe_opcode_t *insn, int len, void *data)
+{
+	return is_swbp_insn(insn);
+}
+
+int uprobe_verify_opcode(struct page *page, unsigned long vaddr,
+			 uprobe_opcode_t *new_opcode)
 {
 	uprobe_opcode_t old_opcode;
 	bool is_swbp;
@@ -292,6 +298,12 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t
 	return 1;
 }
 
+__weak int arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr,
+				     uprobe_opcode_t *new_opcode, void *data)
+{
+	return uprobe_verify_opcode(page, vaddr, new_opcode);
+}
+
 static struct delayed_uprobe *
 delayed_uprobe_check(struct uprobe *uprobe, struct mm_struct *mm)
 {
@@ -471,7 +483,8 @@ static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm,
  * Return 0 (success) or a negative errno.
  */
 int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
-			unsigned long vaddr, uprobe_opcode_t *insn, int len)
+			unsigned long vaddr, uprobe_opcode_t *insn, int len,
+			void *data)
 {
 	struct uprobe *uprobe;
 	struct page *old_page, *new_page;
@@ -480,7 +493,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
 	bool orig_page_huge = false;
 	unsigned int gup_flags = FOLL_FORCE;
 
-	is_register = is_swbp_insn(insn);
+	is_register = arch_uprobe_is_register(insn, len, data);
 	uprobe = container_of(auprobe, struct uprobe, arch);
 
 retry:
@@ -491,7 +504,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
 	if (IS_ERR(old_page))
 		return PTR_ERR(old_page);
 
-	ret = verify_opcode(old_page, vaddr, insn);
+	ret = arch_uprobe_verify_opcode(old_page, vaddr, insn, data);
 	if (ret <= 0)
 		goto put_old;
 
@@ -584,7 +597,7 @@ int __weak set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned
 {
 	uprobe_opcode_t insn = UPROBE_SWBP_INSN;
 
-	return uprobe_write_opcode(auprobe, mm, vaddr, &insn, UPROBE_SWBP_INSN_SIZE);
+	return uprobe_write_opcode(auprobe, mm, vaddr, &insn, UPROBE_SWBP_INSN_SIZE, NULL);
 }
 
 /**
@@ -600,7 +613,7 @@ int __weak
 set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr)
 {
 	return uprobe_write_opcode(auprobe, mm, vaddr,
-			(uprobe_opcode_t *)&auprobe->insn, UPROBE_SWBP_INSN_SIZE);
+			(uprobe_opcode_t *)&auprobe->insn, UPROBE_SWBP_INSN_SIZE, NULL);
 }
 
 /* uprobe should have guaranteed positive refcount */
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC perf/core 05/11] uprobes: Add mapping for optimized uprobe trampolines
  2024-11-05 13:33 [RFC 00/11] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (3 preceding siblings ...)
  2024-11-05 13:33 ` [RFC perf/core 04/11] uprobes: Add data argument to uprobe_write_opcode function Jiri Olsa
@ 2024-11-05 13:33 ` Jiri Olsa
  2024-11-05 14:23   ` Peter Zijlstra
  2024-11-14 23:44   ` Andrii Nakryiko
  2024-11-05 13:34 ` [RFC perf/core 06/11] uprobes: Add uprobe syscall to speed up uprobe Jiri Olsa
                   ` (7 subsequent siblings)
  12 siblings, 2 replies; 51+ messages in thread
From: Jiri Olsa @ 2024-11-05 13:33 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, Song Liu, Yonghong Song, John Fastabend, Hao Luo,
	Steven Rostedt, Masami Hiramatsu, Alan Maguire, linux-kernel,
	linux-trace-kernel

Adding interface to add special mapping for user space page that will be
used as place holder for uprobe trampoline in following changes.

The get_tramp_area(vaddr) function either finds 'callable' page or create
new one.  The 'callable' means it's reachable by call instruction (from
vaddr argument) and is decided by each arch via new arch_uprobe_is_callable
function.

The put_tramp_area function either drops refcount or destroys the special
mapping and all the maps are clean up when the process goes down.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 include/linux/uprobes.h |  12 ++++
 kernel/events/uprobes.c | 141 ++++++++++++++++++++++++++++++++++++++++
 kernel/fork.c           |   2 +
 3 files changed, 155 insertions(+)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index be306028ed59..222d8e82cee2 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -172,6 +172,15 @@ struct xol_area;
 
 struct uprobes_state {
 	struct xol_area		*xol_area;
+	struct hlist_head	tramp_head;
+	struct mutex		tramp_mutex;
+};
+
+struct tramp_area {
+	unsigned long		vaddr;
+	struct page		*page;
+	struct hlist_node	node;
+	refcount_t		ref;
 };
 
 extern void __init uprobes_init(void);
@@ -219,6 +228,9 @@ extern int uprobe_verify_opcode(struct page *page, unsigned long vaddr, uprobe_o
 extern int arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr,
 				     uprobe_opcode_t *new_opcode, void *data);
 extern bool arch_uprobe_is_register(uprobe_opcode_t *insn, int len, void *data);
+struct tramp_area *get_tramp_area(unsigned long vaddr);
+void put_tramp_area(struct tramp_area *area);
+bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 944d9df1f081..a44305c559a4 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -616,6 +616,145 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v
 			(uprobe_opcode_t *)&auprobe->insn, UPROBE_SWBP_INSN_SIZE, NULL);
 }
 
+bool __weak arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr)
+{
+	return false;
+}
+
+static unsigned long find_nearest_page(unsigned long vaddr)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma, *prev;
+	VMA_ITERATOR(vmi, mm, 0);
+
+	prev = vma_next(&vmi);
+	vma = vma_next(&vmi);
+	while (vma) {
+		if (vma->vm_start - prev->vm_end  >= PAGE_SIZE &&
+		    arch_uprobe_is_callable(prev->vm_end, vaddr))
+			return prev->vm_end;
+
+		prev = vma;
+		vma = vma_next(&vmi);
+	}
+
+	return 0;
+}
+
+static vm_fault_t tramp_fault(const struct vm_special_mapping *sm,
+			      struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct hlist_head *head = &vma->vm_mm->uprobes_state.tramp_head;
+	struct tramp_area *area;
+
+	hlist_for_each_entry(area, head, node) {
+		if (vma->vm_start == area->vaddr) {
+			vmf->page = area->page;
+			get_page(vmf->page);
+			return 0;
+		}
+	}
+
+	return -EINVAL;
+}
+
+static int tramp_mremap(const struct vm_special_mapping *sm, struct vm_area_struct *new_vma)
+{
+	return -EPERM;
+}
+
+static const struct vm_special_mapping tramp_mapping = {
+	.name = "[uprobes-trampoline]",
+	.fault = tramp_fault,
+	.mremap = tramp_mremap,
+};
+
+static struct tramp_area *create_tramp_area(unsigned long vaddr)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma;
+	struct tramp_area *area;
+
+	vaddr = find_nearest_page(vaddr);
+	if (!vaddr)
+		return NULL;
+
+	area = kzalloc(sizeof(*area), GFP_KERNEL);
+	if (unlikely(!area))
+		return NULL;
+
+	area->page = alloc_page(GFP_HIGHUSER);
+	if (!area->page)
+		goto free_area;
+
+	refcount_set(&area->ref, 1);
+	area->vaddr = vaddr;
+
+	vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE,
+				VM_READ|VM_EXEC|VM_MAYEXEC|VM_MAYREAD|VM_DONTCOPY|VM_IO,
+				&tramp_mapping);
+	if (!IS_ERR(vma))
+		return area;
+
+	__free_page(area->page);
+ free_area:
+	kfree(area);
+	return NULL;
+}
+
+struct tramp_area *get_tramp_area(unsigned long vaddr)
+{
+	struct uprobes_state *state = &current->mm->uprobes_state;
+	struct tramp_area *area = NULL;
+
+	mutex_lock(&state->tramp_mutex);
+	hlist_for_each_entry(area, &state->tramp_head, node) {
+		if (arch_uprobe_is_callable(area->vaddr, vaddr)) {
+			refcount_inc(&area->ref);
+			goto unlock;
+		}
+	}
+
+	area = create_tramp_area(vaddr);
+	if (area)
+		hlist_add_head(&area->node, &state->tramp_head);
+
+unlock:
+	mutex_unlock(&state->tramp_mutex);
+	return area;
+}
+
+static void destroy_tramp_area(struct tramp_area *area)
+{
+	hlist_del(&area->node);
+	put_page(area->page);
+	kfree(area);
+}
+
+void put_tramp_area(struct tramp_area *area)
+{
+	struct mm_struct *mm = current->mm;
+	struct uprobes_state *state = &mm->uprobes_state;
+
+	if (area == NULL)
+		return;
+
+	mutex_lock(&state->tramp_mutex);
+	if (refcount_dec_and_test(&area->ref))
+		destroy_tramp_area(area);
+	mutex_unlock(&state->tramp_mutex);
+}
+
+static void clear_tramp_head(struct mm_struct *mm)
+{
+	struct uprobes_state *state = &mm->uprobes_state;
+	struct tramp_area *area;
+	struct hlist_node *n;
+
+	hlist_for_each_entry_safe(area, n, &state->tramp_head, node)
+		destroy_tramp_area(area);
+}
+
 /* uprobe should have guaranteed positive refcount */
 static struct uprobe *get_uprobe(struct uprobe *uprobe)
 {
@@ -1788,6 +1927,8 @@ void uprobe_clear_state(struct mm_struct *mm)
 	delayed_uprobe_remove(NULL, mm);
 	mutex_unlock(&delayed_uprobe_lock);
 
+	clear_tramp_head(mm);
+
 	if (!area)
 		return;
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 89ceb4a68af2..b1fe431e5cce 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1248,6 +1248,8 @@ static void mm_init_uprobes_state(struct mm_struct *mm)
 {
 #ifdef CONFIG_UPROBES
 	mm->uprobes_state.xol_area = NULL;
+	mutex_init(&mm->uprobes_state.tramp_mutex);
+	INIT_HLIST_HEAD(&mm->uprobes_state.tramp_head);
 #endif
 }
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC perf/core 06/11] uprobes: Add uprobe syscall to speed up uprobe
  2024-11-05 13:33 [RFC 00/11] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (4 preceding siblings ...)
  2024-11-05 13:33 ` [RFC perf/core 05/11] uprobes: Add mapping for optimized uprobe trampolines Jiri Olsa
@ 2024-11-05 13:34 ` Jiri Olsa
  2024-11-05 13:34 ` [RFC perf/core 07/11] uprobes/x86: Add support to optimize uprobes Jiri Olsa
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 51+ messages in thread
From: Jiri Olsa @ 2024-11-05 13:34 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, Song Liu, Yonghong Song, John Fastabend, Hao Luo,
	Steven Rostedt, Masami Hiramatsu, Alan Maguire, linux-kernel,
	linux-trace-kernel

Adding new uprobe syscall that calls uprobe handlers for given
'breakpoint' address.

The idea is that the 'breakpoint' address calls the user space
trampoline which executes the uprobe syscall.

The syscall handler reads the return address of the initiall call
to retrieve the original 'breakpoint' address.

With this address we find the related uprobe object and call its
consumers.

TODO allow to call uprobe syscall only from uprobe trampoline.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 arch/x86/entry/syscalls/syscall_64.tbl |  1 +
 arch/x86/kernel/uprobes.c              | 48 ++++++++++++++++++++++++++
 include/linux/syscalls.h               |  2 ++
 include/linux/uprobes.h                |  2 ++
 kernel/events/uprobes.c                | 35 +++++++++++++++++++
 kernel/sys_ni.c                        |  1 +
 6 files changed, 89 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 7093ee21c0d1..f6299d57afe5 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -345,6 +345,7 @@
 333	common	io_pgetevents		sys_io_pgetevents
 334	common	rseq			sys_rseq
 335	common	uretprobe		sys_uretprobe
+336	common	uprobe			sys_uprobe
 # don't use numbers 387 through 423, add new calls after the last
 # 'common' entry
 424	common	pidfd_send_signal	sys_pidfd_send_signal
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 22a17c149a55..02aa4519b677 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -425,6 +425,54 @@ SYSCALL_DEFINE0(uretprobe)
 	return -1;
 }
 
+SYSCALL_DEFINE0(uprobe)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+	unsigned long bp_vaddr;
+	int err;
+
+	err = copy_from_user(&bp_vaddr, (void __user *)regs->sp + 3*8, sizeof(bp_vaddr));
+	if (err) {
+		force_sig(SIGILL);
+		return -1;
+	}
+
+	handle_syscall_uprobe(regs, bp_vaddr - 5);
+	return 0;
+}
+
+asm (
+	".pushsection .rodata\n"
+	".global uprobe_trampoline_entry\n"
+	"uprobe_trampoline_entry:\n"
+	"push %rcx\n"
+	"push %r11\n"
+	"push %rax\n"
+	"movq $" __stringify(__NR_uprobe) ", %rax\n"
+	"syscall\n"
+	"pop %rax\n"
+	"pop %r11\n"
+	"pop %rcx\n"
+	"ret\n"
+	".global uprobe_trampoline_end\n"
+	"uprobe_trampoline_end:\n"
+	".popsection\n"
+);
+
+extern __visible u8 uprobe_trampoline_entry[];
+extern __visible u8 uprobe_trampoline_end[];
+
+void *arch_uprobe_trampoline(unsigned long *psize)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+
+	if (user_64bit_mode(regs)) {
+		*psize = uprobe_trampoline_end - uprobe_trampoline_entry;
+		return uprobe_trampoline_entry;
+	}
+	return NULL;
+}
+
 /*
  * If arch_uprobe->insn doesn't use rip-relative addressing, return
  * immediately.  Otherwise, rewrite the instruction so that it accesses
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 5758104921e6..a2573f9dd248 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -981,6 +981,8 @@ asmlinkage long sys_ioperm(unsigned long from, unsigned long num, int on);
 
 asmlinkage long sys_uretprobe(void);
 
+asmlinkage long sys_uprobe(void);
+
 /* pciconfig: alpha, arm, arm64, ia64, sparc */
 asmlinkage long sys_pciconfig_read(unsigned long bus, unsigned long dfn,
 				unsigned long off, unsigned long len,
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 222d8e82cee2..4024e6ea52a4 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -231,6 +231,8 @@ extern bool arch_uprobe_is_register(uprobe_opcode_t *insn, int len, void *data);
 struct tramp_area *get_tramp_area(unsigned long vaddr);
 void put_tramp_area(struct tramp_area *area);
 bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr);
+extern void *arch_uprobe_trampoline(unsigned long *psize);
+extern void handle_syscall_uprobe(struct pt_regs *regs, unsigned long bp_vaddr);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index a44305c559a4..b8399684231c 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -621,6 +621,11 @@ bool __weak arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr)
 	return false;
 }
 
+void * __weak arch_uprobe_trampoline(unsigned long *psize)
+{
+	return NULL;
+}
+
 static unsigned long find_nearest_page(unsigned long vaddr)
 {
 	struct mm_struct *mm = current->mm;
@@ -673,7 +678,13 @@ static struct tramp_area *create_tramp_area(unsigned long vaddr)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
+	unsigned long tramp_size;
 	struct tramp_area *area;
+	void *tramp;
+
+	tramp = arch_uprobe_trampoline(&tramp_size);
+	if (!tramp)
+		return NULL;
 
 	vaddr = find_nearest_page(vaddr);
 	if (!vaddr)
@@ -690,6 +701,8 @@ static struct tramp_area *create_tramp_area(unsigned long vaddr)
 	refcount_set(&area->ref, 1);
 	area->vaddr = vaddr;
 
+	arch_uprobe_copy_ixol(area->page, 0, tramp, tramp_size);
+
 	vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE,
 				VM_READ|VM_EXEC|VM_MAYEXEC|VM_MAYREAD|VM_DONTCOPY|VM_IO,
 				&tramp_mapping);
@@ -2757,6 +2770,28 @@ static void handle_swbp(struct pt_regs *regs)
 	rcu_read_unlock_trace();
 }
 
+void handle_syscall_uprobe(struct pt_regs *regs, unsigned long bp_vaddr)
+{
+	struct uprobe *uprobe;
+	int is_swbp;
+
+	rcu_read_lock_trace();
+	uprobe = find_active_uprobe_rcu(bp_vaddr, &is_swbp);
+	if (!uprobe)
+		goto unlock;
+
+	if (!get_utask())
+		goto unlock;
+
+	if (arch_uprobe_ignore(&uprobe->arch, regs))
+		goto unlock;
+
+	handler_chain(uprobe, regs);
+
+unlock:
+	rcu_read_unlock_trace();
+}
+
 /*
  * Perform required fix-ups and disable singlestep.
  * Allow pending signals to take effect.
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index c00a86931f8c..bf5d05c635ff 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -392,3 +392,4 @@ COND_SYSCALL(setuid16);
 COND_SYSCALL(rseq);
 
 COND_SYSCALL(uretprobe);
+COND_SYSCALL(uprobe);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC perf/core 07/11] uprobes/x86: Add support to optimize uprobes
  2024-11-05 13:33 [RFC 00/11] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (5 preceding siblings ...)
  2024-11-05 13:34 ` [RFC perf/core 06/11] uprobes: Add uprobe syscall to speed up uprobe Jiri Olsa
@ 2024-11-05 13:34 ` Jiri Olsa
  2024-11-14 23:44   ` Andrii Nakryiko
  2024-11-18  8:18   ` Masami Hiramatsu
  2024-11-05 13:34 ` [RFC bpf-next 08/11] selftests/bpf: Use 5-byte nop for x86 usdt probes Jiri Olsa
                   ` (5 subsequent siblings)
  12 siblings, 2 replies; 51+ messages in thread
From: Jiri Olsa @ 2024-11-05 13:34 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, Song Liu, Yonghong Song, John Fastabend, Hao Luo,
	Steven Rostedt, Masami Hiramatsu, Alan Maguire, linux-kernel,
	linux-trace-kernel

Putting together all the previously added pieces to support optimized
uprobes on top of 5-byte nop instruction.

The current uprobe execution goes through following:
  - installs breakpoint instruction over original instruction
  - exception handler hit and calls related uprobe consumers
  - and either simulates original instruction or does out of line single step
    execution of it
  - returns to user space

The optimized uprobe path

  - checks the original instruction is 5-byte nop (plus other checks)
  - adds (or uses existing) user space trampoline and overwrites original
    instruction (5-byte nop) with call to user space trampoline
  - the user space trampoline executes uprobe syscall that calls related uprobe
    consumers
  - trampoline returns back to next instruction

This approach won't speed up all uprobes as it's limited to using nop5 as
original instruction, but we could use nop5 as USDT probe instruction (which
uses single byte nop ATM) and speed up the USDT probes.

This patch overloads related arch functions in uprobe_write_opcode and
set_orig_insn so they can install call instruction if needed.

The arch_uprobe_optimize triggers the uprobe optimization and is called after
first uprobe hit. I originally had it called on uprobe installation but then
it clashed with elf loader, because the user space trampoline was added in a
place where loader might need to put elf segments, so I decided to do it after
first uprobe hit when loading is done.

TODO release uprobe trampoline when it's no longer needed.. we might need to
stop all cpus to make sure no user space thread is in the trampoline.. or we
might just keep it, because there's just one 4GB memory region?

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 arch/x86/include/asm/uprobes.h |   7 ++
 arch/x86/kernel/uprobes.c      | 130 +++++++++++++++++++++++++++++++++
 include/linux/uprobes.h        |   1 +
 kernel/events/uprobes.c        |   3 +
 4 files changed, 141 insertions(+)

diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 678fb546f0a7..84a75ed748f0 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -20,6 +20,11 @@ typedef u8 uprobe_opcode_t;
 #define UPROBE_SWBP_INSN		0xcc
 #define UPROBE_SWBP_INSN_SIZE		   1
 
+enum {
+	ARCH_UPROBE_FLAG_CAN_OPTIMIZE	= 0,
+	ARCH_UPROBE_FLAG_OPTIMIZED	= 1,
+};
+
 struct uprobe_xol_ops;
 
 struct arch_uprobe {
@@ -45,6 +50,8 @@ struct arch_uprobe {
 			u8	ilen;
 		}			push;
 	};
+
+	unsigned long flags;
 };
 
 struct arch_uprobe_task {
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 02aa4519b677..50ccf24ff42c 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -18,6 +18,7 @@
 #include <asm/processor.h>
 #include <asm/insn.h>
 #include <asm/mmu_context.h>
+#include <asm/nops.h>
 
 /* Post-execution fixups. */
 
@@ -877,6 +878,33 @@ static const struct uprobe_xol_ops push_xol_ops = {
 	.emulate  = push_emulate_op,
 };
 
+static int is_nop5_insns(uprobe_opcode_t *insn)
+{
+	return !memcmp(insn, x86_nops[5], 5);
+}
+
+static int is_call_insns(uprobe_opcode_t *insn)
+{
+	return *insn == 0xe8;
+}
+
+static void relative_insn(void *dest, void *from, void *to, u8 op)
+{
+	struct __arch_relative_insn {
+		u8 op;
+		s32 raddr;
+	} __packed *insn;
+
+	insn = (struct __arch_relative_insn *)dest;
+	insn->raddr = (s32)((long)(to) - ((long)(from) + 5));
+	insn->op = op;
+}
+
+static void relative_call(void *dest, void *from, void *to)
+{
+	relative_insn(dest, from, to, CALL_INSN_OPCODE);
+}
+
 /* Returns -ENOSYS if branch_xol_ops doesn't handle this insn */
 static int branch_setup_xol_ops(struct arch_uprobe *auprobe, struct insn *insn)
 {
@@ -896,6 +924,10 @@ static int branch_setup_xol_ops(struct arch_uprobe *auprobe, struct insn *insn)
 		break;
 
 	case 0x0f:
+		if (is_nop5_insns((uprobe_opcode_t *) &auprobe->insn)) {
+			set_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags);
+			break;
+		}
 		if (insn->opcode.nbytes != 2)
 			return -ENOSYS;
 		/*
@@ -1267,3 +1299,101 @@ bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check ctx,
 	else
 		return regs->sp <= ret->stack;
 }
+
+int arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr,
+			      uprobe_opcode_t *new_opcode, void *opt)
+{
+	if (opt) {
+		uprobe_opcode_t old_opcode[5];
+		bool is_call;
+
+		uprobe_copy_from_page(page, vaddr, (uprobe_opcode_t *) &old_opcode, 5);
+		is_call = is_call_insns((uprobe_opcode_t *) &old_opcode);
+
+		if (is_call_insns(new_opcode)) {
+			if (is_call)		/* register: already installed? */
+				return 0;
+		} else {
+			if (!is_call)		/* unregister: was it changed by us? */
+				return 0;
+		}
+
+		return 1;
+	}
+
+	return uprobe_verify_opcode(page, vaddr, new_opcode);
+}
+
+bool arch_uprobe_is_register(uprobe_opcode_t *insn, int len, void *data)
+{
+	return data ? len == 5 && is_call_insns(insn) : is_swbp_insn(insn);
+}
+
+static void __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct *mm,
+				   unsigned long vaddr)
+{
+	struct tramp_area *area = NULL;
+	char call[5];
+
+	/* We can't do cross page atomic writes yet. */
+	if (PAGE_SIZE - (vaddr & ~PAGE_MASK) < 5)
+		goto fail;
+
+	area = get_tramp_area(vaddr);
+	if (!area)
+		goto fail;
+
+	relative_call(call, (void *) vaddr, (void *) area->vaddr);
+	if (uprobe_write_opcode(auprobe, mm, vaddr, call, 5, (void *) 1))
+		goto fail;
+
+	set_bit(ARCH_UPROBE_FLAG_OPTIMIZED, &auprobe->flags);
+	return;
+
+fail:
+	/* Once we fail we never try again. */
+	put_tramp_area(area);
+	clear_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags);
+}
+
+static bool should_optimize(struct arch_uprobe *auprobe)
+{
+	if (!test_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags))
+		return false;
+	if (test_bit(ARCH_UPROBE_FLAG_OPTIMIZED, &auprobe->flags))
+		return false;
+	return true;
+}
+
+void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
+{
+	struct mm_struct *mm = current->mm;
+
+	if (!should_optimize(auprobe))
+		return;
+
+	mmap_write_lock(mm);
+	if (should_optimize(auprobe))
+		__arch_uprobe_optimize(auprobe, mm, vaddr);
+	mmap_write_unlock(mm);
+}
+
+int set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr)
+{
+	uprobe_opcode_t *insn = (uprobe_opcode_t *) auprobe->insn;
+
+	if (test_bit(ARCH_UPROBE_FLAG_OPTIMIZED, &auprobe->flags))
+		return uprobe_write_opcode(auprobe, mm, vaddr, insn, 5, (void *) 1);
+
+	return uprobe_write_opcode(auprobe, mm, vaddr, insn, UPROBE_SWBP_INSN_SIZE, NULL);
+}
+
+bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr)
+{
+	unsigned long delta;
+
+	/* call instructions size */
+	vaddr += 5;
+	delta = vaddr < vtramp ? vtramp - vaddr : vaddr - vtramp;
+	return delta < 0xffffffff;
+}
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 4024e6ea52a4..42ab29f80220 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -233,6 +233,7 @@ void put_tramp_area(struct tramp_area *area);
 bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr);
 extern void *arch_uprobe_trampoline(unsigned long *psize);
 extern void handle_syscall_uprobe(struct pt_regs *regs, unsigned long bp_vaddr);
+extern void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index b8399684231c..efe45fcd5d0a 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2759,6 +2759,9 @@ static void handle_swbp(struct pt_regs *regs)
 
 	handler_chain(uprobe, regs);
 
+	/* Try to optimize after first hit. */
+	arch_uprobe_optimize(&uprobe->arch, bp_vaddr);
+
 	if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
 		goto out;
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC bpf-next 08/11] selftests/bpf: Use 5-byte nop for x86 usdt probes
  2024-11-05 13:33 [RFC 00/11] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (6 preceding siblings ...)
  2024-11-05 13:34 ` [RFC perf/core 07/11] uprobes/x86: Add support to optimize uprobes Jiri Olsa
@ 2024-11-05 13:34 ` Jiri Olsa
  2024-11-05 13:34 ` [RFC bpf-next 09/11] selftests/bpf: Add usdt trigger bench Jiri Olsa
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 51+ messages in thread
From: Jiri Olsa @ 2024-11-05 13:34 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, Song Liu, Yonghong Song, John Fastabend, Hao Luo,
	Steven Rostedt, Masami Hiramatsu, Alan Maguire, linux-kernel,
	linux-trace-kernel

Using 5-byte nop for x86 usdt probes so we can switch
to optimized uprobe them.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 tools/testing/selftests/bpf/sdt.h | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/sdt.h b/tools/testing/selftests/bpf/sdt.h
index ca0162b4dc57..7ac9291f45f1 100644
--- a/tools/testing/selftests/bpf/sdt.h
+++ b/tools/testing/selftests/bpf/sdt.h
@@ -234,6 +234,13 @@ __extension__ extern unsigned long long __sdt_unsp;
 #define _SDT_NOP	nop
 #endif
 
+/* Use 5 byte nop for x86_64 to allow optimizing uprobes. */
+#if defined(__x86_64__)
+# define _SDT_DEF_NOP _SDT_ASM_5(990:	.byte 0x0f, 0x1f, 0x44, 0x00, 0x00)
+#else
+# define _SDT_DEF_NOP _SDT_ASM_1(990:	_SDT_NOP)
+#endif
+
 #define _SDT_NOTE_NAME	"stapsdt"
 #define _SDT_NOTE_TYPE	3
 
@@ -286,7 +293,7 @@ __extension__ extern unsigned long long __sdt_unsp;
 
 #define _SDT_ASM_BODY(provider, name, pack_args, args, ...)		      \
   _SDT_DEF_MACROS							      \
-  _SDT_ASM_1(990:	_SDT_NOP)					      \
+  _SDT_DEF_NOP								      \
   _SDT_ASM_3(		.pushsection .note.stapsdt,_SDT_ASM_AUTOGROUP,"note") \
   _SDT_ASM_1(		.balign 4)					      \
   _SDT_ASM_3(		.4byte 992f-991f, 994f-993f, _SDT_NOTE_TYPE)	      \
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC bpf-next 09/11] selftests/bpf: Add usdt trigger bench
  2024-11-05 13:33 [RFC 00/11] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (7 preceding siblings ...)
  2024-11-05 13:34 ` [RFC bpf-next 08/11] selftests/bpf: Use 5-byte nop for x86 usdt probes Jiri Olsa
@ 2024-11-05 13:34 ` Jiri Olsa
  2024-11-14 23:40   ` Andrii Nakryiko
  2024-11-05 13:34 ` [RFC bpf-next 10/11] selftests/bpf: Add uprobe/usdt optimized test Jiri Olsa
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 51+ messages in thread
From: Jiri Olsa @ 2024-11-05 13:34 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, Song Liu, Yonghong Song, John Fastabend, Hao Luo,
	Steven Rostedt, Masami Hiramatsu, Alan Maguire, linux-kernel,
	linux-trace-kernel

Adding usdt trigger bench to meassure optimized usdt probes.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 tools/testing/selftests/bpf/bench.c           |  2 +
 .../selftests/bpf/benchs/bench_trigger.c      | 45 +++++++++++++++++++
 .../selftests/bpf/progs/trigger_bench.c       | 10 ++++-
 3 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index 1bd403a5ef7b..dc5121e49623 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -526,6 +526,7 @@ extern const struct bench bench_trig_uprobe_multi_push;
 extern const struct bench bench_trig_uretprobe_multi_push;
 extern const struct bench bench_trig_uprobe_multi_ret;
 extern const struct bench bench_trig_uretprobe_multi_ret;
+extern const struct bench bench_trig_usdt;
 
 extern const struct bench bench_rb_libbpf;
 extern const struct bench bench_rb_custom;
@@ -586,6 +587,7 @@ static const struct bench *benchs[] = {
 	&bench_trig_uretprobe_multi_push,
 	&bench_trig_uprobe_multi_ret,
 	&bench_trig_uretprobe_multi_ret,
+	&bench_trig_usdt,
 	/* ringbuf/perfbuf benchmarks */
 	&bench_rb_libbpf,
 	&bench_rb_custom,
diff --git a/tools/testing/selftests/bpf/benchs/bench_trigger.c b/tools/testing/selftests/bpf/benchs/bench_trigger.c
index 32e9f194d449..bdee8b8362d0 100644
--- a/tools/testing/selftests/bpf/benchs/bench_trigger.c
+++ b/tools/testing/selftests/bpf/benchs/bench_trigger.c
@@ -8,6 +8,7 @@
 #include "bench.h"
 #include "trigger_bench.skel.h"
 #include "trace_helpers.h"
+#include "../sdt.h"
 
 #define MAX_TRIG_BATCH_ITERS 1000
 
@@ -333,6 +334,13 @@ static void *uprobe_producer_ret(void *input)
 	return NULL;
 }
 
+static void *uprobe_producer_usdt(void *input)
+{
+	while (true)
+		STAP_PROBE(trigger, usdt);
+	return NULL;
+}
+
 static void usetup(bool use_retprobe, bool use_multi, void *target_addr)
 {
 	size_t uprobe_offset;
@@ -383,6 +391,37 @@ static void usetup(bool use_retprobe, bool use_multi, void *target_addr)
 	}
 }
 
+static void __usdt_setup(const char *provider, const char *name)
+{
+	struct bpf_link *link;
+	int err;
+
+	setup_libbpf();
+
+	ctx.skel = trigger_bench__open();
+	if (!ctx.skel) {
+		fprintf(stderr, "failed to open skeleton\n");
+		exit(1);
+	}
+
+	bpf_program__set_autoload(ctx.skel->progs.bench_trigger_usdt, true);
+
+	err = trigger_bench__load(ctx.skel);
+	if (err) {
+		fprintf(stderr, "failed to load skeleton\n");
+		exit(1);
+	}
+
+	link = bpf_program__attach_usdt(ctx.skel->progs.bench_trigger_usdt,
+					-1 /* all PIDs */, "/proc/self/exe",
+					provider, name, NULL);
+	if (!link) {
+		fprintf(stderr, "failed to attach uprobe!\n");
+		exit(1);
+	}
+	ctx.skel->links.bench_trigger_usdt = link;
+}
+
 static void usermode_count_setup(void)
 {
 	ctx.usermode_counters = true;
@@ -448,6 +487,11 @@ static void uretprobe_multi_ret_setup(void)
 	usetup(true, true /* use_multi */, &uprobe_target_ret);
 }
 
+static void usdt_setup(void)
+{
+	__usdt_setup("trigger", "usdt");
+}
+
 const struct bench bench_trig_syscall_count = {
 	.name = "trig-syscall-count",
 	.validate = trigger_validate,
@@ -506,3 +550,4 @@ BENCH_TRIG_USERMODE(uprobe_multi_ret, ret, "uprobe-multi-ret");
 BENCH_TRIG_USERMODE(uretprobe_multi_nop, nop, "uretprobe-multi-nop");
 BENCH_TRIG_USERMODE(uretprobe_multi_push, push, "uretprobe-multi-push");
 BENCH_TRIG_USERMODE(uretprobe_multi_ret, ret, "uretprobe-multi-ret");
+BENCH_TRIG_USERMODE(usdt, usdt, "usdt");
diff --git a/tools/testing/selftests/bpf/progs/trigger_bench.c b/tools/testing/selftests/bpf/progs/trigger_bench.c
index 044a6d78923e..7b7d4a71e7d4 100644
--- a/tools/testing/selftests/bpf/progs/trigger_bench.c
+++ b/tools/testing/selftests/bpf/progs/trigger_bench.c
@@ -1,8 +1,9 @@
 // SPDX-License-Identifier: GPL-2.0
 // Copyright (c) 2020 Facebook
-#include <linux/bpf.h>
+#include "vmlinux.h"
 #include <asm/unistd.h>
 #include <bpf/bpf_helpers.h>
+#include <bpf/usdt.bpf.h>
 #include <bpf/bpf_tracing.h>
 #include "bpf_misc.h"
 
@@ -138,3 +139,10 @@ int bench_trigger_rawtp(void *ctx)
 	inc_counter();
 	return 0;
 }
+
+SEC("?usdt")
+int bench_trigger_usdt(struct pt_regs *ctx)
+{
+	inc_counter();
+	return 0;
+}
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC bpf-next 10/11] selftests/bpf: Add uprobe/usdt optimized test
  2024-11-05 13:33 [RFC 00/11] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (8 preceding siblings ...)
  2024-11-05 13:34 ` [RFC bpf-next 09/11] selftests/bpf: Add usdt trigger bench Jiri Olsa
@ 2024-11-05 13:34 ` Jiri Olsa
  2024-11-05 13:34 ` [RFC bpf-next 11/11] selftests/bpf: Add hit/attach/detach race optimized uprobe test Jiri Olsa
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 51+ messages in thread
From: Jiri Olsa @ 2024-11-05 13:34 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, Song Liu, Yonghong Song, John Fastabend, Hao Luo,
	Steven Rostedt, Masami Hiramatsu, Alan Maguire, linux-kernel,
	linux-trace-kernel

Adding tests for optimized uprobe/usdt probes.

Checking that we get expected trampoline and attached bpf programs
get executed properly.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 .../bpf/prog_tests/uprobe_optimized.c         | 192 ++++++++++++++++++
 .../selftests/bpf/progs/uprobe_optimized.c    |  29 +++
 2 files changed, 221 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/uprobe_optimized.c
 create mode 100644 tools/testing/selftests/bpf/progs/uprobe_optimized.c

diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_optimized.c b/tools/testing/selftests/bpf/prog_tests/uprobe_optimized.c
new file mode 100644
index 000000000000..f6eb4089b1e2
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_optimized.c
@@ -0,0 +1,192 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <test_progs.h>
+
+#ifdef __x86_64__
+
+#include "sdt.h"
+#include "uprobe_optimized.skel.h"
+
+#define TRAMP "[uprobes-trampoline]"
+
+__naked noinline void uprobe_test(void)
+{
+	asm volatile (".byte 0x0f, 0x1f, 0x44, 0x00, 0x00\n\t"
+		      "ret\n\t");
+}
+
+static int find_uprobes_trampoline(void **start, void **end)
+{
+	char line[128];
+	int ret = -1;
+	FILE *maps;
+
+	maps = fopen("/proc/self/maps", "r");
+	if (!maps) {
+		fprintf(stderr, "cannot open maps\n");
+		return -1;
+	}
+
+	while (fgets(line, sizeof(line), maps)) {
+		int m = -1;
+
+		/* We care only about private r-x mappings. */
+		if (sscanf(line, "%p-%p r-xp %*x %*x:%*x %*u %n", start, end, &m) != 2)
+			continue;
+		if (m < 0)
+			continue;
+		if (!strncmp(&line[m], TRAMP, sizeof(TRAMP)-1)) {
+			ret = 0;
+			break;
+		}
+	}
+
+	fclose(maps);
+	return ret;
+}
+
+static void check_attach(struct uprobe_optimized *skel, void (*trigger)(void))
+{
+	void *tramp_start, *tramp_end;
+	struct __arch_relative_insn {
+		u8 op;
+		s32 raddr;
+	} __packed *call;
+
+	unsigned long delta;
+
+	/* Uprobe gets optimized after first trigger, so let's press twice. */
+	trigger();
+	trigger();
+
+	if (!ASSERT_OK(find_uprobes_trampoline(&tramp_start, &tramp_end), "uprobes_trampoline"))
+		return;
+
+	/* Make sure bpf program got executed.. */
+	ASSERT_EQ(skel->bss->executed, 2, "executed");
+
+	/* .. and check the trampoline is as expected. */
+	call = (struct __arch_relative_insn *) trigger;
+
+	delta = tramp_start > (void *) trigger ?
+		tramp_start - (void *) trigger :
+		(void *) trigger - tramp_start;
+
+	/* and minus call instruction size itself */
+	delta -= 5;
+
+	ASSERT_EQ(call->op, 0xe8, "call");
+	ASSERT_EQ(call->raddr, delta, "delta");
+	ASSERT_EQ(tramp_end - tramp_start, 4096, "size");
+}
+
+static void check_detach(struct uprobe_optimized *skel, void (*trigger)(void))
+{
+	unsigned char nop5[5] = { 0x0f, 0x1f, 0x44, 0x00, 0x00 };
+	void *tramp_start, *tramp_end;
+
+	/* [uprobes_trampoline] stays after detach */
+	ASSERT_OK(find_uprobes_trampoline(&tramp_start, &tramp_end), "uprobes_trampoline");
+	ASSERT_OK(memcmp(trigger, nop5, 5), "nop5");
+}
+
+static void check(struct uprobe_optimized *skel, struct bpf_link *link,
+		  void (*trigger)(void))
+{
+	check_attach(skel, trigger);
+	bpf_link__destroy(link);
+	check_detach(skel, uprobe_test);
+}
+
+static void test_uprobe(void)
+{
+	struct uprobe_optimized *skel;
+	unsigned long offset;
+
+	skel = uprobe_optimized__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "uprobe_optimized__open_and_load"))
+		return;
+
+	offset = get_uprobe_offset(&uprobe_test);
+	if (!ASSERT_GE(offset, 0, "get_uprobe_offset"))
+		goto cleanup;
+
+	skel->links.test_1 = bpf_program__attach_uprobe_opts(skel->progs.test_1,
+					0, "/proc/self/exe", offset, NULL);
+	if (!ASSERT_OK_PTR(skel->links.test_1, "bpf_program__attach_uprobe_opts"))
+		goto cleanup;
+
+	check(skel, skel->links.test_1, uprobe_test);
+	skel->links.test_1 = NULL;
+
+cleanup:
+	uprobe_optimized__destroy(skel);
+}
+
+static void test_uprobe_multi(void)
+{
+	struct uprobe_optimized *skel;
+
+	skel = uprobe_optimized__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "uprobe_optimized__open_and_load"))
+		return;
+
+	skel->links.test_2 = bpf_program__attach_uprobe_multi(skel->progs.test_2,
+						0, "/proc/self/exe", "uprobe_test", NULL);
+	if (!ASSERT_OK_PTR(skel->links.test_2, "bpf_program__attach_uprobe_multi"))
+		goto cleanup;
+
+	check(skel, skel->links.test_2, uprobe_test);
+	skel->links.test_2 = NULL;
+
+cleanup:
+	uprobe_optimized__destroy(skel);
+}
+
+__naked noinline void usdt_test(void)
+{
+	STAP_PROBE(optimized_uprobe, usdt);
+	asm volatile ("ret\n");
+}
+
+static void test_usdt(void)
+{
+	struct uprobe_optimized *skel;
+
+	skel = uprobe_optimized__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "uprobe_optimized__open_and_load"))
+		return;
+
+	skel->links.test_3 = bpf_program__attach_usdt(skel->progs.test_3,
+						-1 /* all PIDs */, "/proc/self/exe",
+						"optimized_uprobe", "usdt", NULL);
+	if (!ASSERT_OK_PTR(skel->links.test_3, "bpf_program__attach_usdt"))
+		goto cleanup;
+
+	check(skel, skel->links.test_3, usdt_test);
+	skel->links.test_3 = NULL;
+
+cleanup:
+	uprobe_optimized__destroy(skel);
+}
+
+static void test_optimized(void)
+{
+	if (test__start_subtest("uprobe"))
+		test_uprobe();
+	if (test__start_subtest("uprobe_multi"))
+		test_uprobe_multi();
+	if (test__start_subtest("usdt"))
+		test_usdt();
+}
+#else
+static void test_optimized(void)
+{
+	test__skip();
+}
+#endif /* __x86_64__ */
+
+void test_uprobe_optimized(void)
+{
+	test_optimized();
+}
diff --git a/tools/testing/selftests/bpf/progs/uprobe_optimized.c b/tools/testing/selftests/bpf/progs/uprobe_optimized.c
new file mode 100644
index 000000000000..7f29c968b7c4
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/uprobe_optimized.c
@@ -0,0 +1,29 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_tracing.h>
+#include <bpf/usdt.bpf.h>
+
+char _license[] SEC("license") = "GPL";
+int executed = 0;
+
+SEC("uprobe")
+int BPF_UPROBE(test_1)
+{
+	executed++;
+	return 0;
+}
+
+SEC("uprobe.multi")
+int BPF_UPROBE(test_2)
+{
+	executed++;
+	return 0;
+}
+
+SEC("usdt")
+int test_3(struct pt_regs *ctx)
+{
+	executed++;
+	return 0;
+}
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC bpf-next 11/11] selftests/bpf: Add hit/attach/detach race optimized uprobe test
  2024-11-05 13:33 [RFC 00/11] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (9 preceding siblings ...)
  2024-11-05 13:34 ` [RFC bpf-next 10/11] selftests/bpf: Add uprobe/usdt optimized test Jiri Olsa
@ 2024-11-05 13:34 ` Jiri Olsa
  2024-11-17 11:49 ` [RFC 00/11] uprobes: Add support to optimize usdt probes on x86_64 Peter Zijlstra
  2024-11-18  8:04 ` Masami Hiramatsu
  12 siblings, 0 replies; 51+ messages in thread
From: Jiri Olsa @ 2024-11-05 13:34 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, Song Liu, Yonghong Song, John Fastabend, Hao Luo,
	Steven Rostedt, Masami Hiramatsu, Alan Maguire, linux-kernel,
	linux-trace-kernel

Adding test that makes sure parallel execution of the uprobe and
attach/detach of optimized uprobe on it works properly.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 .../bpf/prog_tests/uprobe_optimized.c         | 60 +++++++++++++++++++
 1 file changed, 60 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_optimized.c b/tools/testing/selftests/bpf/prog_tests/uprobe_optimized.c
index f6eb4089b1e2..4b9a579c232d 100644
--- a/tools/testing/selftests/bpf/prog_tests/uprobe_optimized.c
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_optimized.c
@@ -170,6 +170,64 @@ static void test_usdt(void)
 	uprobe_optimized__destroy(skel);
 }
 
+static bool race_stop;
+
+static void *worker(void*)
+{
+	while (!race_stop)
+		uprobe_test();
+	return NULL;
+}
+
+static void test_race(void)
+{
+	int err, i, nr_cpus, rounds = 0;
+	struct uprobe_optimized *skel;
+	pthread_t *threads;
+	time_t start;
+
+        nr_cpus = libbpf_num_possible_cpus();
+	if (!ASSERT_GE(nr_cpus, 0, "nr_cpus"))
+		return;
+
+	threads = malloc(sizeof(*threads) * nr_cpus);
+	if (!ASSERT_OK_PTR(threads, "malloc"))
+		return;
+
+	for (i = 0; i < nr_cpus; i++) {
+		err = pthread_create(&threads[i], NULL, worker, NULL);
+		if (!ASSERT_OK(err, "pthread_create"))
+			goto cleanup;
+	}
+
+	skel = uprobe_optimized__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "uprobe_optimized__open_and_load"))
+		goto cleanup;
+
+	start = time(NULL);
+	while (1) {
+		skel->links.test_2 = bpf_program__attach_uprobe_multi(skel->progs.test_2, -1,
+						"/proc/self/exe", "uprobe_test", NULL);
+		if (!ASSERT_OK_PTR(skel->links.test_2, "bpf_program__attach_uprobe_multi"))
+			break;
+
+		bpf_link__destroy(skel->links.test_2);
+		skel->links.test_2 = NULL;
+		rounds++;
+
+		if (start + 2 < time(NULL))
+			break;
+	}
+
+	printf("rounds: %d hits: %d\n", rounds, skel->bss->executed);
+
+cleanup:
+	race_stop = true;
+	for (i = 0; i < nr_cpus; i++)
+		pthread_join(threads[i], NULL);
+	uprobe_optimized__destroy(skel);
+}
+
 static void test_optimized(void)
 {
 	if (test__start_subtest("uprobe"))
@@ -178,6 +236,8 @@ static void test_optimized(void)
 		test_uprobe_multi();
 	if (test__start_subtest("usdt"))
 		test_usdt();
+	if (test__start_subtest("race"))
+		test_race();
 }
 #else
 static void test_optimized(void)
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [RFC perf/core 05/11] uprobes: Add mapping for optimized uprobe trampolines
  2024-11-05 13:33 ` [RFC perf/core 05/11] uprobes: Add mapping for optimized uprobe trampolines Jiri Olsa
@ 2024-11-05 14:23   ` Peter Zijlstra
  2024-11-05 16:33     ` Jiri Olsa
  2024-11-14 23:44   ` Andrii Nakryiko
  1 sibling, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2024-11-05 14:23 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Andrii Nakryiko, bpf, Song Liu, Yonghong Song,
	John Fastabend, Hao Luo, Steven Rostedt, Masami Hiramatsu,
	Alan Maguire, linux-kernel, linux-trace-kernel

On Tue, Nov 05, 2024 at 02:33:59PM +0100, Jiri Olsa wrote:
> Adding interface to add special mapping for user space page that will be
> used as place holder for uprobe trampoline in following changes.
> 
> The get_tramp_area(vaddr) function either finds 'callable' page or create
> new one.  The 'callable' means it's reachable by call instruction (from
> vaddr argument) and is decided by each arch via new arch_uprobe_is_callable
> function.
> 
> The put_tramp_area function either drops refcount or destroys the special
> mapping and all the maps are clean up when the process goes down.

In another thread somewhere, Andrii mentioned that Meta has executables
with more than 4G of .text. This isn't going to work for them, is it?



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC perf/core 05/11] uprobes: Add mapping for optimized uprobe trampolines
  2024-11-05 14:23   ` Peter Zijlstra
@ 2024-11-05 16:33     ` Jiri Olsa
  2024-11-14 23:44       ` Andrii Nakryiko
  0 siblings, 1 reply; 51+ messages in thread
From: Jiri Olsa @ 2024-11-05 16:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Andrii Nakryiko, bpf, Song Liu, Yonghong Song,
	John Fastabend, Hao Luo, Steven Rostedt, Masami Hiramatsu,
	Alan Maguire, linux-kernel, linux-trace-kernel

On Tue, Nov 05, 2024 at 03:23:27PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 05, 2024 at 02:33:59PM +0100, Jiri Olsa wrote:
> > Adding interface to add special mapping for user space page that will be
> > used as place holder for uprobe trampoline in following changes.
> > 
> > The get_tramp_area(vaddr) function either finds 'callable' page or create
> > new one.  The 'callable' means it's reachable by call instruction (from
> > vaddr argument) and is decided by each arch via new arch_uprobe_is_callable
> > function.
> > 
> > The put_tramp_area function either drops refcount or destroys the special
> > mapping and all the maps are clean up when the process goes down.
> 
> In another thread somewhere, Andrii mentioned that Meta has executables
> with more than 4G of .text. This isn't going to work for them, is it?
> 

not if you can't reach the trampoline from the probed address

jirka

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC bpf-next 09/11] selftests/bpf: Add usdt trigger bench
  2024-11-05 13:34 ` [RFC bpf-next 09/11] selftests/bpf: Add usdt trigger bench Jiri Olsa
@ 2024-11-14 23:40   ` Andrii Nakryiko
  2024-11-16 21:45     ` Jiri Olsa
  0 siblings, 1 reply; 51+ messages in thread
From: Andrii Nakryiko @ 2024-11-14 23:40 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Tue, Nov 5, 2024 at 5:35 AM Jiri Olsa <jolsa@kernel.org> wrote:
>
> Adding usdt trigger bench to meassure optimized usdt probes.
>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  tools/testing/selftests/bpf/bench.c           |  2 +
>  .../selftests/bpf/benchs/bench_trigger.c      | 45 +++++++++++++++++++
>  .../selftests/bpf/progs/trigger_bench.c       | 10 ++++-
>  3 files changed, 56 insertions(+), 1 deletion(-)
>

Why not just adding uprobe-nop5 benchmark instead of going all the way
into USDT? Seems simpler and will benchmark all the same stuff?

> diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
> index 1bd403a5ef7b..dc5121e49623 100644
> --- a/tools/testing/selftests/bpf/bench.c
> +++ b/tools/testing/selftests/bpf/bench.c
> @@ -526,6 +526,7 @@ extern const struct bench bench_trig_uprobe_multi_push;
>  extern const struct bench bench_trig_uretprobe_multi_push;
>  extern const struct bench bench_trig_uprobe_multi_ret;
>  extern const struct bench bench_trig_uretprobe_multi_ret;
> +extern const struct bench bench_trig_usdt;
>
>  extern const struct bench bench_rb_libbpf;
>  extern const struct bench bench_rb_custom;
> @@ -586,6 +587,7 @@ static const struct bench *benchs[] = {
>         &bench_trig_uretprobe_multi_push,
>         &bench_trig_uprobe_multi_ret,
>         &bench_trig_uretprobe_multi_ret,
> +       &bench_trig_usdt,
>         /* ringbuf/perfbuf benchmarks */
>         &bench_rb_libbpf,
>         &bench_rb_custom,
> diff --git a/tools/testing/selftests/bpf/benchs/bench_trigger.c b/tools/testing/selftests/bpf/benchs/bench_trigger.c
> index 32e9f194d449..bdee8b8362d0 100644
> --- a/tools/testing/selftests/bpf/benchs/bench_trigger.c
> +++ b/tools/testing/selftests/bpf/benchs/bench_trigger.c
> @@ -8,6 +8,7 @@
>  #include "bench.h"
>  #include "trigger_bench.skel.h"
>  #include "trace_helpers.h"
> +#include "../sdt.h"
>
>  #define MAX_TRIG_BATCH_ITERS 1000
>
> @@ -333,6 +334,13 @@ static void *uprobe_producer_ret(void *input)
>         return NULL;
>  }
>
> +static void *uprobe_producer_usdt(void *input)
> +{
> +       while (true)
> +               STAP_PROBE(trigger, usdt);
> +       return NULL;
> +}
> +
>  static void usetup(bool use_retprobe, bool use_multi, void *target_addr)
>  {
>         size_t uprobe_offset;
> @@ -383,6 +391,37 @@ static void usetup(bool use_retprobe, bool use_multi, void *target_addr)
>         }
>  }
>
> +static void __usdt_setup(const char *provider, const char *name)
> +{
> +       struct bpf_link *link;
> +       int err;
> +
> +       setup_libbpf();
> +
> +       ctx.skel = trigger_bench__open();
> +       if (!ctx.skel) {
> +               fprintf(stderr, "failed to open skeleton\n");
> +               exit(1);
> +       }
> +
> +       bpf_program__set_autoload(ctx.skel->progs.bench_trigger_usdt, true);
> +
> +       err = trigger_bench__load(ctx.skel);
> +       if (err) {
> +               fprintf(stderr, "failed to load skeleton\n");
> +               exit(1);
> +       }
> +
> +       link = bpf_program__attach_usdt(ctx.skel->progs.bench_trigger_usdt,
> +                                       -1 /* all PIDs */, "/proc/self/exe",
> +                                       provider, name, NULL);
> +       if (!link) {
> +               fprintf(stderr, "failed to attach uprobe!\n");
> +               exit(1);
> +       }
> +       ctx.skel->links.bench_trigger_usdt = link;
> +}
> +
>  static void usermode_count_setup(void)
>  {
>         ctx.usermode_counters = true;
> @@ -448,6 +487,11 @@ static void uretprobe_multi_ret_setup(void)
>         usetup(true, true /* use_multi */, &uprobe_target_ret);
>  }
>
> +static void usdt_setup(void)
> +{
> +       __usdt_setup("trigger", "usdt");
> +}
> +
>  const struct bench bench_trig_syscall_count = {
>         .name = "trig-syscall-count",
>         .validate = trigger_validate,
> @@ -506,3 +550,4 @@ BENCH_TRIG_USERMODE(uprobe_multi_ret, ret, "uprobe-multi-ret");
>  BENCH_TRIG_USERMODE(uretprobe_multi_nop, nop, "uretprobe-multi-nop");
>  BENCH_TRIG_USERMODE(uretprobe_multi_push, push, "uretprobe-multi-push");
>  BENCH_TRIG_USERMODE(uretprobe_multi_ret, ret, "uretprobe-multi-ret");
> +BENCH_TRIG_USERMODE(usdt, usdt, "usdt");
> diff --git a/tools/testing/selftests/bpf/progs/trigger_bench.c b/tools/testing/selftests/bpf/progs/trigger_bench.c
> index 044a6d78923e..7b7d4a71e7d4 100644
> --- a/tools/testing/selftests/bpf/progs/trigger_bench.c
> +++ b/tools/testing/selftests/bpf/progs/trigger_bench.c
> @@ -1,8 +1,9 @@
>  // SPDX-License-Identifier: GPL-2.0
>  // Copyright (c) 2020 Facebook
> -#include <linux/bpf.h>
> +#include "vmlinux.h"
>  #include <asm/unistd.h>
>  #include <bpf/bpf_helpers.h>
> +#include <bpf/usdt.bpf.h>
>  #include <bpf/bpf_tracing.h>
>  #include "bpf_misc.h"
>
> @@ -138,3 +139,10 @@ int bench_trigger_rawtp(void *ctx)
>         inc_counter();
>         return 0;
>  }
> +
> +SEC("?usdt")
> +int bench_trigger_usdt(struct pt_regs *ctx)
> +{
> +       inc_counter();
> +       return 0;
> +}
> --
> 2.47.0
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC perf/core 02/11] uprobes: Make copy_from_page global
  2024-11-05 13:33 ` [RFC perf/core 02/11] uprobes: Make copy_from_page global Jiri Olsa
@ 2024-11-14 23:40   ` Andrii Nakryiko
  2024-11-16 21:41     ` Jiri Olsa
  0 siblings, 1 reply; 51+ messages in thread
From: Andrii Nakryiko @ 2024-11-14 23:40 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Tue, Nov 5, 2024 at 5:34 AM Jiri Olsa <jolsa@kernel.org> wrote:
>
> Making copy_from_page global and adding uprobe prefix.
>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  include/linux/uprobes.h |  1 +
>  kernel/events/uprobes.c | 10 +++++-----
>  2 files changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index 2f500bc97263..28068f9fcdc1 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -213,6 +213,7 @@ extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
>  extern void uprobe_handle_trampoline(struct pt_regs *regs);
>  extern void *arch_uretprobe_trampoline(unsigned long *psize);
>  extern unsigned long uprobe_get_trampoline_vaddr(void);
> +extern void uprobe_copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len);
>  #else /* !CONFIG_UPROBES */
>  struct uprobes_state {
>  };
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 0b04c051d712..e9308649bba3 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -250,7 +250,7 @@ bool __weak is_trap_insn(uprobe_opcode_t *insn)
>         return is_swbp_insn(insn);
>  }
>
> -static void copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len)
> +void uprobe_copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len)
>  {
>         void *kaddr = kmap_atomic(page);
>         memcpy(dst, kaddr + (vaddr & ~PAGE_MASK), len);
> @@ -278,7 +278,7 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t
>          * is a trap variant; uprobes always wins over any other (gdb)
>          * breakpoint.
>          */
> -       copy_from_page(page, vaddr, &old_opcode, UPROBE_SWBP_INSN_SIZE);
> +       uprobe_copy_from_page(page, vaddr, &old_opcode, UPROBE_SWBP_INSN_SIZE);
>         is_swbp = is_swbp_insn(&old_opcode);
>
>         if (is_swbp_insn(new_opcode)) {
> @@ -1027,7 +1027,7 @@ static int __copy_insn(struct address_space *mapping, struct file *filp,
>         if (IS_ERR(page))
>                 return PTR_ERR(page);
>
> -       copy_from_page(page, offset, insn, nbytes);
> +       uprobe_copy_from_page(page, offset, insn, nbytes);
>         put_page(page);
>
>         return 0;
> @@ -1368,7 +1368,7 @@ struct uprobe *uprobe_register(struct inode *inode,
>                 return ERR_PTR(-EINVAL);
>
>         /*
> -        * This ensures that copy_from_page(), copy_to_page() and
> +        * This ensures that uprobe_copy_from_page(), copy_to_page() and

rename copy_to_page() for symmetry?


>          * __update_ref_ctr() can't cross page boundary.
>          */
>         if (!IS_ALIGNED(offset, UPROBE_SWBP_INSN_SIZE))
> @@ -2288,7 +2288,7 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr)
>         if (result < 0)
>                 return result;
>
> -       copy_from_page(page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
> +       uprobe_copy_from_page(page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
>         put_page(page);
>   out:
>         /* This needs to return true for any variant of the trap insn */
> --
> 2.47.0
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC perf/core 03/11] uprobes: Add len argument to uprobe_write_opcode
  2024-11-05 13:33 ` [RFC perf/core 03/11] uprobes: Add len argument to uprobe_write_opcode Jiri Olsa
@ 2024-11-14 23:41   ` Andrii Nakryiko
  2024-11-16 21:41     ` Jiri Olsa
  0 siblings, 1 reply; 51+ messages in thread
From: Andrii Nakryiko @ 2024-11-14 23:41 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Tue, Nov 5, 2024 at 5:34 AM Jiri Olsa <jolsa@kernel.org> wrote:
>
> Adding len argument to uprobe_write_opcode as preparation
> fo writing longer instructions in following changes.

typo: for

>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  include/linux/uprobes.h |  3 ++-
>  kernel/events/uprobes.c | 14 ++++++++------
>  2 files changed, 10 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index 28068f9fcdc1..7d23a4fee6f4 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -181,7 +181,8 @@ extern bool is_swbp_insn(uprobe_opcode_t *insn);
>  extern bool is_trap_insn(uprobe_opcode_t *insn);
>  extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs);
>  extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
> -extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t);
> +extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> +                              unsigned long vaddr, uprobe_opcode_t *insn, int len);

is len in sizeof(uprobe_opcode_t) units or in bytes? it would be good
to make this clearer

it feels like passing `void *` for insns would be better, tbh...



>  extern struct uprobe *uprobe_register(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
>  extern int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool);
>  extern void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc);
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index e9308649bba3..3e275717789b 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -471,7 +471,7 @@ static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm,
>   * Return 0 (success) or a negative errno.
>   */
>  int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> -                       unsigned long vaddr, uprobe_opcode_t opcode)
> +                       unsigned long vaddr, uprobe_opcode_t *insn, int len)
>  {
>         struct uprobe *uprobe;
>         struct page *old_page, *new_page;
> @@ -480,7 +480,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
>         bool orig_page_huge = false;
>         unsigned int gup_flags = FOLL_FORCE;
>
> -       is_register = is_swbp_insn(&opcode);
> +       is_register = is_swbp_insn(insn);
>         uprobe = container_of(auprobe, struct uprobe, arch);
>
>  retry:
> @@ -491,7 +491,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
>         if (IS_ERR(old_page))
>                 return PTR_ERR(old_page);
>
> -       ret = verify_opcode(old_page, vaddr, &opcode);
> +       ret = verify_opcode(old_page, vaddr, insn);
>         if (ret <= 0)
>                 goto put_old;
>
> @@ -525,7 +525,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
>
>         __SetPageUptodate(new_page);
>         copy_highpage(new_page, old_page);
> -       copy_to_page(new_page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
> +       copy_to_page(new_page, vaddr, insn, len);
>
>         if (!is_register) {
>                 struct page *orig_page;
> @@ -582,7 +582,9 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
>   */
>  int __weak set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr)
>  {
> -       return uprobe_write_opcode(auprobe, mm, vaddr, UPROBE_SWBP_INSN);
> +       uprobe_opcode_t insn = UPROBE_SWBP_INSN;
> +
> +       return uprobe_write_opcode(auprobe, mm, vaddr, &insn, UPROBE_SWBP_INSN_SIZE);
>  }
>
>  /**
> @@ -598,7 +600,7 @@ int __weak
>  set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr)
>  {
>         return uprobe_write_opcode(auprobe, mm, vaddr,
> -                       *(uprobe_opcode_t *)&auprobe->insn);
> +                       (uprobe_opcode_t *)&auprobe->insn, UPROBE_SWBP_INSN_SIZE);
>  }
>
>  /* uprobe should have guaranteed positive refcount */
> --
> 2.47.0
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC perf/core 04/11] uprobes: Add data argument to uprobe_write_opcode function
  2024-11-05 13:33 ` [RFC perf/core 04/11] uprobes: Add data argument to uprobe_write_opcode function Jiri Olsa
@ 2024-11-14 23:41   ` Andrii Nakryiko
  2024-11-16 21:43     ` Jiri Olsa
  0 siblings, 1 reply; 51+ messages in thread
From: Andrii Nakryiko @ 2024-11-14 23:41 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Tue, Nov 5, 2024 at 5:34 AM Jiri Olsa <jolsa@kernel.org> wrote:
>
> Adding data argument to uprobe_write_opcode function and passing
> it to newly added arch overloaded functions:
>
>   arch_uprobe_verify_opcode
>   arch_uprobe_is_register
>
> This way each architecture can provide custmized verification.
>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  include/linux/uprobes.h |  6 +++++-
>  kernel/events/uprobes.c | 25 +++++++++++++++++++------
>  2 files changed, 24 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index 7d23a4fee6f4..be306028ed59 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -182,7 +182,7 @@ extern bool is_trap_insn(uprobe_opcode_t *insn);
>  extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs);
>  extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
>  extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> -                              unsigned long vaddr, uprobe_opcode_t *insn, int len);
> +                              unsigned long vaddr, uprobe_opcode_t *insn, int len, void *data);
>  extern struct uprobe *uprobe_register(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
>  extern int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool);
>  extern void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc);
> @@ -215,6 +215,10 @@ extern void uprobe_handle_trampoline(struct pt_regs *regs);
>  extern void *arch_uretprobe_trampoline(unsigned long *psize);
>  extern unsigned long uprobe_get_trampoline_vaddr(void);
>  extern void uprobe_copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len);
> +extern int uprobe_verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode);
> +extern int arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr,
> +                                    uprobe_opcode_t *new_opcode, void *data);
> +extern bool arch_uprobe_is_register(uprobe_opcode_t *insn, int len, void *data);
>  #else /* !CONFIG_UPROBES */
>  struct uprobes_state {
>  };
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 3e275717789b..944d9df1f081 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -264,7 +264,13 @@ static void copy_to_page(struct page *page, unsigned long vaddr, const void *src
>         kunmap_atomic(kaddr);
>  }
>
> -static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode)
> +__weak bool arch_uprobe_is_register(uprobe_opcode_t *insn, int len, void *data)
> +{
> +       return is_swbp_insn(insn);
> +}
> +
> +int uprobe_verify_opcode(struct page *page, unsigned long vaddr,
> +                        uprobe_opcode_t *new_opcode)
>  {
>         uprobe_opcode_t old_opcode;
>         bool is_swbp;
> @@ -292,6 +298,12 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t
>         return 1;
>  }
>
> +__weak int arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr,
> +                                    uprobe_opcode_t *new_opcode, void *data)

why wrapping lines? even original longer code was single line


> +{
> +       return uprobe_verify_opcode(page, vaddr, new_opcode);
> +}
> +
>  static struct delayed_uprobe *
>  delayed_uprobe_check(struct uprobe *uprobe, struct mm_struct *mm)
>  {
> @@ -471,7 +483,8 @@ static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm,
>   * Return 0 (success) or a negative errno.
>   */
>  int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> -                       unsigned long vaddr, uprobe_opcode_t *insn, int len)
> +                       unsigned long vaddr, uprobe_opcode_t *insn, int len,
> +                       void *data)
>  {
>         struct uprobe *uprobe;
>         struct page *old_page, *new_page;
> @@ -480,7 +493,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
>         bool orig_page_huge = false;
>         unsigned int gup_flags = FOLL_FORCE;
>
> -       is_register = is_swbp_insn(insn);
> +       is_register = arch_uprobe_is_register(insn, len, data);
>         uprobe = container_of(auprobe, struct uprobe, arch);
>
>  retry:
> @@ -491,7 +504,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
>         if (IS_ERR(old_page))
>                 return PTR_ERR(old_page);
>
> -       ret = verify_opcode(old_page, vaddr, insn);
> +       ret = arch_uprobe_verify_opcode(old_page, vaddr, insn, data);
>         if (ret <= 0)
>                 goto put_old;
>
> @@ -584,7 +597,7 @@ int __weak set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned
>  {
>         uprobe_opcode_t insn = UPROBE_SWBP_INSN;
>
> -       return uprobe_write_opcode(auprobe, mm, vaddr, &insn, UPROBE_SWBP_INSN_SIZE);
> +       return uprobe_write_opcode(auprobe, mm, vaddr, &insn, UPROBE_SWBP_INSN_SIZE, NULL);
>  }
>
>  /**
> @@ -600,7 +613,7 @@ int __weak
>  set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr)
>  {
>         return uprobe_write_opcode(auprobe, mm, vaddr,
> -                       (uprobe_opcode_t *)&auprobe->insn, UPROBE_SWBP_INSN_SIZE);
> +                       (uprobe_opcode_t *)&auprobe->insn, UPROBE_SWBP_INSN_SIZE, NULL);
>  }
>
>  /* uprobe should have guaranteed positive refcount */
> --
> 2.47.0
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC perf/core 05/11] uprobes: Add mapping for optimized uprobe trampolines
  2024-11-05 16:33     ` Jiri Olsa
@ 2024-11-14 23:44       ` Andrii Nakryiko
  2024-11-16 21:44         ` Jiri Olsa
  0 siblings, 1 reply; 51+ messages in thread
From: Andrii Nakryiko @ 2024-11-14 23:44 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Peter Zijlstra, Oleg Nesterov, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Tue, Nov 5, 2024 at 8:33 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Tue, Nov 05, 2024 at 03:23:27PM +0100, Peter Zijlstra wrote:
> > On Tue, Nov 05, 2024 at 02:33:59PM +0100, Jiri Olsa wrote:
> > > Adding interface to add special mapping for user space page that will be
> > > used as place holder for uprobe trampoline in following changes.
> > >
> > > The get_tramp_area(vaddr) function either finds 'callable' page or create
> > > new one.  The 'callable' means it's reachable by call instruction (from
> > > vaddr argument) and is decided by each arch via new arch_uprobe_is_callable
> > > function.
> > >
> > > The put_tramp_area function either drops refcount or destroys the special
> > > mapping and all the maps are clean up when the process goes down.
> >
> > In another thread somewhere, Andrii mentioned that Meta has executables
> > with more than 4G of .text. This isn't going to work for them, is it?
> >
>
> not if you can't reach the trampoline from the probed address

That specific example was about 1.5GB (though we might have bigger
.text, I didn't do exhaustive research). As Jiri said, this would be
best effort trying to find closest free mapping to stay within +/-2GB
offset. If that fails, we always would be falling back to slower
int3-based uprobing, yep.

Jiri, we could also have an option to support 64-bit call, right? We'd
need nop9 for that, but it's an option as well to future-proofing this
approach, no?

Also, can we somehow use fs/gs-based indirect calls/jumps somehow to
have a guarantee that offset is always small (<2GB away relative to
the base stored in fs/gs). Not sure if this is feasible, but I thought
it would be good to bring this up just to make sure it doesn't work.

If segment based absolute call is somehow feasible, we can probably
simplify a bunch of stuff by allocating it eagerly, once, and
somewhere high up next to VDSO (or maybe even put it into VDSO, don't
now).

Anyways, let's brainstorm if there are any clever alternatives here.


>
> jirka

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC perf/core 05/11] uprobes: Add mapping for optimized uprobe trampolines
  2024-11-05 13:33 ` [RFC perf/core 05/11] uprobes: Add mapping for optimized uprobe trampolines Jiri Olsa
  2024-11-05 14:23   ` Peter Zijlstra
@ 2024-11-14 23:44   ` Andrii Nakryiko
  2024-11-16 21:44     ` Jiri Olsa
  1 sibling, 1 reply; 51+ messages in thread
From: Andrii Nakryiko @ 2024-11-14 23:44 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Tue, Nov 5, 2024 at 5:35 AM Jiri Olsa <jolsa@kernel.org> wrote:
>
> Adding interface to add special mapping for user space page that will be
> used as place holder for uprobe trampoline in following changes.
>
> The get_tramp_area(vaddr) function either finds 'callable' page or create
> new one.  The 'callable' means it's reachable by call instruction (from
> vaddr argument) and is decided by each arch via new arch_uprobe_is_callable
> function.
>
> The put_tramp_area function either drops refcount or destroys the special
> mapping and all the maps are clean up when the process goes down.
>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  include/linux/uprobes.h |  12 ++++
>  kernel/events/uprobes.c | 141 ++++++++++++++++++++++++++++++++++++++++
>  kernel/fork.c           |   2 +
>  3 files changed, 155 insertions(+)
>
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index be306028ed59..222d8e82cee2 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -172,6 +172,15 @@ struct xol_area;
>
>  struct uprobes_state {
>         struct xol_area         *xol_area;
> +       struct hlist_head       tramp_head;
> +       struct mutex            tramp_mutex;
> +};
> +
> +struct tramp_area {
> +       unsigned long           vaddr;
> +       struct page             *page;
> +       struct hlist_node       node;
> +       refcount_t              ref;

nit: any reason we are unnecessarily trying to save 4 bytes on
refcount (and we don't actually, due to padding)

>  };
>
>  extern void __init uprobes_init(void);
> @@ -219,6 +228,9 @@ extern int uprobe_verify_opcode(struct page *page, unsigned long vaddr, uprobe_o
>  extern int arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr,
>                                      uprobe_opcode_t *new_opcode, void *data);
>  extern bool arch_uprobe_is_register(uprobe_opcode_t *insn, int len, void *data);
> +struct tramp_area *get_tramp_area(unsigned long vaddr);

uprobe_get_tramp_area() to make it clear this is uprobe specific,
given this is exposed function?

and add that extern like we do for other functions?

> +void put_tramp_area(struct tramp_area *area);

uprobe_put_tramp_area() ?

> +bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr);
>  #else /* !CONFIG_UPROBES */
>  struct uprobes_state {
>  };
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 944d9df1f081..a44305c559a4 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -616,6 +616,145 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v
>                         (uprobe_opcode_t *)&auprobe->insn, UPROBE_SWBP_INSN_SIZE, NULL);
>  }
>
> +bool __weak arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr)
> +{
> +       return false;
> +}
> +
> +static unsigned long find_nearest_page(unsigned long vaddr)
> +{
> +       struct mm_struct *mm = current->mm;
> +       struct vm_area_struct *vma, *prev;
> +       VMA_ITERATOR(vmi, mm, 0);
> +
> +       prev = vma_next(&vmi);
> +       vma = vma_next(&vmi);
> +       while (vma) {
> +               if (vma->vm_start - prev->vm_end  >= PAGE_SIZE &&
> +                   arch_uprobe_is_callable(prev->vm_end, vaddr))
> +                       return prev->vm_end;

shouldn't we try both `prev->vm_end` and `vma->vm_start - PAGE_SIZE`
as two possible places

> +
> +               prev = vma;
> +               vma = vma_next(&vmi);
> +       }
> +
> +       return 0;
> +}
> +
> +static vm_fault_t tramp_fault(const struct vm_special_mapping *sm,
> +                             struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> +       struct hlist_head *head = &vma->vm_mm->uprobes_state.tramp_head;
> +       struct tramp_area *area;
> +
> +       hlist_for_each_entry(area, head, node) {
> +               if (vma->vm_start == area->vaddr) {
> +                       vmf->page = area->page;
> +                       get_page(vmf->page);
> +                       return 0;
> +               }
> +       }
> +
> +       return -EINVAL;
> +}
> +
> +static int tramp_mremap(const struct vm_special_mapping *sm, struct vm_area_struct *new_vma)
> +{
> +       return -EPERM;
> +}
> +
> +static const struct vm_special_mapping tramp_mapping = {
> +       .name = "[uprobes-trampoline]",
> +       .fault = tramp_fault,
> +       .mremap = tramp_mremap,
> +};
> +
> +static struct tramp_area *create_tramp_area(unsigned long vaddr)
> +{
> +       struct mm_struct *mm = current->mm;
> +       struct vm_area_struct *vma;
> +       struct tramp_area *area;
> +
> +       vaddr = find_nearest_page(vaddr);
> +       if (!vaddr)
> +               return NULL;
> +
> +       area = kzalloc(sizeof(*area), GFP_KERNEL);
> +       if (unlikely(!area))
> +               return NULL;
> +
> +       area->page = alloc_page(GFP_HIGHUSER);
> +       if (!area->page)
> +               goto free_area;
> +
> +       refcount_set(&area->ref, 1);
> +       area->vaddr = vaddr;
> +
> +       vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE,
> +                               VM_READ|VM_EXEC|VM_MAYEXEC|VM_MAYREAD|VM_DONTCOPY|VM_IO,
> +                               &tramp_mapping);
> +       if (!IS_ERR(vma))
> +               return area;

please keep a pattern, it's less surprising that way

    if (IS_ERR(vma))
        goto free_page;

    return area;

free_page:

> +
> +       __free_page(area->page);
> + free_area:
> +       kfree(area);
> +       return NULL;
> +}
> +
> +struct tramp_area *get_tramp_area(unsigned long vaddr)
> +{
> +       struct uprobes_state *state = &current->mm->uprobes_state;
> +       struct tramp_area *area = NULL;
> +
> +       mutex_lock(&state->tramp_mutex);
> +       hlist_for_each_entry(area, &state->tramp_head, node) {
> +               if (arch_uprobe_is_callable(area->vaddr, vaddr)) {
> +                       refcount_inc(&area->ref);
> +                       goto unlock;
> +               }
> +       }
> +
> +       area = create_tramp_area(vaddr);
> +       if (area)
> +               hlist_add_head(&area->node, &state->tramp_head);
> +
> +unlock:
> +       mutex_unlock(&state->tramp_mutex);
> +       return area;
> +}
> +
> +static void destroy_tramp_area(struct tramp_area *area)
> +{
> +       hlist_del(&area->node);
> +       put_page(area->page);
> +       kfree(area);
> +}
> +
> +void put_tramp_area(struct tramp_area *area)
> +{
> +       struct mm_struct *mm = current->mm;
> +       struct uprobes_state *state = &mm->uprobes_state;
> +
> +       if (area == NULL)

nit: !area

> +               return;
> +
> +       mutex_lock(&state->tramp_mutex);
> +       if (refcount_dec_and_test(&area->ref))
> +               destroy_tramp_area(area);
> +       mutex_unlock(&state->tramp_mutex);
> +}
> +

[...]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC perf/core 07/11] uprobes/x86: Add support to optimize uprobes
  2024-11-05 13:34 ` [RFC perf/core 07/11] uprobes/x86: Add support to optimize uprobes Jiri Olsa
@ 2024-11-14 23:44   ` Andrii Nakryiko
  2024-11-16 21:44     ` Jiri Olsa
  2024-11-18  8:18   ` Masami Hiramatsu
  1 sibling, 1 reply; 51+ messages in thread
From: Andrii Nakryiko @ 2024-11-14 23:44 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Tue, Nov 5, 2024 at 5:35 AM Jiri Olsa <jolsa@kernel.org> wrote:
>
> Putting together all the previously added pieces to support optimized
> uprobes on top of 5-byte nop instruction.
>
> The current uprobe execution goes through following:
>   - installs breakpoint instruction over original instruction
>   - exception handler hit and calls related uprobe consumers
>   - and either simulates original instruction or does out of line single step
>     execution of it
>   - returns to user space
>
> The optimized uprobe path
>
>   - checks the original instruction is 5-byte nop (plus other checks)
>   - adds (or uses existing) user space trampoline and overwrites original
>     instruction (5-byte nop) with call to user space trampoline
>   - the user space trampoline executes uprobe syscall that calls related uprobe
>     consumers
>   - trampoline returns back to next instruction
>
> This approach won't speed up all uprobes as it's limited to using nop5 as
> original instruction, but we could use nop5 as USDT probe instruction (which
> uses single byte nop ATM) and speed up the USDT probes.

As discussed offline, it's not as simple as just replacing nop1 with
nop5 in USDT definition due to performance considerations on old
kernels (nop5 isn't fast as far as uprobe is concerned), but I think
we'll be able to accommodate transparent "nop1 or nop5" behavior in
user space, we'll just need a careful backwards compatible extension
to USDT definition.

BTW, do you plan to send an optimization for nop5 to avoid
single-stepping them? Ideally we'd just handle any-sized nop, so we
don't have to do this "nop1 or nop5" acrobatics in the future.

>
> This patch overloads related arch functions in uprobe_write_opcode and
> set_orig_insn so they can install call instruction if needed.
>
> The arch_uprobe_optimize triggers the uprobe optimization and is called after
> first uprobe hit. I originally had it called on uprobe installation but then
> it clashed with elf loader, because the user space trampoline was added in a
> place where loader might need to put elf segments, so I decided to do it after
> first uprobe hit when loading is done.

fun... ideally we wouldn't do this lazily, I just came up with another
possible idea, but let's keep all this discussion to another thread
with Peter

>
> TODO release uprobe trampoline when it's no longer needed.. we might need to
> stop all cpus to make sure no user space thread is in the trampoline.. or we
> might just keep it, because there's just one 4GB memory region?

4KB not 4GB, right? Yeah, probably leaving it until process exists is
totally fine.

>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  arch/x86/include/asm/uprobes.h |   7 ++
>  arch/x86/kernel/uprobes.c      | 130 +++++++++++++++++++++++++++++++++
>  include/linux/uprobes.h        |   1 +
>  kernel/events/uprobes.c        |   3 +
>  4 files changed, 141 insertions(+)
>
> diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
> index 678fb546f0a7..84a75ed748f0 100644
> --- a/arch/x86/include/asm/uprobes.h
> +++ b/arch/x86/include/asm/uprobes.h
> @@ -20,6 +20,11 @@ typedef u8 uprobe_opcode_t;
>  #define UPROBE_SWBP_INSN               0xcc
>  #define UPROBE_SWBP_INSN_SIZE             1
>
> +enum {
> +       ARCH_UPROBE_FLAG_CAN_OPTIMIZE   = 0,
> +       ARCH_UPROBE_FLAG_OPTIMIZED      = 1,
> +};
> +
>  struct uprobe_xol_ops;
>
>  struct arch_uprobe {
> @@ -45,6 +50,8 @@ struct arch_uprobe {
>                         u8      ilen;
>                 }                       push;
>         };
> +
> +       unsigned long flags;
>  };
>
>  struct arch_uprobe_task {
> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index 02aa4519b677..50ccf24ff42c 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -18,6 +18,7 @@
>  #include <asm/processor.h>
>  #include <asm/insn.h>
>  #include <asm/mmu_context.h>
> +#include <asm/nops.h>
>
>  /* Post-execution fixups. */
>
> @@ -877,6 +878,33 @@ static const struct uprobe_xol_ops push_xol_ops = {
>         .emulate  = push_emulate_op,
>  };
>
> +static int is_nop5_insns(uprobe_opcode_t *insn)

insns -> insn?

> +{
> +       return !memcmp(insn, x86_nops[5], 5);
> +}
> +
> +static int is_call_insns(uprobe_opcode_t *insn)

ditto, insn, singular?

> +{
> +       return *insn == 0xe8;
> +}
> +
> +static void relative_insn(void *dest, void *from, void *to, u8 op)
> +{
> +       struct __arch_relative_insn {
> +               u8 op;
> +               s32 raddr;
> +       } __packed *insn;
> +
> +       insn = (struct __arch_relative_insn *)dest;
> +       insn->raddr = (s32)((long)(to) - ((long)(from) + 5));
> +       insn->op = op;
> +}
> +
> +static void relative_call(void *dest, void *from, void *to)
> +{
> +       relative_insn(dest, from, to, CALL_INSN_OPCODE);
> +}
> +
>  /* Returns -ENOSYS if branch_xol_ops doesn't handle this insn */
>  static int branch_setup_xol_ops(struct arch_uprobe *auprobe, struct insn *insn)
>  {
> @@ -896,6 +924,10 @@ static int branch_setup_xol_ops(struct arch_uprobe *auprobe, struct insn *insn)
>                 break;
>
>         case 0x0f:
> +               if (is_nop5_insns((uprobe_opcode_t *) &auprobe->insn)) {
> +                       set_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags);
> +                       break;
> +               }
>                 if (insn->opcode.nbytes != 2)
>                         return -ENOSYS;
>                 /*
> @@ -1267,3 +1299,101 @@ bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check ctx,
>         else
>                 return regs->sp <= ret->stack;
>  }
> +
> +int arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr,
> +                             uprobe_opcode_t *new_opcode, void *opt)
> +{
> +       if (opt) {
> +               uprobe_opcode_t old_opcode[5];
> +               bool is_call;
> +
> +               uprobe_copy_from_page(page, vaddr, (uprobe_opcode_t *) &old_opcode, 5);
> +               is_call = is_call_insns((uprobe_opcode_t *) &old_opcode);
> +
> +               if (is_call_insns(new_opcode)) {
> +                       if (is_call)            /* register: already installed? */

probably should check that the destination of the call instruction is
what we expect?

> +                               return 0;
> +               } else {
> +                       if (!is_call)           /* unregister: was it changed by us? */
> +                               return 0;
> +               }
> +
> +               return 1;
> +       }
> +
> +       return uprobe_verify_opcode(page, vaddr, new_opcode);
> +}

[...]

> +int set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr)
> +{
> +       uprobe_opcode_t *insn = (uprobe_opcode_t *) auprobe->insn;
> +
> +       if (test_bit(ARCH_UPROBE_FLAG_OPTIMIZED, &auprobe->flags))
> +               return uprobe_write_opcode(auprobe, mm, vaddr, insn, 5, (void *) 1);
> +
> +       return uprobe_write_opcode(auprobe, mm, vaddr, insn, UPROBE_SWBP_INSN_SIZE, NULL);
> +}
> +
> +bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr)
> +{
> +       unsigned long delta;
> +
> +       /* call instructions size */
> +       vaddr += 5;
> +       delta = vaddr < vtramp ? vtramp - vaddr : vaddr - vtramp;
> +       return delta < 0xffffffff;

isn't immediate a sign extended 32-bit value (that is, int)? wouldn't
this work and be correct:

long delta = (long)(vaddr + 5 - vtramp);
return delta >= INT_MIN && delta <= INT_MAX;

?


> +}
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index 4024e6ea52a4..42ab29f80220 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -233,6 +233,7 @@ void put_tramp_area(struct tramp_area *area);
>  bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr);
>  extern void *arch_uprobe_trampoline(unsigned long *psize);
>  extern void handle_syscall_uprobe(struct pt_regs *regs, unsigned long bp_vaddr);
> +extern void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr);
>  #else /* !CONFIG_UPROBES */
>  struct uprobes_state {
>  };
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index b8399684231c..efe45fcd5d0a 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -2759,6 +2759,9 @@ static void handle_swbp(struct pt_regs *regs)
>
>         handler_chain(uprobe, regs);
>
> +       /* Try to optimize after first hit. */
> +       arch_uprobe_optimize(&uprobe->arch, bp_vaddr);
> +
>         if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
>                 goto out;
>
> --
> 2.47.0
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC perf/core 02/11] uprobes: Make copy_from_page global
  2024-11-14 23:40   ` Andrii Nakryiko
@ 2024-11-16 21:41     ` Jiri Olsa
  0 siblings, 0 replies; 51+ messages in thread
From: Jiri Olsa @ 2024-11-16 21:41 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Thu, Nov 14, 2024 at 03:40:58PM -0800, Andrii Nakryiko wrote:
> On Tue, Nov 5, 2024 at 5:34 AM Jiri Olsa <jolsa@kernel.org> wrote:
> >
> > Making copy_from_page global and adding uprobe prefix.
> >
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > ---
> >  include/linux/uprobes.h |  1 +
> >  kernel/events/uprobes.c | 10 +++++-----
> >  2 files changed, 6 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > index 2f500bc97263..28068f9fcdc1 100644
> > --- a/include/linux/uprobes.h
> > +++ b/include/linux/uprobes.h
> > @@ -213,6 +213,7 @@ extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
> >  extern void uprobe_handle_trampoline(struct pt_regs *regs);
> >  extern void *arch_uretprobe_trampoline(unsigned long *psize);
> >  extern unsigned long uprobe_get_trampoline_vaddr(void);
> > +extern void uprobe_copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len);
> >  #else /* !CONFIG_UPROBES */
> >  struct uprobes_state {
> >  };
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 0b04c051d712..e9308649bba3 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -250,7 +250,7 @@ bool __weak is_trap_insn(uprobe_opcode_t *insn)
> >         return is_swbp_insn(insn);
> >  }
> >
> > -static void copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len)
> > +void uprobe_copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len)
> >  {
> >         void *kaddr = kmap_atomic(page);
> >         memcpy(dst, kaddr + (vaddr & ~PAGE_MASK), len);
> > @@ -278,7 +278,7 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t
> >          * is a trap variant; uprobes always wins over any other (gdb)
> >          * breakpoint.
> >          */
> > -       copy_from_page(page, vaddr, &old_opcode, UPROBE_SWBP_INSN_SIZE);
> > +       uprobe_copy_from_page(page, vaddr, &old_opcode, UPROBE_SWBP_INSN_SIZE);
> >         is_swbp = is_swbp_insn(&old_opcode);
> >
> >         if (is_swbp_insn(new_opcode)) {
> > @@ -1027,7 +1027,7 @@ static int __copy_insn(struct address_space *mapping, struct file *filp,
> >         if (IS_ERR(page))
> >                 return PTR_ERR(page);
> >
> > -       copy_from_page(page, offset, insn, nbytes);
> > +       uprobe_copy_from_page(page, offset, insn, nbytes);
> >         put_page(page);
> >
> >         return 0;
> > @@ -1368,7 +1368,7 @@ struct uprobe *uprobe_register(struct inode *inode,
> >                 return ERR_PTR(-EINVAL);
> >
> >         /*
> > -        * This ensures that copy_from_page(), copy_to_page() and
> > +        * This ensures that uprobe_copy_from_page(), copy_to_page() and
> 
> rename copy_to_page() for symmetry?

ok

jirka

> 
> 
> >          * __update_ref_ctr() can't cross page boundary.
> >          */
> >         if (!IS_ALIGNED(offset, UPROBE_SWBP_INSN_SIZE))
> > @@ -2288,7 +2288,7 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr)
> >         if (result < 0)
> >                 return result;
> >
> > -       copy_from_page(page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
> > +       uprobe_copy_from_page(page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
> >         put_page(page);
> >   out:
> >         /* This needs to return true for any variant of the trap insn */
> > --
> > 2.47.0
> >

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC perf/core 03/11] uprobes: Add len argument to uprobe_write_opcode
  2024-11-14 23:41   ` Andrii Nakryiko
@ 2024-11-16 21:41     ` Jiri Olsa
  0 siblings, 0 replies; 51+ messages in thread
From: Jiri Olsa @ 2024-11-16 21:41 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Thu, Nov 14, 2024 at 03:41:03PM -0800, Andrii Nakryiko wrote:
> On Tue, Nov 5, 2024 at 5:34 AM Jiri Olsa <jolsa@kernel.org> wrote:
> >
> > Adding len argument to uprobe_write_opcode as preparation
> > fo writing longer instructions in following changes.
> 
> typo: for
> 
> >
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > ---
> >  include/linux/uprobes.h |  3 ++-
> >  kernel/events/uprobes.c | 14 ++++++++------
> >  2 files changed, 10 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > index 28068f9fcdc1..7d23a4fee6f4 100644
> > --- a/include/linux/uprobes.h
> > +++ b/include/linux/uprobes.h
> > @@ -181,7 +181,8 @@ extern bool is_swbp_insn(uprobe_opcode_t *insn);
> >  extern bool is_trap_insn(uprobe_opcode_t *insn);
> >  extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs);
> >  extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
> > -extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t);
> > +extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> > +                              unsigned long vaddr, uprobe_opcode_t *insn, int len);
> 
> is len in sizeof(uprobe_opcode_t) units or in bytes? it would be good
> to make this clearer
> 
> it feels like passing `void *` for insns would be better, tbh...

good catch, it's meant as bytes, uprobe_opcode_t is u8 on x86, but other archs
treat it differently, void * might solve that in this function, will check

thanks,
jirka

> 
> 
> 
> >  extern struct uprobe *uprobe_register(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
> >  extern int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool);
> >  extern void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc);
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index e9308649bba3..3e275717789b 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -471,7 +471,7 @@ static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm,
> >   * Return 0 (success) or a negative errno.
> >   */
> >  int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> > -                       unsigned long vaddr, uprobe_opcode_t opcode)
> > +                       unsigned long vaddr, uprobe_opcode_t *insn, int len)
> >  {
> >         struct uprobe *uprobe;
> >         struct page *old_page, *new_page;
> > @@ -480,7 +480,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> >         bool orig_page_huge = false;
> >         unsigned int gup_flags = FOLL_FORCE;
> >
> > -       is_register = is_swbp_insn(&opcode);
> > +       is_register = is_swbp_insn(insn);
> >         uprobe = container_of(auprobe, struct uprobe, arch);
> >
> >  retry:
> > @@ -491,7 +491,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> >         if (IS_ERR(old_page))
> >                 return PTR_ERR(old_page);
> >
> > -       ret = verify_opcode(old_page, vaddr, &opcode);
> > +       ret = verify_opcode(old_page, vaddr, insn);
> >         if (ret <= 0)
> >                 goto put_old;
> >
> > @@ -525,7 +525,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> >
> >         __SetPageUptodate(new_page);
> >         copy_highpage(new_page, old_page);
> > -       copy_to_page(new_page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
> > +       copy_to_page(new_page, vaddr, insn, len);
> >
> >         if (!is_register) {
> >                 struct page *orig_page;
> > @@ -582,7 +582,9 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> >   */
> >  int __weak set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr)
> >  {
> > -       return uprobe_write_opcode(auprobe, mm, vaddr, UPROBE_SWBP_INSN);
> > +       uprobe_opcode_t insn = UPROBE_SWBP_INSN;
> > +
> > +       return uprobe_write_opcode(auprobe, mm, vaddr, &insn, UPROBE_SWBP_INSN_SIZE);
> >  }
> >
> >  /**
> > @@ -598,7 +600,7 @@ int __weak
> >  set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr)
> >  {
> >         return uprobe_write_opcode(auprobe, mm, vaddr,
> > -                       *(uprobe_opcode_t *)&auprobe->insn);
> > +                       (uprobe_opcode_t *)&auprobe->insn, UPROBE_SWBP_INSN_SIZE);
> >  }
> >
> >  /* uprobe should have guaranteed positive refcount */
> > --
> > 2.47.0
> >

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC perf/core 04/11] uprobes: Add data argument to uprobe_write_opcode function
  2024-11-14 23:41   ` Andrii Nakryiko
@ 2024-11-16 21:43     ` Jiri Olsa
  0 siblings, 0 replies; 51+ messages in thread
From: Jiri Olsa @ 2024-11-16 21:43 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Thu, Nov 14, 2024 at 03:41:08PM -0800, Andrii Nakryiko wrote:
> On Tue, Nov 5, 2024 at 5:34 AM Jiri Olsa <jolsa@kernel.org> wrote:
> >
> > Adding data argument to uprobe_write_opcode function and passing
> > it to newly added arch overloaded functions:
> >
> >   arch_uprobe_verify_opcode
> >   arch_uprobe_is_register
> >
> > This way each architecture can provide custmized verification.
> >
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > ---
> >  include/linux/uprobes.h |  6 +++++-
> >  kernel/events/uprobes.c | 25 +++++++++++++++++++------
> >  2 files changed, 24 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > index 7d23a4fee6f4..be306028ed59 100644
> > --- a/include/linux/uprobes.h
> > +++ b/include/linux/uprobes.h
> > @@ -182,7 +182,7 @@ extern bool is_trap_insn(uprobe_opcode_t *insn);
> >  extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs);
> >  extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
> >  extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> > -                              unsigned long vaddr, uprobe_opcode_t *insn, int len);
> > +                              unsigned long vaddr, uprobe_opcode_t *insn, int len, void *data);
> >  extern struct uprobe *uprobe_register(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
> >  extern int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool);
> >  extern void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc);
> > @@ -215,6 +215,10 @@ extern void uprobe_handle_trampoline(struct pt_regs *regs);
> >  extern void *arch_uretprobe_trampoline(unsigned long *psize);
> >  extern unsigned long uprobe_get_trampoline_vaddr(void);
> >  extern void uprobe_copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len);
> > +extern int uprobe_verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode);
> > +extern int arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr,
> > +                                    uprobe_opcode_t *new_opcode, void *data);
> > +extern bool arch_uprobe_is_register(uprobe_opcode_t *insn, int len, void *data);
> >  #else /* !CONFIG_UPROBES */
> >  struct uprobes_state {
> >  };
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 3e275717789b..944d9df1f081 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -264,7 +264,13 @@ static void copy_to_page(struct page *page, unsigned long vaddr, const void *src
> >         kunmap_atomic(kaddr);
> >  }
> >
> > -static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode)
> > +__weak bool arch_uprobe_is_register(uprobe_opcode_t *insn, int len, void *data)
> > +{
> > +       return is_swbp_insn(insn);
> > +}
> > +
> > +int uprobe_verify_opcode(struct page *page, unsigned long vaddr,
> > +                        uprobe_opcode_t *new_opcode)
> >  {
> >         uprobe_opcode_t old_opcode;
> >         bool is_swbp;
> > @@ -292,6 +298,12 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t
> >         return 1;
> >  }
> >
> > +__weak int arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr,
> > +                                    uprobe_opcode_t *new_opcode, void *data)
> 
> why wrapping lines? even original longer code was single line

hm, adding 'uprobe_opcode_t *new_opcode' would make the line over 100 chars,
but right, surrouding code is not strict ;-) ok

jirka

> 
> 
> > +{
> > +       return uprobe_verify_opcode(page, vaddr, new_opcode);
> > +}
> > +
> >  static struct delayed_uprobe *
> >  delayed_uprobe_check(struct uprobe *uprobe, struct mm_struct *mm)
> >  {
> > @@ -471,7 +483,8 @@ static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm,
> >   * Return 0 (success) or a negative errno.
> >   */
> >  int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> > -                       unsigned long vaddr, uprobe_opcode_t *insn, int len)
> > +                       unsigned long vaddr, uprobe_opcode_t *insn, int len,
> > +                       void *data)
> >  {
> >         struct uprobe *uprobe;
> >         struct page *old_page, *new_page;
> > @@ -480,7 +493,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> >         bool orig_page_huge = false;
> >         unsigned int gup_flags = FOLL_FORCE;
> >
> > -       is_register = is_swbp_insn(insn);
> > +       is_register = arch_uprobe_is_register(insn, len, data);
> >         uprobe = container_of(auprobe, struct uprobe, arch);
> >
> >  retry:
> > @@ -491,7 +504,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> >         if (IS_ERR(old_page))
> >                 return PTR_ERR(old_page);
> >
> > -       ret = verify_opcode(old_page, vaddr, insn);
> > +       ret = arch_uprobe_verify_opcode(old_page, vaddr, insn, data);
> >         if (ret <= 0)
> >                 goto put_old;
> >
> > @@ -584,7 +597,7 @@ int __weak set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned
> >  {
> >         uprobe_opcode_t insn = UPROBE_SWBP_INSN;
> >
> > -       return uprobe_write_opcode(auprobe, mm, vaddr, &insn, UPROBE_SWBP_INSN_SIZE);
> > +       return uprobe_write_opcode(auprobe, mm, vaddr, &insn, UPROBE_SWBP_INSN_SIZE, NULL);
> >  }
> >
> >  /**
> > @@ -600,7 +613,7 @@ int __weak
> >  set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr)
> >  {
> >         return uprobe_write_opcode(auprobe, mm, vaddr,
> > -                       (uprobe_opcode_t *)&auprobe->insn, UPROBE_SWBP_INSN_SIZE);
> > +                       (uprobe_opcode_t *)&auprobe->insn, UPROBE_SWBP_INSN_SIZE, NULL);
> >  }
> >
> >  /* uprobe should have guaranteed positive refcount */
> > --
> > 2.47.0
> >

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC perf/core 05/11] uprobes: Add mapping for optimized uprobe trampolines
  2024-11-14 23:44       ` Andrii Nakryiko
@ 2024-11-16 21:44         ` Jiri Olsa
  2024-11-19  6:06           ` Andrii Nakryiko
  0 siblings, 1 reply; 51+ messages in thread
From: Jiri Olsa @ 2024-11-16 21:44 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jiri Olsa, Peter Zijlstra, Oleg Nesterov, Andrii Nakryiko, bpf,
	Song Liu, Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Thu, Nov 14, 2024 at 03:44:12PM -0800, Andrii Nakryiko wrote:
> On Tue, Nov 5, 2024 at 8:33 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Tue, Nov 05, 2024 at 03:23:27PM +0100, Peter Zijlstra wrote:
> > > On Tue, Nov 05, 2024 at 02:33:59PM +0100, Jiri Olsa wrote:
> > > > Adding interface to add special mapping for user space page that will be
> > > > used as place holder for uprobe trampoline in following changes.
> > > >
> > > > The get_tramp_area(vaddr) function either finds 'callable' page or create
> > > > new one.  The 'callable' means it's reachable by call instruction (from
> > > > vaddr argument) and is decided by each arch via new arch_uprobe_is_callable
> > > > function.
> > > >
> > > > The put_tramp_area function either drops refcount or destroys the special
> > > > mapping and all the maps are clean up when the process goes down.
> > >
> > > In another thread somewhere, Andrii mentioned that Meta has executables
> > > with more than 4G of .text. This isn't going to work for them, is it?
> > >
> >
> > not if you can't reach the trampoline from the probed address
> 
> That specific example was about 1.5GB (though we might have bigger
> .text, I didn't do exhaustive research). As Jiri said, this would be
> best effort trying to find closest free mapping to stay within +/-2GB
> offset. If that fails, we always would be falling back to slower
> int3-based uprobing, yep.
> 
> Jiri, we could also have an option to support 64-bit call, right? We'd
> need nop9 for that, but it's an option as well to future-proofing this
> approach, no?

hm, I don't think there's call with relative 64bit offset

there's indirect call through register or address.. but I think we would
fit in nop10 with the indirect call through address

> 
> Also, can we somehow use fs/gs-based indirect calls/jumps somehow to
> have a guarantee that offset is always small (<2GB away relative to
> the base stored in fs/gs). Not sure if this is feasible, but I thought
> it would be good to bring this up just to make sure it doesn't work.
> 
> If segment based absolute call is somehow feasible, we can probably
> simplify a bunch of stuff by allocating it eagerly, once, and
> somewhere high up next to VDSO (or maybe even put it into VDSO, don't
> now).

yes, that would be convenient

jirka

> 
> Anyways, let's brainstorm if there are any clever alternatives here.
> 
> 
> >
> > jirka

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC perf/core 05/11] uprobes: Add mapping for optimized uprobe trampolines
  2024-11-14 23:44   ` Andrii Nakryiko
@ 2024-11-16 21:44     ` Jiri Olsa
  2024-11-19  6:05       ` Andrii Nakryiko
  0 siblings, 1 reply; 51+ messages in thread
From: Jiri Olsa @ 2024-11-16 21:44 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Thu, Nov 14, 2024 at 03:44:14PM -0800, Andrii Nakryiko wrote:
> On Tue, Nov 5, 2024 at 5:35 AM Jiri Olsa <jolsa@kernel.org> wrote:
> >
> > Adding interface to add special mapping for user space page that will be
> > used as place holder for uprobe trampoline in following changes.
> >
> > The get_tramp_area(vaddr) function either finds 'callable' page or create
> > new one.  The 'callable' means it's reachable by call instruction (from
> > vaddr argument) and is decided by each arch via new arch_uprobe_is_callable
> > function.
> >
> > The put_tramp_area function either drops refcount or destroys the special
> > mapping and all the maps are clean up when the process goes down.
> >
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > ---
> >  include/linux/uprobes.h |  12 ++++
> >  kernel/events/uprobes.c | 141 ++++++++++++++++++++++++++++++++++++++++
> >  kernel/fork.c           |   2 +
> >  3 files changed, 155 insertions(+)
> >
> > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > index be306028ed59..222d8e82cee2 100644
> > --- a/include/linux/uprobes.h
> > +++ b/include/linux/uprobes.h
> > @@ -172,6 +172,15 @@ struct xol_area;
> >
> >  struct uprobes_state {
> >         struct xol_area         *xol_area;
> > +       struct hlist_head       tramp_head;
> > +       struct mutex            tramp_mutex;
> > +};
> > +
> > +struct tramp_area {
> > +       unsigned long           vaddr;
> > +       struct page             *page;
> > +       struct hlist_node       node;
> > +       refcount_t              ref;
> 
> nit: any reason we are unnecessarily trying to save 4 bytes on
> refcount (and we don't actually, due to padding)

hum, I'm not sure what you mean.. what's the alternative?

> 
> >  };
> >
> >  extern void __init uprobes_init(void);
> > @@ -219,6 +228,9 @@ extern int uprobe_verify_opcode(struct page *page, unsigned long vaddr, uprobe_o
> >  extern int arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr,
> >                                      uprobe_opcode_t *new_opcode, void *data);
> >  extern bool arch_uprobe_is_register(uprobe_opcode_t *insn, int len, void *data);
> > +struct tramp_area *get_tramp_area(unsigned long vaddr);
> 
> uprobe_get_tramp_area() to make it clear this is uprobe specific,
> given this is exposed function?
> 
> and add that extern like we do for other functions?

ok

> 
> > +void put_tramp_area(struct tramp_area *area);
> 
> uprobe_put_tramp_area() ?

ok

> 
> > +bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr);
> >  #else /* !CONFIG_UPROBES */
> >  struct uprobes_state {
> >  };
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 944d9df1f081..a44305c559a4 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -616,6 +616,145 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v
> >                         (uprobe_opcode_t *)&auprobe->insn, UPROBE_SWBP_INSN_SIZE, NULL);
> >  }
> >
> > +bool __weak arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr)
> > +{
> > +       return false;
> > +}
> > +
> > +static unsigned long find_nearest_page(unsigned long vaddr)
> > +{
> > +       struct mm_struct *mm = current->mm;
> > +       struct vm_area_struct *vma, *prev;
> > +       VMA_ITERATOR(vmi, mm, 0);
> > +
> > +       prev = vma_next(&vmi);
> > +       vma = vma_next(&vmi);
> > +       while (vma) {
> > +               if (vma->vm_start - prev->vm_end  >= PAGE_SIZE &&
> > +                   arch_uprobe_is_callable(prev->vm_end, vaddr))
> > +                       return prev->vm_end;
> 
> shouldn't we try both `prev->vm_end` and `vma->vm_start - PAGE_SIZE`
> as two possible places

right, we should do that

SNIP

> > +static struct tramp_area *create_tramp_area(unsigned long vaddr)
> > +{
> > +       struct mm_struct *mm = current->mm;
> > +       struct vm_area_struct *vma;
> > +       struct tramp_area *area;
> > +
> > +       vaddr = find_nearest_page(vaddr);
> > +       if (!vaddr)
> > +               return NULL;
> > +
> > +       area = kzalloc(sizeof(*area), GFP_KERNEL);
> > +       if (unlikely(!area))
> > +               return NULL;
> > +
> > +       area->page = alloc_page(GFP_HIGHUSER);
> > +       if (!area->page)
> > +               goto free_area;
> > +
> > +       refcount_set(&area->ref, 1);
> > +       area->vaddr = vaddr;
> > +
> > +       vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE,
> > +                               VM_READ|VM_EXEC|VM_MAYEXEC|VM_MAYREAD|VM_DONTCOPY|VM_IO,
> > +                               &tramp_mapping);
> > +       if (!IS_ERR(vma))
> > +               return area;
> 
> please keep a pattern, it's less surprising that way
> 
>     if (IS_ERR(vma))
>         goto free_page;
> 
>     return area;
> 
> free_page:

ok


SNIP

> > +static void destroy_tramp_area(struct tramp_area *area)
> > +{
> > +       hlist_del(&area->node);
> > +       put_page(area->page);
> > +       kfree(area);
> > +}
> > +
> > +void put_tramp_area(struct tramp_area *area)
> > +{
> > +       struct mm_struct *mm = current->mm;
> > +       struct uprobes_state *state = &mm->uprobes_state;
> > +
> > +       if (area == NULL)
> 
> nit: !area

ok

thanks,
jirka

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC perf/core 07/11] uprobes/x86: Add support to optimize uprobes
  2024-11-14 23:44   ` Andrii Nakryiko
@ 2024-11-16 21:44     ` Jiri Olsa
  0 siblings, 0 replies; 51+ messages in thread
From: Jiri Olsa @ 2024-11-16 21:44 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Thu, Nov 14, 2024 at 03:44:20PM -0800, Andrii Nakryiko wrote:
> On Tue, Nov 5, 2024 at 5:35 AM Jiri Olsa <jolsa@kernel.org> wrote:
> >
> > Putting together all the previously added pieces to support optimized
> > uprobes on top of 5-byte nop instruction.
> >
> > The current uprobe execution goes through following:
> >   - installs breakpoint instruction over original instruction
> >   - exception handler hit and calls related uprobe consumers
> >   - and either simulates original instruction or does out of line single step
> >     execution of it
> >   - returns to user space
> >
> > The optimized uprobe path
> >
> >   - checks the original instruction is 5-byte nop (plus other checks)
> >   - adds (or uses existing) user space trampoline and overwrites original
> >     instruction (5-byte nop) with call to user space trampoline
> >   - the user space trampoline executes uprobe syscall that calls related uprobe
> >     consumers
> >   - trampoline returns back to next instruction
> >
> > This approach won't speed up all uprobes as it's limited to using nop5 as
> > original instruction, but we could use nop5 as USDT probe instruction (which
> > uses single byte nop ATM) and speed up the USDT probes.
> 
> As discussed offline, it's not as simple as just replacing nop1 with
> nop5 in USDT definition due to performance considerations on old
> kernels (nop5 isn't fast as far as uprobe is concerned), but I think
> we'll be able to accommodate transparent "nop1 or nop5" behavior in
> user space, we'll just need a careful backwards compatible extension
> to USDT definition.
> 
> BTW, do you plan to send an optimization for nop5 to avoid
> single-stepping them? Ideally we'd just handle any-sized nop, so we
> don't have to do this "nop1 or nop5" acrobatics in the future.

I'll prepare that, but I'd like to check on the alternative calls
you suggested first

> 
> >
> > This patch overloads related arch functions in uprobe_write_opcode and
> > set_orig_insn so they can install call instruction if needed.
> >
> > The arch_uprobe_optimize triggers the uprobe optimization and is called after
> > first uprobe hit. I originally had it called on uprobe installation but then
> > it clashed with elf loader, because the user space trampoline was added in a
> > place where loader might need to put elf segments, so I decided to do it after
> > first uprobe hit when loading is done.
> 
> fun... ideally we wouldn't do this lazily, I just came up with another
> possible idea, but let's keep all this discussion to another thread
> with Peter
> 
> >
> > TODO release uprobe trampoline when it's no longer needed.. we might need to
> > stop all cpus to make sure no user space thread is in the trampoline.. or we
> > might just keep it, because there's just one 4GB memory region?
> 
> 4KB not 4GB, right? Yeah, probably leaving it until process exists is
> totally fine.

yep, ok

SNIP

> > +#include <asm/nops.h>
> >
> >  /* Post-execution fixups. */
> >
> > @@ -877,6 +878,33 @@ static const struct uprobe_xol_ops push_xol_ops = {
> >         .emulate  = push_emulate_op,
> >  };
> >
> > +static int is_nop5_insns(uprobe_opcode_t *insn)
> 
> insns -> insn?
> 
> > +{
> > +       return !memcmp(insn, x86_nops[5], 5);
> > +}
> > +
> > +static int is_call_insns(uprobe_opcode_t *insn)
> 
> ditto, insn, singular?

ok

SNIP

> > +int arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr,
> > +                             uprobe_opcode_t *new_opcode, void *opt)
> > +{
> > +       if (opt) {
> > +               uprobe_opcode_t old_opcode[5];
> > +               bool is_call;
> > +
> > +               uprobe_copy_from_page(page, vaddr, (uprobe_opcode_t *) &old_opcode, 5);
> > +               is_call = is_call_insns((uprobe_opcode_t *) &old_opcode);
> > +
> > +               if (is_call_insns(new_opcode)) {
> > +                       if (is_call)            /* register: already installed? */
> 
> probably should check that the destination of the call instruction is
> what we expect?

ok

SNIP

> > +bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr)
> > +{
> > +       unsigned long delta;
> > +
> > +       /* call instructions size */
> > +       vaddr += 5;
> > +       delta = vaddr < vtramp ? vtramp - vaddr : vaddr - vtramp;
> > +       return delta < 0xffffffff;
> 
> isn't immediate a sign extended 32-bit value (that is, int)? wouldn't
> this work and be correct:
> 
> long delta = (long)(vaddr + 5 - vtramp);
> return delta >= INT_MIN && delta <= INT_MAX;
> 
> ?

ah, right.. should be sign value :-\ thanks

jirka

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC bpf-next 09/11] selftests/bpf: Add usdt trigger bench
  2024-11-14 23:40   ` Andrii Nakryiko
@ 2024-11-16 21:45     ` Jiri Olsa
  2024-11-19  6:08       ` Andrii Nakryiko
  0 siblings, 1 reply; 51+ messages in thread
From: Jiri Olsa @ 2024-11-16 21:45 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Thu, Nov 14, 2024 at 03:40:53PM -0800, Andrii Nakryiko wrote:
> On Tue, Nov 5, 2024 at 5:35 AM Jiri Olsa <jolsa@kernel.org> wrote:
> >
> > Adding usdt trigger bench to meassure optimized usdt probes.
> >
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > ---
> >  tools/testing/selftests/bpf/bench.c           |  2 +
> >  .../selftests/bpf/benchs/bench_trigger.c      | 45 +++++++++++++++++++
> >  .../selftests/bpf/progs/trigger_bench.c       | 10 ++++-
> >  3 files changed, 56 insertions(+), 1 deletion(-)
> >
> 
> Why not just adding uprobe-nop5 benchmark instead of going all the way
> into USDT? Seems simpler and will benchmark all the same stuff?

ok, perhaps with your new usdt library and the possible nop/nop5 tricks we
might want to have specific usdt benchmarks.. but that's for later anyway

jirka

> 
> > diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
> > index 1bd403a5ef7b..dc5121e49623 100644
> > --- a/tools/testing/selftests/bpf/bench.c
> > +++ b/tools/testing/selftests/bpf/bench.c
> > @@ -526,6 +526,7 @@ extern const struct bench bench_trig_uprobe_multi_push;
> >  extern const struct bench bench_trig_uretprobe_multi_push;
> >  extern const struct bench bench_trig_uprobe_multi_ret;
> >  extern const struct bench bench_trig_uretprobe_multi_ret;
> > +extern const struct bench bench_trig_usdt;
> >
> >  extern const struct bench bench_rb_libbpf;
> >  extern const struct bench bench_rb_custom;
> > @@ -586,6 +587,7 @@ static const struct bench *benchs[] = {
> >         &bench_trig_uretprobe_multi_push,
> >         &bench_trig_uprobe_multi_ret,
> >         &bench_trig_uretprobe_multi_ret,
> > +       &bench_trig_usdt,
> >         /* ringbuf/perfbuf benchmarks */
> >         &bench_rb_libbpf,
> >         &bench_rb_custom,
> > diff --git a/tools/testing/selftests/bpf/benchs/bench_trigger.c b/tools/testing/selftests/bpf/benchs/bench_trigger.c
> > index 32e9f194d449..bdee8b8362d0 100644
> > --- a/tools/testing/selftests/bpf/benchs/bench_trigger.c
> > +++ b/tools/testing/selftests/bpf/benchs/bench_trigger.c
> > @@ -8,6 +8,7 @@
> >  #include "bench.h"
> >  #include "trigger_bench.skel.h"
> >  #include "trace_helpers.h"
> > +#include "../sdt.h"
> >
> >  #define MAX_TRIG_BATCH_ITERS 1000
> >
> > @@ -333,6 +334,13 @@ static void *uprobe_producer_ret(void *input)
> >         return NULL;
> >  }
> >
> > +static void *uprobe_producer_usdt(void *input)
> > +{
> > +       while (true)
> > +               STAP_PROBE(trigger, usdt);
> > +       return NULL;
> > +}
> > +
> >  static void usetup(bool use_retprobe, bool use_multi, void *target_addr)
> >  {
> >         size_t uprobe_offset;
> > @@ -383,6 +391,37 @@ static void usetup(bool use_retprobe, bool use_multi, void *target_addr)
> >         }
> >  }
> >
> > +static void __usdt_setup(const char *provider, const char *name)
> > +{
> > +       struct bpf_link *link;
> > +       int err;
> > +
> > +       setup_libbpf();
> > +
> > +       ctx.skel = trigger_bench__open();
> > +       if (!ctx.skel) {
> > +               fprintf(stderr, "failed to open skeleton\n");
> > +               exit(1);
> > +       }
> > +
> > +       bpf_program__set_autoload(ctx.skel->progs.bench_trigger_usdt, true);
> > +
> > +       err = trigger_bench__load(ctx.skel);
> > +       if (err) {
> > +               fprintf(stderr, "failed to load skeleton\n");
> > +               exit(1);
> > +       }
> > +
> > +       link = bpf_program__attach_usdt(ctx.skel->progs.bench_trigger_usdt,
> > +                                       -1 /* all PIDs */, "/proc/self/exe",
> > +                                       provider, name, NULL);
> > +       if (!link) {
> > +               fprintf(stderr, "failed to attach uprobe!\n");
> > +               exit(1);
> > +       }
> > +       ctx.skel->links.bench_trigger_usdt = link;
> > +}
> > +
> >  static void usermode_count_setup(void)
> >  {
> >         ctx.usermode_counters = true;
> > @@ -448,6 +487,11 @@ static void uretprobe_multi_ret_setup(void)
> >         usetup(true, true /* use_multi */, &uprobe_target_ret);
> >  }
> >
> > +static void usdt_setup(void)
> > +{
> > +       __usdt_setup("trigger", "usdt");
> > +}
> > +
> >  const struct bench bench_trig_syscall_count = {
> >         .name = "trig-syscall-count",
> >         .validate = trigger_validate,
> > @@ -506,3 +550,4 @@ BENCH_TRIG_USERMODE(uprobe_multi_ret, ret, "uprobe-multi-ret");
> >  BENCH_TRIG_USERMODE(uretprobe_multi_nop, nop, "uretprobe-multi-nop");
> >  BENCH_TRIG_USERMODE(uretprobe_multi_push, push, "uretprobe-multi-push");
> >  BENCH_TRIG_USERMODE(uretprobe_multi_ret, ret, "uretprobe-multi-ret");
> > +BENCH_TRIG_USERMODE(usdt, usdt, "usdt");
> > diff --git a/tools/testing/selftests/bpf/progs/trigger_bench.c b/tools/testing/selftests/bpf/progs/trigger_bench.c
> > index 044a6d78923e..7b7d4a71e7d4 100644
> > --- a/tools/testing/selftests/bpf/progs/trigger_bench.c
> > +++ b/tools/testing/selftests/bpf/progs/trigger_bench.c
> > @@ -1,8 +1,9 @@
> >  // SPDX-License-Identifier: GPL-2.0
> >  // Copyright (c) 2020 Facebook
> > -#include <linux/bpf.h>
> > +#include "vmlinux.h"
> >  #include <asm/unistd.h>
> >  #include <bpf/bpf_helpers.h>
> > +#include <bpf/usdt.bpf.h>
> >  #include <bpf/bpf_tracing.h>
> >  #include "bpf_misc.h"
> >
> > @@ -138,3 +139,10 @@ int bench_trigger_rawtp(void *ctx)
> >         inc_counter();
> >         return 0;
> >  }
> > +
> > +SEC("?usdt")
> > +int bench_trigger_usdt(struct pt_regs *ctx)
> > +{
> > +       inc_counter();
> > +       return 0;
> > +}
> > --
> > 2.47.0
> >

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC 00/11] uprobes: Add support to optimize usdt probes on x86_64
  2024-11-05 13:33 [RFC 00/11] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (10 preceding siblings ...)
  2024-11-05 13:34 ` [RFC bpf-next 11/11] selftests/bpf: Add hit/attach/detach race optimized uprobe test Jiri Olsa
@ 2024-11-17 11:49 ` Peter Zijlstra
  2024-11-18  9:29   ` Jiri Olsa
  2024-11-18 10:06   ` Mark Rutland
  2024-11-18  8:04 ` Masami Hiramatsu
  12 siblings, 2 replies; 51+ messages in thread
From: Peter Zijlstra @ 2024-11-17 11:49 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Andrii Nakryiko, bpf, Song Liu, Yonghong Song,
	John Fastabend, Hao Luo, Steven Rostedt, Masami Hiramatsu,
	Alan Maguire, linux-kernel, linux-trace-kernel, Will Deacon,
	Mark Rutland

On Tue, Nov 05, 2024 at 02:33:54PM +0100, Jiri Olsa wrote:
> hi,
> this patchset adds support to optimize usdt probes on top of 5-byte
> nop instruction.
> 
> The generic approach (optimize all uprobes) is hard due to emulating
> possible multiple original instructions and its related issues. The
> usdt case, which stores 5-byte nop seems much easier, so starting
> with that.
> 
> The basic idea is to replace breakpoint exception with syscall which
> is faster on x86_64. For more details please see changelog of patch 7.

So this is really about the fact that syscalls are faster than traps on
x86_64? Is there something similar on ARM64, or are they roughly the
same speed there?

That is, I don't think this scheme will work for the various RISC
architectures, given their very limited immediate range turns a typical
call into a multi-instruction trainwreck real quick.

Now, that isn't a problem if their exceptions and syscalls are of equal
speed.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC 00/11] uprobes: Add support to optimize usdt probes on x86_64
  2024-11-05 13:33 [RFC 00/11] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (11 preceding siblings ...)
  2024-11-17 11:49 ` [RFC 00/11] uprobes: Add support to optimize usdt probes on x86_64 Peter Zijlstra
@ 2024-11-18  8:04 ` Masami Hiramatsu
  2024-11-18  9:52   ` Jiri Olsa
  12 siblings, 1 reply; 51+ messages in thread
From: Masami Hiramatsu @ 2024-11-18  8:04 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel,
	Beau Belgrave

Hi Jiri,

On Tue,  5 Nov 2024 14:33:54 +0100
Jiri Olsa <jolsa@kernel.org> wrote:

> hi,
> this patchset adds support to optimize usdt probes on top of 5-byte
> nop instruction.
> 
> The generic approach (optimize all uprobes) is hard due to emulating
> possible multiple original instructions and its related issues. The
> usdt case, which stores 5-byte nop seems much easier, so starting
> with that.
> 
> The basic idea is to replace breakpoint exception with syscall which
> is faster on x86_64. For more details please see changelog of patch 7.

This looks like a great idea!

> 
> The first benchmark shows about 68% speed up (see below). The benchmark
> triggers usdt probe in a loop and counts how many of those happened
> per second.

Hmm, interesting result. I'd like to compare it with user-space event,
which is also use "write" syscall to write the pre-defined events
in the ftrace trace buffer.

But if uprobe trampoline can run in the comparable speed, user may
want to use uprobes because it is based on widely used usdt and
avoid accessing tracefs from application. (The user event user
application has to setup their events via tracefs interface)

> 
> It's still rfc state with some loose ends, but I'd be interested in
> any feedback about the direction of this.

So does this change the usdt macro? Or it just reuse the usdt so that
user applications does not need to be recompiled?

Thank you,

> 
> It's based on tip/perf/core with bpf-next/master merged on top of
> that together with uprobe session patchset.
> 
> thanks,
> jirka
> 
> 
> current:
>         # ./bench -w2 -d5 -a  trig-usdt
>         Setting up benchmark 'trig-usdt'...
>         Benchmark 'trig-usdt' started.
>         Iter   0 ( 46.982us): hits    4.893M/s (  4.893M/prod), drops    0.000M/s, total operations    4.893M/s
>         Iter   1 ( -5.967us): hits    4.892M/s (  4.892M/prod), drops    0.000M/s, total operations    4.892M/s
>         Iter   2 ( -2.771us): hits    4.899M/s (  4.899M/prod), drops    0.000M/s, total operations    4.899M/s
>         Iter   3 (  1.286us): hits    4.889M/s (  4.889M/prod), drops    0.000M/s, total operations    4.889M/s
>         Iter   4 ( -2.871us): hits    4.881M/s (  4.881M/prod), drops    0.000M/s, total operations    4.881M/s
>         Iter   5 (  1.005us): hits    4.886M/s (  4.886M/prod), drops    0.000M/s, total operations    4.886M/s
>         Iter   6 ( 11.626us): hits    4.906M/s (  4.906M/prod), drops    0.000M/s, total operations    4.906M/s
>         Iter   7 ( -6.638us): hits    4.896M/s (  4.896M/prod), drops    0.000M/s, total operations    4.896M/s
>         Summary: hits    4.893 +- 0.009M/s (  4.893M/prod), drops    0.000 +- 0.000M/s, total operations    4.893 +- 0.009M/s
> 
> optimized:
>         # ./bench -w2 -d5 -a  trig-usdt
>         Setting up benchmark 'trig-usdt'...
>         Benchmark 'trig-usdt' started.
>         Iter   0 ( 46.073us): hits    8.258M/s (  8.258M/prod), drops    0.000M/s, total operations    8.258M/s
>         Iter   1 ( -5.752us): hits    8.264M/s (  8.264M/prod), drops    0.000M/s, total operations    8.264M/s
>         Iter   2 ( -1.333us): hits    8.263M/s (  8.263M/prod), drops    0.000M/s, total operations    8.263M/s
>         Iter   3 ( -2.996us): hits    8.265M/s (  8.265M/prod), drops    0.000M/s, total operations    8.265M/s
>         Iter   4 ( -0.620us): hits    8.264M/s (  8.264M/prod), drops    0.000M/s, total operations    8.264M/s
>         Iter   5 ( -2.624us): hits    8.236M/s (  8.236M/prod), drops    0.000M/s, total operations    8.236M/s
>         Iter   6 ( -0.840us): hits    8.232M/s (  8.232M/prod), drops    0.000M/s, total operations    8.232M/s
>         Iter   7 ( -1.783us): hits    8.235M/s (  8.235M/prod), drops    0.000M/s, total operations    8.235M/s
>         Summary: hits    8.249 +- 0.016M/s (  8.249M/prod), drops    0.000 +- 0.000M/s, total operations    8.249 +- 0.016M/s
> 
> ---
> Jiri Olsa (11):
>       uprobes: Rename arch_uretprobe_trampoline function
>       uprobes: Make copy_from_page global
>       uprobes: Add len argument to uprobe_write_opcode
>       uprobes: Add data argument to uprobe_write_opcode function
>       uprobes: Add mapping for optimized uprobe trampolines
>       uprobes: Add uprobe syscall to speed up uprobe
>       uprobes/x86: Add support to optimize uprobes
>       selftests/bpf: Use 5-byte nop for x86 usdt probes
>       selftests/bpf: Add usdt trigger bench
>       selftests/bpf: Add uprobe/usdt optimized test
>       selftests/bpf: Add hit/attach/detach race optimized uprobe test
> 
>  arch/x86/entry/syscalls/syscall_64.tbl                    |   1 +
>  arch/x86/include/asm/uprobes.h                            |   7 +++
>  arch/x86/kernel/uprobes.c                                 | 180 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/syscalls.h                                  |   2 +
>  include/linux/uprobes.h                                   |  25 +++++++++-
>  kernel/events/uprobes.c                                   | 222 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
>  kernel/fork.c                                             |   2 +
>  kernel/sys_ni.c                                           |   1 +
>  tools/testing/selftests/bpf/bench.c                       |   2 +
>  tools/testing/selftests/bpf/benchs/bench_trigger.c        |  45 +++++++++++++++++
>  tools/testing/selftests/bpf/prog_tests/uprobe_optimized.c | 252 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  tools/testing/selftests/bpf/progs/trigger_bench.c         |  10 +++-
>  tools/testing/selftests/bpf/progs/uprobe_optimized.c      |  29 +++++++++++
>  tools/testing/selftests/bpf/sdt.h                         |   9 +++-
>  14 files changed, 768 insertions(+), 19 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/uprobe_optimized.c
>  create mode 100644 tools/testing/selftests/bpf/progs/uprobe_optimized.c


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC perf/core 07/11] uprobes/x86: Add support to optimize uprobes
  2024-11-05 13:34 ` [RFC perf/core 07/11] uprobes/x86: Add support to optimize uprobes Jiri Olsa
  2024-11-14 23:44   ` Andrii Nakryiko
@ 2024-11-18  8:18   ` Masami Hiramatsu
  2024-11-18  9:39     ` Jiri Olsa
  1 sibling, 1 reply; 51+ messages in thread
From: Masami Hiramatsu @ 2024-11-18  8:18 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Tue,  5 Nov 2024 14:34:01 +0100
Jiri Olsa <jolsa@kernel.org> wrote:

> Putting together all the previously added pieces to support optimized
> uprobes on top of 5-byte nop instruction.
> 
> The current uprobe execution goes through following:
>   - installs breakpoint instruction over original instruction
>   - exception handler hit and calls related uprobe consumers
>   - and either simulates original instruction or does out of line single step
>     execution of it
>   - returns to user space
> 
> The optimized uprobe path
> 
>   - checks the original instruction is 5-byte nop (plus other checks)
>   - adds (or uses existing) user space trampoline and overwrites original
>     instruction (5-byte nop) with call to user space trampoline
>   - the user space trampoline executes uprobe syscall that calls related uprobe
>     consumers
>   - trampoline returns back to next instruction
> 
> This approach won't speed up all uprobes as it's limited to using nop5 as
> original instruction, but we could use nop5 as USDT probe instruction (which
> uses single byte nop ATM) and speed up the USDT probes.
> 
> This patch overloads related arch functions in uprobe_write_opcode and
> set_orig_insn so they can install call instruction if needed.
> 
> The arch_uprobe_optimize triggers the uprobe optimization and is called after
> first uprobe hit. I originally had it called on uprobe installation but then
> it clashed with elf loader, because the user space trampoline was added in a
> place where loader might need to put elf segments, so I decided to do it after
> first uprobe hit when loading is done.
> 
> TODO release uprobe trampoline when it's no longer needed.. we might need to
> stop all cpus to make sure no user space thread is in the trampoline.. or we
> might just keep it, because there's just one 4GB memory region?
> 
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  arch/x86/include/asm/uprobes.h |   7 ++
>  arch/x86/kernel/uprobes.c      | 130 +++++++++++++++++++++++++++++++++
>  include/linux/uprobes.h        |   1 +
>  kernel/events/uprobes.c        |   3 +
>  4 files changed, 141 insertions(+)
> 
> diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
> index 678fb546f0a7..84a75ed748f0 100644
> --- a/arch/x86/include/asm/uprobes.h
> +++ b/arch/x86/include/asm/uprobes.h
> @@ -20,6 +20,11 @@ typedef u8 uprobe_opcode_t;
>  #define UPROBE_SWBP_INSN		0xcc
>  #define UPROBE_SWBP_INSN_SIZE		   1
>  
> +enum {
> +	ARCH_UPROBE_FLAG_CAN_OPTIMIZE	= 0,
> +	ARCH_UPROBE_FLAG_OPTIMIZED	= 1,
> +};
> +
>  struct uprobe_xol_ops;
>  
>  struct arch_uprobe {
> @@ -45,6 +50,8 @@ struct arch_uprobe {
>  			u8	ilen;
>  		}			push;
>  	};
> +
> +	unsigned long flags;
>  };
>  
>  struct arch_uprobe_task {
> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index 02aa4519b677..50ccf24ff42c 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -18,6 +18,7 @@
>  #include <asm/processor.h>
>  #include <asm/insn.h>
>  #include <asm/mmu_context.h>
> +#include <asm/nops.h>
>  
>  /* Post-execution fixups. */
>  
> @@ -877,6 +878,33 @@ static const struct uprobe_xol_ops push_xol_ops = {
>  	.emulate  = push_emulate_op,
>  };
>  
> +static int is_nop5_insns(uprobe_opcode_t *insn)
> +{
> +	return !memcmp(insn, x86_nops[5], 5);

Maybe better to use BYTES_NOP5 directly?

> +}
> +
> +static int is_call_insns(uprobe_opcode_t *insn)
> +{
> +	return *insn == 0xe8;

0xe8 -> CALL_INSN_OPCODE

Thank you,



-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC 00/11] uprobes: Add support to optimize usdt probes on x86_64
  2024-11-17 11:49 ` [RFC 00/11] uprobes: Add support to optimize usdt probes on x86_64 Peter Zijlstra
@ 2024-11-18  9:29   ` Jiri Olsa
  2024-11-18 10:06   ` Mark Rutland
  1 sibling, 0 replies; 51+ messages in thread
From: Jiri Olsa @ 2024-11-18  9:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Andrii Nakryiko, bpf, Song Liu, Yonghong Song,
	John Fastabend, Hao Luo, Steven Rostedt, Masami Hiramatsu,
	Alan Maguire, linux-kernel, linux-trace-kernel, Will Deacon,
	Mark Rutland

On Sun, Nov 17, 2024 at 12:49:46PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 05, 2024 at 02:33:54PM +0100, Jiri Olsa wrote:
> > hi,
> > this patchset adds support to optimize usdt probes on top of 5-byte
> > nop instruction.
> > 
> > The generic approach (optimize all uprobes) is hard due to emulating
> > possible multiple original instructions and its related issues. The
> > usdt case, which stores 5-byte nop seems much easier, so starting
> > with that.
> > 
> > The basic idea is to replace breakpoint exception with syscall which
> > is faster on x86_64. For more details please see changelog of patch 7.
> 
> So this is really about the fact that syscalls are faster than traps on
> x86_64? Is there something similar on ARM64, or are they roughly the
> same speed there?

yes, I recall somebody was porting uretprobe syscall to arm, but there was
no speed up IIRC, so looks like it's not the case on arm

I can't find the post atm, I'll keep digging

jirka

> 
> That is, I don't think this scheme will work for the various RISC
> architectures, given their very limited immediate range turns a typical
> call into a multi-instruction trainwreck real quick.
> 
> Now, that isn't a problem if their exceptions and syscalls are of equal
> speed.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC perf/core 07/11] uprobes/x86: Add support to optimize uprobes
  2024-11-18  8:18   ` Masami Hiramatsu
@ 2024-11-18  9:39     ` Jiri Olsa
  0 siblings, 0 replies; 51+ messages in thread
From: Jiri Olsa @ 2024-11-18  9:39 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Alan Maguire, linux-kernel, linux-trace-kernel

On Mon, Nov 18, 2024 at 05:18:08PM +0900, Masami Hiramatsu wrote:
> On Tue,  5 Nov 2024 14:34:01 +0100
> Jiri Olsa <jolsa@kernel.org> wrote:
> 
> > Putting together all the previously added pieces to support optimized
> > uprobes on top of 5-byte nop instruction.
> > 
> > The current uprobe execution goes through following:
> >   - installs breakpoint instruction over original instruction
> >   - exception handler hit and calls related uprobe consumers
> >   - and either simulates original instruction or does out of line single step
> >     execution of it
> >   - returns to user space
> > 
> > The optimized uprobe path
> > 
> >   - checks the original instruction is 5-byte nop (plus other checks)
> >   - adds (or uses existing) user space trampoline and overwrites original
> >     instruction (5-byte nop) with call to user space trampoline
> >   - the user space trampoline executes uprobe syscall that calls related uprobe
> >     consumers
> >   - trampoline returns back to next instruction
> > 
> > This approach won't speed up all uprobes as it's limited to using nop5 as
> > original instruction, but we could use nop5 as USDT probe instruction (which
> > uses single byte nop ATM) and speed up the USDT probes.
> > 
> > This patch overloads related arch functions in uprobe_write_opcode and
> > set_orig_insn so they can install call instruction if needed.
> > 
> > The arch_uprobe_optimize triggers the uprobe optimization and is called after
> > first uprobe hit. I originally had it called on uprobe installation but then
> > it clashed with elf loader, because the user space trampoline was added in a
> > place where loader might need to put elf segments, so I decided to do it after
> > first uprobe hit when loading is done.
> > 
> > TODO release uprobe trampoline when it's no longer needed.. we might need to
> > stop all cpus to make sure no user space thread is in the trampoline.. or we
> > might just keep it, because there's just one 4GB memory region?
> > 
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > ---
> >  arch/x86/include/asm/uprobes.h |   7 ++
> >  arch/x86/kernel/uprobes.c      | 130 +++++++++++++++++++++++++++++++++
> >  include/linux/uprobes.h        |   1 +
> >  kernel/events/uprobes.c        |   3 +
> >  4 files changed, 141 insertions(+)
> > 
> > diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
> > index 678fb546f0a7..84a75ed748f0 100644
> > --- a/arch/x86/include/asm/uprobes.h
> > +++ b/arch/x86/include/asm/uprobes.h
> > @@ -20,6 +20,11 @@ typedef u8 uprobe_opcode_t;
> >  #define UPROBE_SWBP_INSN		0xcc
> >  #define UPROBE_SWBP_INSN_SIZE		   1
> >  
> > +enum {
> > +	ARCH_UPROBE_FLAG_CAN_OPTIMIZE	= 0,
> > +	ARCH_UPROBE_FLAG_OPTIMIZED	= 1,
> > +};
> > +
> >  struct uprobe_xol_ops;
> >  
> >  struct arch_uprobe {
> > @@ -45,6 +50,8 @@ struct arch_uprobe {
> >  			u8	ilen;
> >  		}			push;
> >  	};
> > +
> > +	unsigned long flags;
> >  };
> >  
> >  struct arch_uprobe_task {
> > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> > index 02aa4519b677..50ccf24ff42c 100644
> > --- a/arch/x86/kernel/uprobes.c
> > +++ b/arch/x86/kernel/uprobes.c
> > @@ -18,6 +18,7 @@
> >  #include <asm/processor.h>
> >  #include <asm/insn.h>
> >  #include <asm/mmu_context.h>
> > +#include <asm/nops.h>
> >  
> >  /* Post-execution fixups. */
> >  
> > @@ -877,6 +878,33 @@ static const struct uprobe_xol_ops push_xol_ops = {
> >  	.emulate  = push_emulate_op,
> >  };
> >  
> > +static int is_nop5_insns(uprobe_opcode_t *insn)
> > +{
> > +	return !memcmp(insn, x86_nops[5], 5);
> 
> Maybe better to use BYTES_NOP5 directly?

ok

> 
> > +}
> > +
> > +static int is_call_insns(uprobe_opcode_t *insn)
> > +{
> > +	return *insn == 0xe8;
> 
> 0xe8 -> CALL_INSN_OPCODE

right, thanks

jirka

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC 00/11] uprobes: Add support to optimize usdt probes on x86_64
  2024-11-18  8:04 ` Masami Hiramatsu
@ 2024-11-18  9:52   ` Jiri Olsa
  0 siblings, 0 replies; 51+ messages in thread
From: Jiri Olsa @ 2024-11-18  9:52 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Alan Maguire, linux-kernel, linux-trace-kernel, Beau Belgrave

On Mon, Nov 18, 2024 at 05:04:58PM +0900, Masami Hiramatsu wrote:
> Hi Jiri,
> 
> On Tue,  5 Nov 2024 14:33:54 +0100
> Jiri Olsa <jolsa@kernel.org> wrote:
> 
> > hi,
> > this patchset adds support to optimize usdt probes on top of 5-byte
> > nop instruction.
> > 
> > The generic approach (optimize all uprobes) is hard due to emulating
> > possible multiple original instructions and its related issues. The
> > usdt case, which stores 5-byte nop seems much easier, so starting
> > with that.
> > 
> > The basic idea is to replace breakpoint exception with syscall which
> > is faster on x86_64. For more details please see changelog of patch 7.
> 
> This looks like a great idea!
> 
> > 
> > The first benchmark shows about 68% speed up (see below). The benchmark
> > triggers usdt probe in a loop and counts how many of those happened
> > per second.
> 
> Hmm, interesting result. I'd like to compare it with user-space event,
> which is also use "write" syscall to write the pre-defined events
> in the ftrace trace buffer.
> 
> But if uprobe trampoline can run in the comparable speed, user may
> want to use uprobes because it is based on widely used usdt and
> avoid accessing tracefs from application. (The user event user
> application has to setup their events via tracefs interface)

I see, will check more details on user events

> 
> > 
> > It's still rfc state with some loose ends, but I'd be interested in
> > any feedback about the direction of this.
> 
> So does this change the usdt macro? Or it just reuse the usdt so that
> user applications does not need to be recompiled?

yes, ideally apps keeps using the current usdt macro (or switch to [1])
which would unwind to nop5 (or whatever we will end up using)

there's an issue wrt to old kernels running apps compiled with the new usdt macro,
but it's solvable as Andrii pointed out in the other reply [2]

thanks,
jirka

[1] https://github.com/libbpf/usdt
[2] https://lore.kernel.org/bpf/20241118170458.c825bf255c2fb93f2e6a3519@kernel.org/T/#m20da8469e210880cae07f01783f4a51817ffbe4d

> 
> Thank you,
> 
> > 
> > It's based on tip/perf/core with bpf-next/master merged on top of
> > that together with uprobe session patchset.
> > 
> > thanks,
> > jirka
> > 
> > 
> > current:
> >         # ./bench -w2 -d5 -a  trig-usdt
> >         Setting up benchmark 'trig-usdt'...
> >         Benchmark 'trig-usdt' started.
> >         Iter   0 ( 46.982us): hits    4.893M/s (  4.893M/prod), drops    0.000M/s, total operations    4.893M/s
> >         Iter   1 ( -5.967us): hits    4.892M/s (  4.892M/prod), drops    0.000M/s, total operations    4.892M/s
> >         Iter   2 ( -2.771us): hits    4.899M/s (  4.899M/prod), drops    0.000M/s, total operations    4.899M/s
> >         Iter   3 (  1.286us): hits    4.889M/s (  4.889M/prod), drops    0.000M/s, total operations    4.889M/s
> >         Iter   4 ( -2.871us): hits    4.881M/s (  4.881M/prod), drops    0.000M/s, total operations    4.881M/s
> >         Iter   5 (  1.005us): hits    4.886M/s (  4.886M/prod), drops    0.000M/s, total operations    4.886M/s
> >         Iter   6 ( 11.626us): hits    4.906M/s (  4.906M/prod), drops    0.000M/s, total operations    4.906M/s
> >         Iter   7 ( -6.638us): hits    4.896M/s (  4.896M/prod), drops    0.000M/s, total operations    4.896M/s
> >         Summary: hits    4.893 +- 0.009M/s (  4.893M/prod), drops    0.000 +- 0.000M/s, total operations    4.893 +- 0.009M/s
> > 
> > optimized:
> >         # ./bench -w2 -d5 -a  trig-usdt
> >         Setting up benchmark 'trig-usdt'...
> >         Benchmark 'trig-usdt' started.
> >         Iter   0 ( 46.073us): hits    8.258M/s (  8.258M/prod), drops    0.000M/s, total operations    8.258M/s
> >         Iter   1 ( -5.752us): hits    8.264M/s (  8.264M/prod), drops    0.000M/s, total operations    8.264M/s
> >         Iter   2 ( -1.333us): hits    8.263M/s (  8.263M/prod), drops    0.000M/s, total operations    8.263M/s
> >         Iter   3 ( -2.996us): hits    8.265M/s (  8.265M/prod), drops    0.000M/s, total operations    8.265M/s
> >         Iter   4 ( -0.620us): hits    8.264M/s (  8.264M/prod), drops    0.000M/s, total operations    8.264M/s
> >         Iter   5 ( -2.624us): hits    8.236M/s (  8.236M/prod), drops    0.000M/s, total operations    8.236M/s
> >         Iter   6 ( -0.840us): hits    8.232M/s (  8.232M/prod), drops    0.000M/s, total operations    8.232M/s
> >         Iter   7 ( -1.783us): hits    8.235M/s (  8.235M/prod), drops    0.000M/s, total operations    8.235M/s
> >         Summary: hits    8.249 +- 0.016M/s (  8.249M/prod), drops    0.000 +- 0.000M/s, total operations    8.249 +- 0.016M/s
> > 
> > ---
> > Jiri Olsa (11):
> >       uprobes: Rename arch_uretprobe_trampoline function
> >       uprobes: Make copy_from_page global
> >       uprobes: Add len argument to uprobe_write_opcode
> >       uprobes: Add data argument to uprobe_write_opcode function
> >       uprobes: Add mapping for optimized uprobe trampolines
> >       uprobes: Add uprobe syscall to speed up uprobe
> >       uprobes/x86: Add support to optimize uprobes
> >       selftests/bpf: Use 5-byte nop for x86 usdt probes
> >       selftests/bpf: Add usdt trigger bench
> >       selftests/bpf: Add uprobe/usdt optimized test
> >       selftests/bpf: Add hit/attach/detach race optimized uprobe test
> > 
> >  arch/x86/entry/syscalls/syscall_64.tbl                    |   1 +
> >  arch/x86/include/asm/uprobes.h                            |   7 +++
> >  arch/x86/kernel/uprobes.c                                 | 180 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
> >  include/linux/syscalls.h                                  |   2 +
> >  include/linux/uprobes.h                                   |  25 +++++++++-
> >  kernel/events/uprobes.c                                   | 222 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
> >  kernel/fork.c                                             |   2 +
> >  kernel/sys_ni.c                                           |   1 +
> >  tools/testing/selftests/bpf/bench.c                       |   2 +
> >  tools/testing/selftests/bpf/benchs/bench_trigger.c        |  45 +++++++++++++++++
> >  tools/testing/selftests/bpf/prog_tests/uprobe_optimized.c | 252 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  tools/testing/selftests/bpf/progs/trigger_bench.c         |  10 +++-
> >  tools/testing/selftests/bpf/progs/uprobe_optimized.c      |  29 +++++++++++
> >  tools/testing/selftests/bpf/sdt.h                         |   9 +++-
> >  14 files changed, 768 insertions(+), 19 deletions(-)
> >  create mode 100644 tools/testing/selftests/bpf/prog_tests/uprobe_optimized.c
> >  create mode 100644 tools/testing/selftests/bpf/progs/uprobe_optimized.c
> 
> 
> -- 
> Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC 00/11] uprobes: Add support to optimize usdt probes on x86_64
  2024-11-17 11:49 ` [RFC 00/11] uprobes: Add support to optimize usdt probes on x86_64 Peter Zijlstra
  2024-11-18  9:29   ` Jiri Olsa
@ 2024-11-18 10:06   ` Mark Rutland
  2024-11-19  6:13     ` Andrii Nakryiko
  1 sibling, 1 reply; 51+ messages in thread
From: Mark Rutland @ 2024-11-18 10:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jiri Olsa, Oleg Nesterov, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel,
	Will Deacon

On Sun, Nov 17, 2024 at 12:49:46PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 05, 2024 at 02:33:54PM +0100, Jiri Olsa wrote:
> > hi,
> > this patchset adds support to optimize usdt probes on top of 5-byte
> > nop instruction.
> > 
> > The generic approach (optimize all uprobes) is hard due to emulating
> > possible multiple original instructions and its related issues. The
> > usdt case, which stores 5-byte nop seems much easier, so starting
> > with that.
> > 
> > The basic idea is to replace breakpoint exception with syscall which
> > is faster on x86_64. For more details please see changelog of patch 7.
> 
> So this is really about the fact that syscalls are faster than traps on
> x86_64? Is there something similar on ARM64, or are they roughly the
> same speed there?

From the hardware side I would expect those to be the same speed.

From the software side, there might be a difference, but in theory we
should be able to make the non-syscall case faster because we don't have
syscall tracing there.

> That is, I don't think this scheme will work for the various RISC
> architectures, given their very limited immediate range turns a typical
> call into a multi-instruction trainwreck real quick.
> 
> Now, that isn't a problem if their exceptions and syscalls are of equal
> speed.

Yep, on arm64 we definitely can't patch in branches reliably; using BRK
(as we do today) is the only reliable option, and it *shouldn't* be
slower than a syscall.

Looking around, we have a different latent issue with uprobes on arm64
in that only certain instructions can be modified while being
concurrently executed (in addition to the atomictiy of updating the
bytes in memory), and for everything else we need to stop-the-world. We
handle that for kprobes but it looks like we don't have any
infrastructure to handle that for uprobes.

Mark.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC perf/core 05/11] uprobes: Add mapping for optimized uprobe trampolines
  2024-11-16 21:44     ` Jiri Olsa
@ 2024-11-19  6:05       ` Andrii Nakryiko
  2024-11-19 15:14         ` Jiri Olsa
  0 siblings, 1 reply; 51+ messages in thread
From: Andrii Nakryiko @ 2024-11-19  6:05 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Sat, Nov 16, 2024 at 1:44 PM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Thu, Nov 14, 2024 at 03:44:14PM -0800, Andrii Nakryiko wrote:
> > On Tue, Nov 5, 2024 at 5:35 AM Jiri Olsa <jolsa@kernel.org> wrote:
> > >
> > > Adding interface to add special mapping for user space page that will be
> > > used as place holder for uprobe trampoline in following changes.
> > >
> > > The get_tramp_area(vaddr) function either finds 'callable' page or create
> > > new one.  The 'callable' means it's reachable by call instruction (from
> > > vaddr argument) and is decided by each arch via new arch_uprobe_is_callable
> > > function.
> > >
> > > The put_tramp_area function either drops refcount or destroys the special
> > > mapping and all the maps are clean up when the process goes down.
> > >
> > > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > > ---
> > >  include/linux/uprobes.h |  12 ++++
> > >  kernel/events/uprobes.c | 141 ++++++++++++++++++++++++++++++++++++++++
> > >  kernel/fork.c           |   2 +
> > >  3 files changed, 155 insertions(+)
> > >
> > > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > > index be306028ed59..222d8e82cee2 100644
> > > --- a/include/linux/uprobes.h
> > > +++ b/include/linux/uprobes.h
> > > @@ -172,6 +172,15 @@ struct xol_area;
> > >
> > >  struct uprobes_state {
> > >         struct xol_area         *xol_area;
> > > +       struct hlist_head       tramp_head;
> > > +       struct mutex            tramp_mutex;
> > > +};
> > > +
> > > +struct tramp_area {
> > > +       unsigned long           vaddr;
> > > +       struct page             *page;
> > > +       struct hlist_node       node;
> > > +       refcount_t              ref;
> >
> > nit: any reason we are unnecessarily trying to save 4 bytes on
> > refcount (and we don't actually, due to padding)
>
> hum, I'm not sure what you mean.. what's the alternative?

atomic64_t ?

>
> >
> > >  };
> > >
> > >  extern void __init uprobes_init(void);
> > > @@ -219,6 +228,9 @@ extern int uprobe_verify_opcode(struct page *page, unsigned long vaddr, uprobe_o
> > >  extern int arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr,
> > >                                      uprobe_opcode_t *new_opcode, void *data);
> > >  extern bool arch_uprobe_is_register(uprobe_opcode_t *insn, int len, void *data);
> > > +struct tramp_area *get_tramp_area(unsigned long vaddr);
> >
> > uprobe_get_tramp_area() to make it clear this is uprobe specific,
> > given this is exposed function?
> >
> > and add that extern like we do for other functions?
>
> ok
>
> >
> > > +void put_tramp_area(struct tramp_area *area);
> >
> > uprobe_put_tramp_area() ?
>
> ok
>
> >
> > > +bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr);
> > >  #else /* !CONFIG_UPROBES */
> > >  struct uprobes_state {
> > >  };
> > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > index 944d9df1f081..a44305c559a4 100644
> > > --- a/kernel/events/uprobes.c
> > > +++ b/kernel/events/uprobes.c
> > > @@ -616,6 +616,145 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v
> > >                         (uprobe_opcode_t *)&auprobe->insn, UPROBE_SWBP_INSN_SIZE, NULL);
> > >  }
> > >
> > > +bool __weak arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr)
> > > +{
> > > +       return false;
> > > +}
> > > +
> > > +static unsigned long find_nearest_page(unsigned long vaddr)
> > > +{
> > > +       struct mm_struct *mm = current->mm;
> > > +       struct vm_area_struct *vma, *prev;
> > > +       VMA_ITERATOR(vmi, mm, 0);
> > > +
> > > +       prev = vma_next(&vmi);
> > > +       vma = vma_next(&vmi);
> > > +       while (vma) {
> > > +               if (vma->vm_start - prev->vm_end  >= PAGE_SIZE &&
> > > +                   arch_uprobe_is_callable(prev->vm_end, vaddr))
> > > +                       return prev->vm_end;
> >
> > shouldn't we try both `prev->vm_end` and `vma->vm_start - PAGE_SIZE`
> > as two possible places
>
> right, we should do that
>
> SNIP
>
> > > +static struct tramp_area *create_tramp_area(unsigned long vaddr)
> > > +{
> > > +       struct mm_struct *mm = current->mm;
> > > +       struct vm_area_struct *vma;
> > > +       struct tramp_area *area;
> > > +
> > > +       vaddr = find_nearest_page(vaddr);
> > > +       if (!vaddr)
> > > +               return NULL;
> > > +
> > > +       area = kzalloc(sizeof(*area), GFP_KERNEL);
> > > +       if (unlikely(!area))
> > > +               return NULL;
> > > +
> > > +       area->page = alloc_page(GFP_HIGHUSER);
> > > +       if (!area->page)
> > > +               goto free_area;
> > > +
> > > +       refcount_set(&area->ref, 1);
> > > +       area->vaddr = vaddr;
> > > +
> > > +       vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE,
> > > +                               VM_READ|VM_EXEC|VM_MAYEXEC|VM_MAYREAD|VM_DONTCOPY|VM_IO,
> > > +                               &tramp_mapping);
> > > +       if (!IS_ERR(vma))
> > > +               return area;
> >
> > please keep a pattern, it's less surprising that way
> >
> >     if (IS_ERR(vma))
> >         goto free_page;
> >
> >     return area;
> >
> > free_page:
>
> ok
>
>
> SNIP
>
> > > +static void destroy_tramp_area(struct tramp_area *area)
> > > +{
> > > +       hlist_del(&area->node);
> > > +       put_page(area->page);
> > > +       kfree(area);
> > > +}
> > > +
> > > +void put_tramp_area(struct tramp_area *area)
> > > +{
> > > +       struct mm_struct *mm = current->mm;
> > > +       struct uprobes_state *state = &mm->uprobes_state;
> > > +
> > > +       if (area == NULL)
> >
> > nit: !area
>
> ok
>
> thanks,
> jirka

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC perf/core 05/11] uprobes: Add mapping for optimized uprobe trampolines
  2024-11-16 21:44         ` Jiri Olsa
@ 2024-11-19  6:06           ` Andrii Nakryiko
  2024-11-19  9:13             ` Peter Zijlstra
  0 siblings, 1 reply; 51+ messages in thread
From: Andrii Nakryiko @ 2024-11-19  6:06 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Peter Zijlstra, Oleg Nesterov, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Sat, Nov 16, 2024 at 1:44 PM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Thu, Nov 14, 2024 at 03:44:12PM -0800, Andrii Nakryiko wrote:
> > On Tue, Nov 5, 2024 at 8:33 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > >
> > > On Tue, Nov 05, 2024 at 03:23:27PM +0100, Peter Zijlstra wrote:
> > > > On Tue, Nov 05, 2024 at 02:33:59PM +0100, Jiri Olsa wrote:
> > > > > Adding interface to add special mapping for user space page that will be
> > > > > used as place holder for uprobe trampoline in following changes.
> > > > >
> > > > > The get_tramp_area(vaddr) function either finds 'callable' page or create
> > > > > new one.  The 'callable' means it's reachable by call instruction (from
> > > > > vaddr argument) and is decided by each arch via new arch_uprobe_is_callable
> > > > > function.
> > > > >
> > > > > The put_tramp_area function either drops refcount or destroys the special
> > > > > mapping and all the maps are clean up when the process goes down.
> > > >
> > > > In another thread somewhere, Andrii mentioned that Meta has executables
> > > > with more than 4G of .text. This isn't going to work for them, is it?
> > > >
> > >
> > > not if you can't reach the trampoline from the probed address
> >
> > That specific example was about 1.5GB (though we might have bigger
> > .text, I didn't do exhaustive research). As Jiri said, this would be
> > best effort trying to find closest free mapping to stay within +/-2GB
> > offset. If that fails, we always would be falling back to slower
> > int3-based uprobing, yep.
> >
> > Jiri, we could also have an option to support 64-bit call, right? We'd
> > need nop9 for that, but it's an option as well to future-proofing this
> > approach, no?
>
> hm, I don't think there's call with relative 64bit offset

why do you need a relative, when you have 64 bits? ;) there is a call
to absolute address, no?

>
> there's indirect call through register or address.. but I think we would
> fit in nop10 with the indirect call through address
>
> >
> > Also, can we somehow use fs/gs-based indirect calls/jumps somehow to
> > have a guarantee that offset is always small (<2GB away relative to
> > the base stored in fs/gs). Not sure if this is feasible, but I thought
> > it would be good to bring this up just to make sure it doesn't work.
> >
> > If segment based absolute call is somehow feasible, we can probably
> > simplify a bunch of stuff by allocating it eagerly, once, and
> > somewhere high up next to VDSO (or maybe even put it into VDSO, don't
> > now).
>
> yes, that would be convenient
>
> jirka
>
> >
> > Anyways, let's brainstorm if there are any clever alternatives here.
> >
> >
> > >
> > > jirka

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC bpf-next 09/11] selftests/bpf: Add usdt trigger bench
  2024-11-16 21:45     ` Jiri Olsa
@ 2024-11-19  6:08       ` Andrii Nakryiko
  0 siblings, 0 replies; 51+ messages in thread
From: Andrii Nakryiko @ 2024-11-19  6:08 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Sat, Nov 16, 2024 at 1:45 PM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Thu, Nov 14, 2024 at 03:40:53PM -0800, Andrii Nakryiko wrote:
> > On Tue, Nov 5, 2024 at 5:35 AM Jiri Olsa <jolsa@kernel.org> wrote:
> > >
> > > Adding usdt trigger bench to meassure optimized usdt probes.
> > >
> > > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > > ---
> > >  tools/testing/selftests/bpf/bench.c           |  2 +
> > >  .../selftests/bpf/benchs/bench_trigger.c      | 45 +++++++++++++++++++
> > >  .../selftests/bpf/progs/trigger_bench.c       | 10 ++++-
> > >  3 files changed, 56 insertions(+), 1 deletion(-)
> > >
> >
> > Why not just adding uprobe-nop5 benchmark instead of going all the way
> > into USDT? Seems simpler and will benchmark all the same stuff?
>
> ok, perhaps with your new usdt library and the possible nop/nop5 tricks we
> might want to have specific usdt benchmarks.. but that's for later anyway
>

meh, maybe, don't know if necessary *for benchmark*.

But anyways, the USDT library is out, see [0], feel free to take a look and use

  [0] https://github.com/libbpf/usdt

> jirka
>
> >
> > > diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
> > > index 1bd403a5ef7b..dc5121e49623 100644
> > > --- a/tools/testing/selftests/bpf/bench.c
> > > +++ b/tools/testing/selftests/bpf/bench.c
> > > @@ -526,6 +526,7 @@ extern const struct bench bench_trig_uprobe_multi_push;
> > >  extern const struct bench bench_trig_uretprobe_multi_push;
> > >  extern const struct bench bench_trig_uprobe_multi_ret;
> > >  extern const struct bench bench_trig_uretprobe_multi_ret;
> > > +extern const struct bench bench_trig_usdt;
> > >
> > >  extern const struct bench bench_rb_libbpf;
> > >  extern const struct bench bench_rb_custom;
> > > @@ -586,6 +587,7 @@ static const struct bench *benchs[] = {
> > >         &bench_trig_uretprobe_multi_push,
> > >         &bench_trig_uprobe_multi_ret,
> > >         &bench_trig_uretprobe_multi_ret,
> > > +       &bench_trig_usdt,
> > >         /* ringbuf/perfbuf benchmarks */
> > >         &bench_rb_libbpf,
> > >         &bench_rb_custom,
> > > diff --git a/tools/testing/selftests/bpf/benchs/bench_trigger.c b/tools/testing/selftests/bpf/benchs/bench_trigger.c
> > > index 32e9f194d449..bdee8b8362d0 100644
> > > --- a/tools/testing/selftests/bpf/benchs/bench_trigger.c
> > > +++ b/tools/testing/selftests/bpf/benchs/bench_trigger.c
> > > @@ -8,6 +8,7 @@
> > >  #include "bench.h"
> > >  #include "trigger_bench.skel.h"
> > >  #include "trace_helpers.h"
> > > +#include "../sdt.h"
> > >
> > >  #define MAX_TRIG_BATCH_ITERS 1000
> > >
> > > @@ -333,6 +334,13 @@ static void *uprobe_producer_ret(void *input)
> > >         return NULL;
> > >  }
> > >
> > > +static void *uprobe_producer_usdt(void *input)
> > > +{
> > > +       while (true)
> > > +               STAP_PROBE(trigger, usdt);
> > > +       return NULL;
> > > +}
> > > +
> > >  static void usetup(bool use_retprobe, bool use_multi, void *target_addr)
> > >  {
> > >         size_t uprobe_offset;
> > > @@ -383,6 +391,37 @@ static void usetup(bool use_retprobe, bool use_multi, void *target_addr)
> > >         }
> > >  }
> > >
> > > +static void __usdt_setup(const char *provider, const char *name)
> > > +{
> > > +       struct bpf_link *link;
> > > +       int err;
> > > +
> > > +       setup_libbpf();
> > > +
> > > +       ctx.skel = trigger_bench__open();
> > > +       if (!ctx.skel) {
> > > +               fprintf(stderr, "failed to open skeleton\n");
> > > +               exit(1);
> > > +       }
> > > +
> > > +       bpf_program__set_autoload(ctx.skel->progs.bench_trigger_usdt, true);
> > > +
> > > +       err = trigger_bench__load(ctx.skel);
> > > +       if (err) {
> > > +               fprintf(stderr, "failed to load skeleton\n");
> > > +               exit(1);
> > > +       }
> > > +
> > > +       link = bpf_program__attach_usdt(ctx.skel->progs.bench_trigger_usdt,
> > > +                                       -1 /* all PIDs */, "/proc/self/exe",
> > > +                                       provider, name, NULL);
> > > +       if (!link) {
> > > +               fprintf(stderr, "failed to attach uprobe!\n");
> > > +               exit(1);
> > > +       }
> > > +       ctx.skel->links.bench_trigger_usdt = link;
> > > +}
> > > +
> > >  static void usermode_count_setup(void)
> > >  {
> > >         ctx.usermode_counters = true;
> > > @@ -448,6 +487,11 @@ static void uretprobe_multi_ret_setup(void)
> > >         usetup(true, true /* use_multi */, &uprobe_target_ret);
> > >  }
> > >
> > > +static void usdt_setup(void)
> > > +{
> > > +       __usdt_setup("trigger", "usdt");
> > > +}
> > > +
> > >  const struct bench bench_trig_syscall_count = {
> > >         .name = "trig-syscall-count",
> > >         .validate = trigger_validate,
> > > @@ -506,3 +550,4 @@ BENCH_TRIG_USERMODE(uprobe_multi_ret, ret, "uprobe-multi-ret");
> > >  BENCH_TRIG_USERMODE(uretprobe_multi_nop, nop, "uretprobe-multi-nop");
> > >  BENCH_TRIG_USERMODE(uretprobe_multi_push, push, "uretprobe-multi-push");
> > >  BENCH_TRIG_USERMODE(uretprobe_multi_ret, ret, "uretprobe-multi-ret");
> > > +BENCH_TRIG_USERMODE(usdt, usdt, "usdt");
> > > diff --git a/tools/testing/selftests/bpf/progs/trigger_bench.c b/tools/testing/selftests/bpf/progs/trigger_bench.c
> > > index 044a6d78923e..7b7d4a71e7d4 100644
> > > --- a/tools/testing/selftests/bpf/progs/trigger_bench.c
> > > +++ b/tools/testing/selftests/bpf/progs/trigger_bench.c
> > > @@ -1,8 +1,9 @@
> > >  // SPDX-License-Identifier: GPL-2.0
> > >  // Copyright (c) 2020 Facebook
> > > -#include <linux/bpf.h>
> > > +#include "vmlinux.h"
> > >  #include <asm/unistd.h>
> > >  #include <bpf/bpf_helpers.h>
> > > +#include <bpf/usdt.bpf.h>
> > >  #include <bpf/bpf_tracing.h>
> > >  #include "bpf_misc.h"
> > >
> > > @@ -138,3 +139,10 @@ int bench_trigger_rawtp(void *ctx)
> > >         inc_counter();
> > >         return 0;
> > >  }
> > > +
> > > +SEC("?usdt")
> > > +int bench_trigger_usdt(struct pt_regs *ctx)
> > > +{
> > > +       inc_counter();
> > > +       return 0;
> > > +}
> > > --
> > > 2.47.0
> > >

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC 00/11] uprobes: Add support to optimize usdt probes on x86_64
  2024-11-18 10:06   ` Mark Rutland
@ 2024-11-19  6:13     ` Andrii Nakryiko
  2024-11-21 18:18       ` Mark Rutland
  0 siblings, 1 reply; 51+ messages in thread
From: Andrii Nakryiko @ 2024-11-19  6:13 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Peter Zijlstra, Jiri Olsa, Oleg Nesterov, Andrii Nakryiko, bpf,
	Song Liu, Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel,
	Will Deacon

On Mon, Nov 18, 2024 at 2:06 AM Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Sun, Nov 17, 2024 at 12:49:46PM +0100, Peter Zijlstra wrote:
> > On Tue, Nov 05, 2024 at 02:33:54PM +0100, Jiri Olsa wrote:
> > > hi,
> > > this patchset adds support to optimize usdt probes on top of 5-byte
> > > nop instruction.
> > >
> > > The generic approach (optimize all uprobes) is hard due to emulating
> > > possible multiple original instructions and its related issues. The
> > > usdt case, which stores 5-byte nop seems much easier, so starting
> > > with that.
> > >
> > > The basic idea is to replace breakpoint exception with syscall which
> > > is faster on x86_64. For more details please see changelog of patch 7.
> >
> > So this is really about the fact that syscalls are faster than traps on
> > x86_64? Is there something similar on ARM64, or are they roughly the
> > same speed there?
>
> From the hardware side I would expect those to be the same speed.
>
> From the software side, there might be a difference, but in theory we
> should be able to make the non-syscall case faster because we don't have
> syscall tracing there.
>
> > That is, I don't think this scheme will work for the various RISC
> > architectures, given their very limited immediate range turns a typical
> > call into a multi-instruction trainwreck real quick.
> >
> > Now, that isn't a problem if their exceptions and syscalls are of equal
> > speed.
>
> Yep, on arm64 we definitely can't patch in branches reliably; using BRK
> (as we do today) is the only reliable option, and it *shouldn't* be
> slower than a syscall.
>
> Looking around, we have a different latent issue with uprobes on arm64
> in that only certain instructions can be modified while being
> concurrently executed (in addition to the atomictiy of updating the

What does this mean for the application in practical terms? Will it
crash? Or will there be some corruption? Just curious how this can
manifest.

> bytes in memory), and for everything else we need to stop-the-world. We
> handle that for kprobes but it looks like we don't have any
> infrastructure to handle that for uprobes.
>
> Mark.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC perf/core 05/11] uprobes: Add mapping for optimized uprobe trampolines
  2024-11-19  6:06           ` Andrii Nakryiko
@ 2024-11-19  9:13             ` Peter Zijlstra
  2024-11-19 15:15               ` Jiri Olsa
  2024-11-21  0:07               ` Andrii Nakryiko
  0 siblings, 2 replies; 51+ messages in thread
From: Peter Zijlstra @ 2024-11-19  9:13 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jiri Olsa, Oleg Nesterov, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Mon, Nov 18, 2024 at 10:06:51PM -0800, Andrii Nakryiko wrote:

> > > Jiri, we could also have an option to support 64-bit call, right? We'd
> > > need nop9 for that, but it's an option as well to future-proofing this
> > > approach, no?
> >
> > hm, I don't think there's call with relative 64bit offset
> 
> why do you need a relative, when you have 64 bits? ;) there is a call
> to absolute address, no?

No, there is not :/ You get to use an indirect call, which means
multiple instructions and all the speculation joy.

IFF USDT thingies have AX clobbered (I couldn't find in a hurry) then
patching the multi instruction thing is relatively straight forward, if
they don't, its going to be a pain.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC perf/core 05/11] uprobes: Add mapping for optimized uprobe trampolines
  2024-11-19  6:05       ` Andrii Nakryiko
@ 2024-11-19 15:14         ` Jiri Olsa
  2024-11-21  0:10           ` Andrii Nakryiko
  0 siblings, 1 reply; 51+ messages in thread
From: Jiri Olsa @ 2024-11-19 15:14 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jiri Olsa, Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf,
	Song Liu, Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Mon, Nov 18, 2024 at 10:05:41PM -0800, Andrii Nakryiko wrote:
> On Sat, Nov 16, 2024 at 1:44 PM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Thu, Nov 14, 2024 at 03:44:14PM -0800, Andrii Nakryiko wrote:
> > > On Tue, Nov 5, 2024 at 5:35 AM Jiri Olsa <jolsa@kernel.org> wrote:
> > > >
> > > > Adding interface to add special mapping for user space page that will be
> > > > used as place holder for uprobe trampoline in following changes.
> > > >
> > > > The get_tramp_area(vaddr) function either finds 'callable' page or create
> > > > new one.  The 'callable' means it's reachable by call instruction (from
> > > > vaddr argument) and is decided by each arch via new arch_uprobe_is_callable
> > > > function.
> > > >
> > > > The put_tramp_area function either drops refcount or destroys the special
> > > > mapping and all the maps are clean up when the process goes down.
> > > >
> > > > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > > > ---
> > > >  include/linux/uprobes.h |  12 ++++
> > > >  kernel/events/uprobes.c | 141 ++++++++++++++++++++++++++++++++++++++++
> > > >  kernel/fork.c           |   2 +
> > > >  3 files changed, 155 insertions(+)
> > > >
> > > > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > > > index be306028ed59..222d8e82cee2 100644
> > > > --- a/include/linux/uprobes.h
> > > > +++ b/include/linux/uprobes.h
> > > > @@ -172,6 +172,15 @@ struct xol_area;
> > > >
> > > >  struct uprobes_state {
> > > >         struct xol_area         *xol_area;
> > > > +       struct hlist_head       tramp_head;
> > > > +       struct mutex            tramp_mutex;
> > > > +};
> > > > +
> > > > +struct tramp_area {
> > > > +       unsigned long           vaddr;
> > > > +       struct page             *page;
> > > > +       struct hlist_node       node;
> > > > +       refcount_t              ref;
> > >
> > > nit: any reason we are unnecessarily trying to save 4 bytes on
> > > refcount (and we don't actually, due to padding)
> >
> > hum, I'm not sure what you mean.. what's the alternative?
> 
> atomic64_t ?

hum, just because we have extra 4 bytes padding? we use refcount_t
on other places so seems like better fit to me

jirka

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC perf/core 05/11] uprobes: Add mapping for optimized uprobe trampolines
  2024-11-19  9:13             ` Peter Zijlstra
@ 2024-11-19 15:15               ` Jiri Olsa
  2024-11-21  0:07               ` Andrii Nakryiko
  1 sibling, 0 replies; 51+ messages in thread
From: Jiri Olsa @ 2024-11-19 15:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrii Nakryiko, Jiri Olsa, Oleg Nesterov, Andrii Nakryiko, bpf,
	Song Liu, Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Tue, Nov 19, 2024 at 10:13:48AM +0100, Peter Zijlstra wrote:
> On Mon, Nov 18, 2024 at 10:06:51PM -0800, Andrii Nakryiko wrote:
> 
> > > > Jiri, we could also have an option to support 64-bit call, right? We'd
> > > > need nop9 for that, but it's an option as well to future-proofing this
> > > > approach, no?
> > >
> > > hm, I don't think there's call with relative 64bit offset
> > 
> > why do you need a relative, when you have 64 bits? ;) there is a call
> > to absolute address, no?
> 
> No, there is not :/ You get to use an indirect call, which means
> multiple instructions and all the speculation joy.
> 
> IFF USDT thingies have AX clobbered (I couldn't find in a hurry) then
> patching the multi instruction thing is relatively straight forward, if
> they don't, its going to be a pain.

I don't follow, what's the reason for that?

jirka

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC perf/core 05/11] uprobes: Add mapping for optimized uprobe trampolines
  2024-11-19  9:13             ` Peter Zijlstra
  2024-11-19 15:15               ` Jiri Olsa
@ 2024-11-21  0:07               ` Andrii Nakryiko
  2024-11-21 11:53                 ` Peter Zijlstra
  1 sibling, 1 reply; 51+ messages in thread
From: Andrii Nakryiko @ 2024-11-21  0:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jiri Olsa, Oleg Nesterov, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Tue, Nov 19, 2024 at 1:13 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Nov 18, 2024 at 10:06:51PM -0800, Andrii Nakryiko wrote:
>
> > > > Jiri, we could also have an option to support 64-bit call, right? We'd
> > > > need nop9 for that, but it's an option as well to future-proofing this
> > > > approach, no?
> > >
> > > hm, I don't think there's call with relative 64bit offset
> >
> > why do you need a relative, when you have 64 bits? ;) there is a call
> > to absolute address, no?
>
> No, there is not :/ You get to use an indirect call, which means
> multiple instructions and all the speculation joy.

Ah, I misinterpreted `CALL m16:64` (from [0]) as an absolute address
call with 64-bit displacement encoded in the struct, my bad.

  [0] https://www.felixcloutier.com/x86/call

>
> IFF USDT thingies have AX clobbered (I couldn't find in a hurry) then
> patching the multi instruction thing is relatively straight forward, if
> they don't, its going to be a pain.

USDTs are meant to be "transparent" to the surrounding code and they
don't mark any clobbered registers. Technically it could be added, but
I'm not a fan of this.

I'd stick to the 32-bit relative call for now, and fallback to int3 if
we can't reach the uprobe trampoline. So huge .text might suffer
suboptimal performance, but at least USDTs won't pessimize the
surrounding code (and kernel won't make any assumptions about
registers that are ok to use).

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC perf/core 05/11] uprobes: Add mapping for optimized uprobe trampolines
  2024-11-19 15:14         ` Jiri Olsa
@ 2024-11-21  0:10           ` Andrii Nakryiko
  0 siblings, 0 replies; 51+ messages in thread
From: Andrii Nakryiko @ 2024-11-21  0:10 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Tue, Nov 19, 2024 at 7:14 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Mon, Nov 18, 2024 at 10:05:41PM -0800, Andrii Nakryiko wrote:
> > On Sat, Nov 16, 2024 at 1:44 PM Jiri Olsa <olsajiri@gmail.com> wrote:
> > >
> > > On Thu, Nov 14, 2024 at 03:44:14PM -0800, Andrii Nakryiko wrote:
> > > > On Tue, Nov 5, 2024 at 5:35 AM Jiri Olsa <jolsa@kernel.org> wrote:
> > > > >
> > > > > Adding interface to add special mapping for user space page that will be
> > > > > used as place holder for uprobe trampoline in following changes.
> > > > >
> > > > > The get_tramp_area(vaddr) function either finds 'callable' page or create
> > > > > new one.  The 'callable' means it's reachable by call instruction (from
> > > > > vaddr argument) and is decided by each arch via new arch_uprobe_is_callable
> > > > > function.
> > > > >
> > > > > The put_tramp_area function either drops refcount or destroys the special
> > > > > mapping and all the maps are clean up when the process goes down.
> > > > >
> > > > > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > > > > ---
> > > > >  include/linux/uprobes.h |  12 ++++
> > > > >  kernel/events/uprobes.c | 141 ++++++++++++++++++++++++++++++++++++++++
> > > > >  kernel/fork.c           |   2 +
> > > > >  3 files changed, 155 insertions(+)
> > > > >
> > > > > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > > > > index be306028ed59..222d8e82cee2 100644
> > > > > --- a/include/linux/uprobes.h
> > > > > +++ b/include/linux/uprobes.h
> > > > > @@ -172,6 +172,15 @@ struct xol_area;
> > > > >
> > > > >  struct uprobes_state {
> > > > >         struct xol_area         *xol_area;
> > > > > +       struct hlist_head       tramp_head;
> > > > > +       struct mutex            tramp_mutex;
> > > > > +};
> > > > > +
> > > > > +struct tramp_area {
> > > > > +       unsigned long           vaddr;
> > > > > +       struct page             *page;
> > > > > +       struct hlist_node       node;
> > > > > +       refcount_t              ref;
> > > >
> > > > nit: any reason we are unnecessarily trying to save 4 bytes on
> > > > refcount (and we don't actually, due to padding)
> > >
> > > hum, I'm not sure what you mean.. what's the alternative?
> >
> > atomic64_t ?
>
> hum, just because we have extra 4 bytes padding? we use refcount_t
> on other places so seems like better fit to me

My (minor) concern was that tramp_area is a very long-living object
that can be shared across huge amounts of uprobes, and 2 billion
uprobes is a lot, but still a number that one can, practically
speaking, reach (with enough effort). And so skimping on refcount to
save 4 bytes for something that we have one or two per *some*
processes seemed (and still seems) like wrong savings to pursue.

>
> jirka

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC perf/core 05/11] uprobes: Add mapping for optimized uprobe trampolines
  2024-11-21  0:07               ` Andrii Nakryiko
@ 2024-11-21 11:53                 ` Peter Zijlstra
  2024-11-21 16:02                   ` Alexei Starovoitov
  0 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2024-11-21 11:53 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jiri Olsa, Oleg Nesterov, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel,
	H. Peter Anvin

On Wed, Nov 20, 2024 at 04:07:38PM -0800, Andrii Nakryiko wrote:

> USDTs are meant to be "transparent" to the surrounding code and they
> don't mark any clobbered registers. Technically it could be added, but
> I'm not a fan of this.

Sure. Anyway, another thing to consider is FRED, will all of this still
matter once that lands? If FRED gets us INT3 performance close to what
SYSCALL has, then all this work will go unused.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC perf/core 05/11] uprobes: Add mapping for optimized uprobe trampolines
  2024-11-21 11:53                 ` Peter Zijlstra
@ 2024-11-21 16:02                   ` Alexei Starovoitov
  2024-11-21 16:34                     ` Peter Zijlstra
  0 siblings, 1 reply; 51+ messages in thread
From: Alexei Starovoitov @ 2024-11-21 16:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrii Nakryiko, Jiri Olsa, Oleg Nesterov, Andrii Nakryiko, bpf,
	Song Liu, Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, LKML, linux-trace-kernel,
	H. Peter Anvin

On Thu, Nov 21, 2024 at 4:17 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Nov 20, 2024 at 04:07:38PM -0800, Andrii Nakryiko wrote:
>
> > USDTs are meant to be "transparent" to the surrounding code and they
> > don't mark any clobbered registers. Technically it could be added, but
> > I'm not a fan of this.
>
> Sure. Anyway, another thing to consider is FRED, will all of this still
> matter once that lands? If FRED gets us INT3 performance close to what
> SYSCALL has, then all this work will go unused.

afaik not a single cpu in the datacenter supports FRED while
uprobe overhead is real.
imo it's worth improving performance today for existing cpus.
I suspect arm64 might benefit too. Even if arm hw does the same
amount of work for trap vs syscall the sw overhead of handling
trap is different.
I suspect that equation will apply to future FRED cpus too.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC perf/core 05/11] uprobes: Add mapping for optimized uprobe trampolines
  2024-11-21 16:02                   ` Alexei Starovoitov
@ 2024-11-21 16:34                     ` Peter Zijlstra
  2024-11-21 16:47                       ` Alexei Starovoitov
  0 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2024-11-21 16:34 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, Jiri Olsa, Oleg Nesterov, Andrii Nakryiko, bpf,
	Song Liu, Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, LKML, linux-trace-kernel,
	H. Peter Anvin

On Thu, Nov 21, 2024 at 08:02:12AM -0800, Alexei Starovoitov wrote:
> On Thu, Nov 21, 2024 at 4:17 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Wed, Nov 20, 2024 at 04:07:38PM -0800, Andrii Nakryiko wrote:
> >
> > > USDTs are meant to be "transparent" to the surrounding code and they
> > > don't mark any clobbered registers. Technically it could be added, but
> > > I'm not a fan of this.
> >
> > Sure. Anyway, another thing to consider is FRED, will all of this still
> > matter once that lands? If FRED gets us INT3 performance close to what
> > SYSCALL has, then all this work will go unused.
> 
> afaik not a single cpu in the datacenter supports FRED while
> uprobe overhead is real.
> imo it's worth improving performance today for existing cpus.

I understand, but OTOH adding a syscall now, that we'll have to maintain
for years and years, even through we know it'll not be used much is a
bit annoying.

> I suspect arm64 might benefit too. Even if arm hw does the same
> amount of work for trap vs syscall the sw overhead of handling
> trap is different.

Well, the RISC CPUs have a much harder time using this, their immediate
range is typically puny and they end up needing multiple instructions
and some register in order to set up a call.

Elsewhere in the thread Mark Rutland already noted that arm64 really
doesn't need or want this.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC perf/core 05/11] uprobes: Add mapping for optimized uprobe trampolines
  2024-11-21 16:34                     ` Peter Zijlstra
@ 2024-11-21 16:47                       ` Alexei Starovoitov
  2024-11-21 19:38                         ` Mark Rutland
  0 siblings, 1 reply; 51+ messages in thread
From: Alexei Starovoitov @ 2024-11-21 16:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrii Nakryiko, Jiri Olsa, Oleg Nesterov, Andrii Nakryiko, bpf,
	Song Liu, Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, LKML, linux-trace-kernel,
	H. Peter Anvin

On Thu, Nov 21, 2024 at 8:34 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Nov 21, 2024 at 08:02:12AM -0800, Alexei Starovoitov wrote:
> > On Thu, Nov 21, 2024 at 4:17 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Wed, Nov 20, 2024 at 04:07:38PM -0800, Andrii Nakryiko wrote:
> > >
> > > > USDTs are meant to be "transparent" to the surrounding code and they
> > > > don't mark any clobbered registers. Technically it could be added, but
> > > > I'm not a fan of this.
> > >
> > > Sure. Anyway, another thing to consider is FRED, will all of this still
> > > matter once that lands? If FRED gets us INT3 performance close to what
> > > SYSCALL has, then all this work will go unused.
> >
> > afaik not a single cpu in the datacenter supports FRED while
> > uprobe overhead is real.
> > imo it's worth improving performance today for existing cpus.
>
> I understand, but OTOH adding a syscall now, that we'll have to maintain
> for years and years, even through we know it'll not be used much is a
> bit annoying.

No. It _will_ be used for years.

>
> > I suspect arm64 might benefit too. Even if arm hw does the same
> > amount of work for trap vs syscall the sw overhead of handling
> > trap is different.
>
> Well, the RISC CPUs have a much harder time using this, their immediate
> range is typically puny and they end up needing multiple instructions
> and some register in order to set up a call.

We don't care about 32-bit archs and other exotics.
They're not the reasons to leave performance on the table
on dominant archs.

> Elsewhere in the thread Mark Rutland already noted that arm64 really
> doesn't need or want this.

Doesn't look like you've read what you quoted above.
On arm64 the _HW_ cost may be the same.
The _SW_ difference in handling trap vs syscall is real.
I bet once uprobe syscall is benchmarked on arm64 there will
be a delta.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC 00/11] uprobes: Add support to optimize usdt probes on x86_64
  2024-11-19  6:13     ` Andrii Nakryiko
@ 2024-11-21 18:18       ` Mark Rutland
  2024-11-26 19:13         ` Andrii Nakryiko
  0 siblings, 1 reply; 51+ messages in thread
From: Mark Rutland @ 2024-11-21 18:18 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Peter Zijlstra, Jiri Olsa, Oleg Nesterov, Andrii Nakryiko, bpf,
	Song Liu, Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel,
	Will Deacon

On Mon, Nov 18, 2024 at 10:13:04PM -0800, Andrii Nakryiko wrote:
> On Mon, Nov 18, 2024 at 2:06 AM Mark Rutland <mark.rutland@arm.com> wrote:
> > Yep, on arm64 we definitely can't patch in branches reliably; using BRK
> > (as we do today) is the only reliable option, and it *shouldn't* be
> > slower than a syscall.
> >
> > Looking around, we have a different latent issue with uprobes on arm64
> > in that only certain instructions can be modified while being
> > concurrently executed (in addition to the atomictiy of updating the
> 
> What does this mean for the application in practical terms? Will it
> crash? Or will there be some corruption? Just curious how this can
> manifest.

It can result in a variety of effects including crashes, corruption of
memory, registers, issuing random syscalls, etc.

The ARM ARM (ARM DDI 0487K.a [1]) says in section B2.2.5:

  Concurrent modification and execution of instructions can lead to the
  resulting instruction performing any behavior that can be achieved by
  executing any sequence of instructions that can be executed from the
  same Exception level [...]

Which is to say basically anything might happen, except that this can't
corrupt any state userspace cannot access, and cannot provide a
mechanism to escalate privilege to a higher exception level.

So that's potentially *very bad*, and we're just getting lucky that most
implementations don't happen to do that for most instructions, though
I'm fairly certain there are implementations out there which do exhibit
this behaviour (and it gets more likely as implementations get more
aggressive).

Mark.

[1] https://developer.arm.com/documentation/ddi0487/ka/?lang=en

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC perf/core 05/11] uprobes: Add mapping for optimized uprobe trampolines
  2024-11-21 16:47                       ` Alexei Starovoitov
@ 2024-11-21 19:38                         ` Mark Rutland
  0 siblings, 0 replies; 51+ messages in thread
From: Mark Rutland @ 2024-11-21 19:38 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Peter Zijlstra, Andrii Nakryiko, Jiri Olsa, Oleg Nesterov,
	Andrii Nakryiko, bpf, Song Liu, Yonghong Song, John Fastabend,
	Hao Luo, Steven Rostedt, Masami Hiramatsu, Alan Maguire, LKML,
	linux-trace-kernel, H. Peter Anvin

[resending as I somehow messed up the 'From' header and got a tonne of
bounces]

On Thu, Nov 21, 2024 at 08:47:56AM -0800, Alexei Starovoitov wrote:
> On Thu, Nov 21, 2024 at 8:34 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > Elsewhere in the thread Mark Rutland already noted that arm64 really
> > doesn't need or want this.
> 
> Doesn't look like you've read what you quoted above.
> On arm64 the _HW_ cost may be the same.
> The _SW_ difference in handling trap vs syscall is real.
> I bet once uprobe syscall is benchmarked on arm64 there will
> be a delta.

I already pointed out in [1] that on arm64 we can make the trap case
*faster* than the syscall. If that's not already the case, there's only
a small amount of rework needed, (pulling BRK handling into
entry-common.c), which we want to do for other reasons anyway.

On arm64 I do not want the syscall; the trap is faster and simpler to
maintain.

Mark

[1] https://lore.kernel.org/lkml/ZzsRfhGSYXVK0mst@J2N7QTR9R3/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC 00/11] uprobes: Add support to optimize usdt probes on x86_64
  2024-11-21 18:18       ` Mark Rutland
@ 2024-11-26 19:13         ` Andrii Nakryiko
  0 siblings, 0 replies; 51+ messages in thread
From: Andrii Nakryiko @ 2024-11-26 19:13 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Peter Zijlstra, Jiri Olsa, Oleg Nesterov, Andrii Nakryiko, bpf,
	Song Liu, Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel,
	Will Deacon

On Thu, Nov 21, 2024 at 10:18 AM Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Mon, Nov 18, 2024 at 10:13:04PM -0800, Andrii Nakryiko wrote:
> > On Mon, Nov 18, 2024 at 2:06 AM Mark Rutland <mark.rutland@arm.com> wrote:
> > > Yep, on arm64 we definitely can't patch in branches reliably; using BRK
> > > (as we do today) is the only reliable option, and it *shouldn't* be
> > > slower than a syscall.
> > >
> > > Looking around, we have a different latent issue with uprobes on arm64
> > > in that only certain instructions can be modified while being
> > > concurrently executed (in addition to the atomictiy of updating the
> >
> > What does this mean for the application in practical terms? Will it
> > crash? Or will there be some corruption? Just curious how this can
> > manifest.
>
> It can result in a variety of effects including crashes, corruption of
> memory, registers, issuing random syscalls, etc.
>
> The ARM ARM (ARM DDI 0487K.a [1]) says in section B2.2.5:
>
>   Concurrent modification and execution of instructions can lead to the
>   resulting instruction performing any behavior that can be achieved by
>   executing any sequence of instructions that can be executed from the
>   same Exception level [...]
>
> Which is to say basically anything might happen, except that this can't
> corrupt any state userspace cannot access, and cannot provide a
> mechanism to escalate privilege to a higher exception level.
>
> So that's potentially *very bad*, and we're just getting lucky that most
> implementations don't happen to do that for most instructions, though
> I'm fairly certain there are implementations out there which do exhibit
> this behaviour (and it gets more likely as implementations get more
> aggressive).
>

I see. I wonder if the fact that we do __replace_page() saves us here?
Either way, if that's a problem, it would be good for someone familiar
with ARM64 to try to address it. Ideally in a way that won't ruin the
multi-uprobe attachment speeds (i.e., not doing stop-the-world for
each of many uprobe locations to be attached, but rather do that once
for all uprobes).

> Mark.
>
> [1] https://developer.arm.com/documentation/ddi0487/ka/?lang=en

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2024-11-26 19:13 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-05 13:33 [RFC 00/11] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
2024-11-05 13:33 ` [RFC perf/core 01/11] uprobes: Rename arch_uretprobe_trampoline function Jiri Olsa
2024-11-05 13:33 ` [RFC perf/core 02/11] uprobes: Make copy_from_page global Jiri Olsa
2024-11-14 23:40   ` Andrii Nakryiko
2024-11-16 21:41     ` Jiri Olsa
2024-11-05 13:33 ` [RFC perf/core 03/11] uprobes: Add len argument to uprobe_write_opcode Jiri Olsa
2024-11-14 23:41   ` Andrii Nakryiko
2024-11-16 21:41     ` Jiri Olsa
2024-11-05 13:33 ` [RFC perf/core 04/11] uprobes: Add data argument to uprobe_write_opcode function Jiri Olsa
2024-11-14 23:41   ` Andrii Nakryiko
2024-11-16 21:43     ` Jiri Olsa
2024-11-05 13:33 ` [RFC perf/core 05/11] uprobes: Add mapping for optimized uprobe trampolines Jiri Olsa
2024-11-05 14:23   ` Peter Zijlstra
2024-11-05 16:33     ` Jiri Olsa
2024-11-14 23:44       ` Andrii Nakryiko
2024-11-16 21:44         ` Jiri Olsa
2024-11-19  6:06           ` Andrii Nakryiko
2024-11-19  9:13             ` Peter Zijlstra
2024-11-19 15:15               ` Jiri Olsa
2024-11-21  0:07               ` Andrii Nakryiko
2024-11-21 11:53                 ` Peter Zijlstra
2024-11-21 16:02                   ` Alexei Starovoitov
2024-11-21 16:34                     ` Peter Zijlstra
2024-11-21 16:47                       ` Alexei Starovoitov
2024-11-21 19:38                         ` Mark Rutland
2024-11-14 23:44   ` Andrii Nakryiko
2024-11-16 21:44     ` Jiri Olsa
2024-11-19  6:05       ` Andrii Nakryiko
2024-11-19 15:14         ` Jiri Olsa
2024-11-21  0:10           ` Andrii Nakryiko
2024-11-05 13:34 ` [RFC perf/core 06/11] uprobes: Add uprobe syscall to speed up uprobe Jiri Olsa
2024-11-05 13:34 ` [RFC perf/core 07/11] uprobes/x86: Add support to optimize uprobes Jiri Olsa
2024-11-14 23:44   ` Andrii Nakryiko
2024-11-16 21:44     ` Jiri Olsa
2024-11-18  8:18   ` Masami Hiramatsu
2024-11-18  9:39     ` Jiri Olsa
2024-11-05 13:34 ` [RFC bpf-next 08/11] selftests/bpf: Use 5-byte nop for x86 usdt probes Jiri Olsa
2024-11-05 13:34 ` [RFC bpf-next 09/11] selftests/bpf: Add usdt trigger bench Jiri Olsa
2024-11-14 23:40   ` Andrii Nakryiko
2024-11-16 21:45     ` Jiri Olsa
2024-11-19  6:08       ` Andrii Nakryiko
2024-11-05 13:34 ` [RFC bpf-next 10/11] selftests/bpf: Add uprobe/usdt optimized test Jiri Olsa
2024-11-05 13:34 ` [RFC bpf-next 11/11] selftests/bpf: Add hit/attach/detach race optimized uprobe test Jiri Olsa
2024-11-17 11:49 ` [RFC 00/11] uprobes: Add support to optimize usdt probes on x86_64 Peter Zijlstra
2024-11-18  9:29   ` Jiri Olsa
2024-11-18 10:06   ` Mark Rutland
2024-11-19  6:13     ` Andrii Nakryiko
2024-11-21 18:18       ` Mark Rutland
2024-11-26 19:13         ` Andrii Nakryiko
2024-11-18  8:04 ` Masami Hiramatsu
2024-11-18  9:52   ` Jiri Olsa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).