linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next 00/13] uprobes: Add support to optimize usdt probes on x86_64
@ 2024-12-11 13:33 Jiri Olsa
  2024-12-11 13:33 ` [PATCH bpf-next 01/13] uprobes: Rename arch_uretprobe_trampoline function Jiri Olsa
                   ` (14 more replies)
  0 siblings, 15 replies; 63+ messages in thread
From: Jiri Olsa @ 2024-12-11 13:33 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, Song Liu, Yonghong Song, John Fastabend, Hao Luo,
	Steven Rostedt, Masami Hiramatsu, Alan Maguire, linux-kernel,
	linux-trace-kernel

hi,
this patchset adds support to optimize usdt probes on top of 5-byte
nop instruction.

The generic approach (optimize all uprobes) is hard due to emulating
possible multiple original instructions and its related issues. The
usdt case, which stores 5-byte nop seems much easier, so starting
with that.

The basic idea is to replace breakpoint exception with syscall which
is faster on x86_64. For more details please see changelog of patch 8.

The run_bench_uprobes.sh benchmark triggers uprobe (on top of different
original instructions) in a loop and counts how many of those happened
per second (the unit below is million loops).

There's big speed up if you consider current usdt implementation
(uprobe-nop) compared to proposed usdt (uprobe-nop5):

  # ./benchs/run_bench_uprobes.sh 

      usermode-count :  233.831 ± 0.257M/s
      syscall-count  :   12.107 ± 0.038M/s
  --> uprobe-nop     :    3.246 ± 0.004M/s
      uprobe-push    :    3.057 ± 0.000M/s
      uprobe-ret     :    1.113 ± 0.003M/s
  --> uprobe-nop5    :    6.751 ± 0.037M/s
      uretprobe-nop  :    1.740 ± 0.015M/s
      uretprobe-push :    1.677 ± 0.018M/s
      uretprobe-ret  :    0.852 ± 0.005M/s
      uretprobe-nop5 :    6.769 ± 0.040M/s


v1 changes:
- rebased on top of bpf-next/master
- couple of function/variable renames [Andrii]
- added nop5 emulation [Andrii]
- added checks to arch_uprobe_verify_opcode [Andrii]
- fixed arch_uprobe_is_callable/find_nearest_page [Andrii]
- used CALL_INSN_OPCODE [Masami]
- added uprobe-nop5 benchmark [Andrii]
- using atomic64_t in tramp_area [Andri]
- using single page for all uprobe trampoline mappings

thanks,
jirka


---
Jiri Olsa (13):
      uprobes: Rename arch_uretprobe_trampoline function
      uprobes: Make copy_from_page global
      uprobes: Add nbytes argument to uprobe_write_opcode
      uprobes: Add arch_uprobe_verify_opcode function
      uprobes: Add mapping for optimized uprobe trampolines
      uprobes/x86: Add uprobe syscall to speed up uprobe
      uprobes/x86: Add support to emulate nop5 instruction
      uprobes/x86: Add support to optimize uprobes
      selftests/bpf: Use 5-byte nop for x86 usdt probes
      selftests/bpf: Add uprobe/usdt optimized test
      selftests/bpf: Add hit/attach/detach race optimized uprobe test
      selftests/bpf: Add uprobe syscall sigill signal test
      selftests/bpf: Add 5-byte nop uprobe trigger bench

 arch/x86/entry/syscalls/syscall_64.tbl                  |   1 +
 arch/x86/include/asm/uprobes.h                          |   7 +++
 arch/x86/kernel/uprobes.c                               | 255 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/syscalls.h                                |   2 +
 include/linux/uprobes.h                                 |  25 +++++++-
 kernel/events/uprobes.c                                 | 191 +++++++++++++++++++++++++++++++++++++++++++++++++++-----
 kernel/fork.c                                           |   1 +
 kernel/sys_ni.c                                         |   1 +
 tools/testing/selftests/bpf/bench.c                     |  12 ++++
 tools/testing/selftests/bpf/benchs/bench_trigger.c      |  42 +++++++++++++
 tools/testing/selftests/bpf/benchs/run_bench_uprobes.sh |   2 +-
 tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c | 326 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 tools/testing/selftests/bpf/progs/uprobe_optimized.c    |  29 +++++++++
 tools/testing/selftests/bpf/sdt.h                       |   9 ++-
 14 files changed, 880 insertions(+), 23 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/uprobe_optimized.c

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH bpf-next 01/13] uprobes: Rename arch_uretprobe_trampoline function
  2024-12-11 13:33 [PATCH bpf-next 00/13] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
@ 2024-12-11 13:33 ` Jiri Olsa
  2024-12-13  0:42   ` Andrii Nakryiko
  2024-12-11 13:33 ` [PATCH bpf-next 02/13] uprobes: Make copy_from_page global Jiri Olsa
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 63+ messages in thread
From: Jiri Olsa @ 2024-12-11 13:33 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, Song Liu, Yonghong Song, John Fastabend, Hao Luo,
	Steven Rostedt, Masami Hiramatsu, Alan Maguire, linux-kernel,
	linux-trace-kernel

We are about to add uprobe trampoline, so cleaning up the namespace.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 arch/x86/kernel/uprobes.c | 2 +-
 include/linux/uprobes.h   | 2 +-
 kernel/events/uprobes.c   | 4 ++--
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 5a952c5ea66b..22a17c149a55 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -338,7 +338,7 @@ extern u8 uretprobe_trampoline_entry[];
 extern u8 uretprobe_trampoline_end[];
 extern u8 uretprobe_syscall_check[];
 
-void *arch_uprobe_trampoline(unsigned long *psize)
+void *arch_uretprobe_trampoline(unsigned long *psize)
 {
 	static uprobe_opcode_t insn = UPROBE_SWBP_INSN;
 	struct pt_regs *regs = task_pt_regs(current);
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index e0a4c2082245..09a298e416a8 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -211,7 +211,7 @@ extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
 extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
 					 void *src, unsigned long len);
 extern void uprobe_handle_trampoline(struct pt_regs *regs);
-extern void *arch_uprobe_trampoline(unsigned long *psize);
+extern void *arch_uretprobe_trampoline(unsigned long *psize);
 extern unsigned long uprobe_get_trampoline_vaddr(void);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index fa04b14a7d72..e0e3ebb4c0a1 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1695,7 +1695,7 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
 	return ret;
 }
 
-void * __weak arch_uprobe_trampoline(unsigned long *psize)
+void * __weak arch_uretprobe_trampoline(unsigned long *psize)
 {
 	static uprobe_opcode_t insn = UPROBE_SWBP_INSN;
 
@@ -1727,7 +1727,7 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
 	init_waitqueue_head(&area->wq);
 	/* Reserve the 1st slot for get_trampoline_vaddr() */
 	set_bit(0, area->bitmap);
-	insns = arch_uprobe_trampoline(&insns_size);
+	insns = arch_uretprobe_trampoline(&insns_size);
 	arch_uprobe_copy_ixol(area->page, 0, insns, insns_size);
 
 	if (!xol_add_vma(mm, area))
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH bpf-next 02/13] uprobes: Make copy_from_page global
  2024-12-11 13:33 [PATCH bpf-next 00/13] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
  2024-12-11 13:33 ` [PATCH bpf-next 01/13] uprobes: Rename arch_uretprobe_trampoline function Jiri Olsa
@ 2024-12-11 13:33 ` Jiri Olsa
  2024-12-13  0:43   ` Andrii Nakryiko
  2024-12-11 13:33 ` [PATCH bpf-next 03/13] uprobes: Add nbytes argument to uprobe_write_opcode Jiri Olsa
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 63+ messages in thread
From: Jiri Olsa @ 2024-12-11 13:33 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, Song Liu, Yonghong Song, John Fastabend, Hao Luo,
	Steven Rostedt, Masami Hiramatsu, Alan Maguire, linux-kernel,
	linux-trace-kernel

Making copy_from_page global and adding uprobe prefix.
Adding the uprobe prefix to copy_to_page as well for symmetry.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 include/linux/uprobes.h |  1 +
 kernel/events/uprobes.c | 16 ++++++++--------
 2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 09a298e416a8..e24fbe496efb 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -213,6 +213,7 @@ extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
 extern void uprobe_handle_trampoline(struct pt_regs *regs);
 extern void *arch_uretprobe_trampoline(unsigned long *psize);
 extern unsigned long uprobe_get_trampoline_vaddr(void);
+extern void uprobe_copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index e0e3ebb4c0a1..61ec91f635dc 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -249,14 +249,14 @@ bool __weak is_trap_insn(uprobe_opcode_t *insn)
 	return is_swbp_insn(insn);
 }
 
-static void copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len)
+void uprobe_copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len)
 {
 	void *kaddr = kmap_atomic(page);
 	memcpy(dst, kaddr + (vaddr & ~PAGE_MASK), len);
 	kunmap_atomic(kaddr);
 }
 
-static void copy_to_page(struct page *page, unsigned long vaddr, const void *src, int len)
+static void uprobe_copy_to_page(struct page *page, unsigned long vaddr, const void *src, int len)
 {
 	void *kaddr = kmap_atomic(page);
 	memcpy(kaddr + (vaddr & ~PAGE_MASK), src, len);
@@ -277,7 +277,7 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t
 	 * is a trap variant; uprobes always wins over any other (gdb)
 	 * breakpoint.
 	 */
-	copy_from_page(page, vaddr, &old_opcode, UPROBE_SWBP_INSN_SIZE);
+	uprobe_copy_from_page(page, vaddr, &old_opcode, UPROBE_SWBP_INSN_SIZE);
 	is_swbp = is_swbp_insn(&old_opcode);
 
 	if (is_swbp_insn(new_opcode)) {
@@ -524,7 +524,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
 
 	__SetPageUptodate(new_page);
 	copy_highpage(new_page, old_page);
-	copy_to_page(new_page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
+	uprobe_copy_to_page(new_page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
 
 	if (!is_register) {
 		struct page *orig_page;
@@ -1026,7 +1026,7 @@ static int __copy_insn(struct address_space *mapping, struct file *filp,
 	if (IS_ERR(page))
 		return PTR_ERR(page);
 
-	copy_from_page(page, offset, insn, nbytes);
+	uprobe_copy_from_page(page, offset, insn, nbytes);
 	put_page(page);
 
 	return 0;
@@ -1367,7 +1367,7 @@ struct uprobe *uprobe_register(struct inode *inode,
 		return ERR_PTR(-EINVAL);
 
 	/*
-	 * This ensures that copy_from_page(), copy_to_page() and
+	 * This ensures that uprobe_copy_from_page(), uprobe_copy_to_page() and
 	 * __update_ref_ctr() can't cross page boundary.
 	 */
 	if (!IS_ALIGNED(offset, UPROBE_SWBP_INSN_SIZE))
@@ -1856,7 +1856,7 @@ void __weak arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
 				  void *src, unsigned long len)
 {
 	/* Initialize the slot */
-	copy_to_page(page, vaddr, src, len);
+	uprobe_copy_to_page(page, vaddr, src, len);
 
 	/*
 	 * We probably need flush_icache_user_page() but it needs vma.
@@ -2287,7 +2287,7 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr)
 	if (result < 0)
 		return result;
 
-	copy_from_page(page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
+	uprobe_copy_from_page(page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
 	put_page(page);
  out:
 	/* This needs to return true for any variant of the trap insn */
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH bpf-next 03/13] uprobes: Add nbytes argument to uprobe_write_opcode
  2024-12-11 13:33 [PATCH bpf-next 00/13] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
  2024-12-11 13:33 ` [PATCH bpf-next 01/13] uprobes: Rename arch_uretprobe_trampoline function Jiri Olsa
  2024-12-11 13:33 ` [PATCH bpf-next 02/13] uprobes: Make copy_from_page global Jiri Olsa
@ 2024-12-11 13:33 ` Jiri Olsa
  2024-12-13  0:45   ` Andrii Nakryiko
  2024-12-11 13:33 ` [PATCH bpf-next 04/13] uprobes: Add arch_uprobe_verify_opcode function Jiri Olsa
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 63+ messages in thread
From: Jiri Olsa @ 2024-12-11 13:33 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, Song Liu, Yonghong Song, John Fastabend, Hao Luo,
	Steven Rostedt, Masami Hiramatsu, Alan Maguire, linux-kernel,
	linux-trace-kernel

Adding nbytes argument to uprobe_write_opcode as preparation
fo writing longer instructions in following changes.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 include/linux/uprobes.h |  3 ++-
 kernel/events/uprobes.c | 14 ++++++++------
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index e24fbe496efb..cc723bc48c1d 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -181,7 +181,8 @@ extern bool is_swbp_insn(uprobe_opcode_t *insn);
 extern bool is_trap_insn(uprobe_opcode_t *insn);
 extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs);
 extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
-extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t);
+extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr,
+			       uprobe_opcode_t *insn, int nbytes);
 extern struct uprobe *uprobe_register(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
 extern int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool);
 extern void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 61ec91f635dc..7c2ecf11a573 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -470,7 +470,7 @@ static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm,
  * Return 0 (success) or a negative errno.
  */
 int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
-			unsigned long vaddr, uprobe_opcode_t opcode)
+			unsigned long vaddr, uprobe_opcode_t *insn, int nbytes)
 {
 	struct uprobe *uprobe;
 	struct page *old_page, *new_page;
@@ -479,7 +479,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
 	bool orig_page_huge = false;
 	unsigned int gup_flags = FOLL_FORCE;
 
-	is_register = is_swbp_insn(&opcode);
+	is_register = is_swbp_insn(insn);
 	uprobe = container_of(auprobe, struct uprobe, arch);
 
 retry:
@@ -490,7 +490,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
 	if (IS_ERR(old_page))
 		return PTR_ERR(old_page);
 
-	ret = verify_opcode(old_page, vaddr, &opcode);
+	ret = verify_opcode(old_page, vaddr, insn);
 	if (ret <= 0)
 		goto put_old;
 
@@ -524,7 +524,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
 
 	__SetPageUptodate(new_page);
 	copy_highpage(new_page, old_page);
-	uprobe_copy_to_page(new_page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
+	uprobe_copy_to_page(new_page, vaddr, insn, nbytes);
 
 	if (!is_register) {
 		struct page *orig_page;
@@ -581,7 +581,9 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
  */
 int __weak set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr)
 {
-	return uprobe_write_opcode(auprobe, mm, vaddr, UPROBE_SWBP_INSN);
+	uprobe_opcode_t insn = UPROBE_SWBP_INSN;
+
+	return uprobe_write_opcode(auprobe, mm, vaddr, &insn, UPROBE_SWBP_INSN_SIZE);
 }
 
 /**
@@ -597,7 +599,7 @@ int __weak
 set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr)
 {
 	return uprobe_write_opcode(auprobe, mm, vaddr,
-			*(uprobe_opcode_t *)&auprobe->insn);
+			(uprobe_opcode_t *)&auprobe->insn, UPROBE_SWBP_INSN_SIZE);
 }
 
 /* uprobe should have guaranteed positive refcount */
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH bpf-next 04/13] uprobes: Add arch_uprobe_verify_opcode function
  2024-12-11 13:33 [PATCH bpf-next 00/13] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (2 preceding siblings ...)
  2024-12-11 13:33 ` [PATCH bpf-next 03/13] uprobes: Add nbytes argument to uprobe_write_opcode Jiri Olsa
@ 2024-12-11 13:33 ` Jiri Olsa
  2024-12-13  0:48   ` Andrii Nakryiko
  2024-12-11 13:33 ` [PATCH bpf-next 05/13] uprobes: Add mapping for optimized uprobe trampolines Jiri Olsa
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 63+ messages in thread
From: Jiri Olsa @ 2024-12-11 13:33 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, Song Liu, Yonghong Song, John Fastabend, Hao Luo,
	Steven Rostedt, Masami Hiramatsu, Alan Maguire, linux-kernel,
	linux-trace-kernel

Adding arch_uprobe_verify_opcode function, so we can overload
verification for each architecture in following changes.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 include/linux/uprobes.h |  5 +++++
 kernel/events/uprobes.c | 19 ++++++++++++++++---
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index cc723bc48c1d..8843b7f99ed0 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -215,6 +215,11 @@ extern void uprobe_handle_trampoline(struct pt_regs *regs);
 extern void *arch_uretprobe_trampoline(unsigned long *psize);
 extern unsigned long uprobe_get_trampoline_vaddr(void);
 extern void uprobe_copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len);
+extern int uprobe_verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode);
+extern int arch_uprobe_verify_opcode(struct arch_uprobe *auprobe, struct page *page,
+				     unsigned long vaddr, uprobe_opcode_t *new_opcode,
+				     int nbytes);
+extern bool arch_uprobe_is_register(uprobe_opcode_t *insn, int nbytes);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 7c2ecf11a573..8068f91de9e3 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -263,7 +263,13 @@ static void uprobe_copy_to_page(struct page *page, unsigned long vaddr, const vo
 	kunmap_atomic(kaddr);
 }
 
-static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode)
+__weak bool arch_uprobe_is_register(uprobe_opcode_t *insn, int nbytes)
+{
+	return is_swbp_insn(insn);
+}
+
+int uprobe_verify_opcode(struct page *page, unsigned long vaddr,
+			 uprobe_opcode_t *new_opcode)
 {
 	uprobe_opcode_t old_opcode;
 	bool is_swbp;
@@ -291,6 +297,13 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t
 	return 1;
 }
 
+__weak int arch_uprobe_verify_opcode(struct arch_uprobe *auprobe, struct page *page,
+				     unsigned long vaddr, uprobe_opcode_t *new_opcode,
+				     int nbytes)
+{
+	return uprobe_verify_opcode(page, vaddr, new_opcode);
+}
+
 static struct delayed_uprobe *
 delayed_uprobe_check(struct uprobe *uprobe, struct mm_struct *mm)
 {
@@ -479,7 +492,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
 	bool orig_page_huge = false;
 	unsigned int gup_flags = FOLL_FORCE;
 
-	is_register = is_swbp_insn(insn);
+	is_register = arch_uprobe_is_register(insn, nbytes);
 	uprobe = container_of(auprobe, struct uprobe, arch);
 
 retry:
@@ -490,7 +503,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
 	if (IS_ERR(old_page))
 		return PTR_ERR(old_page);
 
-	ret = verify_opcode(old_page, vaddr, insn);
+	ret = arch_uprobe_verify_opcode(auprobe, old_page, vaddr, insn, nbytes);
 	if (ret <= 0)
 		goto put_old;
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH bpf-next 05/13] uprobes: Add mapping for optimized uprobe trampolines
  2024-12-11 13:33 [PATCH bpf-next 00/13] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (3 preceding siblings ...)
  2024-12-11 13:33 ` [PATCH bpf-next 04/13] uprobes: Add arch_uprobe_verify_opcode function Jiri Olsa
@ 2024-12-11 13:33 ` Jiri Olsa
  2024-12-13  1:01   ` Andrii Nakryiko
  2024-12-11 13:33 ` [PATCH bpf-next 06/13] uprobes/x86: Add uprobe syscall to speed up uprobe Jiri Olsa
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 63+ messages in thread
From: Jiri Olsa @ 2024-12-11 13:33 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, Song Liu, Yonghong Song, John Fastabend, Hao Luo,
	Steven Rostedt, Masami Hiramatsu, Alan Maguire, linux-kernel,
	linux-trace-kernel

Adding support to add special mapping for for user space trampoline
with following functions:

  uprobe_trampoline_get - find or add related uprobe_trampoline
  uprobe_trampoline_put - remove ref or destroy uprobe_trampoline

The user space trampoline is exported as architecture specific user space
special mapping, which is provided by arch_uprobe_trampoline_mapping
function.

The uprobe trampoline needs to be callable/reachable from the probe address,
so while searching for available address we use arch_uprobe_is_callable
function to decide if the uprobe trampoline is callable from the probe address.

All uprobe_trampoline objects are stored in uprobes_state object and
are cleaned up when the process mm_struct goes down.

Locking is provided by callers in following changes.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 include/linux/uprobes.h |  12 +++++
 kernel/events/uprobes.c | 114 ++++++++++++++++++++++++++++++++++++++++
 kernel/fork.c           |   1 +
 3 files changed, 127 insertions(+)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 8843b7f99ed0..c4ee755ca2a1 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -16,6 +16,7 @@
 #include <linux/types.h>
 #include <linux/wait.h>
 #include <linux/timer.h>
+#include <linux/mutex.h>
 
 struct uprobe;
 struct vm_area_struct;
@@ -172,6 +173,13 @@ struct xol_area;
 
 struct uprobes_state {
 	struct xol_area		*xol_area;
+	struct hlist_head	tramp_head;
+};
+
+struct uprobe_trampoline {
+	struct hlist_node	node;
+	unsigned long		vaddr;
+	atomic64_t		ref;
 };
 
 extern void __init uprobes_init(void);
@@ -220,6 +228,10 @@ extern int arch_uprobe_verify_opcode(struct arch_uprobe *auprobe, struct page *p
 				     unsigned long vaddr, uprobe_opcode_t *new_opcode,
 				     int nbytes);
 extern bool arch_uprobe_is_register(uprobe_opcode_t *insn, int nbytes);
+extern struct uprobe_trampoline *uprobe_trampoline_get(unsigned long vaddr);
+extern void uprobe_trampoline_put(struct uprobe_trampoline *area);
+extern bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr);
+extern const struct vm_special_mapping *arch_uprobe_trampoline_mapping(void);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 8068f91de9e3..f57918c624da 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -615,6 +615,118 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v
 			(uprobe_opcode_t *)&auprobe->insn, UPROBE_SWBP_INSN_SIZE);
 }
 
+bool __weak arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr)
+{
+	return false;
+}
+
+const struct vm_special_mapping * __weak arch_uprobe_trampoline_mapping(void)
+{
+	return NULL;
+}
+
+static unsigned long find_nearest_page(unsigned long vaddr)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma, *prev;
+	VMA_ITERATOR(vmi, mm, 0);
+
+	prev = vma_next(&vmi);
+	vma = vma_next(&vmi);
+	while (vma) {
+		if (vma->vm_start - prev->vm_end  >= PAGE_SIZE) {
+			if (arch_uprobe_is_callable(prev->vm_end, vaddr))
+				return prev->vm_end;
+			if (arch_uprobe_is_callable(vma->vm_start - PAGE_SIZE, vaddr))
+				return vma->vm_start - PAGE_SIZE;
+		}
+
+		prev = vma;
+		vma = vma_next(&vmi);
+	}
+
+	return 0;
+}
+
+static struct uprobe_trampoline *create_uprobe_trampoline(unsigned long vaddr)
+{
+	const struct vm_special_mapping *mapping;
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma;
+	struct uprobe_trampoline *tramp;
+
+	mapping = arch_uprobe_trampoline_mapping();
+	if (!mapping)
+		return NULL;
+
+	vaddr = find_nearest_page(vaddr);
+	if (!vaddr)
+		return NULL;
+
+	tramp = kzalloc(sizeof(*tramp), GFP_KERNEL);
+	if (unlikely(!tramp))
+		return NULL;
+
+	atomic64_set(&tramp->ref, 1);
+	tramp->vaddr = vaddr;
+
+	vma = _install_special_mapping(mm, tramp->vaddr, PAGE_SIZE,
+				VM_READ|VM_EXEC|VM_MAYEXEC|VM_MAYREAD|VM_DONTCOPY|VM_IO,
+				mapping);
+	if (IS_ERR(vma))
+		goto free_area;
+	return tramp;
+
+ free_area:
+	kfree(tramp);
+	return NULL;
+}
+
+struct uprobe_trampoline *uprobe_trampoline_get(unsigned long vaddr)
+{
+	struct uprobes_state *state = &current->mm->uprobes_state;
+	struct uprobe_trampoline *tramp = NULL;
+
+	hlist_for_each_entry(tramp, &state->tramp_head, node) {
+		if (arch_uprobe_is_callable(tramp->vaddr, vaddr)) {
+			atomic64_inc(&tramp->ref);
+			return tramp;
+		}
+	}
+
+	tramp = create_uprobe_trampoline(vaddr);
+	if (!tramp)
+		return NULL;
+
+	hlist_add_head(&tramp->node, &state->tramp_head);
+	return tramp;
+}
+
+static void destroy_uprobe_trampoline(struct uprobe_trampoline *tramp)
+{
+	hlist_del(&tramp->node);
+	kfree(tramp);
+}
+
+void uprobe_trampoline_put(struct uprobe_trampoline *tramp)
+{
+	if (tramp == NULL)
+		return;
+
+	if (atomic64_dec_and_test(&tramp->ref))
+		destroy_uprobe_trampoline(tramp);
+}
+
+static void clear_tramp_head(struct mm_struct *mm)
+{
+	struct uprobes_state *state = &mm->uprobes_state;
+	struct uprobe_trampoline *tramp;
+	struct hlist_node *n;
+
+	hlist_for_each_entry_safe(tramp, n, &state->tramp_head, node)
+		destroy_uprobe_trampoline(tramp);
+}
+
 /* uprobe should have guaranteed positive refcount */
 static struct uprobe *get_uprobe(struct uprobe *uprobe)
 {
@@ -1787,6 +1899,8 @@ void uprobe_clear_state(struct mm_struct *mm)
 	delayed_uprobe_remove(NULL, mm);
 	mutex_unlock(&delayed_uprobe_lock);
 
+	clear_tramp_head(mm);
+
 	if (!area)
 		return;
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 1450b461d196..b734a172fd6e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1254,6 +1254,7 @@ static void mm_init_uprobes_state(struct mm_struct *mm)
 {
 #ifdef CONFIG_UPROBES
 	mm->uprobes_state.xol_area = NULL;
+	INIT_HLIST_HEAD(&mm->uprobes_state.tramp_head);
 #endif
 }
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH bpf-next 06/13] uprobes/x86: Add uprobe syscall to speed up uprobe
  2024-12-11 13:33 [PATCH bpf-next 00/13] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (4 preceding siblings ...)
  2024-12-11 13:33 ` [PATCH bpf-next 05/13] uprobes: Add mapping for optimized uprobe trampolines Jiri Olsa
@ 2024-12-11 13:33 ` Jiri Olsa
  2024-12-13 13:48   ` Thomas Weißschuh
  2024-12-11 13:33 ` [PATCH bpf-next 07/13] uprobes/x86: Add support to emulate nop5 instruction Jiri Olsa
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 63+ messages in thread
From: Jiri Olsa @ 2024-12-11 13:33 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, Song Liu, Yonghong Song, John Fastabend, Hao Luo,
	Steven Rostedt, Masami Hiramatsu, Alan Maguire, linux-kernel,
	linux-trace-kernel

Adding new uprobe syscall that calls uprobe handlers for given
'breakpoint' address.

The idea is that the 'breakpoint' address calls the user space
trampoline which executes the uprobe syscall.

The syscall handler reads the return address of the initial call
to retrieve the original 'breakpoint' address. With this address
we find the related uprobe object and call its consumers.

Adding the arch_uprobe_trampoline_mapping function that provides
uprobe trampoline mapping. This mapping is backed with one global
page initialized at __init time and shared by the all the mapping
instances.

We do not allow to execute uprobe syscall if the caller is not
from uprobe trampoline mapping.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 arch/x86/entry/syscalls/syscall_64.tbl |  1 +
 arch/x86/kernel/uprobes.c              | 80 ++++++++++++++++++++++++++
 include/linux/syscalls.h               |  2 +
 include/linux/uprobes.h                |  1 +
 kernel/events/uprobes.c                | 22 +++++++
 kernel/sys_ni.c                        |  1 +
 6 files changed, 107 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 5eb708bff1c7..88e388c7675b 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -345,6 +345,7 @@
 333	common	io_pgetevents		sys_io_pgetevents
 334	common	rseq			sys_rseq
 335	common	uretprobe		sys_uretprobe
+336	common	uprobe			sys_uprobe
 # don't use numbers 387 through 423, add new calls after the last
 # 'common' entry
 424	common	pidfd_send_signal	sys_pidfd_send_signal
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 22a17c149a55..23e4f2821cff 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -425,6 +425,86 @@ SYSCALL_DEFINE0(uretprobe)
 	return -1;
 }
 
+static int tramp_mremap(const struct vm_special_mapping *sm, struct vm_area_struct *new_vma)
+{
+	return -EPERM;
+}
+
+static struct vm_special_mapping tramp_mapping = {
+	.name   = "[uprobes-trampoline]",
+	.mremap = tramp_mremap,
+};
+
+SYSCALL_DEFINE0(uprobe)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+	struct vm_area_struct *vma;
+	unsigned long bp_vaddr;
+	int err;
+
+	err = copy_from_user(&bp_vaddr, (void __user *)regs->sp + 3*8, sizeof(bp_vaddr));
+	if (err) {
+		force_sig(SIGILL);
+		return -1;
+	}
+
+	/* Allow execution only from uprobe trampolines. */
+	vma = vma_lookup(current->mm, regs->ip);
+	if (!vma || vma->vm_private_data != (void *) &tramp_mapping) {
+		force_sig(SIGILL);
+		return -1;
+	}
+
+	handle_syscall_uprobe(regs, bp_vaddr - 5);
+	return 0;
+}
+
+asm (
+	".pushsection .rodata\n"
+	".global uprobe_trampoline_entry\n"
+	"uprobe_trampoline_entry:\n"
+	"endbr64\n"
+	"push %rcx\n"
+	"push %r11\n"
+	"push %rax\n"
+	"movq $" __stringify(__NR_uprobe) ", %rax\n"
+	"syscall\n"
+	"pop %rax\n"
+	"pop %r11\n"
+	"pop %rcx\n"
+	"ret\n"
+	".global uprobe_trampoline_end\n"
+	"uprobe_trampoline_end:\n"
+	".popsection\n"
+);
+
+extern __visible u8 uprobe_trampoline_entry[];
+extern __visible u8 uprobe_trampoline_end[];
+
+const struct vm_special_mapping *arch_uprobe_trampoline_mapping(void)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+
+	return user_64bit_mode(regs) ? &tramp_mapping : NULL;
+}
+
+static int __init arch_uprobes_init(void)
+{
+	unsigned long size = uprobe_trampoline_end - uprobe_trampoline_entry;
+	static struct page *pages[2];
+	struct page *page;
+
+	page = alloc_page(GFP_HIGHUSER);
+	if (!page)
+		return -ENOMEM;
+	pages[0] = page;
+	tramp_mapping.pages = (struct page **) &pages;
+	arch_uprobe_copy_ixol(page, 0, uprobe_trampoline_entry, size);
+	return 0;
+}
+
+late_initcall(arch_uprobes_init);
+
 /*
  * If arch_uprobe->insn doesn't use rip-relative addressing, return
  * immediately.  Otherwise, rewrite the instruction so that it accesses
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index c6333204d451..002f4e1debe5 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -994,6 +994,8 @@ asmlinkage long sys_ioperm(unsigned long from, unsigned long num, int on);
 
 asmlinkage long sys_uretprobe(void);
 
+asmlinkage long sys_uprobe(void);
+
 /* pciconfig: alpha, arm, arm64, ia64, sparc */
 asmlinkage long sys_pciconfig_read(unsigned long bus, unsigned long dfn,
 				unsigned long off, unsigned long len,
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index c4ee755ca2a1..5e9a33bfb747 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -232,6 +232,7 @@ extern struct uprobe_trampoline *uprobe_trampoline_get(unsigned long vaddr);
 extern void uprobe_trampoline_put(struct uprobe_trampoline *area);
 extern bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr);
 extern const struct vm_special_mapping *arch_uprobe_trampoline_mapping(void);
+extern void handle_syscall_uprobe(struct pt_regs *regs, unsigned long bp_vaddr);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index f57918c624da..52f38d1ef276 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2729,6 +2729,28 @@ static void handle_swbp(struct pt_regs *regs)
 	rcu_read_unlock_trace();
 }
 
+void handle_syscall_uprobe(struct pt_regs *regs, unsigned long bp_vaddr)
+{
+	struct uprobe *uprobe;
+	int is_swbp;
+
+	rcu_read_lock_trace();
+	uprobe = find_active_uprobe_rcu(bp_vaddr, &is_swbp);
+	if (!uprobe)
+		goto unlock;
+
+	if (!get_utask())
+		goto unlock;
+
+	if (arch_uprobe_ignore(&uprobe->arch, regs))
+		goto unlock;
+
+	handler_chain(uprobe, regs);
+
+ unlock:
+	rcu_read_unlock_trace();
+}
+
 /*
  * Perform required fix-ups and disable singlestep.
  * Allow pending signals to take effect.
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index c00a86931f8c..bf5d05c635ff 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -392,3 +392,4 @@ COND_SYSCALL(setuid16);
 COND_SYSCALL(rseq);
 
 COND_SYSCALL(uretprobe);
+COND_SYSCALL(uprobe);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH bpf-next 07/13] uprobes/x86: Add support to emulate nop5 instruction
  2024-12-11 13:33 [PATCH bpf-next 00/13] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (5 preceding siblings ...)
  2024-12-11 13:33 ` [PATCH bpf-next 06/13] uprobes/x86: Add uprobe syscall to speed up uprobe Jiri Olsa
@ 2024-12-11 13:33 ` Jiri Olsa
  2024-12-13 10:45   ` Peter Zijlstra
  2024-12-11 13:33 ` [PATCH bpf-next 08/13] uprobes/x86: Add support to optimize uprobes Jiri Olsa
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 63+ messages in thread
From: Jiri Olsa @ 2024-12-11 13:33 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, Song Liu, Yonghong Song, John Fastabend, Hao Luo,
	Steven Rostedt, Masami Hiramatsu, Alan Maguire, linux-kernel,
	linux-trace-kernel

Adding support to emulate nop5 as the original uprobe instruction.

This speeds up uprobes on top of nop5 instructions:
(results from benchs/run_bench_uprobes.sh)

current:

     uprobe-nop     :    3.252 ± 0.019M/s
     uprobe-push    :    3.097 ± 0.002M/s
     uprobe-ret     :    1.116 ± 0.001M/s
 --> uprobe-nop5    :    1.115 ± 0.001M/s
     uretprobe-nop  :    1.731 ± 0.016M/s
     uretprobe-push :    1.673 ± 0.023M/s
     uretprobe-ret  :    0.843 ± 0.009M/s
 --> uretprobe-nop5 :    1.124 ± 0.001M/s

after the change:

     uprobe-nop     :    3.281 ± 0.003M/s
     uprobe-push    :    3.085 ± 0.003M/s
     uprobe-ret     :    1.130 ± 0.000M/s
 --> uprobe-nop5    :    3.276 ± 0.007M/s
     uretprobe-nop  :    1.716 ± 0.016M/s
     uretprobe-push :    1.651 ± 0.017M/s
     uretprobe-ret  :    0.846 ± 0.006M/s
 --> uretprobe-nop5 :    3.279 ± 0.002M/s

Strangely I can see uretprobe-nop5 is now much faster compared to
uretprobe-nop, while perf profiles for both are almost identical.
I'm still checking on that.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 arch/x86/kernel/uprobes.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 23e4f2821cff..cdea97f8cd39 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -909,6 +909,11 @@ static const struct uprobe_xol_ops push_xol_ops = {
 	.emulate  = push_emulate_op,
 };
 
+static int is_nop5_insn(uprobe_opcode_t *insn)
+{
+	return !memcmp(insn, x86_nops[5], 5);
+}
+
 /* Returns -ENOSYS if branch_xol_ops doesn't handle this insn */
 static int branch_setup_xol_ops(struct arch_uprobe *auprobe, struct insn *insn)
 {
@@ -928,6 +933,8 @@ static int branch_setup_xol_ops(struct arch_uprobe *auprobe, struct insn *insn)
 		break;
 
 	case 0x0f:
+		if (is_nop5_insn((uprobe_opcode_t *) &auprobe->insn))
+			goto setup;
 		if (insn->opcode.nbytes != 2)
 			return -ENOSYS;
 		/*
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH bpf-next 08/13] uprobes/x86: Add support to optimize uprobes
  2024-12-11 13:33 [PATCH bpf-next 00/13] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (6 preceding siblings ...)
  2024-12-11 13:33 ` [PATCH bpf-next 07/13] uprobes/x86: Add support to emulate nop5 instruction Jiri Olsa
@ 2024-12-11 13:33 ` Jiri Olsa
  2024-12-13 10:49   ` Peter Zijlstra
                     ` (2 more replies)
  2024-12-11 13:33 ` [PATCH bpf-next 09/13] selftests/bpf: Use 5-byte nop for x86 usdt probes Jiri Olsa
                   ` (6 subsequent siblings)
  14 siblings, 3 replies; 63+ messages in thread
From: Jiri Olsa @ 2024-12-11 13:33 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, Song Liu, Yonghong Song, John Fastabend, Hao Luo,
	Steven Rostedt, Masami Hiramatsu, Alan Maguire, linux-kernel,
	linux-trace-kernel

Putting together all the previously added pieces to support optimized
uprobes on top of 5-byte nop instruction.

The current uprobe execution goes through following:
  - installs breakpoint instruction over original instruction
  - exception handler hit and calls related uprobe consumers
  - and either simulates original instruction or does out of line single step
    execution of it
  - returns to user space

The optimized uprobe path

  - checks the original instruction is 5-byte nop (plus other checks)
  - adds (or uses existing) user space trampoline and overwrites original
    instruction (5-byte nop) with call to user space trampoline
  - the user space trampoline executes uprobe syscall that calls related uprobe
    consumers
  - trampoline returns back to next instruction

This approach won't speed up all uprobes as it's limited to using nop5 as
original instruction, but we could use nop5 as USDT probe instruction (which
uses single byte nop ATM) and speed up the USDT probes.

This patch overloads related arch functions in uprobe_write_opcode and
set_orig_insn so they can install call instruction if needed.

The arch_uprobe_optimize triggers the uprobe optimization and is called after
first uprobe hit. I originally had it called on uprobe installation but then
it clashed with elf loader, because the user space trampoline was added in a
place where loader might need to put elf segments, so I decided to do it after
first uprobe hit when loading is done.

We do not unmap and release uprobe trampoline when it's no longer needed,
because there's no easy way to make sure none of the threads is still
inside the trampoline. But we do not waste memory, because there's just
single page for all the uprobe trampoline mappings.

We do waste frmae on page mapping for every 4GB by keeping the uprobe
trampoline page mapped, but that seems ok.

Attaching the speed up from benchs/run_bench_uprobes.sh script:

current:

     uprobe-nop     :    3.281 ± 0.003M/s
     uprobe-push    :    3.085 ± 0.003M/s
     uprobe-ret     :    1.130 ± 0.000M/s
 --> uprobe-nop5    :    3.276 ± 0.007M/s
     uretprobe-nop  :    1.716 ± 0.016M/s
     uretprobe-push :    1.651 ± 0.017M/s
     uretprobe-ret  :    0.846 ± 0.006M/s
 --> uretprobe-nop5 :    3.279 ± 0.002M/s

after the change:

     uprobe-nop     :    3.246 ± 0.004M/s
     uprobe-push    :    3.057 ± 0.000M/s
     uprobe-ret     :    1.113 ± 0.003M/s
 --> uprobe-nop5    :    6.751 ± 0.037M/s
     uretprobe-nop  :    1.740 ± 0.015M/s
     uretprobe-push :    1.677 ± 0.018M/s
     uretprobe-ret  :    0.852 ± 0.005M/s
 --> uretprobe-nop5 :    6.769 ± 0.040M/s

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 arch/x86/include/asm/uprobes.h |   7 ++
 arch/x86/kernel/uprobes.c      | 168 ++++++++++++++++++++++++++++++++-
 include/linux/uprobes.h        |   1 +
 kernel/events/uprobes.c        |   8 ++
 4 files changed, 181 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 678fb546f0a7..84a75ed748f0 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -20,6 +20,11 @@ typedef u8 uprobe_opcode_t;
 #define UPROBE_SWBP_INSN		0xcc
 #define UPROBE_SWBP_INSN_SIZE		   1
 
+enum {
+	ARCH_UPROBE_FLAG_CAN_OPTIMIZE	= 0,
+	ARCH_UPROBE_FLAG_OPTIMIZED	= 1,
+};
+
 struct uprobe_xol_ops;
 
 struct arch_uprobe {
@@ -45,6 +50,8 @@ struct arch_uprobe {
 			u8	ilen;
 		}			push;
 	};
+
+	unsigned long flags;
 };
 
 struct arch_uprobe_task {
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index cdea97f8cd39..b2420eeee23a 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -18,6 +18,7 @@
 #include <asm/processor.h>
 #include <asm/insn.h>
 #include <asm/mmu_context.h>
+#include <asm/nops.h>
 
 /* Post-execution fixups. */
 
@@ -914,8 +915,37 @@ static int is_nop5_insn(uprobe_opcode_t *insn)
 	return !memcmp(insn, x86_nops[5], 5);
 }
 
+static int is_call_insn(uprobe_opcode_t *insn)
+{
+	return *insn == CALL_INSN_OPCODE;
+}
+
+static void relative_insn(void *dest, void *from, void *to, u8 op)
+{
+	struct __arch_relative_insn {
+		u8 op;
+		s32 raddr;
+	} __packed *insn;
+
+	insn = (struct __arch_relative_insn *)dest;
+	insn->raddr = (s32)((long)(to) - ((long)(from) + 5));
+	insn->op = op;
+}
+
+static void relative_call(void *dest, void *from, void *to)
+{
+	relative_insn(dest, from, to, CALL_INSN_OPCODE);
+}
+
+static bool can_optimize_vaddr(unsigned long vaddr)
+{
+	/* We can't do cross page atomic writes yet. */
+	return PAGE_SIZE - (vaddr & ~PAGE_MASK) >= 5;
+}
+
 /* Returns -ENOSYS if branch_xol_ops doesn't handle this insn */
-static int branch_setup_xol_ops(struct arch_uprobe *auprobe, struct insn *insn)
+static int branch_setup_xol_ops(struct arch_uprobe *auprobe, struct insn *insn,
+				unsigned long vaddr)
 {
 	u8 opc1 = OPCODE1(insn);
 	insn_byte_t p;
@@ -933,8 +963,11 @@ static int branch_setup_xol_ops(struct arch_uprobe *auprobe, struct insn *insn)
 		break;
 
 	case 0x0f:
-		if (is_nop5_insn((uprobe_opcode_t *) &auprobe->insn))
+		if (is_nop5_insn((uprobe_opcode_t *) &auprobe->insn)) {
+			if (can_optimize_vaddr(vaddr))
+				set_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags);
 			goto setup;
+		}
 		if (insn->opcode.nbytes != 2)
 			return -ENOSYS;
 		/*
@@ -1065,7 +1098,7 @@ int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe, struct mm_struct *mm,
 	if (ret)
 		return ret;
 
-	ret = branch_setup_xol_ops(auprobe, &insn);
+	ret = branch_setup_xol_ops(auprobe, &insn, addr);
 	if (ret != -ENOSYS)
 		return ret;
 
@@ -1306,3 +1339,132 @@ bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check ctx,
 	else
 		return regs->sp <= ret->stack;
 }
+
+int arch_uprobe_verify_opcode(struct arch_uprobe *auprobe, struct page *page,
+			      unsigned long vaddr, uprobe_opcode_t *new_opcode,
+			      int nbytes)
+{
+	uprobe_opcode_t old_opcode[5];
+	bool is_call, is_swbp, is_nop5;
+
+	if (!test_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags))
+		return uprobe_verify_opcode(page, vaddr, new_opcode);
+
+	/*
+	 * The ARCH_UPROBE_FLAG_CAN_OPTIMIZE flag guarantees the following
+	 * 5 bytes read won't cross the page boundary.
+	 */
+	uprobe_copy_from_page(page, vaddr, (uprobe_opcode_t *) &old_opcode, 5);
+	is_call = is_call_insn((uprobe_opcode_t *) &old_opcode);
+	is_swbp = is_swbp_insn((uprobe_opcode_t *) &old_opcode);
+	is_nop5 = is_nop5_insn((uprobe_opcode_t *) &old_opcode);
+
+	/*
+	 * We allow following trasitions for optimized uprobes:
+	 *
+	 *   nop5 -> swbp -> call
+	 *   ||      |       |
+	 *   |'--<---'       |
+	 *   '---<-----------'
+	 *
+	 * We return 1 to ack the write, 0 to do nothing, -1 to fail write.
+	 *
+	 * If the current opcode (old_opcode) has already desired value,
+	 * we do nothing, because we are racing with another thread doing
+	 * the update.
+	 */
+	switch (nbytes) {
+	case 5:
+		if (is_call_insn(new_opcode)) {
+			if (is_swbp)
+				return 1;
+			if (is_call && !memcmp(new_opcode, &old_opcode, 5))
+				return 0;
+		} else {
+			if (is_call || is_swbp)
+				return 1;
+			if (is_nop5)
+				return 0;
+		}
+		break;
+	case 1:
+		if (is_swbp_insn(new_opcode)) {
+			if (is_nop5)
+				return 1;
+			if (is_swbp || is_call)
+				return 0;
+		} else {
+			if (is_swbp || is_call)
+				return 1;
+			if (is_nop5)
+				return 0;
+		}
+	}
+	return -1;
+}
+
+bool arch_uprobe_is_register(uprobe_opcode_t *insn, int nbytes)
+{
+	return nbytes == 5 ? is_call_insn(insn) : is_swbp_insn(insn);
+}
+
+static void __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct *mm,
+				   unsigned long vaddr)
+{
+	struct uprobe_trampoline *tramp;
+	char call[5];
+
+	tramp = uprobe_trampoline_get(vaddr);
+	if (!tramp)
+		goto fail;
+
+	relative_call(call, (void *) vaddr, (void *) tramp->vaddr);
+	if (uprobe_write_opcode(auprobe, mm, vaddr, call, 5))
+		goto fail;
+
+	set_bit(ARCH_UPROBE_FLAG_OPTIMIZED, &auprobe->flags);
+	return;
+
+fail:
+	/* Once we fail we never try again. */
+	clear_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags);
+	uprobe_trampoline_put(tramp);
+}
+
+static bool should_optimize(struct arch_uprobe *auprobe)
+{
+	if (!test_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags))
+		return false;
+	if (test_bit(ARCH_UPROBE_FLAG_OPTIMIZED, &auprobe->flags))
+		return false;
+	return true;
+}
+
+void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
+{
+	struct mm_struct *mm = current->mm;
+
+	if (!should_optimize(auprobe))
+		return;
+
+	mmap_write_lock(mm);
+	if (should_optimize(auprobe))
+		__arch_uprobe_optimize(auprobe, mm, vaddr);
+	mmap_write_unlock(mm);
+}
+
+int set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr)
+{
+	uprobe_opcode_t *insn = (uprobe_opcode_t *) auprobe->insn;
+
+	if (test_bit(ARCH_UPROBE_FLAG_OPTIMIZED, &auprobe->flags))
+		return uprobe_write_opcode(auprobe, mm, vaddr, insn, 5);
+
+	return uprobe_write_opcode(auprobe, mm, vaddr, insn, UPROBE_SWBP_INSN_SIZE);
+}
+
+bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr)
+{
+	long delta = (long)(vaddr + 5 - vtramp);
+	return delta >= INT_MIN && delta <= INT_MAX;
+}
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 5e9a33bfb747..1b14b9f2f8d0 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -233,6 +233,7 @@ extern void uprobe_trampoline_put(struct uprobe_trampoline *area);
 extern bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr);
 extern const struct vm_special_mapping *arch_uprobe_trampoline_mapping(void);
 extern void handle_syscall_uprobe(struct pt_regs *regs, unsigned long bp_vaddr);
+extern void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 52f38d1ef276..a7a3eeec9e51 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2654,6 +2654,11 @@ bool __weak arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c
 	return true;
 }
 
+void __weak arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
+{
+	return;
+}
+
 /*
  * Run handler and ask thread to singlestep.
  * Ensure all non-fatal signals cannot interrupt thread while it singlesteps.
@@ -2718,6 +2723,9 @@ static void handle_swbp(struct pt_regs *regs)
 
 	handler_chain(uprobe, regs);
 
+	/* Try to optimize after first hit. */
+	arch_uprobe_optimize(&uprobe->arch, bp_vaddr);
+
 	if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
 		goto out;
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH bpf-next 09/13] selftests/bpf: Use 5-byte nop for x86 usdt probes
  2024-12-11 13:33 [PATCH bpf-next 00/13] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (7 preceding siblings ...)
  2024-12-11 13:33 ` [PATCH bpf-next 08/13] uprobes/x86: Add support to optimize uprobes Jiri Olsa
@ 2024-12-11 13:33 ` Jiri Olsa
  2024-12-13 21:58   ` Andrii Nakryiko
  2024-12-11 13:33 ` [PATCH bpf-next 10/13] selftests/bpf: Add uprobe/usdt optimized test Jiri Olsa
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 63+ messages in thread
From: Jiri Olsa @ 2024-12-11 13:33 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, Song Liu, Yonghong Song, John Fastabend, Hao Luo,
	Steven Rostedt, Masami Hiramatsu, Alan Maguire, linux-kernel,
	linux-trace-kernel

Using 5-byte nop for x86 usdt probes so we can switch
to optimized uprobe them.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 tools/testing/selftests/bpf/sdt.h | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/sdt.h b/tools/testing/selftests/bpf/sdt.h
index ca0162b4dc57..7ac9291f45f1 100644
--- a/tools/testing/selftests/bpf/sdt.h
+++ b/tools/testing/selftests/bpf/sdt.h
@@ -234,6 +234,13 @@ __extension__ extern unsigned long long __sdt_unsp;
 #define _SDT_NOP	nop
 #endif
 
+/* Use 5 byte nop for x86_64 to allow optimizing uprobes. */
+#if defined(__x86_64__)
+# define _SDT_DEF_NOP _SDT_ASM_5(990:	.byte 0x0f, 0x1f, 0x44, 0x00, 0x00)
+#else
+# define _SDT_DEF_NOP _SDT_ASM_1(990:	_SDT_NOP)
+#endif
+
 #define _SDT_NOTE_NAME	"stapsdt"
 #define _SDT_NOTE_TYPE	3
 
@@ -286,7 +293,7 @@ __extension__ extern unsigned long long __sdt_unsp;
 
 #define _SDT_ASM_BODY(provider, name, pack_args, args, ...)		      \
   _SDT_DEF_MACROS							      \
-  _SDT_ASM_1(990:	_SDT_NOP)					      \
+  _SDT_DEF_NOP								      \
   _SDT_ASM_3(		.pushsection .note.stapsdt,_SDT_ASM_AUTOGROUP,"note") \
   _SDT_ASM_1(		.balign 4)					      \
   _SDT_ASM_3(		.4byte 992f-991f, 994f-993f, _SDT_NOTE_TYPE)	      \
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH bpf-next 10/13] selftests/bpf: Add uprobe/usdt optimized test
  2024-12-11 13:33 [PATCH bpf-next 00/13] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (8 preceding siblings ...)
  2024-12-11 13:33 ` [PATCH bpf-next 09/13] selftests/bpf: Use 5-byte nop for x86 usdt probes Jiri Olsa
@ 2024-12-11 13:33 ` Jiri Olsa
  2024-12-13 21:58   ` Andrii Nakryiko
  2024-12-11 13:34 ` [PATCH bpf-next 11/13] selftests/bpf: Add hit/attach/detach race optimized uprobe test Jiri Olsa
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 63+ messages in thread
From: Jiri Olsa @ 2024-12-11 13:33 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, Song Liu, Yonghong Song, John Fastabend, Hao Luo,
	Steven Rostedt, Masami Hiramatsu, Alan Maguire, linux-kernel,
	linux-trace-kernel

Adding tests for optimized uprobe/usdt probes.

Checking that we get expected trampoline and attached bpf programs
get executed properly.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 .../selftests/bpf/prog_tests/uprobe_syscall.c | 203 ++++++++++++++++++
 .../selftests/bpf/progs/uprobe_optimized.c    |  29 +++
 2 files changed, 232 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/uprobe_optimized.c

diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
index c397336fe1ed..1dbc26a1130c 100644
--- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
@@ -14,6 +14,8 @@
 #include <asm/prctl.h>
 #include "uprobe_syscall.skel.h"
 #include "uprobe_syscall_executed.skel.h"
+#include "uprobe_optimized.skel.h"
+#include "sdt.h"
 
 __naked unsigned long uretprobe_regs_trigger(void)
 {
@@ -350,6 +352,186 @@ static void test_uretprobe_shadow_stack(void)
 
 	ARCH_PRCTL(ARCH_SHSTK_DISABLE, ARCH_SHSTK_SHSTK);
 }
+
+#define TRAMP "[uprobes-trampoline]"
+
+static unsigned char nop5[5] = { 0x0f, 0x1f, 0x44, 0x00, 0x00 };
+
+noinline void uprobe_test(void)
+{
+	asm volatile ("					\n"
+		".global uprobe_test_nop5		\n"
+		".type uprobe_test_nop5, STT_FUNC	\n"
+		"uprobe_test_nop5:			\n"
+		".byte 0x0f, 0x1f, 0x44, 0x00, 0x00	\n"
+	);
+}
+
+extern u8 uprobe_test_nop5[];
+
+noinline void usdt_test(void)
+{
+	STAP_PROBE(optimized_uprobe, usdt);
+}
+
+static void *find_nop5(void *fn)
+{
+	int i;
+
+	for (i = 0; i < 10; i++) {
+		if (!memcmp(nop5, fn + i, 5))
+			return fn + i;
+	}
+	return NULL;
+}
+
+static int find_uprobes_trampoline(void **start, void **end)
+{
+	char line[128];
+	int ret = -1;
+	FILE *maps;
+
+	maps = fopen("/proc/self/maps", "r");
+	if (!maps) {
+		fprintf(stderr, "cannot open maps\n");
+		return -1;
+	}
+
+	while (fgets(line, sizeof(line), maps)) {
+		int m = -1;
+
+		/* We care only about private r-x mappings. */
+		if (sscanf(line, "%p-%p r-xp %*x %*x:%*x %*u %n", start, end, &m) != 2)
+			continue;
+		if (m < 0)
+			continue;
+		if (!strncmp(&line[m], TRAMP, sizeof(TRAMP)-1)) {
+			ret = 0;
+			break;
+		}
+	}
+
+	fclose(maps);
+	return ret;
+}
+
+static void check_attach(struct uprobe_optimized *skel, void (*trigger)(void), void *addr)
+{
+	void *tramp_start, *tramp_end;
+	struct __arch_relative_insn {
+		u8 op;
+		s32 raddr;
+	} __packed *call;
+
+	s32 delta;
+
+	/* Uprobe gets optimized after first trigger, so let's press twice. */
+	trigger();
+	trigger();
+
+	if (!ASSERT_OK(find_uprobes_trampoline(&tramp_start, &tramp_end), "uprobes_trampoline"))
+		return;
+
+	/* Make sure bpf program got executed.. */
+	ASSERT_EQ(skel->bss->executed, 2, "executed");
+
+	/* .. and check the trampoline is as expected. */
+	call = (struct __arch_relative_insn *) addr;
+	delta = (unsigned long) tramp_start - ((unsigned long) addr + 5);
+
+	ASSERT_EQ(call->op, 0xe8, "call");
+	ASSERT_EQ(call->raddr, delta, "delta");
+	ASSERT_EQ(tramp_end - tramp_start, 4096, "size");
+}
+
+static void check_detach(struct uprobe_optimized *skel, void (*trigger)(void), void *addr)
+{
+	void *tramp_start, *tramp_end;
+
+	/* [uprobes_trampoline] stays after detach */
+	ASSERT_OK(find_uprobes_trampoline(&tramp_start, &tramp_end), "uprobes_trampoline");
+	ASSERT_OK(memcmp(addr, nop5, 5), "nop5");
+}
+
+static void check(struct uprobe_optimized *skel, struct bpf_link *link,
+		  void (*trigger)(void), void *addr)
+{
+	check_attach(skel, trigger, addr);
+	bpf_link__destroy(link);
+	check_detach(skel, trigger, addr);
+}
+
+static void test_uprobe_legacy(void)
+{
+	struct uprobe_optimized *skel;
+	struct bpf_link *link;
+	unsigned long offset;
+
+	skel = uprobe_optimized__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "uprobe_optimized__open_and_load"))
+		return;
+
+	offset = get_uprobe_offset(&uprobe_test_nop5);
+	if (!ASSERT_GE(offset, 0, "get_uprobe_offset"))
+		goto cleanup;
+
+	link = bpf_program__attach_uprobe_opts(skel->progs.test_1,
+				0, "/proc/self/exe", offset, NULL);
+	if (!ASSERT_OK_PTR(link, "bpf_program__attach_uprobe_opts"))
+		goto cleanup;
+
+	check(skel, link, uprobe_test, uprobe_test_nop5);
+
+cleanup:
+	uprobe_optimized__destroy(skel);
+}
+
+static void test_uprobe_multi(void)
+{
+	struct uprobe_optimized *skel;
+	struct bpf_link *link;
+
+	skel = uprobe_optimized__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "uprobe_optimized__open_and_load"))
+		return;
+
+	link = bpf_program__attach_uprobe_multi(skel->progs.test_2,
+				0, "/proc/self/exe", "uprobe_test_nop5", NULL);
+	if (!ASSERT_OK_PTR(link, "bpf_program__attach_uprobe_multi"))
+		goto cleanup;
+
+	check(skel, link, uprobe_test, uprobe_test_nop5);
+
+cleanup:
+	uprobe_optimized__destroy(skel);
+}
+
+static void test_uprobe_usdt(void)
+{
+	struct uprobe_optimized *skel;
+	struct bpf_link *link;
+	void *addr;
+
+	errno = 0;
+	addr = find_nop5(usdt_test);
+	if (!ASSERT_OK_PTR(addr, "find_nop5"))
+		return;
+
+	skel = uprobe_optimized__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "uprobe_optimized__open_and_load"))
+		return;
+
+	link = bpf_program__attach_usdt(skel->progs.test_3,
+				-1 /* all PIDs */, "/proc/self/exe",
+				"optimized_uprobe", "usdt", NULL);
+	if (!ASSERT_OK_PTR(link, "bpf_program__attach_usdt"))
+		goto cleanup;
+
+	check(skel, link, usdt_test, addr);
+
+cleanup:
+	uprobe_optimized__destroy(skel);
+}
 #else
 static void test_uretprobe_regs_equal(void)
 {
@@ -370,6 +552,21 @@ static void test_uretprobe_shadow_stack(void)
 {
 	test__skip();
 }
+
+static void test_uprobe_legacy(void)
+{
+	test__skip();
+}
+
+static void test_uprobe_multi(void)
+{
+	test__skip();
+}
+
+static void test_uprobe_usdt(void)
+{
+	test__skip();
+}
 #endif
 
 void test_uprobe_syscall(void)
@@ -382,4 +579,10 @@ void test_uprobe_syscall(void)
 		test_uretprobe_syscall_call();
 	if (test__start_subtest("uretprobe_shadow_stack"))
 		test_uretprobe_shadow_stack();
+	if (test__start_subtest("uprobe_legacy"))
+		test_uprobe_legacy();
+	if (test__start_subtest("uprobe_multi"))
+		test_uprobe_multi();
+	if (test__start_subtest("uprobe_usdt"))
+		test_uprobe_usdt();
 }
diff --git a/tools/testing/selftests/bpf/progs/uprobe_optimized.c b/tools/testing/selftests/bpf/progs/uprobe_optimized.c
new file mode 100644
index 000000000000..2441d59960a6
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/uprobe_optimized.c
@@ -0,0 +1,29 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_tracing.h>
+#include <bpf/usdt.bpf.h>
+
+char _license[] SEC("license") = "GPL";
+unsigned long executed = 0;
+
+SEC("uprobe")
+int BPF_UPROBE(test_1)
+{
+	executed++;
+	return 0;
+}
+
+SEC("uprobe.multi")
+int BPF_UPROBE(test_2)
+{
+	executed++;
+	return 0;
+}
+
+SEC("usdt")
+int test_3(struct pt_regs *ctx)
+{
+	executed++;
+	return 0;
+}
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH bpf-next 11/13] selftests/bpf: Add hit/attach/detach race optimized uprobe test
  2024-12-11 13:33 [PATCH bpf-next 00/13] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (9 preceding siblings ...)
  2024-12-11 13:33 ` [PATCH bpf-next 10/13] selftests/bpf: Add uprobe/usdt optimized test Jiri Olsa
@ 2024-12-11 13:34 ` Jiri Olsa
  2024-12-13 21:58   ` Andrii Nakryiko
  2024-12-11 13:34 ` [PATCH bpf-next 12/13] selftests/bpf: Add uprobe syscall sigill signal test Jiri Olsa
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 63+ messages in thread
From: Jiri Olsa @ 2024-12-11 13:34 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, Song Liu, Yonghong Song, John Fastabend, Hao Luo,
	Steven Rostedt, Masami Hiramatsu, Alan Maguire, linux-kernel,
	linux-trace-kernel

Adding test that makes sure parallel execution of the uprobe and
attach/detach of optimized uprobe on it works properly.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 .../selftests/bpf/prog_tests/uprobe_syscall.c | 82 +++++++++++++++++++
 1 file changed, 82 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
index 1dbc26a1130c..eacd14db8f8d 100644
--- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
@@ -532,6 +532,81 @@ static void test_uprobe_usdt(void)
 cleanup:
 	uprobe_optimized__destroy(skel);
 }
+
+static bool race_stop;
+
+static void *worker_trigger(void *arg)
+{
+	unsigned long rounds = 0;
+
+	while (!race_stop) {
+		uprobe_test();
+		rounds++;
+	}
+
+	printf("tid %d trigger rounds: %lu\n", gettid(), rounds);
+	return NULL;
+}
+
+static void *worker_attach(void *arg)
+{
+	struct uprobe_optimized *skel;
+	unsigned long rounds = 0;
+
+	skel = uprobe_optimized__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "uprobe_optimized__open_and_load"))
+		goto cleanup;
+
+	while (!race_stop) {
+		skel->links.test_2 = bpf_program__attach_uprobe_multi(skel->progs.test_2, -1,
+						"/proc/self/exe", "uprobe_test_nop5", NULL);
+		if (!ASSERT_OK_PTR(skel->links.test_2, "bpf_program__attach_uprobe_multi"))
+			break;
+		bpf_link__destroy(skel->links.test_2);
+		skel->links.test_2 = NULL;
+		rounds++;
+	}
+
+	printf("tid %d attach rounds: %lu hits: %lu\n", gettid(), rounds, skel->bss->executed);
+
+cleanup:
+	uprobe_optimized__destroy(skel);
+	return NULL;
+}
+
+static void test_uprobe_race(void)
+{
+	int err, i, nr_cpus, nr;
+	pthread_t *threads;
+
+        nr_cpus = libbpf_num_possible_cpus();
+	if (!ASSERT_GE(nr_cpus, 0, "nr_cpus"))
+		return;
+
+	nr = nr_cpus * 2;
+	threads = malloc(sizeof(*threads) * nr);
+	if (!ASSERT_OK_PTR(threads, "malloc"))
+		return;
+
+	for (i = 0; i < nr_cpus; i++) {
+		err = pthread_create(&threads[i], NULL, worker_trigger, NULL);
+		if (!ASSERT_OK(err, "pthread_create"))
+			goto cleanup;
+	}
+
+	for (; i < nr; i++) {
+		err = pthread_create(&threads[i], NULL, worker_attach, NULL);
+		if (!ASSERT_OK(err, "pthread_create"))
+			goto cleanup;
+	}
+
+	sleep(4);
+
+cleanup:
+	race_stop = true;
+	for (i = 0; i < nr; i++)
+		pthread_join(threads[i], NULL);
+}
 #else
 static void test_uretprobe_regs_equal(void)
 {
@@ -567,6 +642,11 @@ static void test_uprobe_usdt(void)
 {
 	test__skip();
 }
+
+static void test_uprobe_race(void)
+{
+	test__skip();
+}
 #endif
 
 void test_uprobe_syscall(void)
@@ -585,4 +665,6 @@ void test_uprobe_syscall(void)
 		test_uprobe_multi();
 	if (test__start_subtest("uprobe_usdt"))
 		test_uprobe_usdt();
+	if (test__start_subtest("uprobe_race"))
+		test_uprobe_race();
 }
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH bpf-next 12/13] selftests/bpf: Add uprobe syscall sigill signal test
  2024-12-11 13:33 [PATCH bpf-next 00/13] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (10 preceding siblings ...)
  2024-12-11 13:34 ` [PATCH bpf-next 11/13] selftests/bpf: Add hit/attach/detach race optimized uprobe test Jiri Olsa
@ 2024-12-11 13:34 ` Jiri Olsa
  2024-12-11 13:34 ` [PATCH bpf-next 13/13] selftests/bpf: Add 5-byte nop uprobe trigger bench Jiri Olsa
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 63+ messages in thread
From: Jiri Olsa @ 2024-12-11 13:34 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, Song Liu, Yonghong Song, John Fastabend, Hao Luo,
	Steven Rostedt, Masami Hiramatsu, Alan Maguire, linux-kernel,
	linux-trace-kernel

Make sure that calling uprobe syscall from outside uprobe trampoline
results in sigill signal.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 .../selftests/bpf/prog_tests/uprobe_syscall.c | 41 +++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
index eacd14db8f8d..12631abc7017 100644
--- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
@@ -607,6 +607,40 @@ static void test_uprobe_race(void)
 	for (i = 0; i < nr; i++)
 		pthread_join(threads[i], NULL);
 }
+
+#ifndef __NR_uprobe
+#define __NR_uprobe 336
+#endif
+
+static void test_uprobe_sigill(void)
+{
+	int status, err, pid;
+
+	pid = fork();
+	if (!ASSERT_GE(pid, 0, "fork"))
+		return;
+	/* child */
+	if (pid == 0) {
+		asm volatile (
+			"pushq %rax\n"
+			"pushq %rcx\n"
+			"pushq %r11\n"
+			"movq $" __stringify(__NR_uprobe) ", %rax\n"
+			"syscall\n"
+			"popq %r11\n"
+			"popq %rcx\n"
+			"retq\n"
+		);
+		exit(0);
+	}
+
+	err = waitpid(pid, &status, 0);
+	ASSERT_EQ(err, pid, "waitpid");
+
+	/* verify the child got killed with SIGILL */
+	ASSERT_EQ(WIFSIGNALED(status), 1, "WIFSIGNALED");
+	ASSERT_EQ(WTERMSIG(status), SIGILL, "WTERMSIG");
+}
 #else
 static void test_uretprobe_regs_equal(void)
 {
@@ -647,6 +681,11 @@ static void test_uprobe_race(void)
 {
 	test__skip();
 }
+
+static void test_uprobe_sigill(void)
+{
+	test__skip();
+}
 #endif
 
 void test_uprobe_syscall(void)
@@ -667,4 +706,6 @@ void test_uprobe_syscall(void)
 		test_uprobe_usdt();
 	if (test__start_subtest("uprobe_race"))
 		test_uprobe_race();
+	if (test__start_subtest("uprobe_sigill"))
+		test_uprobe_sigill();
 }
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH bpf-next 13/13] selftests/bpf: Add 5-byte nop uprobe trigger bench
  2024-12-11 13:33 [PATCH bpf-next 00/13] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (11 preceding siblings ...)
  2024-12-11 13:34 ` [PATCH bpf-next 12/13] selftests/bpf: Add uprobe syscall sigill signal test Jiri Olsa
@ 2024-12-11 13:34 ` Jiri Olsa
  2024-12-13 21:57   ` Andrii Nakryiko
  2024-12-13  0:43 ` [PATCH bpf-next 00/13] uprobes: Add support to optimize usdt probes on x86_64 Andrii Nakryiko
  2024-12-13 10:51 ` Peter Zijlstra
  14 siblings, 1 reply; 63+ messages in thread
From: Jiri Olsa @ 2024-12-11 13:34 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, Song Liu, Yonghong Song, John Fastabend, Hao Luo,
	Steven Rostedt, Masami Hiramatsu, Alan Maguire, linux-kernel,
	linux-trace-kernel

Add 5-byte nop uprobe trigger bench (x86_64 specific) to measure
uprobes/uretprobes on top of nop5 instruction.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 tools/testing/selftests/bpf/bench.c           | 12 ++++++
 .../selftests/bpf/benchs/bench_trigger.c      | 42 +++++++++++++++++++
 .../selftests/bpf/benchs/run_bench_uprobes.sh |  2 +-
 3 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index 1bd403a5ef7b..0fd8c9b0d38f 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -526,6 +526,12 @@ extern const struct bench bench_trig_uprobe_multi_push;
 extern const struct bench bench_trig_uretprobe_multi_push;
 extern const struct bench bench_trig_uprobe_multi_ret;
 extern const struct bench bench_trig_uretprobe_multi_ret;
+#ifdef __x86_64__
+extern const struct bench bench_trig_uprobe_nop5;
+extern const struct bench bench_trig_uretprobe_nop5;
+extern const struct bench bench_trig_uprobe_multi_nop5;
+extern const struct bench bench_trig_uretprobe_multi_nop5;
+#endif
 
 extern const struct bench bench_rb_libbpf;
 extern const struct bench bench_rb_custom;
@@ -586,6 +592,12 @@ static const struct bench *benchs[] = {
 	&bench_trig_uretprobe_multi_push,
 	&bench_trig_uprobe_multi_ret,
 	&bench_trig_uretprobe_multi_ret,
+#ifdef __x86_64__
+	&bench_trig_uprobe_nop5,
+	&bench_trig_uretprobe_nop5,
+	&bench_trig_uprobe_multi_nop5,
+	&bench_trig_uretprobe_multi_nop5,
+#endif
 	/* ringbuf/perfbuf benchmarks */
 	&bench_rb_libbpf,
 	&bench_rb_custom,
diff --git a/tools/testing/selftests/bpf/benchs/bench_trigger.c b/tools/testing/selftests/bpf/benchs/bench_trigger.c
index 32e9f194d449..e74289a3270b 100644
--- a/tools/testing/selftests/bpf/benchs/bench_trigger.c
+++ b/tools/testing/selftests/bpf/benchs/bench_trigger.c
@@ -333,6 +333,20 @@ static void *uprobe_producer_ret(void *input)
 	return NULL;
 }
 
+#ifdef __x86_64__
+__nocf_check __weak void uprobe_target_nop5(void)
+{
+	asm volatile (".byte 0x0f, 0x1f, 0x44, 0x00, 0x00");
+}
+
+static void *uprobe_producer_nop5(void *input)
+{
+	while (true)
+		uprobe_target_nop5();
+	return NULL;
+}
+#endif
+
 static void usetup(bool use_retprobe, bool use_multi, void *target_addr)
 {
 	size_t uprobe_offset;
@@ -448,6 +462,28 @@ static void uretprobe_multi_ret_setup(void)
 	usetup(true, true /* use_multi */, &uprobe_target_ret);
 }
 
+#ifdef __x86_64__
+static void uprobe_nop5_setup(void)
+{
+	usetup(false, false /* !use_multi */, &uprobe_target_nop5);
+}
+
+static void uretprobe_nop5_setup(void)
+{
+	usetup(false, false /* !use_multi */, &uprobe_target_nop5);
+}
+
+static void uprobe_multi_nop5_setup(void)
+{
+	usetup(false, true /* use_multi */, &uprobe_target_nop5);
+}
+
+static void uretprobe_multi_nop5_setup(void)
+{
+	usetup(false, true /* use_multi */, &uprobe_target_nop5);
+}
+#endif
+
 const struct bench bench_trig_syscall_count = {
 	.name = "trig-syscall-count",
 	.validate = trigger_validate,
@@ -506,3 +542,9 @@ BENCH_TRIG_USERMODE(uprobe_multi_ret, ret, "uprobe-multi-ret");
 BENCH_TRIG_USERMODE(uretprobe_multi_nop, nop, "uretprobe-multi-nop");
 BENCH_TRIG_USERMODE(uretprobe_multi_push, push, "uretprobe-multi-push");
 BENCH_TRIG_USERMODE(uretprobe_multi_ret, ret, "uretprobe-multi-ret");
+#ifdef __x86_64__
+BENCH_TRIG_USERMODE(uprobe_nop5, nop5, "uprobe-nop5");
+BENCH_TRIG_USERMODE(uretprobe_nop5, nop5, "uretprobe-nop5");
+BENCH_TRIG_USERMODE(uprobe_multi_nop5, nop5, "uprobe-multi-nop5");
+BENCH_TRIG_USERMODE(uretprobe_multi_nop5, nop5, "uretprobe-multi-nop5");
+#endif
diff --git a/tools/testing/selftests/bpf/benchs/run_bench_uprobes.sh b/tools/testing/selftests/bpf/benchs/run_bench_uprobes.sh
index af169f831f2f..03f55405484b 100755
--- a/tools/testing/selftests/bpf/benchs/run_bench_uprobes.sh
+++ b/tools/testing/selftests/bpf/benchs/run_bench_uprobes.sh
@@ -2,7 +2,7 @@
 
 set -eufo pipefail
 
-for i in usermode-count syscall-count {uprobe,uretprobe}-{nop,push,ret}
+for i in usermode-count syscall-count {uprobe,uretprobe}-{nop,push,ret,nop5}
 do
 	summary=$(sudo ./bench -w2 -d5 -a trig-$i | tail -n1 | cut -d'(' -f1 | cut -d' ' -f3-)
 	printf "%-15s: %s\n" $i "$summary"
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 01/13] uprobes: Rename arch_uretprobe_trampoline function
  2024-12-11 13:33 ` [PATCH bpf-next 01/13] uprobes: Rename arch_uretprobe_trampoline function Jiri Olsa
@ 2024-12-13  0:42   ` Andrii Nakryiko
  0 siblings, 0 replies; 63+ messages in thread
From: Andrii Nakryiko @ 2024-12-13  0:42 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Wed, Dec 11, 2024 at 5:34 AM Jiri Olsa <jolsa@kernel.org> wrote:
>
> We are about to add uprobe trampoline, so cleaning up the namespace.
>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  arch/x86/kernel/uprobes.c | 2 +-
>  include/linux/uprobes.h   | 2 +-
>  kernel/events/uprobes.c   | 4 ++--
>  3 files changed, 4 insertions(+), 4 deletions(-)
>

LGTM

Acked-by: Andrii Nakryiko <andrii@kernel.org>

> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index 5a952c5ea66b..22a17c149a55 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -338,7 +338,7 @@ extern u8 uretprobe_trampoline_entry[];
>  extern u8 uretprobe_trampoline_end[];
>  extern u8 uretprobe_syscall_check[];
>
> -void *arch_uprobe_trampoline(unsigned long *psize)
> +void *arch_uretprobe_trampoline(unsigned long *psize)
>  {
>         static uprobe_opcode_t insn = UPROBE_SWBP_INSN;
>         struct pt_regs *regs = task_pt_regs(current);
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index e0a4c2082245..09a298e416a8 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -211,7 +211,7 @@ extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
>  extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
>                                          void *src, unsigned long len);
>  extern void uprobe_handle_trampoline(struct pt_regs *regs);
> -extern void *arch_uprobe_trampoline(unsigned long *psize);
> +extern void *arch_uretprobe_trampoline(unsigned long *psize);
>  extern unsigned long uprobe_get_trampoline_vaddr(void);
>  #else /* !CONFIG_UPROBES */
>  struct uprobes_state {
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index fa04b14a7d72..e0e3ebb4c0a1 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -1695,7 +1695,7 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
>         return ret;
>  }
>
> -void * __weak arch_uprobe_trampoline(unsigned long *psize)
> +void * __weak arch_uretprobe_trampoline(unsigned long *psize)
>  {
>         static uprobe_opcode_t insn = UPROBE_SWBP_INSN;
>
> @@ -1727,7 +1727,7 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
>         init_waitqueue_head(&area->wq);
>         /* Reserve the 1st slot for get_trampoline_vaddr() */
>         set_bit(0, area->bitmap);
> -       insns = arch_uprobe_trampoline(&insns_size);
> +       insns = arch_uretprobe_trampoline(&insns_size);
>         arch_uprobe_copy_ixol(area->page, 0, insns, insns_size);
>
>         if (!xol_add_vma(mm, area))
> --
> 2.47.0
>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 00/13] uprobes: Add support to optimize usdt probes on x86_64
  2024-12-11 13:33 [PATCH bpf-next 00/13] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (12 preceding siblings ...)
  2024-12-11 13:34 ` [PATCH bpf-next 13/13] selftests/bpf: Add 5-byte nop uprobe trigger bench Jiri Olsa
@ 2024-12-13  0:43 ` Andrii Nakryiko
  2024-12-13  9:46   ` Jiri Olsa
  2024-12-13 10:51 ` Peter Zijlstra
  14 siblings, 1 reply; 63+ messages in thread
From: Andrii Nakryiko @ 2024-12-13  0:43 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Wed, Dec 11, 2024 at 5:34 AM Jiri Olsa <jolsa@kernel.org> wrote:
>
> hi,
> this patchset adds support to optimize usdt probes on top of 5-byte
> nop instruction.
>
> The generic approach (optimize all uprobes) is hard due to emulating
> possible multiple original instructions and its related issues. The
> usdt case, which stores 5-byte nop seems much easier, so starting
> with that.
>
> The basic idea is to replace breakpoint exception with syscall which
> is faster on x86_64. For more details please see changelog of patch 8.
>
> The run_bench_uprobes.sh benchmark triggers uprobe (on top of different
> original instructions) in a loop and counts how many of those happened
> per second (the unit below is million loops).
>
> There's big speed up if you consider current usdt implementation
> (uprobe-nop) compared to proposed usdt (uprobe-nop5):
>
>   # ./benchs/run_bench_uprobes.sh
>
>       usermode-count :  233.831 ± 0.257M/s
>       syscall-count  :   12.107 ± 0.038M/s
>   --> uprobe-nop     :    3.246 ± 0.004M/s
>       uprobe-push    :    3.057 ± 0.000M/s
>       uprobe-ret     :    1.113 ± 0.003M/s
>   --> uprobe-nop5    :    6.751 ± 0.037M/s
>       uretprobe-nop  :    1.740 ± 0.015M/s
>       uretprobe-push :    1.677 ± 0.018M/s
>       uretprobe-ret  :    0.852 ± 0.005M/s
>       uretprobe-nop5 :    6.769 ± 0.040M/s

uretprobe-nop5 throughput is the same as uprobe-nop5?..


>
>
> v1 changes:
> - rebased on top of bpf-next/master
> - couple of function/variable renames [Andrii]
> - added nop5 emulation [Andrii]
> - added checks to arch_uprobe_verify_opcode [Andrii]
> - fixed arch_uprobe_is_callable/find_nearest_page [Andrii]
> - used CALL_INSN_OPCODE [Masami]
> - added uprobe-nop5 benchmark [Andrii]
> - using atomic64_t in tramp_area [Andri]
> - using single page for all uprobe trampoline mappings
>
> thanks,
> jirka
>
>
> ---
> Jiri Olsa (13):
>       uprobes: Rename arch_uretprobe_trampoline function
>       uprobes: Make copy_from_page global
>       uprobes: Add nbytes argument to uprobe_write_opcode
>       uprobes: Add arch_uprobe_verify_opcode function
>       uprobes: Add mapping for optimized uprobe trampolines
>       uprobes/x86: Add uprobe syscall to speed up uprobe
>       uprobes/x86: Add support to emulate nop5 instruction
>       uprobes/x86: Add support to optimize uprobes
>       selftests/bpf: Use 5-byte nop for x86 usdt probes
>       selftests/bpf: Add uprobe/usdt optimized test
>       selftests/bpf: Add hit/attach/detach race optimized uprobe test
>       selftests/bpf: Add uprobe syscall sigill signal test
>       selftests/bpf: Add 5-byte nop uprobe trigger bench
>
>  arch/x86/entry/syscalls/syscall_64.tbl                  |   1 +
>  arch/x86/include/asm/uprobes.h                          |   7 +++
>  arch/x86/kernel/uprobes.c                               | 255 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/syscalls.h                                |   2 +
>  include/linux/uprobes.h                                 |  25 +++++++-
>  kernel/events/uprobes.c                                 | 191 +++++++++++++++++++++++++++++++++++++++++++++++++++-----
>  kernel/fork.c                                           |   1 +
>  kernel/sys_ni.c                                         |   1 +
>  tools/testing/selftests/bpf/bench.c                     |  12 ++++
>  tools/testing/selftests/bpf/benchs/bench_trigger.c      |  42 +++++++++++++
>  tools/testing/selftests/bpf/benchs/run_bench_uprobes.sh |   2 +-
>  tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c | 326 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  tools/testing/selftests/bpf/progs/uprobe_optimized.c    |  29 +++++++++
>  tools/testing/selftests/bpf/sdt.h                       |   9 ++-
>  14 files changed, 880 insertions(+), 23 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/progs/uprobe_optimized.c

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 02/13] uprobes: Make copy_from_page global
  2024-12-11 13:33 ` [PATCH bpf-next 02/13] uprobes: Make copy_from_page global Jiri Olsa
@ 2024-12-13  0:43   ` Andrii Nakryiko
  0 siblings, 0 replies; 63+ messages in thread
From: Andrii Nakryiko @ 2024-12-13  0:43 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Wed, Dec 11, 2024 at 5:34 AM Jiri Olsa <jolsa@kernel.org> wrote:
>
> Making copy_from_page global and adding uprobe prefix.
> Adding the uprobe prefix to copy_to_page as well for symmetry.
>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  include/linux/uprobes.h |  1 +
>  kernel/events/uprobes.c | 16 ++++++++--------
>  2 files changed, 9 insertions(+), 8 deletions(-)
>

Acked-by: Andrii Nakryiko <andrii@kernel.org>

> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index 09a298e416a8..e24fbe496efb 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -213,6 +213,7 @@ extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
>  extern void uprobe_handle_trampoline(struct pt_regs *regs);
>  extern void *arch_uretprobe_trampoline(unsigned long *psize);
>  extern unsigned long uprobe_get_trampoline_vaddr(void);
> +extern void uprobe_copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len);
>  #else /* !CONFIG_UPROBES */
>  struct uprobes_state {
>  };
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index e0e3ebb4c0a1..61ec91f635dc 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -249,14 +249,14 @@ bool __weak is_trap_insn(uprobe_opcode_t *insn)
>         return is_swbp_insn(insn);
>  }
>
> -static void copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len)
> +void uprobe_copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len)
>  {
>         void *kaddr = kmap_atomic(page);
>         memcpy(dst, kaddr + (vaddr & ~PAGE_MASK), len);
>         kunmap_atomic(kaddr);
>  }
>
> -static void copy_to_page(struct page *page, unsigned long vaddr, const void *src, int len)
> +static void uprobe_copy_to_page(struct page *page, unsigned long vaddr, const void *src, int len)
>  {
>         void *kaddr = kmap_atomic(page);
>         memcpy(kaddr + (vaddr & ~PAGE_MASK), src, len);
> @@ -277,7 +277,7 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t
>          * is a trap variant; uprobes always wins over any other (gdb)
>          * breakpoint.
>          */
> -       copy_from_page(page, vaddr, &old_opcode, UPROBE_SWBP_INSN_SIZE);
> +       uprobe_copy_from_page(page, vaddr, &old_opcode, UPROBE_SWBP_INSN_SIZE);
>         is_swbp = is_swbp_insn(&old_opcode);
>
>         if (is_swbp_insn(new_opcode)) {
> @@ -524,7 +524,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
>
>         __SetPageUptodate(new_page);
>         copy_highpage(new_page, old_page);
> -       copy_to_page(new_page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
> +       uprobe_copy_to_page(new_page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
>
>         if (!is_register) {
>                 struct page *orig_page;
> @@ -1026,7 +1026,7 @@ static int __copy_insn(struct address_space *mapping, struct file *filp,
>         if (IS_ERR(page))
>                 return PTR_ERR(page);
>
> -       copy_from_page(page, offset, insn, nbytes);
> +       uprobe_copy_from_page(page, offset, insn, nbytes);
>         put_page(page);
>
>         return 0;
> @@ -1367,7 +1367,7 @@ struct uprobe *uprobe_register(struct inode *inode,
>                 return ERR_PTR(-EINVAL);
>
>         /*
> -        * This ensures that copy_from_page(), copy_to_page() and
> +        * This ensures that uprobe_copy_from_page(), uprobe_copy_to_page() and
>          * __update_ref_ctr() can't cross page boundary.
>          */
>         if (!IS_ALIGNED(offset, UPROBE_SWBP_INSN_SIZE))
> @@ -1856,7 +1856,7 @@ void __weak arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
>                                   void *src, unsigned long len)
>  {
>         /* Initialize the slot */
> -       copy_to_page(page, vaddr, src, len);
> +       uprobe_copy_to_page(page, vaddr, src, len);
>
>         /*
>          * We probably need flush_icache_user_page() but it needs vma.
> @@ -2287,7 +2287,7 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr)
>         if (result < 0)
>                 return result;
>
> -       copy_from_page(page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
> +       uprobe_copy_from_page(page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
>         put_page(page);
>   out:
>         /* This needs to return true for any variant of the trap insn */
> --
> 2.47.0
>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 03/13] uprobes: Add nbytes argument to uprobe_write_opcode
  2024-12-11 13:33 ` [PATCH bpf-next 03/13] uprobes: Add nbytes argument to uprobe_write_opcode Jiri Olsa
@ 2024-12-13  0:45   ` Andrii Nakryiko
  0 siblings, 0 replies; 63+ messages in thread
From: Andrii Nakryiko @ 2024-12-13  0:45 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Wed, Dec 11, 2024 at 5:34 AM Jiri Olsa <jolsa@kernel.org> wrote:
>
> Adding nbytes argument to uprobe_write_opcode as preparation
> fo writing longer instructions in following changes.

typo: for

lgtm overall

Acked-by: Andrii Nakryiko <andrii@kernel.org>

>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  include/linux/uprobes.h |  3 ++-
>  kernel/events/uprobes.c | 14 ++++++++------
>  2 files changed, 10 insertions(+), 7 deletions(-)
>

[...]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 04/13] uprobes: Add arch_uprobe_verify_opcode function
  2024-12-11 13:33 ` [PATCH bpf-next 04/13] uprobes: Add arch_uprobe_verify_opcode function Jiri Olsa
@ 2024-12-13  0:48   ` Andrii Nakryiko
  2024-12-13 13:21     ` Jiri Olsa
  0 siblings, 1 reply; 63+ messages in thread
From: Andrii Nakryiko @ 2024-12-13  0:48 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Wed, Dec 11, 2024 at 5:34 AM Jiri Olsa <jolsa@kernel.org> wrote:
>
> Adding arch_uprobe_verify_opcode function, so we can overload
> verification for each architecture in following changes.
>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  include/linux/uprobes.h |  5 +++++
>  kernel/events/uprobes.c | 19 ++++++++++++++++---
>  2 files changed, 21 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index cc723bc48c1d..8843b7f99ed0 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -215,6 +215,11 @@ extern void uprobe_handle_trampoline(struct pt_regs *regs);
>  extern void *arch_uretprobe_trampoline(unsigned long *psize);
>  extern unsigned long uprobe_get_trampoline_vaddr(void);
>  extern void uprobe_copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len);
> +extern int uprobe_verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode);
> +extern int arch_uprobe_verify_opcode(struct arch_uprobe *auprobe, struct page *page,
> +                                    unsigned long vaddr, uprobe_opcode_t *new_opcode,
> +                                    int nbytes);
> +extern bool arch_uprobe_is_register(uprobe_opcode_t *insn, int nbytes);
>  #else /* !CONFIG_UPROBES */
>  struct uprobes_state {
>  };
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 7c2ecf11a573..8068f91de9e3 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -263,7 +263,13 @@ static void uprobe_copy_to_page(struct page *page, unsigned long vaddr, const vo
>         kunmap_atomic(kaddr);
>  }
>
> -static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode)
> +__weak bool arch_uprobe_is_register(uprobe_opcode_t *insn, int nbytes)
> +{
> +       return is_swbp_insn(insn);

a bit weird that we ignore nbytes here... should we have nbytes ==
UPROBE_SWBP_INSN_SIZE check somewhere here or inside is_swbp_insn()?

> +}
> +
> +int uprobe_verify_opcode(struct page *page, unsigned long vaddr,
> +                        uprobe_opcode_t *new_opcode)
>  {
>         uprobe_opcode_t old_opcode;
>         bool is_swbp;
> @@ -291,6 +297,13 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t
>         return 1;
>  }
>
> +__weak int arch_uprobe_verify_opcode(struct arch_uprobe *auprobe, struct page *page,
> +                                    unsigned long vaddr, uprobe_opcode_t *new_opcode,
> +                                    int nbytes)
> +{
> +       return uprobe_verify_opcode(page, vaddr, new_opcode);

again, dropping nbytes on the floor here

> +}
> +
>  static struct delayed_uprobe *
>  delayed_uprobe_check(struct uprobe *uprobe, struct mm_struct *mm)
>  {
> @@ -479,7 +492,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
>         bool orig_page_huge = false;
>         unsigned int gup_flags = FOLL_FORCE;
>
> -       is_register = is_swbp_insn(insn);
> +       is_register = arch_uprobe_is_register(insn, nbytes);
>         uprobe = container_of(auprobe, struct uprobe, arch);
>
>  retry:
> @@ -490,7 +503,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
>         if (IS_ERR(old_page))
>                 return PTR_ERR(old_page);
>
> -       ret = verify_opcode(old_page, vaddr, insn);
> +       ret = arch_uprobe_verify_opcode(auprobe, old_page, vaddr, insn, nbytes);
>         if (ret <= 0)
>                 goto put_old;
>
> --
> 2.47.0
>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 05/13] uprobes: Add mapping for optimized uprobe trampolines
  2024-12-11 13:33 ` [PATCH bpf-next 05/13] uprobes: Add mapping for optimized uprobe trampolines Jiri Olsa
@ 2024-12-13  1:01   ` Andrii Nakryiko
  2024-12-13 13:42     ` Jiri Olsa
  0 siblings, 1 reply; 63+ messages in thread
From: Andrii Nakryiko @ 2024-12-13  1:01 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Wed, Dec 11, 2024 at 5:35 AM Jiri Olsa <jolsa@kernel.org> wrote:
>
> Adding support to add special mapping for for user space trampoline

typo: for for

> with following functions:
>
>   uprobe_trampoline_get - find or add related uprobe_trampoline
>   uprobe_trampoline_put - remove ref or destroy uprobe_trampoline
>
> The user space trampoline is exported as architecture specific user space
> special mapping, which is provided by arch_uprobe_trampoline_mapping
> function.
>
> The uprobe trampoline needs to be callable/reachable from the probe address,
> so while searching for available address we use arch_uprobe_is_callable
> function to decide if the uprobe trampoline is callable from the probe address.
>
> All uprobe_trampoline objects are stored in uprobes_state object and
> are cleaned up when the process mm_struct goes down.
>
> Locking is provided by callers in following changes.
>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  include/linux/uprobes.h |  12 +++++
>  kernel/events/uprobes.c | 114 ++++++++++++++++++++++++++++++++++++++++
>  kernel/fork.c           |   1 +
>  3 files changed, 127 insertions(+)
>

Ran out of time for today, will continue tomorrow for the rest of
patches. Some comments below.

The numbers are really encouraging, though!

> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index 8843b7f99ed0..c4ee755ca2a1 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -16,6 +16,7 @@
>  #include <linux/types.h>
>  #include <linux/wait.h>
>  #include <linux/timer.h>
> +#include <linux/mutex.h>
>
>  struct uprobe;
>  struct vm_area_struct;
> @@ -172,6 +173,13 @@ struct xol_area;
>
>  struct uprobes_state {
>         struct xol_area         *xol_area;
> +       struct hlist_head       tramp_head;
> +};
> +

should we make uprobe_state be linked by a pointer from mm_struct
instead of increasing mm for each added field? right now it's
embedded, I don't think it's problematic to allocate it on demand and
keep it until mm_struct is freed

> +struct uprobe_trampoline {
> +       struct hlist_node       node;
> +       unsigned long           vaddr;
> +       atomic64_t              ref;
>  };
>
>  extern void __init uprobes_init(void);
> @@ -220,6 +228,10 @@ extern int arch_uprobe_verify_opcode(struct arch_uprobe *auprobe, struct page *p
>                                      unsigned long vaddr, uprobe_opcode_t *new_opcode,
>                                      int nbytes);
>  extern bool arch_uprobe_is_register(uprobe_opcode_t *insn, int nbytes);
> +extern struct uprobe_trampoline *uprobe_trampoline_get(unsigned long vaddr);
> +extern void uprobe_trampoline_put(struct uprobe_trampoline *area);
> +extern bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr);
> +extern const struct vm_special_mapping *arch_uprobe_trampoline_mapping(void);
>  #else /* !CONFIG_UPROBES */
>  struct uprobes_state {
>  };
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 8068f91de9e3..f57918c624da 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -615,6 +615,118 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v
>                         (uprobe_opcode_t *)&auprobe->insn, UPROBE_SWBP_INSN_SIZE);
>  }
>
> +bool __weak arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr)

bikeshedding some more, I still find "is_callable" confusing. How
about "is_reachable_by_call"? slightly verbose, but probably more
meaningful?

> +{
> +       return false;
> +}
> +
> +const struct vm_special_mapping * __weak arch_uprobe_trampoline_mapping(void)
> +{
> +       return NULL;
> +}
> +
> +static unsigned long find_nearest_page(unsigned long vaddr)
> +{
> +       struct mm_struct *mm = current->mm;
> +       struct vm_area_struct *vma, *prev;
> +       VMA_ITERATOR(vmi, mm, 0);
> +
> +       prev = vma_next(&vmi);

minor: we are missing an opportunity to add something between
[PAGE_SIZE, <first_vma_start>). Probably fine, but why not?

> +       vma = vma_next(&vmi);
> +       while (vma) {
> +               if (vma->vm_start - prev->vm_end  >= PAGE_SIZE) {
> +                       if (arch_uprobe_is_callable(prev->vm_end, vaddr))
> +                               return prev->vm_end;
> +                       if (arch_uprobe_is_callable(vma->vm_start - PAGE_SIZE, vaddr))
> +                               return vma->vm_start - PAGE_SIZE;
> +               }
> +
> +               prev = vma;
> +               vma = vma_next(&vmi);
> +       }
> +
> +       return 0;
> +}
> +

[...]

> +struct uprobe_trampoline *uprobe_trampoline_get(unsigned long vaddr)
> +{
> +       struct uprobes_state *state = &current->mm->uprobes_state;
> +       struct uprobe_trampoline *tramp = NULL;
> +
> +       hlist_for_each_entry(tramp, &state->tramp_head, node) {
> +               if (arch_uprobe_is_callable(tramp->vaddr, vaddr)) {
> +                       atomic64_inc(&tramp->ref);
> +                       return tramp;
> +               }
> +       }
> +
> +       tramp = create_uprobe_trampoline(vaddr);
> +       if (!tramp)
> +               return NULL;
> +
> +       hlist_add_head(&tramp->node, &state->tramp_head);
> +       return tramp;
> +}
> +
> +static void destroy_uprobe_trampoline(struct uprobe_trampoline *tramp)
> +{
> +       hlist_del(&tramp->node);
> +       kfree(tramp);

hmm... shouldn't this be RCU-delayed (RCU Tasks Trace for uprobes),
otherwise we might have some CPU executing code in that trampoline,
no?

> +}
> +

[...]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 00/13] uprobes: Add support to optimize usdt probes on x86_64
  2024-12-13  0:43 ` [PATCH bpf-next 00/13] uprobes: Add support to optimize usdt probes on x86_64 Andrii Nakryiko
@ 2024-12-13  9:46   ` Jiri Olsa
  0 siblings, 0 replies; 63+ messages in thread
From: Jiri Olsa @ 2024-12-13  9:46 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Thu, Dec 12, 2024 at 04:43:02PM -0800, Andrii Nakryiko wrote:
> On Wed, Dec 11, 2024 at 5:34 AM Jiri Olsa <jolsa@kernel.org> wrote:
> >
> > hi,
> > this patchset adds support to optimize usdt probes on top of 5-byte
> > nop instruction.
> >
> > The generic approach (optimize all uprobes) is hard due to emulating
> > possible multiple original instructions and its related issues. The
> > usdt case, which stores 5-byte nop seems much easier, so starting
> > with that.
> >
> > The basic idea is to replace breakpoint exception with syscall which
> > is faster on x86_64. For more details please see changelog of patch 8.
> >
> > The run_bench_uprobes.sh benchmark triggers uprobe (on top of different
> > original instructions) in a loop and counts how many of those happened
> > per second (the unit below is million loops).
> >
> > There's big speed up if you consider current usdt implementation
> > (uprobe-nop) compared to proposed usdt (uprobe-nop5):
> >
> >   # ./benchs/run_bench_uprobes.sh
> >
> >       usermode-count :  233.831 ± 0.257M/s
> >       syscall-count  :   12.107 ± 0.038M/s
> >   --> uprobe-nop     :    3.246 ± 0.004M/s
> >       uprobe-push    :    3.057 ± 0.000M/s
> >       uprobe-ret     :    1.113 ± 0.003M/s
> >   --> uprobe-nop5    :    6.751 ± 0.037M/s
> >       uretprobe-nop  :    1.740 ± 0.015M/s
> >       uretprobe-push :    1.677 ± 0.018M/s
> >       uretprobe-ret  :    0.852 ± 0.005M/s
> >       uretprobe-nop5 :    6.769 ± 0.040M/s
> 
> uretprobe-nop5 throughput is the same as uprobe-nop5?..

ok, there's bug in the uretprobe bench setup, the number is wrong, sorry
will send new numbers

jirka

> 
> 
> >
> >
> > v1 changes:
> > - rebased on top of bpf-next/master
> > - couple of function/variable renames [Andrii]
> > - added nop5 emulation [Andrii]
> > - added checks to arch_uprobe_verify_opcode [Andrii]
> > - fixed arch_uprobe_is_callable/find_nearest_page [Andrii]
> > - used CALL_INSN_OPCODE [Masami]
> > - added uprobe-nop5 benchmark [Andrii]
> > - using atomic64_t in tramp_area [Andri]
> > - using single page for all uprobe trampoline mappings
> >
> > thanks,
> > jirka
> >
> >
> > ---
> > Jiri Olsa (13):
> >       uprobes: Rename arch_uretprobe_trampoline function
> >       uprobes: Make copy_from_page global
> >       uprobes: Add nbytes argument to uprobe_write_opcode
> >       uprobes: Add arch_uprobe_verify_opcode function
> >       uprobes: Add mapping for optimized uprobe trampolines
> >       uprobes/x86: Add uprobe syscall to speed up uprobe
> >       uprobes/x86: Add support to emulate nop5 instruction
> >       uprobes/x86: Add support to optimize uprobes
> >       selftests/bpf: Use 5-byte nop for x86 usdt probes
> >       selftests/bpf: Add uprobe/usdt optimized test
> >       selftests/bpf: Add hit/attach/detach race optimized uprobe test
> >       selftests/bpf: Add uprobe syscall sigill signal test
> >       selftests/bpf: Add 5-byte nop uprobe trigger bench
> >
> >  arch/x86/entry/syscalls/syscall_64.tbl                  |   1 +
> >  arch/x86/include/asm/uprobes.h                          |   7 +++
> >  arch/x86/kernel/uprobes.c                               | 255 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
> >  include/linux/syscalls.h                                |   2 +
> >  include/linux/uprobes.h                                 |  25 +++++++-
> >  kernel/events/uprobes.c                                 | 191 +++++++++++++++++++++++++++++++++++++++++++++++++++-----
> >  kernel/fork.c                                           |   1 +
> >  kernel/sys_ni.c                                         |   1 +
> >  tools/testing/selftests/bpf/bench.c                     |  12 ++++
> >  tools/testing/selftests/bpf/benchs/bench_trigger.c      |  42 +++++++++++++
> >  tools/testing/selftests/bpf/benchs/run_bench_uprobes.sh |   2 +-
> >  tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c | 326 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  tools/testing/selftests/bpf/progs/uprobe_optimized.c    |  29 +++++++++
> >  tools/testing/selftests/bpf/sdt.h                       |   9 ++-
> >  14 files changed, 880 insertions(+), 23 deletions(-)
> >  create mode 100644 tools/testing/selftests/bpf/progs/uprobe_optimized.c

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 07/13] uprobes/x86: Add support to emulate nop5 instruction
  2024-12-11 13:33 ` [PATCH bpf-next 07/13] uprobes/x86: Add support to emulate nop5 instruction Jiri Olsa
@ 2024-12-13 10:45   ` Peter Zijlstra
  2024-12-13 13:02     ` Jiri Olsa
  0 siblings, 1 reply; 63+ messages in thread
From: Peter Zijlstra @ 2024-12-13 10:45 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Andrii Nakryiko, bpf, Song Liu, Yonghong Song,
	John Fastabend, Hao Luo, Steven Rostedt, Masami Hiramatsu,
	Alan Maguire, linux-kernel, linux-trace-kernel

On Wed, Dec 11, 2024 at 02:33:56PM +0100, Jiri Olsa wrote:
> Adding support to emulate nop5 as the original uprobe instruction.
> 
> This speeds up uprobes on top of nop5 instructions:
> (results from benchs/run_bench_uprobes.sh)
> 
> current:
> 
>      uprobe-nop     :    3.252 ± 0.019M/s
>      uprobe-push    :    3.097 ± 0.002M/s
>      uprobe-ret     :    1.116 ± 0.001M/s
>  --> uprobe-nop5    :    1.115 ± 0.001M/s
>      uretprobe-nop  :    1.731 ± 0.016M/s
>      uretprobe-push :    1.673 ± 0.023M/s
>      uretprobe-ret  :    0.843 ± 0.009M/s
>  --> uretprobe-nop5 :    1.124 ± 0.001M/s
> 
> after the change:
> 
>      uprobe-nop     :    3.281 ± 0.003M/s
>      uprobe-push    :    3.085 ± 0.003M/s
>      uprobe-ret     :    1.130 ± 0.000M/s
>  --> uprobe-nop5    :    3.276 ± 0.007M/s
>      uretprobe-nop  :    1.716 ± 0.016M/s
>      uretprobe-push :    1.651 ± 0.017M/s
>      uretprobe-ret  :    0.846 ± 0.006M/s
>  --> uretprobe-nop5 :    3.279 ± 0.002M/s
> 
> Strangely I can see uretprobe-nop5 is now much faster compared to
> uretprobe-nop, while perf profiles for both are almost identical.
> I'm still checking on that.
> 
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  arch/x86/kernel/uprobes.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index 23e4f2821cff..cdea97f8cd39 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -909,6 +909,11 @@ static const struct uprobe_xol_ops push_xol_ops = {
>  	.emulate  = push_emulate_op,
>  };
>  
> +static int is_nop5_insn(uprobe_opcode_t *insn)
> +{
> +	return !memcmp(insn, x86_nops[5], 5);
> +}
> +
>  /* Returns -ENOSYS if branch_xol_ops doesn't handle this insn */
>  static int branch_setup_xol_ops(struct arch_uprobe *auprobe, struct insn *insn)
>  {
> @@ -928,6 +933,8 @@ static int branch_setup_xol_ops(struct arch_uprobe *auprobe, struct insn *insn)
>  		break;
>  
>  	case 0x0f:
> +		if (is_nop5_insn((uprobe_opcode_t *) &auprobe->insn))
> +			goto setup;

This isn't right, this is not x86_64 specific code, and there's a bunch
of 32bit 5 byte nops that do not start with 0f.

Also, since you already have the insn decoded, I would suggest you
simply check OPCODE2(insn) == 0x1f /* NOPL */ and length == 5.

>  		if (insn->opcode.nbytes != 2)
>  			return -ENOSYS;
>  		/*
> -- 
> 2.47.0
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 08/13] uprobes/x86: Add support to optimize uprobes
  2024-12-11 13:33 ` [PATCH bpf-next 08/13] uprobes/x86: Add support to optimize uprobes Jiri Olsa
@ 2024-12-13 10:49   ` Peter Zijlstra
  2024-12-13 13:06     ` Jiri Olsa
  2024-12-13 21:58   ` Andrii Nakryiko
  2024-12-15 12:06   ` David Laight
  2 siblings, 1 reply; 63+ messages in thread
From: Peter Zijlstra @ 2024-12-13 10:49 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Andrii Nakryiko, bpf, Song Liu, Yonghong Song,
	John Fastabend, Hao Luo, Steven Rostedt, Masami Hiramatsu,
	Alan Maguire, linux-kernel, linux-trace-kernel

On Wed, Dec 11, 2024 at 02:33:57PM +0100, Jiri Olsa wrote:
> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index cdea97f8cd39..b2420eeee23a 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c

> @@ -1306,3 +1339,132 @@ bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check ctx,
>  	else
>  		return regs->sp <= ret->stack;
>  }
> +
> +int arch_uprobe_verify_opcode(struct arch_uprobe *auprobe, struct page *page,
> +			      unsigned long vaddr, uprobe_opcode_t *new_opcode,
> +			      int nbytes)
> +{
> +	uprobe_opcode_t old_opcode[5];
> +	bool is_call, is_swbp, is_nop5;
> +
> +	if (!test_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags))
> +		return uprobe_verify_opcode(page, vaddr, new_opcode);
> +
> +	/*
> +	 * The ARCH_UPROBE_FLAG_CAN_OPTIMIZE flag guarantees the following
> +	 * 5 bytes read won't cross the page boundary.
> +	 */
> +	uprobe_copy_from_page(page, vaddr, (uprobe_opcode_t *) &old_opcode, 5);
> +	is_call = is_call_insn((uprobe_opcode_t *) &old_opcode);
> +	is_swbp = is_swbp_insn((uprobe_opcode_t *) &old_opcode);
> +	is_nop5 = is_nop5_insn((uprobe_opcode_t *) &old_opcode);
> +
> +	/*
> +	 * We allow following trasitions for optimized uprobes:
> +	 *
> +	 *   nop5 -> swbp -> call
> +	 *   ||      |       |
> +	 *   |'--<---'       |
> +	 *   '---<-----------'
> +	 *
> +	 * We return 1 to ack the write, 0 to do nothing, -1 to fail write.
> +	 *
> +	 * If the current opcode (old_opcode) has already desired value,
> +	 * we do nothing, because we are racing with another thread doing
> +	 * the update.
> +	 */
> +	switch (nbytes) {
> +	case 5:
> +		if (is_call_insn(new_opcode)) {
> +			if (is_swbp)
> +				return 1;
> +			if (is_call && !memcmp(new_opcode, &old_opcode, 5))
> +				return 0;
> +		} else {
> +			if (is_call || is_swbp)
> +				return 1;
> +			if (is_nop5)
> +				return 0;
> +		}
> +		break;
> +	case 1:
> +		if (is_swbp_insn(new_opcode)) {
> +			if (is_nop5)
> +				return 1;
> +			if (is_swbp || is_call)
> +				return 0;
> +		} else {
> +			if (is_swbp || is_call)
> +				return 1;
> +			if (is_nop5)
> +				return 0;
> +		}
> +	}
> +	return -1;
> +}
> +
> +bool arch_uprobe_is_register(uprobe_opcode_t *insn, int nbytes)
> +{
> +	return nbytes == 5 ? is_call_insn(insn) : is_swbp_insn(insn);
> +}
> +
> +static void __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct *mm,
> +				   unsigned long vaddr)
> +{
> +	struct uprobe_trampoline *tramp;
> +	char call[5];
> +
> +	tramp = uprobe_trampoline_get(vaddr);
> +	if (!tramp)
> +		goto fail;
> +
> +	relative_call(call, (void *) vaddr, (void *) tramp->vaddr);
> +	if (uprobe_write_opcode(auprobe, mm, vaddr, call, 5))
> +		goto fail;
> +
> +	set_bit(ARCH_UPROBE_FLAG_OPTIMIZED, &auprobe->flags);
> +	return;
> +
> +fail:
> +	/* Once we fail we never try again. */
> +	clear_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags);
> +	uprobe_trampoline_put(tramp);
> +}
> +
> +static bool should_optimize(struct arch_uprobe *auprobe)
> +{
> +	if (!test_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags))
> +		return false;
> +	if (test_bit(ARCH_UPROBE_FLAG_OPTIMIZED, &auprobe->flags))
> +		return false;
> +	return true;
> +}
> +
> +void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
> +{
> +	struct mm_struct *mm = current->mm;
> +
> +	if (!should_optimize(auprobe))
> +		return;
> +
> +	mmap_write_lock(mm);
> +	if (should_optimize(auprobe))
> +		__arch_uprobe_optimize(auprobe, mm, vaddr);
> +	mmap_write_unlock(mm);
> +}
> +
> +int set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr)
> +{
> +	uprobe_opcode_t *insn = (uprobe_opcode_t *) auprobe->insn;
> +
> +	if (test_bit(ARCH_UPROBE_FLAG_OPTIMIZED, &auprobe->flags))
> +		return uprobe_write_opcode(auprobe, mm, vaddr, insn, 5);
> +
> +	return uprobe_write_opcode(auprobe, mm, vaddr, insn, UPROBE_SWBP_INSN_SIZE);
> +}
> +
> +bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr)
> +{
> +	long delta = (long)(vaddr + 5 - vtramp);
> +	return delta >= INT_MIN && delta <= INT_MAX;
> +}

All this code is useless on 32bit, right?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 00/13] uprobes: Add support to optimize usdt probes on x86_64
  2024-12-11 13:33 [PATCH bpf-next 00/13] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (13 preceding siblings ...)
  2024-12-13  0:43 ` [PATCH bpf-next 00/13] uprobes: Add support to optimize usdt probes on x86_64 Andrii Nakryiko
@ 2024-12-13 10:51 ` Peter Zijlstra
  2024-12-13 13:07   ` Jiri Olsa
  14 siblings, 1 reply; 63+ messages in thread
From: Peter Zijlstra @ 2024-12-13 10:51 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Andrii Nakryiko, bpf, Song Liu, Yonghong Song,
	John Fastabend, Hao Luo, Steven Rostedt, Masami Hiramatsu,
	Alan Maguire, linux-kernel, linux-trace-kernel

On Wed, Dec 11, 2024 at 02:33:49PM +0100, Jiri Olsa wrote:
> hi,
> this patchset adds support to optimize usdt probes on top of 5-byte
> nop instruction.
> 
> The generic approach (optimize all uprobes) is hard due to emulating
> possible multiple original instructions and its related issues. The
> usdt case, which stores 5-byte nop seems much easier, so starting
> with that.
> 
> The basic idea is to replace breakpoint exception with syscall which
> is faster on x86_64. For more details please see changelog of patch 8.

So ideally we'd put a check in the syscall, which verifies it comes from
one of our trampolines and reject any and all other usage.

The reason to do this is that we can then delete all this code the
moment it becomes irrelevant without having to worry userspace might be
'creative' somewhere.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 07/13] uprobes/x86: Add support to emulate nop5 instruction
  2024-12-13 10:45   ` Peter Zijlstra
@ 2024-12-13 13:02     ` Jiri Olsa
  0 siblings, 0 replies; 63+ messages in thread
From: Jiri Olsa @ 2024-12-13 13:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Andrii Nakryiko, bpf, Song Liu, Yonghong Song,
	John Fastabend, Hao Luo, Steven Rostedt, Masami Hiramatsu,
	Alan Maguire, linux-kernel, linux-trace-kernel

On Fri, Dec 13, 2024 at 11:45:36AM +0100, Peter Zijlstra wrote:
> On Wed, Dec 11, 2024 at 02:33:56PM +0100, Jiri Olsa wrote:
> > Adding support to emulate nop5 as the original uprobe instruction.
> > 
> > This speeds up uprobes on top of nop5 instructions:
> > (results from benchs/run_bench_uprobes.sh)
> > 
> > current:
> > 
> >      uprobe-nop     :    3.252 ± 0.019M/s
> >      uprobe-push    :    3.097 ± 0.002M/s
> >      uprobe-ret     :    1.116 ± 0.001M/s
> >  --> uprobe-nop5    :    1.115 ± 0.001M/s
> >      uretprobe-nop  :    1.731 ± 0.016M/s
> >      uretprobe-push :    1.673 ± 0.023M/s
> >      uretprobe-ret  :    0.843 ± 0.009M/s
> >  --> uretprobe-nop5 :    1.124 ± 0.001M/s
> > 
> > after the change:
> > 
> >      uprobe-nop     :    3.281 ± 0.003M/s
> >      uprobe-push    :    3.085 ± 0.003M/s
> >      uprobe-ret     :    1.130 ± 0.000M/s
> >  --> uprobe-nop5    :    3.276 ± 0.007M/s
> >      uretprobe-nop  :    1.716 ± 0.016M/s
> >      uretprobe-push :    1.651 ± 0.017M/s
> >      uretprobe-ret  :    0.846 ± 0.006M/s
> >  --> uretprobe-nop5 :    3.279 ± 0.002M/s
> > 
> > Strangely I can see uretprobe-nop5 is now much faster compared to
> > uretprobe-nop, while perf profiles for both are almost identical.
> > I'm still checking on that.
> > 
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > ---
> >  arch/x86/kernel/uprobes.c | 7 +++++++
> >  1 file changed, 7 insertions(+)
> > 
> > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> > index 23e4f2821cff..cdea97f8cd39 100644
> > --- a/arch/x86/kernel/uprobes.c
> > +++ b/arch/x86/kernel/uprobes.c
> > @@ -909,6 +909,11 @@ static const struct uprobe_xol_ops push_xol_ops = {
> >  	.emulate  = push_emulate_op,
> >  };
> >  
> > +static int is_nop5_insn(uprobe_opcode_t *insn)
> > +{
> > +	return !memcmp(insn, x86_nops[5], 5);
> > +}
> > +
> >  /* Returns -ENOSYS if branch_xol_ops doesn't handle this insn */
> >  static int branch_setup_xol_ops(struct arch_uprobe *auprobe, struct insn *insn)
> >  {
> > @@ -928,6 +933,8 @@ static int branch_setup_xol_ops(struct arch_uprobe *auprobe, struct insn *insn)
> >  		break;
> >  
> >  	case 0x0f:
> > +		if (is_nop5_insn((uprobe_opcode_t *) &auprobe->insn))
> > +			goto setup;
> 
> This isn't right, this is not x86_64 specific code, and there's a bunch
> of 32bit 5 byte nops that do not start with 0f.
> 
> Also, since you already have the insn decoded, I would suggest you
> simply check OPCODE2(insn) == 0x1f /* NOPL */ and length == 5.

ah right.. ok will change, thanks

jirka

> 
> >  		if (insn->opcode.nbytes != 2)
> >  			return -ENOSYS;
> >  		/*
> > -- 
> > 2.47.0
> > 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 08/13] uprobes/x86: Add support to optimize uprobes
  2024-12-13 10:49   ` Peter Zijlstra
@ 2024-12-13 13:06     ` Jiri Olsa
  0 siblings, 0 replies; 63+ messages in thread
From: Jiri Olsa @ 2024-12-13 13:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Andrii Nakryiko, bpf, Song Liu, Yonghong Song,
	John Fastabend, Hao Luo, Steven Rostedt, Masami Hiramatsu,
	Alan Maguire, linux-kernel, linux-trace-kernel

On Fri, Dec 13, 2024 at 11:49:07AM +0100, Peter Zijlstra wrote:
> On Wed, Dec 11, 2024 at 02:33:57PM +0100, Jiri Olsa wrote:
> > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> > index cdea97f8cd39..b2420eeee23a 100644
> > --- a/arch/x86/kernel/uprobes.c
> > +++ b/arch/x86/kernel/uprobes.c
> 
> > @@ -1306,3 +1339,132 @@ bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check ctx,
> >  	else
> >  		return regs->sp <= ret->stack;
> >  }
> > +
> > +int arch_uprobe_verify_opcode(struct arch_uprobe *auprobe, struct page *page,
> > +			      unsigned long vaddr, uprobe_opcode_t *new_opcode,
> > +			      int nbytes)
> > +{
> > +	uprobe_opcode_t old_opcode[5];
> > +	bool is_call, is_swbp, is_nop5;
> > +
> > +	if (!test_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags))
> > +		return uprobe_verify_opcode(page, vaddr, new_opcode);
> > +
> > +	/*
> > +	 * The ARCH_UPROBE_FLAG_CAN_OPTIMIZE flag guarantees the following
> > +	 * 5 bytes read won't cross the page boundary.
> > +	 */
> > +	uprobe_copy_from_page(page, vaddr, (uprobe_opcode_t *) &old_opcode, 5);
> > +	is_call = is_call_insn((uprobe_opcode_t *) &old_opcode);
> > +	is_swbp = is_swbp_insn((uprobe_opcode_t *) &old_opcode);
> > +	is_nop5 = is_nop5_insn((uprobe_opcode_t *) &old_opcode);
> > +
> > +	/*
> > +	 * We allow following trasitions for optimized uprobes:
> > +	 *
> > +	 *   nop5 -> swbp -> call
> > +	 *   ||      |       |
> > +	 *   |'--<---'       |
> > +	 *   '---<-----------'
> > +	 *
> > +	 * We return 1 to ack the write, 0 to do nothing, -1 to fail write.
> > +	 *
> > +	 * If the current opcode (old_opcode) has already desired value,
> > +	 * we do nothing, because we are racing with another thread doing
> > +	 * the update.
> > +	 */
> > +	switch (nbytes) {
> > +	case 5:
> > +		if (is_call_insn(new_opcode)) {
> > +			if (is_swbp)
> > +				return 1;
> > +			if (is_call && !memcmp(new_opcode, &old_opcode, 5))
> > +				return 0;
> > +		} else {
> > +			if (is_call || is_swbp)
> > +				return 1;
> > +			if (is_nop5)
> > +				return 0;
> > +		}
> > +		break;
> > +	case 1:
> > +		if (is_swbp_insn(new_opcode)) {
> > +			if (is_nop5)
> > +				return 1;
> > +			if (is_swbp || is_call)
> > +				return 0;
> > +		} else {
> > +			if (is_swbp || is_call)
> > +				return 1;
> > +			if (is_nop5)
> > +				return 0;
> > +		}
> > +	}
> > +	return -1;
> > +}
> > +
> > +bool arch_uprobe_is_register(uprobe_opcode_t *insn, int nbytes)
> > +{
> > +	return nbytes == 5 ? is_call_insn(insn) : is_swbp_insn(insn);
> > +}
> > +
> > +static void __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct *mm,
> > +				   unsigned long vaddr)
> > +{
> > +	struct uprobe_trampoline *tramp;
> > +	char call[5];
> > +
> > +	tramp = uprobe_trampoline_get(vaddr);
> > +	if (!tramp)
> > +		goto fail;
> > +
> > +	relative_call(call, (void *) vaddr, (void *) tramp->vaddr);
> > +	if (uprobe_write_opcode(auprobe, mm, vaddr, call, 5))
> > +		goto fail;
> > +
> > +	set_bit(ARCH_UPROBE_FLAG_OPTIMIZED, &auprobe->flags);
> > +	return;
> > +
> > +fail:
> > +	/* Once we fail we never try again. */
> > +	clear_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags);
> > +	uprobe_trampoline_put(tramp);
> > +}
> > +
> > +static bool should_optimize(struct arch_uprobe *auprobe)
> > +{
> > +	if (!test_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags))
> > +		return false;
> > +	if (test_bit(ARCH_UPROBE_FLAG_OPTIMIZED, &auprobe->flags))
> > +		return false;
> > +	return true;
> > +}
> > +
> > +void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
> > +{
> > +	struct mm_struct *mm = current->mm;
> > +
> > +	if (!should_optimize(auprobe))
> > +		return;
> > +
> > +	mmap_write_lock(mm);
> > +	if (should_optimize(auprobe))
> > +		__arch_uprobe_optimize(auprobe, mm, vaddr);
> > +	mmap_write_unlock(mm);
> > +}
> > +
> > +int set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr)
> > +{
> > +	uprobe_opcode_t *insn = (uprobe_opcode_t *) auprobe->insn;
> > +
> > +	if (test_bit(ARCH_UPROBE_FLAG_OPTIMIZED, &auprobe->flags))
> > +		return uprobe_write_opcode(auprobe, mm, vaddr, insn, 5);
> > +
> > +	return uprobe_write_opcode(auprobe, mm, vaddr, insn, UPROBE_SWBP_INSN_SIZE);
> > +}
> > +
> > +bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr)
> > +{
> > +	long delta = (long)(vaddr + 5 - vtramp);
> > +	return delta >= INT_MIN && delta <= INT_MAX;
> > +}
> 
> All this code is useless on 32bit, right?

yes, there's user_64bit_mode check in arch_uprobe_trampoline_mapping
when getting the trampoline, so above won't get executed in practise..

but I think we should make it more obvious and put the check directly
to arch_uprobe_optimize

jirka

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 00/13] uprobes: Add support to optimize usdt probes on x86_64
  2024-12-13 10:51 ` Peter Zijlstra
@ 2024-12-13 13:07   ` Jiri Olsa
  2024-12-13 13:54     ` Peter Zijlstra
  0 siblings, 1 reply; 63+ messages in thread
From: Jiri Olsa @ 2024-12-13 13:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Andrii Nakryiko, bpf, Song Liu, Yonghong Song,
	John Fastabend, Hao Luo, Steven Rostedt, Masami Hiramatsu,
	Alan Maguire, linux-kernel, linux-trace-kernel

On Fri, Dec 13, 2024 at 11:51:05AM +0100, Peter Zijlstra wrote:
> On Wed, Dec 11, 2024 at 02:33:49PM +0100, Jiri Olsa wrote:
> > hi,
> > this patchset adds support to optimize usdt probes on top of 5-byte
> > nop instruction.
> > 
> > The generic approach (optimize all uprobes) is hard due to emulating
> > possible multiple original instructions and its related issues. The
> > usdt case, which stores 5-byte nop seems much easier, so starting
> > with that.
> > 
> > The basic idea is to replace breakpoint exception with syscall which
> > is faster on x86_64. For more details please see changelog of patch 8.
> 
> So ideally we'd put a check in the syscall, which verifies it comes from
> one of our trampolines and reject any and all other usage.
> 
> The reason to do this is that we can then delete all this code the
> moment it becomes irrelevant without having to worry userspace might be
> 'creative' somewhere.

yes, we do that already in SYSCALL_DEFINE0(uprobe):

        /* Allow execution only from uprobe trampolines. */
        vma = vma_lookup(current->mm, regs->ip);
        if (!vma || vma->vm_private_data != (void *) &tramp_mapping) {
                force_sig(SIGILL);
                return -1;
        }

jirka

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 04/13] uprobes: Add arch_uprobe_verify_opcode function
  2024-12-13  0:48   ` Andrii Nakryiko
@ 2024-12-13 13:21     ` Jiri Olsa
  2024-12-13 21:11       ` Andrii Nakryiko
  0 siblings, 1 reply; 63+ messages in thread
From: Jiri Olsa @ 2024-12-13 13:21 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Thu, Dec 12, 2024 at 04:48:05PM -0800, Andrii Nakryiko wrote:
> On Wed, Dec 11, 2024 at 5:34 AM Jiri Olsa <jolsa@kernel.org> wrote:
> >
> > Adding arch_uprobe_verify_opcode function, so we can overload
> > verification for each architecture in following changes.
> >
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > ---
> >  include/linux/uprobes.h |  5 +++++
> >  kernel/events/uprobes.c | 19 ++++++++++++++++---
> >  2 files changed, 21 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > index cc723bc48c1d..8843b7f99ed0 100644
> > --- a/include/linux/uprobes.h
> > +++ b/include/linux/uprobes.h
> > @@ -215,6 +215,11 @@ extern void uprobe_handle_trampoline(struct pt_regs *regs);
> >  extern void *arch_uretprobe_trampoline(unsigned long *psize);
> >  extern unsigned long uprobe_get_trampoline_vaddr(void);
> >  extern void uprobe_copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len);
> > +extern int uprobe_verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode);
> > +extern int arch_uprobe_verify_opcode(struct arch_uprobe *auprobe, struct page *page,
> > +                                    unsigned long vaddr, uprobe_opcode_t *new_opcode,
> > +                                    int nbytes);
> > +extern bool arch_uprobe_is_register(uprobe_opcode_t *insn, int nbytes);
> >  #else /* !CONFIG_UPROBES */
> >  struct uprobes_state {
> >  };
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 7c2ecf11a573..8068f91de9e3 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -263,7 +263,13 @@ static void uprobe_copy_to_page(struct page *page, unsigned long vaddr, const vo
> >         kunmap_atomic(kaddr);
> >  }
> >
> > -static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode)
> > +__weak bool arch_uprobe_is_register(uprobe_opcode_t *insn, int nbytes)
> > +{
> > +       return is_swbp_insn(insn);
> 
> a bit weird that we ignore nbytes here... should we have nbytes ==
> UPROBE_SWBP_INSN_SIZE check somewhere here or inside is_swbp_insn()?

the original is_swbp_insn function does not need that and we need
nbytes in the overloaded arch_uprobe_is_register to distinguish
between 1 byte and 5 byte update..

jirka

> 
> > +}
> > +
> > +int uprobe_verify_opcode(struct page *page, unsigned long vaddr,
> > +                        uprobe_opcode_t *new_opcode)
> >  {
> >         uprobe_opcode_t old_opcode;
> >         bool is_swbp;
> > @@ -291,6 +297,13 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t
> >         return 1;
> >  }
> >
> > +__weak int arch_uprobe_verify_opcode(struct arch_uprobe *auprobe, struct page *page,
> > +                                    unsigned long vaddr, uprobe_opcode_t *new_opcode,
> > +                                    int nbytes)
> > +{
> > +       return uprobe_verify_opcode(page, vaddr, new_opcode);
> 
> again, dropping nbytes on the floor here
> 
> > +}
> > +
> >  static struct delayed_uprobe *
> >  delayed_uprobe_check(struct uprobe *uprobe, struct mm_struct *mm)
> >  {
> > @@ -479,7 +492,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> >         bool orig_page_huge = false;
> >         unsigned int gup_flags = FOLL_FORCE;
> >
> > -       is_register = is_swbp_insn(insn);
> > +       is_register = arch_uprobe_is_register(insn, nbytes);
> >         uprobe = container_of(auprobe, struct uprobe, arch);
> >
> >  retry:
> > @@ -490,7 +503,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> >         if (IS_ERR(old_page))
> >                 return PTR_ERR(old_page);
> >
> > -       ret = verify_opcode(old_page, vaddr, insn);
> > +       ret = arch_uprobe_verify_opcode(auprobe, old_page, vaddr, insn, nbytes);
> >         if (ret <= 0)
> >                 goto put_old;
> >
> > --
> > 2.47.0
> >

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 05/13] uprobes: Add mapping for optimized uprobe trampolines
  2024-12-13  1:01   ` Andrii Nakryiko
@ 2024-12-13 13:42     ` Jiri Olsa
  2024-12-13 21:58       ` Andrii Nakryiko
  0 siblings, 1 reply; 63+ messages in thread
From: Jiri Olsa @ 2024-12-13 13:42 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Thu, Dec 12, 2024 at 05:01:52PM -0800, Andrii Nakryiko wrote:

SNIP

> > ---
> >  include/linux/uprobes.h |  12 +++++
> >  kernel/events/uprobes.c | 114 ++++++++++++++++++++++++++++++++++++++++
> >  kernel/fork.c           |   1 +
> >  3 files changed, 127 insertions(+)
> >
> 
> Ran out of time for today, will continue tomorrow for the rest of
> patches. Some comments below.

thanks!

> 
> The numbers are really encouraging, though!
> 
> > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > index 8843b7f99ed0..c4ee755ca2a1 100644
> > --- a/include/linux/uprobes.h
> > +++ b/include/linux/uprobes.h
> > @@ -16,6 +16,7 @@
> >  #include <linux/types.h>
> >  #include <linux/wait.h>
> >  #include <linux/timer.h>
> > +#include <linux/mutex.h>
> >
> >  struct uprobe;
> >  struct vm_area_struct;
> > @@ -172,6 +173,13 @@ struct xol_area;
> >
> >  struct uprobes_state {
> >         struct xol_area         *xol_area;
> > +       struct hlist_head       tramp_head;
> > +};
> > +
> 
> should we make uprobe_state be linked by a pointer from mm_struct
> instead of increasing mm for each added field? right now it's
> embedded, I don't think it's problematic to allocate it on demand and
> keep it until mm_struct is freed

seems like good idea, I'll check on that

> 
> > +struct uprobe_trampoline {
> > +       struct hlist_node       node;
> > +       unsigned long           vaddr;
> > +       atomic64_t              ref;
> >  };
> >
> >  extern void __init uprobes_init(void);
> > @@ -220,6 +228,10 @@ extern int arch_uprobe_verify_opcode(struct arch_uprobe *auprobe, struct page *p
> >                                      unsigned long vaddr, uprobe_opcode_t *new_opcode,
> >                                      int nbytes);
> >  extern bool arch_uprobe_is_register(uprobe_opcode_t *insn, int nbytes);
> > +extern struct uprobe_trampoline *uprobe_trampoline_get(unsigned long vaddr);
> > +extern void uprobe_trampoline_put(struct uprobe_trampoline *area);
> > +extern bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr);
> > +extern const struct vm_special_mapping *arch_uprobe_trampoline_mapping(void);
> >  #else /* !CONFIG_UPROBES */
> >  struct uprobes_state {
> >  };
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 8068f91de9e3..f57918c624da 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -615,6 +615,118 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v
> >                         (uprobe_opcode_t *)&auprobe->insn, UPROBE_SWBP_INSN_SIZE);
> >  }
> >
> > +bool __weak arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr)
> 
> bikeshedding some more, I still find "is_callable" confusing. How
> about "is_reachable_by_call"? slightly verbose, but probably more
> meaningful?

yep, more precise, will change

> 
> > +{
> > +       return false;
> > +}
> > +
> > +const struct vm_special_mapping * __weak arch_uprobe_trampoline_mapping(void)
> > +{
> > +       return NULL;
> > +}
> > +
> > +static unsigned long find_nearest_page(unsigned long vaddr)
> > +{
> > +       struct mm_struct *mm = current->mm;
> > +       struct vm_area_struct *vma, *prev;
> > +       VMA_ITERATOR(vmi, mm, 0);
> > +
> > +       prev = vma_next(&vmi);
> 
> minor: we are missing an opportunity to add something between
> [PAGE_SIZE, <first_vma_start>). Probably fine, but why not?

true, will add that check

> 
> > +       vma = vma_next(&vmi);
> > +       while (vma) {
> > +               if (vma->vm_start - prev->vm_end  >= PAGE_SIZE) {
> > +                       if (arch_uprobe_is_callable(prev->vm_end, vaddr))
> > +                               return prev->vm_end;
> > +                       if (arch_uprobe_is_callable(vma->vm_start - PAGE_SIZE, vaddr))
> > +                               return vma->vm_start - PAGE_SIZE;
> > +               }
> > +
> > +               prev = vma;
> > +               vma = vma_next(&vmi);
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> 
> [...]
> 
> > +struct uprobe_trampoline *uprobe_trampoline_get(unsigned long vaddr)
> > +{
> > +       struct uprobes_state *state = &current->mm->uprobes_state;
> > +       struct uprobe_trampoline *tramp = NULL;
> > +
> > +       hlist_for_each_entry(tramp, &state->tramp_head, node) {
> > +               if (arch_uprobe_is_callable(tramp->vaddr, vaddr)) {
> > +                       atomic64_inc(&tramp->ref);
> > +                       return tramp;
> > +               }
> > +       }
> > +
> > +       tramp = create_uprobe_trampoline(vaddr);
> > +       if (!tramp)
> > +               return NULL;
> > +
> > +       hlist_add_head(&tramp->node, &state->tramp_head);
> > +       return tramp;
> > +}
> > +
> > +static void destroy_uprobe_trampoline(struct uprobe_trampoline *tramp)
> > +{
> > +       hlist_del(&tramp->node);
> > +       kfree(tramp);
> 
> hmm... shouldn't this be RCU-delayed (RCU Tasks Trace for uprobes),
> otherwise we might have some CPU executing code in that trampoline,
> no?

so we call destroy_uprobe_trampoline in 2 scenarios:

  - from uprobe_trampoline_put (in __arch_uprobe_optimize) when we failed
    to optimize the uprobe, so no task can execute it at that point

  - from clear_tramp_head as part of the uprobe trampolines cleanup
    (__mmput -> uprobe_clear_state) at which point the task should be dead

jirka

> 
> > +}
> > +
> 
> [...]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 06/13] uprobes/x86: Add uprobe syscall to speed up uprobe
  2024-12-11 13:33 ` [PATCH bpf-next 06/13] uprobes/x86: Add uprobe syscall to speed up uprobe Jiri Olsa
@ 2024-12-13 13:48   ` Thomas Weißschuh
  2024-12-13 14:51     ` Jiri Olsa
  0 siblings, 1 reply; 63+ messages in thread
From: Thomas Weißschuh @ 2024-12-13 13:48 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On 2024-12-11 14:33:55+0100, Jiri Olsa wrote:
> Adding new uprobe syscall that calls uprobe handlers for given
> 'breakpoint' address.
> 
> The idea is that the 'breakpoint' address calls the user space
> trampoline which executes the uprobe syscall.
> 
> The syscall handler reads the return address of the initial call
> to retrieve the original 'breakpoint' address. With this address
> we find the related uprobe object and call its consumers.
> 
> Adding the arch_uprobe_trampoline_mapping function that provides
> uprobe trampoline mapping. This mapping is backed with one global
> page initialized at __init time and shared by the all the mapping
> instances.
> 
> We do not allow to execute uprobe syscall if the caller is not
> from uprobe trampoline mapping.
> 
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
>  arch/x86/kernel/uprobes.c              | 80 ++++++++++++++++++++++++++
>  include/linux/syscalls.h               |  2 +
>  include/linux/uprobes.h                |  1 +
>  kernel/events/uprobes.c                | 22 +++++++
>  kernel/sys_ni.c                        |  1 +
>  6 files changed, 107 insertions(+)
> 
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index 5eb708bff1c7..88e388c7675b 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -345,6 +345,7 @@
>  333	common	io_pgetevents		sys_io_pgetevents
>  334	common	rseq			sys_rseq
>  335	common	uretprobe		sys_uretprobe
> +336	common	uprobe			sys_uprobe
>  # don't use numbers 387 through 423, add new calls after the last
>  # 'common' entry
>  424	common	pidfd_send_signal	sys_pidfd_send_signal
> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index 22a17c149a55..23e4f2821cff 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -425,6 +425,86 @@ SYSCALL_DEFINE0(uretprobe)
>  	return -1;
>  }
>  
> +static int tramp_mremap(const struct vm_special_mapping *sm, struct vm_area_struct *new_vma)
> +{
> +	return -EPERM;
> +}
> +
> +static struct vm_special_mapping tramp_mapping = {
> +	.name   = "[uprobes-trampoline]",
> +	.mremap = tramp_mremap,
> +};
> +
> +SYSCALL_DEFINE0(uprobe)
> +{
> +	struct pt_regs *regs = task_pt_regs(current);
> +	struct vm_area_struct *vma;
> +	unsigned long bp_vaddr;
> +	int err;
> +
> +	err = copy_from_user(&bp_vaddr, (void __user *)regs->sp + 3*8, sizeof(bp_vaddr));

A #define for the magic values would be nice.

> +	if (err) {
> +		force_sig(SIGILL);
> +		return -1;
> +	}
> +
> +	/* Allow execution only from uprobe trampolines. */
> +	vma = vma_lookup(current->mm, regs->ip);
> +	if (!vma || vma->vm_private_data != (void *) &tramp_mapping) {

vma_is_special_mapping()

> +		force_sig(SIGILL);
> +		return -1;
> +	}
> +
> +	handle_syscall_uprobe(regs, bp_vaddr - 5);
> +	return 0;
> +}
> +
> +asm (
> +	".pushsection .rodata\n"
> +	".global uprobe_trampoline_entry\n"
> +	"uprobe_trampoline_entry:\n"
> +	"endbr64\n"
> +	"push %rcx\n"
> +	"push %r11\n"
> +	"push %rax\n"
> +	"movq $" __stringify(__NR_uprobe) ", %rax\n"
> +	"syscall\n"
> +	"pop %rax\n"
> +	"pop %r11\n"
> +	"pop %rcx\n"
> +	"ret\n"
> +	".global uprobe_trampoline_end\n"
> +	"uprobe_trampoline_end:\n"
> +	".popsection\n"
> +);
> +
> +extern __visible u8 uprobe_trampoline_entry[];
> +extern __visible u8 uprobe_trampoline_end[];
> +
> +const struct vm_special_mapping *arch_uprobe_trampoline_mapping(void)
> +{
> +	struct pt_regs *regs = task_pt_regs(current);
> +
> +	return user_64bit_mode(regs) ? &tramp_mapping : NULL;
> +}
> +
> +static int __init arch_uprobes_init(void)
> +{
> +	unsigned long size = uprobe_trampoline_end - uprobe_trampoline_entry;
> +	static struct page *pages[2];
> +	struct page *page;
> +
> +	page = alloc_page(GFP_HIGHUSER);

That page could be in static memory, removing the need for the explicit
allocation. It could also be __ro_after_init.
Then tramp_mapping itself can be const.

Also this seems to waste the page on 32bit kernels.

> +	if (!page)
> +		return -ENOMEM;
> +	pages[0] = page;
> +	tramp_mapping.pages = (struct page **) &pages;

tramp_mapping.pages = pages; ?

> +	arch_uprobe_copy_ixol(page, 0, uprobe_trampoline_entry, size);
> +	return 0;
> +}
> +
> +late_initcall(arch_uprobes_init);
> +
>  /*
>   * If arch_uprobe->insn doesn't use rip-relative addressing, return
>   * immediately.  Otherwise, rewrite the instruction so that it accesses

[..]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 00/13] uprobes: Add support to optimize usdt probes on x86_64
  2024-12-13 13:07   ` Jiri Olsa
@ 2024-12-13 13:54     ` Peter Zijlstra
  2024-12-13 14:05       ` Jiri Olsa
  0 siblings, 1 reply; 63+ messages in thread
From: Peter Zijlstra @ 2024-12-13 13:54 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Andrii Nakryiko, bpf, Song Liu, Yonghong Song,
	John Fastabend, Hao Luo, Steven Rostedt, Masami Hiramatsu,
	Alan Maguire, linux-kernel, linux-trace-kernel

On Fri, Dec 13, 2024 at 02:07:54PM +0100, Jiri Olsa wrote:
> On Fri, Dec 13, 2024 at 11:51:05AM +0100, Peter Zijlstra wrote:
> > On Wed, Dec 11, 2024 at 02:33:49PM +0100, Jiri Olsa wrote:
> > > hi,
> > > this patchset adds support to optimize usdt probes on top of 5-byte
> > > nop instruction.
> > > 
> > > The generic approach (optimize all uprobes) is hard due to emulating
> > > possible multiple original instructions and its related issues. The
> > > usdt case, which stores 5-byte nop seems much easier, so starting
> > > with that.
> > > 
> > > The basic idea is to replace breakpoint exception with syscall which
> > > is faster on x86_64. For more details please see changelog of patch 8.
> > 
> > So ideally we'd put a check in the syscall, which verifies it comes from
> > one of our trampolines and reject any and all other usage.
> > 
> > The reason to do this is that we can then delete all this code the
> > moment it becomes irrelevant without having to worry userspace might be
> > 'creative' somewhere.
> 
> yes, we do that already in SYSCALL_DEFINE0(uprobe):
> 
>         /* Allow execution only from uprobe trampolines. */
>         vma = vma_lookup(current->mm, regs->ip);
>         if (!vma || vma->vm_private_data != (void *) &tramp_mapping) {
>                 force_sig(SIGILL);
>                 return -1;
>         }

Ah, right I missed that. Doesn't that need more locking through? The
moment vma_lookup() returns that vma can go bad.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 00/13] uprobes: Add support to optimize usdt probes on x86_64
  2024-12-13 13:54     ` Peter Zijlstra
@ 2024-12-13 14:05       ` Jiri Olsa
  2024-12-13 18:39         ` Peter Zijlstra
  0 siblings, 1 reply; 63+ messages in thread
From: Jiri Olsa @ 2024-12-13 14:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jiri Olsa, Oleg Nesterov, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Fri, Dec 13, 2024 at 02:54:33PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 13, 2024 at 02:07:54PM +0100, Jiri Olsa wrote:
> > On Fri, Dec 13, 2024 at 11:51:05AM +0100, Peter Zijlstra wrote:
> > > On Wed, Dec 11, 2024 at 02:33:49PM +0100, Jiri Olsa wrote:
> > > > hi,
> > > > this patchset adds support to optimize usdt probes on top of 5-byte
> > > > nop instruction.
> > > > 
> > > > The generic approach (optimize all uprobes) is hard due to emulating
> > > > possible multiple original instructions and its related issues. The
> > > > usdt case, which stores 5-byte nop seems much easier, so starting
> > > > with that.
> > > > 
> > > > The basic idea is to replace breakpoint exception with syscall which
> > > > is faster on x86_64. For more details please see changelog of patch 8.
> > > 
> > > So ideally we'd put a check in the syscall, which verifies it comes from
> > > one of our trampolines and reject any and all other usage.
> > > 
> > > The reason to do this is that we can then delete all this code the
> > > moment it becomes irrelevant without having to worry userspace might be
> > > 'creative' somewhere.
> > 
> > yes, we do that already in SYSCALL_DEFINE0(uprobe):
> > 
> >         /* Allow execution only from uprobe trampolines. */
> >         vma = vma_lookup(current->mm, regs->ip);
> >         if (!vma || vma->vm_private_data != (void *) &tramp_mapping) {
> >                 force_sig(SIGILL);
> >                 return -1;
> >         }
> 
> Ah, right I missed that. Doesn't that need more locking through? The
> moment vma_lookup() returns that vma can go bad.

ugh yes.. I guess mmap_read_lock(current->mm) should do, will check

thanks,
jirka

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 06/13] uprobes/x86: Add uprobe syscall to speed up uprobe
  2024-12-13 13:48   ` Thomas Weißschuh
@ 2024-12-13 14:51     ` Jiri Olsa
  2024-12-13 15:12       ` Thomas Weißschuh
  0 siblings, 1 reply; 63+ messages in thread
From: Jiri Olsa @ 2024-12-13 14:51 UTC (permalink / raw)
  To: Thomas Weißschuh
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Fri, Dec 13, 2024 at 02:48:00PM +0100, Thomas Weißschuh wrote:

SNIP

> > +static int tramp_mremap(const struct vm_special_mapping *sm, struct vm_area_struct *new_vma)
> > +{
> > +	return -EPERM;
> > +}
> > +
> > +static struct vm_special_mapping tramp_mapping = {
> > +	.name   = "[uprobes-trampoline]",
> > +	.mremap = tramp_mremap,
> > +};
> > +
> > +SYSCALL_DEFINE0(uprobe)
> > +{
> > +	struct pt_regs *regs = task_pt_regs(current);
> > +	struct vm_area_struct *vma;
> > +	unsigned long bp_vaddr;
> > +	int err;
> > +
> > +	err = copy_from_user(&bp_vaddr, (void __user *)regs->sp + 3*8, sizeof(bp_vaddr));
> 
> A #define for the magic values would be nice.

the 3*8 is to skip 3 values pushed on stack and get the return ip value,
I'd prefer to keep 3*8 but it's definitely missing explaining comment
above, wdyt?

> 
> > +	if (err) {
> > +		force_sig(SIGILL);
> > +		return -1;
> > +	}
> > +
> > +	/* Allow execution only from uprobe trampolines. */
> > +	vma = vma_lookup(current->mm, regs->ip);
> > +	if (!vma || vma->vm_private_data != (void *) &tramp_mapping) {
> 
> vma_is_special_mapping()

did not know about this function, thanks

> 
> > +		force_sig(SIGILL);
> > +		return -1;
> > +	}
> > +
> > +	handle_syscall_uprobe(regs, bp_vaddr - 5);
> > +	return 0;
> > +}
> > +
> > +asm (
> > +	".pushsection .rodata\n"
> > +	".global uprobe_trampoline_entry\n"
> > +	"uprobe_trampoline_entry:\n"
> > +	"endbr64\n"
> > +	"push %rcx\n"
> > +	"push %r11\n"
> > +	"push %rax\n"
> > +	"movq $" __stringify(__NR_uprobe) ", %rax\n"
> > +	"syscall\n"
> > +	"pop %rax\n"
> > +	"pop %r11\n"
> > +	"pop %rcx\n"
> > +	"ret\n"
> > +	".global uprobe_trampoline_end\n"
> > +	"uprobe_trampoline_end:\n"
> > +	".popsection\n"
> > +);
> > +
> > +extern __visible u8 uprobe_trampoline_entry[];
> > +extern __visible u8 uprobe_trampoline_end[];
> > +
> > +const struct vm_special_mapping *arch_uprobe_trampoline_mapping(void)
> > +{
> > +	struct pt_regs *regs = task_pt_regs(current);
> > +
> > +	return user_64bit_mode(regs) ? &tramp_mapping : NULL;
> > +}
> > +
> > +static int __init arch_uprobes_init(void)
> > +{
> > +	unsigned long size = uprobe_trampoline_end - uprobe_trampoline_entry;
> > +	static struct page *pages[2];
> > +	struct page *page;
> > +
> > +	page = alloc_page(GFP_HIGHUSER);
> 
> That page could be in static memory, removing the need for the explicit
> allocation. It could also be __ro_after_init.
> Then tramp_mapping itself can be const.

hum, how would that look like? I think that to get proper page object
you have to call alloc_page or some other page alloc family function..
what do I miss?

> 
> Also this seems to waste the page on 32bit kernels.

it's inside CONFIG_X86_64 ifdef

> 
> > +	if (!page)
> > +		return -ENOMEM;
> > +	pages[0] = page;
> > +	tramp_mapping.pages = (struct page **) &pages;
> 
> tramp_mapping.pages = pages; ?

I think the compiler will cry about *pages[2] vs **pages types mismatch,
but I'll double check that

thanks,
jirka

> 
> > +	arch_uprobe_copy_ixol(page, 0, uprobe_trampoline_entry, size);
> > +	return 0;
> > +}
> > +
> > +late_initcall(arch_uprobes_init);
> > +
> >  /*
> >   * If arch_uprobe->insn doesn't use rip-relative addressing, return
> >   * immediately.  Otherwise, rewrite the instruction so that it accesses
> 
> [..]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 06/13] uprobes/x86: Add uprobe syscall to speed up uprobe
  2024-12-13 14:51     ` Jiri Olsa
@ 2024-12-13 15:12       ` Thomas Weißschuh
  2024-12-13 21:52         ` Jiri Olsa
  0 siblings, 1 reply; 63+ messages in thread
From: Thomas Weißschuh @ 2024-12-13 15:12 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On 2024-12-13 15:51:44+0100, Jiri Olsa wrote:
> On Fri, Dec 13, 2024 at 02:48:00PM +0100, Thomas Weißschuh wrote:
> 
> SNIP
> 
> > > +static int tramp_mremap(const struct vm_special_mapping *sm, struct vm_area_struct *new_vma)
> > > +{
> > > +	return -EPERM;
> > > +}
> > > +
> > > +static struct vm_special_mapping tramp_mapping = {
> > > +	.name   = "[uprobes-trampoline]",
> > > +	.mremap = tramp_mremap,
> > > +};
> > > +
> > > +SYSCALL_DEFINE0(uprobe)
> > > +{
> > > +	struct pt_regs *regs = task_pt_regs(current);
> > > +	struct vm_area_struct *vma;
> > > +	unsigned long bp_vaddr;
> > > +	int err;
> > > +
> > > +	err = copy_from_user(&bp_vaddr, (void __user *)regs->sp + 3*8, sizeof(bp_vaddr));
> > 
> > A #define for the magic values would be nice.
> 
> the 3*8 is to skip 3 values pushed on stack and get the return ip value,
> I'd prefer to keep 3*8 but it's definitely missing explaining comment
> above, wdyt?

A comment sounds good.

> > > +	if (err) {
> > > +		force_sig(SIGILL);
> > > +		return -1;
> > > +	}
> > > +
> > > +	/* Allow execution only from uprobe trampolines. */
> > > +	vma = vma_lookup(current->mm, regs->ip);
> > > +	if (!vma || vma->vm_private_data != (void *) &tramp_mapping) {
> > 
> > vma_is_special_mapping()
> 
> did not know about this function, thanks
> 
> > 
> > > +		force_sig(SIGILL);
> > > +		return -1;
> > > +	}
> > > +
> > > +	handle_syscall_uprobe(regs, bp_vaddr - 5);
> > > +	return 0;
> > > +}
> > > +
> > > +asm (
> > > +	".pushsection .rodata\n"
> > > +	".global uprobe_trampoline_entry\n"
> > > +	"uprobe_trampoline_entry:\n"
> > > +	"endbr64\n"
> > > +	"push %rcx\n"
> > > +	"push %r11\n"
> > > +	"push %rax\n"
> > > +	"movq $" __stringify(__NR_uprobe) ", %rax\n"
> > > +	"syscall\n"
> > > +	"pop %rax\n"
> > > +	"pop %r11\n"
> > > +	"pop %rcx\n"
> > > +	"ret\n"
> > > +	".global uprobe_trampoline_end\n"
> > > +	"uprobe_trampoline_end:\n"
> > > +	".popsection\n"
> > > +);
> > > +
> > > +extern __visible u8 uprobe_trampoline_entry[];
> > > +extern __visible u8 uprobe_trampoline_end[];
> > > +
> > > +const struct vm_special_mapping *arch_uprobe_trampoline_mapping(void)
> > > +{
> > > +	struct pt_regs *regs = task_pt_regs(current);
> > > +
> > > +	return user_64bit_mode(regs) ? &tramp_mapping : NULL;
> > > +}
> > > +
> > > +static int __init arch_uprobes_init(void)
> > > +{
> > > +	unsigned long size = uprobe_trampoline_end - uprobe_trampoline_entry;
> > > +	static struct page *pages[2];
> > > +	struct page *page;
> > > +
> > > +	page = alloc_page(GFP_HIGHUSER);
> > 
> > That page could be in static memory, removing the need for the explicit
> > allocation. It could also be __ro_after_init.
> > Then tramp_mapping itself can be const.
> 
> hum, how would that look like? I think that to get proper page object
> you have to call alloc_page or some other page alloc family function..
> what do I miss?

static u8 trampoline_page[PAGE_SIZE] __ro_after_init __aligned(PAGE_SIZE);
static struct page *tramp_mapping_pages[2] __ro_after_init;

static const struct vm_special_mapping tramp_mapping = {
	.name   = "[uprobes-trampoline]",
	.pages  = tramp_mapping_pages,
	.mremap = tramp_mremap,
};

static int __init arch_uprobes_init(void)
{
	...
	trampoline_pages[0] = virt_to_page(trampoline_page);
	...
}

Untested, but it's similar to the stuff the vDSO implementations are
doing which I am working with at the moment.

> > 
> > Also this seems to waste the page on 32bit kernels.
> 
> it's inside CONFIG_X86_64 ifdef
> 
> > 
> > > +	if (!page)
> > > +		return -ENOMEM;
> > > +	pages[0] = page;
> > > +	tramp_mapping.pages = (struct page **) &pages;
> > 
> > tramp_mapping.pages = pages; ?
> 
> I think the compiler will cry about *pages[2] vs **pages types mismatch,
> but I'll double check that

It compiles for me.

> thanks,
> jirka
> 
> > 
> > > +	arch_uprobe_copy_ixol(page, 0, uprobe_trampoline_entry, size);
> > > +	return 0;
> > > +}
> > > +
> > > +late_initcall(arch_uprobes_init);
> > > +
> > >  /*
> > >   * If arch_uprobe->insn doesn't use rip-relative addressing, return
> > >   * immediately.  Otherwise, rewrite the instruction so that it accesses
> > 
> > [..]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 00/13] uprobes: Add support to optimize usdt probes on x86_64
  2024-12-13 14:05       ` Jiri Olsa
@ 2024-12-13 18:39         ` Peter Zijlstra
  2024-12-13 21:52           ` Jiri Olsa
  0 siblings, 1 reply; 63+ messages in thread
From: Peter Zijlstra @ 2024-12-13 18:39 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Andrii Nakryiko, bpf, Song Liu, Yonghong Song,
	John Fastabend, Hao Luo, Steven Rostedt, Masami Hiramatsu,
	Alan Maguire, linux-kernel, linux-trace-kernel

On Fri, Dec 13, 2024 at 03:05:54PM +0100, Jiri Olsa wrote:
> On Fri, Dec 13, 2024 at 02:54:33PM +0100, Peter Zijlstra wrote:
> > On Fri, Dec 13, 2024 at 02:07:54PM +0100, Jiri Olsa wrote:
> > > On Fri, Dec 13, 2024 at 11:51:05AM +0100, Peter Zijlstra wrote:
> > > > On Wed, Dec 11, 2024 at 02:33:49PM +0100, Jiri Olsa wrote:
> > > > > hi,
> > > > > this patchset adds support to optimize usdt probes on top of 5-byte
> > > > > nop instruction.
> > > > > 
> > > > > The generic approach (optimize all uprobes) is hard due to emulating
> > > > > possible multiple original instructions and its related issues. The
> > > > > usdt case, which stores 5-byte nop seems much easier, so starting
> > > > > with that.
> > > > > 
> > > > > The basic idea is to replace breakpoint exception with syscall which
> > > > > is faster on x86_64. For more details please see changelog of patch 8.
> > > > 
> > > > So ideally we'd put a check in the syscall, which verifies it comes from
> > > > one of our trampolines and reject any and all other usage.
> > > > 
> > > > The reason to do this is that we can then delete all this code the
> > > > moment it becomes irrelevant without having to worry userspace might be
> > > > 'creative' somewhere.
> > > 
> > > yes, we do that already in SYSCALL_DEFINE0(uprobe):
> > > 
> > >         /* Allow execution only from uprobe trampolines. */
> > >         vma = vma_lookup(current->mm, regs->ip);
> > >         if (!vma || vma->vm_private_data != (void *) &tramp_mapping) {
> > >                 force_sig(SIGILL);
> > >                 return -1;
> > >         }
> > 
> > Ah, right I missed that. Doesn't that need more locking through? The
> > moment vma_lookup() returns that vma can go bad.
> 
> ugh yes.. I guess mmap_read_lock(current->mm) should do, will check

If you check
tip/perf/core:kernel/events/uprobe.c:find_active_uprobe_speculative()
you'll find means of doing it locklessly using RCU.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 04/13] uprobes: Add arch_uprobe_verify_opcode function
  2024-12-13 13:21     ` Jiri Olsa
@ 2024-12-13 21:11       ` Andrii Nakryiko
  2024-12-13 21:52         ` Jiri Olsa
  0 siblings, 1 reply; 63+ messages in thread
From: Andrii Nakryiko @ 2024-12-13 21:11 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Fri, Dec 13, 2024 at 5:21 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Thu, Dec 12, 2024 at 04:48:05PM -0800, Andrii Nakryiko wrote:
> > On Wed, Dec 11, 2024 at 5:34 AM Jiri Olsa <jolsa@kernel.org> wrote:
> > >
> > > Adding arch_uprobe_verify_opcode function, so we can overload
> > > verification for each architecture in following changes.
> > >
> > > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > > ---
> > >  include/linux/uprobes.h |  5 +++++
> > >  kernel/events/uprobes.c | 19 ++++++++++++++++---
> > >  2 files changed, 21 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > > index cc723bc48c1d..8843b7f99ed0 100644
> > > --- a/include/linux/uprobes.h
> > > +++ b/include/linux/uprobes.h
> > > @@ -215,6 +215,11 @@ extern void uprobe_handle_trampoline(struct pt_regs *regs);
> > >  extern void *arch_uretprobe_trampoline(unsigned long *psize);
> > >  extern unsigned long uprobe_get_trampoline_vaddr(void);
> > >  extern void uprobe_copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len);
> > > +extern int uprobe_verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode);
> > > +extern int arch_uprobe_verify_opcode(struct arch_uprobe *auprobe, struct page *page,
> > > +                                    unsigned long vaddr, uprobe_opcode_t *new_opcode,
> > > +                                    int nbytes);
> > > +extern bool arch_uprobe_is_register(uprobe_opcode_t *insn, int nbytes);
> > >  #else /* !CONFIG_UPROBES */
> > >  struct uprobes_state {
> > >  };
> > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > index 7c2ecf11a573..8068f91de9e3 100644
> > > --- a/kernel/events/uprobes.c
> > > +++ b/kernel/events/uprobes.c
> > > @@ -263,7 +263,13 @@ static void uprobe_copy_to_page(struct page *page, unsigned long vaddr, const vo
> > >         kunmap_atomic(kaddr);
> > >  }
> > >
> > > -static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode)
> > > +__weak bool arch_uprobe_is_register(uprobe_opcode_t *insn, int nbytes)
> > > +{
> > > +       return is_swbp_insn(insn);
> >
> > a bit weird that we ignore nbytes here... should we have nbytes ==
> > UPROBE_SWBP_INSN_SIZE check somewhere here or inside is_swbp_insn()?
>
> the original is_swbp_insn function does not need that and we need
> nbytes in the overloaded arch_uprobe_is_register to distinguish
> between 1 byte and 5 byte update..
>

and that's my point, if some architecture forgot to override it for
nop5 (or similar stuff), this default implementation should reject
instruction that's not an original nop, no?


> jirka
>
> >
> > > +}
> > > +
> > > +int uprobe_verify_opcode(struct page *page, unsigned long vaddr,
> > > +                        uprobe_opcode_t *new_opcode)
> > >  {
> > >         uprobe_opcode_t old_opcode;
> > >         bool is_swbp;
> > > @@ -291,6 +297,13 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t
> > >         return 1;
> > >  }
> > >
> > > +__weak int arch_uprobe_verify_opcode(struct arch_uprobe *auprobe, struct page *page,
> > > +                                    unsigned long vaddr, uprobe_opcode_t *new_opcode,
> > > +                                    int nbytes)
> > > +{
> > > +       return uprobe_verify_opcode(page, vaddr, new_opcode);
> >
> > again, dropping nbytes on the floor here
> >
> > > +}
> > > +
> > >  static struct delayed_uprobe *
> > >  delayed_uprobe_check(struct uprobe *uprobe, struct mm_struct *mm)
> > >  {
> > > @@ -479,7 +492,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> > >         bool orig_page_huge = false;
> > >         unsigned int gup_flags = FOLL_FORCE;
> > >
> > > -       is_register = is_swbp_insn(insn);
> > > +       is_register = arch_uprobe_is_register(insn, nbytes);
> > >         uprobe = container_of(auprobe, struct uprobe, arch);
> > >
> > >  retry:
> > > @@ -490,7 +503,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> > >         if (IS_ERR(old_page))
> > >                 return PTR_ERR(old_page);
> > >
> > > -       ret = verify_opcode(old_page, vaddr, insn);
> > > +       ret = arch_uprobe_verify_opcode(auprobe, old_page, vaddr, insn, nbytes);
> > >         if (ret <= 0)
> > >                 goto put_old;
> > >
> > > --
> > > 2.47.0
> > >

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 04/13] uprobes: Add arch_uprobe_verify_opcode function
  2024-12-13 21:11       ` Andrii Nakryiko
@ 2024-12-13 21:52         ` Jiri Olsa
  0 siblings, 0 replies; 63+ messages in thread
From: Jiri Olsa @ 2024-12-13 21:52 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jiri Olsa, Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf,
	Song Liu, Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Fri, Dec 13, 2024 at 01:11:30PM -0800, Andrii Nakryiko wrote:
> On Fri, Dec 13, 2024 at 5:21 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Thu, Dec 12, 2024 at 04:48:05PM -0800, Andrii Nakryiko wrote:
> > > On Wed, Dec 11, 2024 at 5:34 AM Jiri Olsa <jolsa@kernel.org> wrote:
> > > >
> > > > Adding arch_uprobe_verify_opcode function, so we can overload
> > > > verification for each architecture in following changes.
> > > >
> > > > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > > > ---
> > > >  include/linux/uprobes.h |  5 +++++
> > > >  kernel/events/uprobes.c | 19 ++++++++++++++++---
> > > >  2 files changed, 21 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > > > index cc723bc48c1d..8843b7f99ed0 100644
> > > > --- a/include/linux/uprobes.h
> > > > +++ b/include/linux/uprobes.h
> > > > @@ -215,6 +215,11 @@ extern void uprobe_handle_trampoline(struct pt_regs *regs);
> > > >  extern void *arch_uretprobe_trampoline(unsigned long *psize);
> > > >  extern unsigned long uprobe_get_trampoline_vaddr(void);
> > > >  extern void uprobe_copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len);
> > > > +extern int uprobe_verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode);
> > > > +extern int arch_uprobe_verify_opcode(struct arch_uprobe *auprobe, struct page *page,
> > > > +                                    unsigned long vaddr, uprobe_opcode_t *new_opcode,
> > > > +                                    int nbytes);
> > > > +extern bool arch_uprobe_is_register(uprobe_opcode_t *insn, int nbytes);
> > > >  #else /* !CONFIG_UPROBES */
> > > >  struct uprobes_state {
> > > >  };
> > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > > index 7c2ecf11a573..8068f91de9e3 100644
> > > > --- a/kernel/events/uprobes.c
> > > > +++ b/kernel/events/uprobes.c
> > > > @@ -263,7 +263,13 @@ static void uprobe_copy_to_page(struct page *page, unsigned long vaddr, const vo
> > > >         kunmap_atomic(kaddr);
> > > >  }
> > > >
> > > > -static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode)
> > > > +__weak bool arch_uprobe_is_register(uprobe_opcode_t *insn, int nbytes)
> > > > +{
> > > > +       return is_swbp_insn(insn);
> > >
> > > a bit weird that we ignore nbytes here... should we have nbytes ==
> > > UPROBE_SWBP_INSN_SIZE check somewhere here or inside is_swbp_insn()?
> >
> > the original is_swbp_insn function does not need that and we need
> > nbytes in the overloaded arch_uprobe_is_register to distinguish
> > between 1 byte and 5 byte update..
> >
> 
> and that's my point, if some architecture forgot to override it for
> nop5 (or similar stuff), this default implementation should reject
> instruction that's not an original nop, no?

ok, makes sense, will add that

thanks,
jirka

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 06/13] uprobes/x86: Add uprobe syscall to speed up uprobe
  2024-12-13 15:12       ` Thomas Weißschuh
@ 2024-12-13 21:52         ` Jiri Olsa
  2024-12-14 13:21           ` Thomas Weißschuh
  0 siblings, 1 reply; 63+ messages in thread
From: Jiri Olsa @ 2024-12-13 21:52 UTC (permalink / raw)
  To: Thomas Weißschuh
  Cc: Jiri Olsa, Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf,
	Song Liu, Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Fri, Dec 13, 2024 at 04:12:46PM +0100, Thomas Weißschuh wrote:

SNIP

> > > > +static int __init arch_uprobes_init(void)
> > > > +{
> > > > +	unsigned long size = uprobe_trampoline_end - uprobe_trampoline_entry;
> > > > +	static struct page *pages[2];
> > > > +	struct page *page;
> > > > +
> > > > +	page = alloc_page(GFP_HIGHUSER);
> > > 
> > > That page could be in static memory, removing the need for the explicit
> > > allocation. It could also be __ro_after_init.
> > > Then tramp_mapping itself can be const.
> > 
> > hum, how would that look like? I think that to get proper page object
> > you have to call alloc_page or some other page alloc family function..
> > what do I miss?
> 
> static u8 trampoline_page[PAGE_SIZE] __ro_after_init __aligned(PAGE_SIZE);
> static struct page *tramp_mapping_pages[2] __ro_after_init;
> 
> static const struct vm_special_mapping tramp_mapping = {
> 	.name   = "[uprobes-trampoline]",
> 	.pages  = tramp_mapping_pages,
> 	.mremap = tramp_mremap,
> };
> 
> static int __init arch_uprobes_init(void)
> {
> 	...
> 	trampoline_pages[0] = virt_to_page(trampoline_page);
> 	...
> }
> 
> Untested, but it's similar to the stuff the vDSO implementations are
> doing which I am working with at the moment.

nice idea, better than allocating the page, will do that

> 
> > > 
> > > Also this seems to waste the page on 32bit kernels.
> > 
> > it's inside CONFIG_X86_64 ifdef
> > 
> > > 
> > > > +	if (!page)
> > > > +		return -ENOMEM;
> > > > +	pages[0] = page;
> > > > +	tramp_mapping.pages = (struct page **) &pages;
> > > 
> > > tramp_mapping.pages = pages; ?
> > 
> > I think the compiler will cry about *pages[2] vs **pages types mismatch,
> > but I'll double check that
> 
> It compiles for me.

ok

thanks,
jirka

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 00/13] uprobes: Add support to optimize usdt probes on x86_64
  2024-12-13 18:39         ` Peter Zijlstra
@ 2024-12-13 21:52           ` Jiri Olsa
  2024-12-13 21:59             ` Andrii Nakryiko
  0 siblings, 1 reply; 63+ messages in thread
From: Jiri Olsa @ 2024-12-13 21:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jiri Olsa, Oleg Nesterov, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Fri, Dec 13, 2024 at 07:39:54PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 13, 2024 at 03:05:54PM +0100, Jiri Olsa wrote:
> > On Fri, Dec 13, 2024 at 02:54:33PM +0100, Peter Zijlstra wrote:
> > > On Fri, Dec 13, 2024 at 02:07:54PM +0100, Jiri Olsa wrote:
> > > > On Fri, Dec 13, 2024 at 11:51:05AM +0100, Peter Zijlstra wrote:
> > > > > On Wed, Dec 11, 2024 at 02:33:49PM +0100, Jiri Olsa wrote:
> > > > > > hi,
> > > > > > this patchset adds support to optimize usdt probes on top of 5-byte
> > > > > > nop instruction.
> > > > > > 
> > > > > > The generic approach (optimize all uprobes) is hard due to emulating
> > > > > > possible multiple original instructions and its related issues. The
> > > > > > usdt case, which stores 5-byte nop seems much easier, so starting
> > > > > > with that.
> > > > > > 
> > > > > > The basic idea is to replace breakpoint exception with syscall which
> > > > > > is faster on x86_64. For more details please see changelog of patch 8.
> > > > > 
> > > > > So ideally we'd put a check in the syscall, which verifies it comes from
> > > > > one of our trampolines and reject any and all other usage.
> > > > > 
> > > > > The reason to do this is that we can then delete all this code the
> > > > > moment it becomes irrelevant without having to worry userspace might be
> > > > > 'creative' somewhere.
> > > > 
> > > > yes, we do that already in SYSCALL_DEFINE0(uprobe):
> > > > 
> > > >         /* Allow execution only from uprobe trampolines. */
> > > >         vma = vma_lookup(current->mm, regs->ip);
> > > >         if (!vma || vma->vm_private_data != (void *) &tramp_mapping) {
> > > >                 force_sig(SIGILL);
> > > >                 return -1;
> > > >         }
> > > 
> > > Ah, right I missed that. Doesn't that need more locking through? The
> > > moment vma_lookup() returns that vma can go bad.
> > 
> > ugh yes.. I guess mmap_read_lock(current->mm) should do, will check
> 
> If you check
> tip/perf/core:kernel/events/uprobe.c:find_active_uprobe_speculative()
> you'll find means of doing it locklessly using RCU.

right, will use that

thanks,
jirka

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 13/13] selftests/bpf: Add 5-byte nop uprobe trigger bench
  2024-12-11 13:34 ` [PATCH bpf-next 13/13] selftests/bpf: Add 5-byte nop uprobe trigger bench Jiri Olsa
@ 2024-12-13 21:57   ` Andrii Nakryiko
  2024-12-16  7:56     ` Jiri Olsa
  0 siblings, 1 reply; 63+ messages in thread
From: Andrii Nakryiko @ 2024-12-13 21:57 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Wed, Dec 11, 2024 at 5:36 AM Jiri Olsa <jolsa@kernel.org> wrote:
>
> Add 5-byte nop uprobe trigger bench (x86_64 specific) to measure
> uprobes/uretprobes on top of nop5 instruction.
>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  tools/testing/selftests/bpf/bench.c           | 12 ++++++
>  .../selftests/bpf/benchs/bench_trigger.c      | 42 +++++++++++++++++++
>  .../selftests/bpf/benchs/run_bench_uprobes.sh |  2 +-
>  3 files changed, 55 insertions(+), 1 deletion(-)
>

[...]

>  static void usetup(bool use_retprobe, bool use_multi, void *target_addr)
>  {
>         size_t uprobe_offset;
> @@ -448,6 +462,28 @@ static void uretprobe_multi_ret_setup(void)
>         usetup(true, true /* use_multi */, &uprobe_target_ret);
>  }
>
> +#ifdef __x86_64__
> +static void uprobe_nop5_setup(void)
> +{
> +       usetup(false, false /* !use_multi */, &uprobe_target_nop5);
> +}
> +
> +static void uretprobe_nop5_setup(void)
> +{
> +       usetup(false, false /* !use_multi */, &uprobe_target_nop5);
> +}

true /* use_retprobe */

that's the problem with bench setup, right?

> +
> +static void uprobe_multi_nop5_setup(void)
> +{
> +       usetup(false, true /* use_multi */, &uprobe_target_nop5);
> +}
> +
> +static void uretprobe_multi_nop5_setup(void)
> +{
> +       usetup(false, true /* use_multi */, &uprobe_target_nop5);
> +}
> +#endif
> +
>  const struct bench bench_trig_syscall_count = {
>         .name = "trig-syscall-count",
>         .validate = trigger_validate,

[...]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 05/13] uprobes: Add mapping for optimized uprobe trampolines
  2024-12-13 13:42     ` Jiri Olsa
@ 2024-12-13 21:58       ` Andrii Nakryiko
  0 siblings, 0 replies; 63+ messages in thread
From: Andrii Nakryiko @ 2024-12-13 21:58 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Fri, Dec 13, 2024 at 5:42 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Thu, Dec 12, 2024 at 05:01:52PM -0800, Andrii Nakryiko wrote:
>
> SNIP
>
> > > ---
> > >  include/linux/uprobes.h |  12 +++++
> > >  kernel/events/uprobes.c | 114 ++++++++++++++++++++++++++++++++++++++++
> > >  kernel/fork.c           |   1 +
> > >  3 files changed, 127 insertions(+)
> > >
> >
> > Ran out of time for today, will continue tomorrow for the rest of
> > patches. Some comments below.
>
> thanks!
>
> >
> > The numbers are really encouraging, though!
> >
> > > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > > index 8843b7f99ed0..c4ee755ca2a1 100644
> > > --- a/include/linux/uprobes.h
> > > +++ b/include/linux/uprobes.h
> > > @@ -16,6 +16,7 @@
> > >  #include <linux/types.h>
> > >  #include <linux/wait.h>
> > >  #include <linux/timer.h>
> > > +#include <linux/mutex.h>
> > >
> > >  struct uprobe;
> > >  struct vm_area_struct;
> > > @@ -172,6 +173,13 @@ struct xol_area;
> > >
> > >  struct uprobes_state {
> > >         struct xol_area         *xol_area;
> > > +       struct hlist_head       tramp_head;
> > > +};
> > > +
> >
> > should we make uprobe_state be linked by a pointer from mm_struct
> > instead of increasing mm for each added field? right now it's
> > embedded, I don't think it's problematic to allocate it on demand and
> > keep it until mm_struct is freed
>
> seems like good idea, I'll check on that
>
> >
> > > +struct uprobe_trampoline {
> > > +       struct hlist_node       node;
> > > +       unsigned long           vaddr;
> > > +       atomic64_t              ref;
> > >  };
> > >
> > >  extern void __init uprobes_init(void);
> > > @@ -220,6 +228,10 @@ extern int arch_uprobe_verify_opcode(struct arch_uprobe *auprobe, struct page *p
> > >                                      unsigned long vaddr, uprobe_opcode_t *new_opcode,
> > >                                      int nbytes);
> > >  extern bool arch_uprobe_is_register(uprobe_opcode_t *insn, int nbytes);
> > > +extern struct uprobe_trampoline *uprobe_trampoline_get(unsigned long vaddr);
> > > +extern void uprobe_trampoline_put(struct uprobe_trampoline *area);
> > > +extern bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr);
> > > +extern const struct vm_special_mapping *arch_uprobe_trampoline_mapping(void);
> > >  #else /* !CONFIG_UPROBES */
> > >  struct uprobes_state {
> > >  };
> > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > index 8068f91de9e3..f57918c624da 100644
> > > --- a/kernel/events/uprobes.c
> > > +++ b/kernel/events/uprobes.c
> > > @@ -615,6 +615,118 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v
> > >                         (uprobe_opcode_t *)&auprobe->insn, UPROBE_SWBP_INSN_SIZE);
> > >  }
> > >
> > > +bool __weak arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr)
> >
> > bikeshedding some more, I still find "is_callable" confusing. How
> > about "is_reachable_by_call"? slightly verbose, but probably more
> > meaningful?
>
> yep, more precise, will change
>
> >
> > > +{
> > > +       return false;
> > > +}
> > > +
> > > +const struct vm_special_mapping * __weak arch_uprobe_trampoline_mapping(void)
> > > +{
> > > +       return NULL;
> > > +}
> > > +
> > > +static unsigned long find_nearest_page(unsigned long vaddr)
> > > +{
> > > +       struct mm_struct *mm = current->mm;
> > > +       struct vm_area_struct *vma, *prev;
> > > +       VMA_ITERATOR(vmi, mm, 0);
> > > +
> > > +       prev = vma_next(&vmi);
> >
> > minor: we are missing an opportunity to add something between
> > [PAGE_SIZE, <first_vma_start>). Probably fine, but why not?
>
> true, will add that check
>
> >
> > > +       vma = vma_next(&vmi);
> > > +       while (vma) {
> > > +               if (vma->vm_start - prev->vm_end  >= PAGE_SIZE) {
> > > +                       if (arch_uprobe_is_callable(prev->vm_end, vaddr))
> > > +                               return prev->vm_end;
> > > +                       if (arch_uprobe_is_callable(vma->vm_start - PAGE_SIZE, vaddr))
> > > +                               return vma->vm_start - PAGE_SIZE;
> > > +               }
> > > +
> > > +               prev = vma;
> > > +               vma = vma_next(&vmi);
> > > +       }
> > > +
> > > +       return 0;
> > > +}
> > > +
> >
> > [...]
> >
> > > +struct uprobe_trampoline *uprobe_trampoline_get(unsigned long vaddr)
> > > +{
> > > +       struct uprobes_state *state = &current->mm->uprobes_state;
> > > +       struct uprobe_trampoline *tramp = NULL;
> > > +
> > > +       hlist_for_each_entry(tramp, &state->tramp_head, node) {
> > > +               if (arch_uprobe_is_callable(tramp->vaddr, vaddr)) {
> > > +                       atomic64_inc(&tramp->ref);
> > > +                       return tramp;
> > > +               }
> > > +       }
> > > +
> > > +       tramp = create_uprobe_trampoline(vaddr);
> > > +       if (!tramp)
> > > +               return NULL;
> > > +
> > > +       hlist_add_head(&tramp->node, &state->tramp_head);
> > > +       return tramp;
> > > +}
> > > +
> > > +static void destroy_uprobe_trampoline(struct uprobe_trampoline *tramp)
> > > +{
> > > +       hlist_del(&tramp->node);
> > > +       kfree(tramp);
> >
> > hmm... shouldn't this be RCU-delayed (RCU Tasks Trace for uprobes),
> > otherwise we might have some CPU executing code in that trampoline,
> > no?
>
> so we call destroy_uprobe_trampoline in 2 scenarios:
>
>   - from uprobe_trampoline_put (in __arch_uprobe_optimize) when we failed
>     to optimize the uprobe, so no task can execute it at that point
>
>   - from clear_tramp_head as part of the uprobe trampolines cleanup
>     (__mmput -> uprobe_clear_state) at which point the task should be dead

makes sense, I've been overcautious

>
> jirka
>
> >
> > > +}
> > > +
> >
> > [...]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 08/13] uprobes/x86: Add support to optimize uprobes
  2024-12-11 13:33 ` [PATCH bpf-next 08/13] uprobes/x86: Add support to optimize uprobes Jiri Olsa
  2024-12-13 10:49   ` Peter Zijlstra
@ 2024-12-13 21:58   ` Andrii Nakryiko
  2024-12-15 12:06   ` David Laight
  2 siblings, 0 replies; 63+ messages in thread
From: Andrii Nakryiko @ 2024-12-13 21:58 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Wed, Dec 11, 2024 at 5:35 AM Jiri Olsa <jolsa@kernel.org> wrote:
>
> Putting together all the previously added pieces to support optimized
> uprobes on top of 5-byte nop instruction.
>
> The current uprobe execution goes through following:
>   - installs breakpoint instruction over original instruction
>   - exception handler hit and calls related uprobe consumers
>   - and either simulates original instruction or does out of line single step
>     execution of it
>   - returns to user space
>
> The optimized uprobe path
>
>   - checks the original instruction is 5-byte nop (plus other checks)
>   - adds (or uses existing) user space trampoline and overwrites original
>     instruction (5-byte nop) with call to user space trampoline
>   - the user space trampoline executes uprobe syscall that calls related uprobe
>     consumers
>   - trampoline returns back to next instruction
>
> This approach won't speed up all uprobes as it's limited to using nop5 as
> original instruction, but we could use nop5 as USDT probe instruction (which
> uses single byte nop ATM) and speed up the USDT probes.
>
> This patch overloads related arch functions in uprobe_write_opcode and
> set_orig_insn so they can install call instruction if needed.
>
> The arch_uprobe_optimize triggers the uprobe optimization and is called after
> first uprobe hit. I originally had it called on uprobe installation but then
> it clashed with elf loader, because the user space trampoline was added in a
> place where loader might need to put elf segments, so I decided to do it after
> first uprobe hit when loading is done.
>
> We do not unmap and release uprobe trampoline when it's no longer needed,
> because there's no easy way to make sure none of the threads is still
> inside the trampoline. But we do not waste memory, because there's just
> single page for all the uprobe trampoline mappings.
>
> We do waste frmae on page mapping for every 4GB by keeping the uprobe
> trampoline page mapped, but that seems ok.
>
> Attaching the speed up from benchs/run_bench_uprobes.sh script:
>
> current:
>
>      uprobe-nop     :    3.281 ± 0.003M/s
>      uprobe-push    :    3.085 ± 0.003M/s
>      uprobe-ret     :    1.130 ± 0.000M/s
>  --> uprobe-nop5    :    3.276 ± 0.007M/s
>      uretprobe-nop  :    1.716 ± 0.016M/s
>      uretprobe-push :    1.651 ± 0.017M/s
>      uretprobe-ret  :    0.846 ± 0.006M/s
>  --> uretprobe-nop5 :    3.279 ± 0.002M/s
>
> after the change:
>
>      uprobe-nop     :    3.246 ± 0.004M/s
>      uprobe-push    :    3.057 ± 0.000M/s
>      uprobe-ret     :    1.113 ± 0.003M/s
>  --> uprobe-nop5    :    6.751 ± 0.037M/s
>      uretprobe-nop  :    1.740 ± 0.015M/s
>      uretprobe-push :    1.677 ± 0.018M/s
>      uretprobe-ret  :    0.852 ± 0.005M/s
>  --> uretprobe-nop5 :    6.769 ± 0.040M/s
>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  arch/x86/include/asm/uprobes.h |   7 ++
>  arch/x86/kernel/uprobes.c      | 168 ++++++++++++++++++++++++++++++++-
>  include/linux/uprobes.h        |   1 +
>  kernel/events/uprobes.c        |   8 ++
>  4 files changed, 181 insertions(+), 3 deletions(-)
>

[...]

> +
> +int arch_uprobe_verify_opcode(struct arch_uprobe *auprobe, struct page *page,
> +                             unsigned long vaddr, uprobe_opcode_t *new_opcode,
> +                             int nbytes)
> +{
> +       uprobe_opcode_t old_opcode[5];
> +       bool is_call, is_swbp, is_nop5;
> +
> +       if (!test_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags))
> +               return uprobe_verify_opcode(page, vaddr, new_opcode);
> +
> +       /*
> +        * The ARCH_UPROBE_FLAG_CAN_OPTIMIZE flag guarantees the following
> +        * 5 bytes read won't cross the page boundary.
> +        */
> +       uprobe_copy_from_page(page, vaddr, (uprobe_opcode_t *) &old_opcode, 5);
> +       is_call = is_call_insn((uprobe_opcode_t *) &old_opcode);
> +       is_swbp = is_swbp_insn((uprobe_opcode_t *) &old_opcode);
> +       is_nop5 = is_nop5_insn((uprobe_opcode_t *) &old_opcode);
> +
> +       /*
> +        * We allow following trasitions for optimized uprobes:
> +        *
> +        *   nop5 -> swbp -> call
> +        *   ||      |       |
> +        *   |'--<---'       |
> +        *   '---<-----------'
> +        *
> +        * We return 1 to ack the write, 0 to do nothing, -1 to fail write.
> +        *
> +        * If the current opcode (old_opcode) has already desired value,
> +        * we do nothing, because we are racing with another thread doing
> +        * the update.
> +        */
> +       switch (nbytes) {
> +       case 5:
> +               if (is_call_insn(new_opcode)) {
> +                       if (is_swbp)
> +                               return 1;
> +                       if (is_call && !memcmp(new_opcode, &old_opcode, 5))
> +                               return 0;
> +               } else {
> +                       if (is_call || is_swbp)
> +                               return 1;
> +                       if (is_nop5)
> +                               return 0;
> +               }
> +               break;
> +       case 1:
> +               if (is_swbp_insn(new_opcode)) {
> +                       if (is_nop5)
> +                               return 1;
> +                       if (is_swbp || is_call)
> +                               return 0;
> +               } else {
> +                       if (is_swbp || is_call)
> +                               return 1;
> +                       if (is_nop5)
> +                               return 0;
> +               }
> +       }
> +       return -1;

nit: -EINVAL?

> +}
> +
> +bool arch_uprobe_is_register(uprobe_opcode_t *insn, int nbytes)
> +{
> +       return nbytes == 5 ? is_call_insn(insn) : is_swbp_insn(insn);
> +}
> +

[...]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 09/13] selftests/bpf: Use 5-byte nop for x86 usdt probes
  2024-12-11 13:33 ` [PATCH bpf-next 09/13] selftests/bpf: Use 5-byte nop for x86 usdt probes Jiri Olsa
@ 2024-12-13 21:58   ` Andrii Nakryiko
  2024-12-16  8:32     ` Jiri Olsa
  0 siblings, 1 reply; 63+ messages in thread
From: Andrii Nakryiko @ 2024-12-13 21:58 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Wed, Dec 11, 2024 at 5:35 AM Jiri Olsa <jolsa@kernel.org> wrote:
>
> Using 5-byte nop for x86 usdt probes so we can switch
> to optimized uprobe them.
>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  tools/testing/selftests/bpf/sdt.h | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
>

This change will make it impossible to run latest selftests on older
kernels. Let's do what we did with -DENABLE_ATOMICS_TESTS and allow to
disable this through Makefile, ok?


> diff --git a/tools/testing/selftests/bpf/sdt.h b/tools/testing/selftests/bpf/sdt.h
> index ca0162b4dc57..7ac9291f45f1 100644
> --- a/tools/testing/selftests/bpf/sdt.h
> +++ b/tools/testing/selftests/bpf/sdt.h
> @@ -234,6 +234,13 @@ __extension__ extern unsigned long long __sdt_unsp;
>  #define _SDT_NOP       nop
>  #endif
>
> +/* Use 5 byte nop for x86_64 to allow optimizing uprobes. */
> +#if defined(__x86_64__)
> +# define _SDT_DEF_NOP _SDT_ASM_5(990:  .byte 0x0f, 0x1f, 0x44, 0x00, 0x00)
> +#else
> +# define _SDT_DEF_NOP _SDT_ASM_1(990:  _SDT_NOP)
> +#endif
> +
>  #define _SDT_NOTE_NAME "stapsdt"
>  #define _SDT_NOTE_TYPE 3
>
> @@ -286,7 +293,7 @@ __extension__ extern unsigned long long __sdt_unsp;
>
>  #define _SDT_ASM_BODY(provider, name, pack_args, args, ...)                  \
>    _SDT_DEF_MACROS                                                            \
> -  _SDT_ASM_1(990:      _SDT_NOP)                                             \
> +  _SDT_DEF_NOP                                                               \
>    _SDT_ASM_3(          .pushsection .note.stapsdt,_SDT_ASM_AUTOGROUP,"note") \
>    _SDT_ASM_1(          .balign 4)                                            \
>    _SDT_ASM_3(          .4byte 992f-991f, 994f-993f, _SDT_NOTE_TYPE)          \
> --
> 2.47.0
>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 10/13] selftests/bpf: Add uprobe/usdt optimized test
  2024-12-11 13:33 ` [PATCH bpf-next 10/13] selftests/bpf: Add uprobe/usdt optimized test Jiri Olsa
@ 2024-12-13 21:58   ` Andrii Nakryiko
  2024-12-16  7:58     ` Jiri Olsa
  0 siblings, 1 reply; 63+ messages in thread
From: Andrii Nakryiko @ 2024-12-13 21:58 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Wed, Dec 11, 2024 at 5:35 AM Jiri Olsa <jolsa@kernel.org> wrote:
>
> Adding tests for optimized uprobe/usdt probes.
>
> Checking that we get expected trampoline and attached bpf programs
> get executed properly.
>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  .../selftests/bpf/prog_tests/uprobe_syscall.c | 203 ++++++++++++++++++
>  .../selftests/bpf/progs/uprobe_optimized.c    |  29 +++
>  2 files changed, 232 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/progs/uprobe_optimized.c
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> index c397336fe1ed..1dbc26a1130c 100644
> --- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> +++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> @@ -14,6 +14,8 @@
>  #include <asm/prctl.h>
>  #include "uprobe_syscall.skel.h"
>  #include "uprobe_syscall_executed.skel.h"
> +#include "uprobe_optimized.skel.h"
> +#include "sdt.h"
>
>  __naked unsigned long uretprobe_regs_trigger(void)
>  {
> @@ -350,6 +352,186 @@ static void test_uretprobe_shadow_stack(void)
>
>         ARCH_PRCTL(ARCH_SHSTK_DISABLE, ARCH_SHSTK_SHSTK);
>  }
> +
> +#define TRAMP "[uprobes-trampoline]"
> +
> +static unsigned char nop5[5] = { 0x0f, 0x1f, 0x44, 0x00, 0x00 };
> +
> +noinline void uprobe_test(void)
> +{
> +       asm volatile ("                                 \n"
> +               ".global uprobe_test_nop5               \n"
> +               ".type uprobe_test_nop5, STT_FUNC       \n"
> +               "uprobe_test_nop5:                      \n"
> +               ".byte 0x0f, 0x1f, 0x44, 0x00, 0x00     \n"
> +       );
> +}
> +
> +extern u8 uprobe_test_nop5[];
> +
> +noinline void usdt_test(void)
> +{
> +       STAP_PROBE(optimized_uprobe, usdt);
> +}
> +
> +static void *find_nop5(void *fn)
> +{
> +       int i;
> +
> +       for (i = 0; i < 10; i++) {
> +               if (!memcmp(nop5, fn + i, 5))
> +                       return fn + i;
> +       }
> +       return NULL;
> +}
> +
> +static int find_uprobes_trampoline(void **start, void **end)
> +{
> +       char line[128];
> +       int ret = -1;
> +       FILE *maps;
> +
> +       maps = fopen("/proc/self/maps", "r");
> +       if (!maps) {
> +               fprintf(stderr, "cannot open maps\n");
> +               return -1;
> +       }
> +
> +       while (fgets(line, sizeof(line), maps)) {
> +               int m = -1;
> +
> +               /* We care only about private r-x mappings. */
> +               if (sscanf(line, "%p-%p r-xp %*x %*x:%*x %*u %n", start, end, &m) != 2)
> +                       continue;
> +               if (m < 0)
> +                       continue;
> +               if (!strncmp(&line[m], TRAMP, sizeof(TRAMP)-1)) {
> +                       ret = 0;
> +                       break;
> +               }
> +       }

you could have used PROCMAP_QUERY ;)

> +
> +       fclose(maps);
> +       return ret;
> +}
> +

[...]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 11/13] selftests/bpf: Add hit/attach/detach race optimized uprobe test
  2024-12-11 13:34 ` [PATCH bpf-next 11/13] selftests/bpf: Add hit/attach/detach race optimized uprobe test Jiri Olsa
@ 2024-12-13 21:58   ` Andrii Nakryiko
  2024-12-16  7:59     ` Jiri Olsa
  0 siblings, 1 reply; 63+ messages in thread
From: Andrii Nakryiko @ 2024-12-13 21:58 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Wed, Dec 11, 2024 at 5:36 AM Jiri Olsa <jolsa@kernel.org> wrote:
>
> Adding test that makes sure parallel execution of the uprobe and
> attach/detach of optimized uprobe on it works properly.
>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  .../selftests/bpf/prog_tests/uprobe_syscall.c | 82 +++++++++++++++++++
>  1 file changed, 82 insertions(+)
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> index 1dbc26a1130c..eacd14db8f8d 100644
> --- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> +++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> @@ -532,6 +532,81 @@ static void test_uprobe_usdt(void)
>  cleanup:
>         uprobe_optimized__destroy(skel);
>  }
> +
> +static bool race_stop;

volatile?

> +
> +static void *worker_trigger(void *arg)
> +{
> +       unsigned long rounds = 0;
> +
> +       while (!race_stop) {
> +               uprobe_test();
> +               rounds++;
> +       }
> +
> +       printf("tid %d trigger rounds: %lu\n", gettid(), rounds);
> +       return NULL;
> +}
> +
> +static void *worker_attach(void *arg)
> +{
> +       struct uprobe_optimized *skel;
> +       unsigned long rounds = 0;
> +
> +       skel = uprobe_optimized__open_and_load();
> +       if (!ASSERT_OK_PTR(skel, "uprobe_optimized__open_and_load"))
> +               goto cleanup;
> +
> +       while (!race_stop) {
> +               skel->links.test_2 = bpf_program__attach_uprobe_multi(skel->progs.test_2, -1,
> +                                               "/proc/self/exe", "uprobe_test_nop5", NULL);
> +               if (!ASSERT_OK_PTR(skel->links.test_2, "bpf_program__attach_uprobe_multi"))
> +                       break;
> +               bpf_link__destroy(skel->links.test_2);
> +               skel->links.test_2 = NULL;
> +               rounds++;
> +       }
> +
> +       printf("tid %d attach rounds: %lu hits: %lu\n", gettid(), rounds, skel->bss->executed);
> +
> +cleanup:
> +       uprobe_optimized__destroy(skel);
> +       return NULL;
> +}
> +
> +static void test_uprobe_race(void)
> +{
> +       int err, i, nr_cpus, nr;
> +       pthread_t *threads;
> +
> +        nr_cpus = libbpf_num_possible_cpus();

check whitespaces

> +       if (!ASSERT_GE(nr_cpus, 0, "nr_cpus"))
> +               return;
> +
> +       nr = nr_cpus * 2;
> +       threads = malloc(sizeof(*threads) * nr);
> +       if (!ASSERT_OK_PTR(threads, "malloc"))
> +               return;
> +
> +       for (i = 0; i < nr_cpus; i++) {
> +               err = pthread_create(&threads[i], NULL, worker_trigger, NULL);
> +               if (!ASSERT_OK(err, "pthread_create"))
> +                       goto cleanup;
> +       }
> +
> +       for (; i < nr; i++) {
> +               err = pthread_create(&threads[i], NULL, worker_attach, NULL);
> +               if (!ASSERT_OK(err, "pthread_create"))
> +                       goto cleanup;
> +       }
> +
> +       sleep(4);
> +
> +cleanup:
> +       race_stop = true;
> +       for (i = 0; i < nr; i++)
> +               pthread_join(threads[i], NULL);

what happens with pthread_join() when called with uninitialized
threads[i] (e.g., when error happens in the middle of creating
threads)?


> +}
>  #else
>  static void test_uretprobe_regs_equal(void)
>  {
> @@ -567,6 +642,11 @@ static void test_uprobe_usdt(void)
>  {
>         test__skip();
>  }
> +
> +static void test_uprobe_race(void)
> +{
> +       test__skip();
> +}
>  #endif
>
>  void test_uprobe_syscall(void)
> @@ -585,4 +665,6 @@ void test_uprobe_syscall(void)
>                 test_uprobe_multi();
>         if (test__start_subtest("uprobe_usdt"))
>                 test_uprobe_usdt();
> +       if (test__start_subtest("uprobe_race"))
> +               test_uprobe_race();
>  }
> --
> 2.47.0
>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 00/13] uprobes: Add support to optimize usdt probes on x86_64
  2024-12-13 21:52           ` Jiri Olsa
@ 2024-12-13 21:59             ` Andrii Nakryiko
  0 siblings, 0 replies; 63+ messages in thread
From: Andrii Nakryiko @ 2024-12-13 21:59 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Peter Zijlstra, Oleg Nesterov, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Fri, Dec 13, 2024 at 1:52 PM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Fri, Dec 13, 2024 at 07:39:54PM +0100, Peter Zijlstra wrote:
> > On Fri, Dec 13, 2024 at 03:05:54PM +0100, Jiri Olsa wrote:
> > > On Fri, Dec 13, 2024 at 02:54:33PM +0100, Peter Zijlstra wrote:
> > > > On Fri, Dec 13, 2024 at 02:07:54PM +0100, Jiri Olsa wrote:
> > > > > On Fri, Dec 13, 2024 at 11:51:05AM +0100, Peter Zijlstra wrote:
> > > > > > On Wed, Dec 11, 2024 at 02:33:49PM +0100, Jiri Olsa wrote:
> > > > > > > hi,
> > > > > > > this patchset adds support to optimize usdt probes on top of 5-byte
> > > > > > > nop instruction.
> > > > > > >
> > > > > > > The generic approach (optimize all uprobes) is hard due to emulating
> > > > > > > possible multiple original instructions and its related issues. The
> > > > > > > usdt case, which stores 5-byte nop seems much easier, so starting
> > > > > > > with that.
> > > > > > >
> > > > > > > The basic idea is to replace breakpoint exception with syscall which
> > > > > > > is faster on x86_64. For more details please see changelog of patch 8.
> > > > > >
> > > > > > So ideally we'd put a check in the syscall, which verifies it comes from
> > > > > > one of our trampolines and reject any and all other usage.
> > > > > >
> > > > > > The reason to do this is that we can then delete all this code the
> > > > > > moment it becomes irrelevant without having to worry userspace might be
> > > > > > 'creative' somewhere.
> > > > >
> > > > > yes, we do that already in SYSCALL_DEFINE0(uprobe):
> > > > >
> > > > >         /* Allow execution only from uprobe trampolines. */
> > > > >         vma = vma_lookup(current->mm, regs->ip);
> > > > >         if (!vma || vma->vm_private_data != (void *) &tramp_mapping) {
> > > > >                 force_sig(SIGILL);
> > > > >                 return -1;
> > > > >         }
> > > >
> > > > Ah, right I missed that. Doesn't that need more locking through? The
> > > > moment vma_lookup() returns that vma can go bad.
> > >
> > > ugh yes.. I guess mmap_read_lock(current->mm) should do, will check
> >
> > If you check
> > tip/perf/core:kernel/events/uprobe.c:find_active_uprobe_speculative()
> > you'll find means of doing it locklessly using RCU.
>
> right, will use that

phew, yep, came here to ask not to add mmap_read_lock() into the hot
path again :)

>
> thanks,
> jirka

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 06/13] uprobes/x86: Add uprobe syscall to speed up uprobe
  2024-12-13 21:52         ` Jiri Olsa
@ 2024-12-14 13:21           ` Thomas Weißschuh
  2024-12-16  8:03             ` Jiri Olsa
  0 siblings, 1 reply; 63+ messages in thread
From: Thomas Weißschuh @ 2024-12-14 13:21 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On 2024-12-13 22:52:15+0100, Jiri Olsa wrote:
> On Fri, Dec 13, 2024 at 04:12:46PM +0100, Thomas Weißschuh wrote:
> 
> SNIP
> 
> > > > > +static int __init arch_uprobes_init(void)
> > > > > +{
> > > > > +	unsigned long size = uprobe_trampoline_end - uprobe_trampoline_entry;
> > > > > +	static struct page *pages[2];
> > > > > +	struct page *page;
> > > > > +
> > > > > +	page = alloc_page(GFP_HIGHUSER);
> > > > 
> > > > That page could be in static memory, removing the need for the explicit
> > > > allocation. It could also be __ro_after_init.
> > > > Then tramp_mapping itself can be const.
> > > 
> > > hum, how would that look like? I think that to get proper page object
> > > you have to call alloc_page or some other page alloc family function..
> > > what do I miss?
> > 
> > static u8 trampoline_page[PAGE_SIZE] __ro_after_init __aligned(PAGE_SIZE);
> > static struct page *tramp_mapping_pages[2] __ro_after_init;
> > 
> > static const struct vm_special_mapping tramp_mapping = {
> > 	.name   = "[uprobes-trampoline]",
> > 	.pages  = tramp_mapping_pages,
> > 	.mremap = tramp_mremap,
> > };
> > 
> > static int __init arch_uprobes_init(void)
> > {
> > 	...
> > 	trampoline_pages[0] = virt_to_page(trampoline_page);
> > 	...
> > }
> > 
> > Untested, but it's similar to the stuff the vDSO implementations are
> > doing which I am working with at the moment.
> 
> nice idea, better than allocating the page, will do that

Or even better yet, just allocate the whole page already in the inline
asm and avoid the copying, too:

diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index b2420eeee23a..c5e6ca7f998a 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -462,7 +462,7 @@ SYSCALL_DEFINE0(uprobe)

 asm (
        ".pushsection .rodata\n"
-       ".global uprobe_trampoline_entry\n"
+       ".balign " __stringify(PAGE_SIZE) "\n"
        "uprobe_trampoline_entry:\n"
        "endbr64\n"
        "push %rcx\n"
@@ -474,13 +474,11 @@ asm (
        "pop %r11\n"
        "pop %rcx\n"
        "ret\n"
-       ".global uprobe_trampoline_end\n"
-       "uprobe_trampoline_end:\n"
+       ".balign " __stringify(PAGE_SIZE) "\n"
        ".popsection\n"
 );

-extern __visible u8 uprobe_trampoline_entry[];
-extern __visible u8 uprobe_trampoline_end[];
+extern u8 uprobe_trampoline_entry[];


If you want to keep the copying for some reason, the asm code should be
in the section ".init.rodata" as its not used afterwards.

(A bit bikesheddy, I admit)


Thomas

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* RE: [PATCH bpf-next 08/13] uprobes/x86: Add support to optimize uprobes
  2024-12-11 13:33 ` [PATCH bpf-next 08/13] uprobes/x86: Add support to optimize uprobes Jiri Olsa
  2024-12-13 10:49   ` Peter Zijlstra
  2024-12-13 21:58   ` Andrii Nakryiko
@ 2024-12-15 12:06   ` David Laight
  2024-12-15 14:14     ` Oleg Nesterov
  2 siblings, 1 reply; 63+ messages in thread
From: David Laight @ 2024-12-15 12:06 UTC (permalink / raw)
  To: 'Jiri Olsa', Oleg Nesterov, Peter Zijlstra,
	Andrii Nakryiko
  Cc: bpf@vger.kernel.org, Song Liu, Yonghong Song, John Fastabend,
	Hao Luo, Steven Rostedt, Masami Hiramatsu, Alan Maguire,
	linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org

From: Jiri Olsa
> Sent: 11 December 2024 13:34
> 
> Putting together all the previously added pieces to support optimized
> uprobes on top of 5-byte nop instruction.
> 
> The current uprobe execution goes through following:
>   - installs breakpoint instruction over original instruction
>   - exception handler hit and calls related uprobe consumers
>   - and either simulates original instruction or does out of line single step
>     execution of it
>   - returns to user space
> 
> The optimized uprobe path
> 
>   - checks the original instruction is 5-byte nop (plus other checks)
>   - adds (or uses existing) user space trampoline and overwrites original
>     instruction (5-byte nop) with call to user space trampoline
>   - the user space trampoline executes uprobe syscall that calls related uprobe
>     consumers
>   - trampoline returns back to next instruction
...

How on earth can you safely overwrite a randomly aligned 5 byte instruction
that might be being prefetched and executed by another thread of the
same process.

If the instruction doesn't cross a cache line boundary then you might
manage to convince people that an 8-byte write will always be atomic
wrt other cpu reading instructions.
But you can't guarantee the alignment.

You might manage with the 7 byte sequence:
	br .+7; call addr
and then update 'addr' before changing the branch offset from 05 to 00.
But even that may not be safe if 'addr' crosses a cache line boundary.

You could replace a one byte nop (0x90) with a breakpoint (0xcc) and
then return to the instruction after the breakpoint.
That would save having to emulate or single stap the overwritten
instruction.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 08/13] uprobes/x86: Add support to optimize uprobes
  2024-12-15 12:06   ` David Laight
@ 2024-12-15 14:14     ` Oleg Nesterov
  2024-12-16  8:08       ` Jiri Olsa
  0 siblings, 1 reply; 63+ messages in thread
From: Oleg Nesterov @ 2024-12-15 14:14 UTC (permalink / raw)
  To: David Laight
  Cc: 'Jiri Olsa', Peter Zijlstra, Andrii Nakryiko,
	bpf@vger.kernel.org, Song Liu, Yonghong Song, John Fastabend,
	Hao Luo, Steven Rostedt, Masami Hiramatsu, Alan Maguire,
	linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org

On 12/15, David Laight wrote:
>
> From: Jiri Olsa
> > The optimized uprobe path
> >
> >   - checks the original instruction is 5-byte nop (plus other checks)
> >   - adds (or uses existing) user space trampoline and overwrites original
> >     instruction (5-byte nop) with call to user space trampoline
> >   - the user space trampoline executes uprobe syscall that calls related uprobe
> >     consumers
> >   - trampoline returns back to next instruction
> ...
>
> How on earth can you safely overwrite a randomly aligned 5 byte instruction
> that might be being prefetched and executed by another thread of the
> same process.

uprobe_write_opcode() doesn't overwrite the instruction in place.

It creates the new page with the same content, overwrites the probed insn in
that page, then calls __replace_page().

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 13/13] selftests/bpf: Add 5-byte nop uprobe trigger bench
  2024-12-13 21:57   ` Andrii Nakryiko
@ 2024-12-16  7:56     ` Jiri Olsa
  0 siblings, 0 replies; 63+ messages in thread
From: Jiri Olsa @ 2024-12-16  7:56 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Fri, Dec 13, 2024 at 01:57:56PM -0800, Andrii Nakryiko wrote:
> On Wed, Dec 11, 2024 at 5:36 AM Jiri Olsa <jolsa@kernel.org> wrote:
> >
> > Add 5-byte nop uprobe trigger bench (x86_64 specific) to measure
> > uprobes/uretprobes on top of nop5 instruction.
> >
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > ---
> >  tools/testing/selftests/bpf/bench.c           | 12 ++++++
> >  .../selftests/bpf/benchs/bench_trigger.c      | 42 +++++++++++++++++++
> >  .../selftests/bpf/benchs/run_bench_uprobes.sh |  2 +-
> >  3 files changed, 55 insertions(+), 1 deletion(-)
> >
> 
> [...]
> 
> >  static void usetup(bool use_retprobe, bool use_multi, void *target_addr)
> >  {
> >         size_t uprobe_offset;
> > @@ -448,6 +462,28 @@ static void uretprobe_multi_ret_setup(void)
> >         usetup(true, true /* use_multi */, &uprobe_target_ret);
> >  }
> >
> > +#ifdef __x86_64__
> > +static void uprobe_nop5_setup(void)
> > +{
> > +       usetup(false, false /* !use_multi */, &uprobe_target_nop5);
> > +}
> > +
> > +static void uretprobe_nop5_setup(void)
> > +{
> > +       usetup(false, false /* !use_multi */, &uprobe_target_nop5);
> > +}
> 
> true /* use_retprobe */
> 
> that's the problem with bench setup, right?

yes, but there's more ;-)

we also need change in arch_uretprobe_hijack_return_addr to skip
the extra 3 values (pushed on stack by the uprobe trampoline) when
hijacking the returm value, I'll send new version

jirka

> 
> > +
> > +static void uprobe_multi_nop5_setup(void)
> > +{
> > +       usetup(false, true /* use_multi */, &uprobe_target_nop5);
> > +}
> > +
> > +static void uretprobe_multi_nop5_setup(void)
> > +{
> > +       usetup(false, true /* use_multi */, &uprobe_target_nop5);
> > +}
> > +#endif
> > +
> >  const struct bench bench_trig_syscall_count = {
> >         .name = "trig-syscall-count",
> >         .validate = trigger_validate,
> 
> [...]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 10/13] selftests/bpf: Add uprobe/usdt optimized test
  2024-12-13 21:58   ` Andrii Nakryiko
@ 2024-12-16  7:58     ` Jiri Olsa
  0 siblings, 0 replies; 63+ messages in thread
From: Jiri Olsa @ 2024-12-16  7:58 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Fri, Dec 13, 2024 at 01:58:38PM -0800, Andrii Nakryiko wrote:

SNIP

> > +static int find_uprobes_trampoline(void **start, void **end)
> > +{
> > +       char line[128];
> > +       int ret = -1;
> > +       FILE *maps;
> > +
> > +       maps = fopen("/proc/self/maps", "r");
> > +       if (!maps) {
> > +               fprintf(stderr, "cannot open maps\n");
> > +               return -1;
> > +       }
> > +
> > +       while (fgets(line, sizeof(line), maps)) {
> > +               int m = -1;
> > +
> > +               /* We care only about private r-x mappings. */
> > +               if (sscanf(line, "%p-%p r-xp %*x %*x:%*x %*u %n", start, end, &m) != 2)
> > +                       continue;
> > +               if (m < 0)
> > +                       continue;
> > +               if (!strncmp(&line[m], TRAMP, sizeof(TRAMP)-1)) {
> > +                       ret = 0;
> > +                       break;
> > +               }
> > +       }
> 
> you could have used PROCMAP_QUERY ;)

true ;-) will check on that in new version

thanks,
jirka

> 
> > +
> > +       fclose(maps);
> > +       return ret;
> > +}
> > +
> 
> [...]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 11/13] selftests/bpf: Add hit/attach/detach race optimized uprobe test
  2024-12-13 21:58   ` Andrii Nakryiko
@ 2024-12-16  7:59     ` Jiri Olsa
  0 siblings, 0 replies; 63+ messages in thread
From: Jiri Olsa @ 2024-12-16  7:59 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Fri, Dec 13, 2024 at 01:58:43PM -0800, Andrii Nakryiko wrote:
> On Wed, Dec 11, 2024 at 5:36 AM Jiri Olsa <jolsa@kernel.org> wrote:
> >
> > Adding test that makes sure parallel execution of the uprobe and
> > attach/detach of optimized uprobe on it works properly.
> >
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > ---
> >  .../selftests/bpf/prog_tests/uprobe_syscall.c | 82 +++++++++++++++++++
> >  1 file changed, 82 insertions(+)
> >
> > diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> > index 1dbc26a1130c..eacd14db8f8d 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> > @@ -532,6 +532,81 @@ static void test_uprobe_usdt(void)
> >  cleanup:
> >         uprobe_optimized__destroy(skel);
> >  }
> > +
> > +static bool race_stop;
> 
> volatile?

ok

> 
> > +
> > +static void *worker_trigger(void *arg)
> > +{
> > +       unsigned long rounds = 0;
> > +
> > +       while (!race_stop) {
> > +               uprobe_test();
> > +               rounds++;
> > +       }
> > +
> > +       printf("tid %d trigger rounds: %lu\n", gettid(), rounds);
> > +       return NULL;
> > +}
> > +
> > +static void *worker_attach(void *arg)
> > +{
> > +       struct uprobe_optimized *skel;
> > +       unsigned long rounds = 0;
> > +
> > +       skel = uprobe_optimized__open_and_load();
> > +       if (!ASSERT_OK_PTR(skel, "uprobe_optimized__open_and_load"))
> > +               goto cleanup;
> > +
> > +       while (!race_stop) {
> > +               skel->links.test_2 = bpf_program__attach_uprobe_multi(skel->progs.test_2, -1,
> > +                                               "/proc/self/exe", "uprobe_test_nop5", NULL);
> > +               if (!ASSERT_OK_PTR(skel->links.test_2, "bpf_program__attach_uprobe_multi"))
> > +                       break;
> > +               bpf_link__destroy(skel->links.test_2);
> > +               skel->links.test_2 = NULL;
> > +               rounds++;
> > +       }
> > +
> > +       printf("tid %d attach rounds: %lu hits: %lu\n", gettid(), rounds, skel->bss->executed);
> > +
> > +cleanup:
> > +       uprobe_optimized__destroy(skel);
> > +       return NULL;
> > +}
> > +
> > +static void test_uprobe_race(void)
> > +{
> > +       int err, i, nr_cpus, nr;
> > +       pthread_t *threads;
> > +
> > +        nr_cpus = libbpf_num_possible_cpus();
> 
> check whitespaces

ok

> 
> > +       if (!ASSERT_GE(nr_cpus, 0, "nr_cpus"))
> > +               return;
> > +
> > +       nr = nr_cpus * 2;
> > +       threads = malloc(sizeof(*threads) * nr);
> > +       if (!ASSERT_OK_PTR(threads, "malloc"))
> > +               return;
> > +
> > +       for (i = 0; i < nr_cpus; i++) {
> > +               err = pthread_create(&threads[i], NULL, worker_trigger, NULL);
> > +               if (!ASSERT_OK(err, "pthread_create"))
> > +                       goto cleanup;
> > +       }
> > +
> > +       for (; i < nr; i++) {
> > +               err = pthread_create(&threads[i], NULL, worker_attach, NULL);
> > +               if (!ASSERT_OK(err, "pthread_create"))
> > +                       goto cleanup;
> > +       }
> > +
> > +       sleep(4);
> > +
> > +cleanup:
> > +       race_stop = true;
> > +       for (i = 0; i < nr; i++)
> > +               pthread_join(threads[i], NULL);
> 
> what happens with pthread_join() when called with uninitialized
> threads[i] (e.g., when error happens in the middle of creating
> threads)?

yes, I'll do the proper cleanup in new version

thanks,
jirka

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 06/13] uprobes/x86: Add uprobe syscall to speed up uprobe
  2024-12-14 13:21           ` Thomas Weißschuh
@ 2024-12-16  8:03             ` Jiri Olsa
  0 siblings, 0 replies; 63+ messages in thread
From: Jiri Olsa @ 2024-12-16  8:03 UTC (permalink / raw)
  To: Thomas Weißschuh
  Cc: Jiri Olsa, Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf,
	Song Liu, Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Sat, Dec 14, 2024 at 02:21:43PM +0100, Thomas Weißschuh wrote:
> On 2024-12-13 22:52:15+0100, Jiri Olsa wrote:
> > On Fri, Dec 13, 2024 at 04:12:46PM +0100, Thomas Weißschuh wrote:
> > 
> > SNIP
> > 
> > > > > > +static int __init arch_uprobes_init(void)
> > > > > > +{
> > > > > > +	unsigned long size = uprobe_trampoline_end - uprobe_trampoline_entry;
> > > > > > +	static struct page *pages[2];
> > > > > > +	struct page *page;
> > > > > > +
> > > > > > +	page = alloc_page(GFP_HIGHUSER);
> > > > > 
> > > > > That page could be in static memory, removing the need for the explicit
> > > > > allocation. It could also be __ro_after_init.
> > > > > Then tramp_mapping itself can be const.
> > > > 
> > > > hum, how would that look like? I think that to get proper page object
> > > > you have to call alloc_page or some other page alloc family function..
> > > > what do I miss?
> > > 
> > > static u8 trampoline_page[PAGE_SIZE] __ro_after_init __aligned(PAGE_SIZE);
> > > static struct page *tramp_mapping_pages[2] __ro_after_init;
> > > 
> > > static const struct vm_special_mapping tramp_mapping = {
> > > 	.name   = "[uprobes-trampoline]",
> > > 	.pages  = tramp_mapping_pages,
> > > 	.mremap = tramp_mremap,
> > > };
> > > 
> > > static int __init arch_uprobes_init(void)
> > > {
> > > 	...
> > > 	trampoline_pages[0] = virt_to_page(trampoline_page);
> > > 	...
> > > }
> > > 
> > > Untested, but it's similar to the stuff the vDSO implementations are
> > > doing which I am working with at the moment.
> > 
> > nice idea, better than allocating the page, will do that
> 
> Or even better yet, just allocate the whole page already in the inline
> asm and avoid the copying, too:
> 
> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index b2420eeee23a..c5e6ca7f998a 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -462,7 +462,7 @@ SYSCALL_DEFINE0(uprobe)
> 
>  asm (
>         ".pushsection .rodata\n"
> -       ".global uprobe_trampoline_entry\n"
> +       ".balign " __stringify(PAGE_SIZE) "\n"
>         "uprobe_trampoline_entry:\n"
>         "endbr64\n"
>         "push %rcx\n"
> @@ -474,13 +474,11 @@ asm (
>         "pop %r11\n"
>         "pop %rcx\n"
>         "ret\n"
> -       ".global uprobe_trampoline_end\n"
> -       "uprobe_trampoline_end:\n"
> +       ".balign " __stringify(PAGE_SIZE) "\n"
>         ".popsection\n"
>  );
> 
> -extern __visible u8 uprobe_trampoline_entry[];
> -extern __visible u8 uprobe_trampoline_end[];
> +extern u8 uprobe_trampoline_entry[];
> 
> 
> If you want to keep the copying for some reason, the asm code should be
> in the section ".init.rodata" as its not used afterwards.

perfect, no need for copy, I'll do what you propose above

> 
> (A bit bikesheddy, I admit)

thanks for the review,

jirka

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 08/13] uprobes/x86: Add support to optimize uprobes
  2024-12-15 14:14     ` Oleg Nesterov
@ 2024-12-16  8:08       ` Jiri Olsa
  2024-12-16  9:18         ` David Laight
  0 siblings, 1 reply; 63+ messages in thread
From: Jiri Olsa @ 2024-12-16  8:08 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: David Laight, Peter Zijlstra, Andrii Nakryiko,
	bpf@vger.kernel.org, Song Liu, Yonghong Song, John Fastabend,
	Hao Luo, Steven Rostedt, Masami Hiramatsu, Alan Maguire,
	linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org

On Sun, Dec 15, 2024 at 03:14:13PM +0100, Oleg Nesterov wrote:
> On 12/15, David Laight wrote:
> >
> > From: Jiri Olsa
> > > The optimized uprobe path
> > >
> > >   - checks the original instruction is 5-byte nop (plus other checks)
> > >   - adds (or uses existing) user space trampoline and overwrites original
> > >     instruction (5-byte nop) with call to user space trampoline
> > >   - the user space trampoline executes uprobe syscall that calls related uprobe
> > >     consumers
> > >   - trampoline returns back to next instruction
> > ...
> >
> > How on earth can you safely overwrite a randomly aligned 5 byte instruction
> > that might be being prefetched and executed by another thread of the
> > same process.
> 
> uprobe_write_opcode() doesn't overwrite the instruction in place.
> 
> It creates the new page with the same content, overwrites the probed insn in
> that page, then calls __replace_page().

tbh I wasn't completely sure about that as well, I added selftest
in patch #11 trying to hit the issue you described and it seems to
work ok

jirka

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 09/13] selftests/bpf: Use 5-byte nop for x86 usdt probes
  2024-12-13 21:58   ` Andrii Nakryiko
@ 2024-12-16  8:32     ` Jiri Olsa
  2024-12-16 23:06       ` Andrii Nakryiko
  0 siblings, 1 reply; 63+ messages in thread
From: Jiri Olsa @ 2024-12-16  8:32 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Fri, Dec 13, 2024 at 01:58:33PM -0800, Andrii Nakryiko wrote:
> On Wed, Dec 11, 2024 at 5:35 AM Jiri Olsa <jolsa@kernel.org> wrote:
> >
> > Using 5-byte nop for x86 usdt probes so we can switch
> > to optimized uprobe them.
> >
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > ---
> >  tools/testing/selftests/bpf/sdt.h | 9 ++++++++-
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> >
> 
> This change will make it impossible to run latest selftests on older
> kernels. Let's do what we did with -DENABLE_ATOMICS_TESTS and allow to
> disable this through Makefile, ok?

ok, I wanted to start addressing this after this version

so the problem is using this macro with nop5 in application running
on older kernels that do not have nop5 emulation and uprobe syscall
optimization, because uprobe/usdt on top of nop5 (single-stepped)
will be slower than on top of current nop1 (emulated)

AFAICS selftests should still work, just bit slower due to nop5 emulation

one part of the solution would be to backport [1] to stable kernels
which is an easy fix (even though it needs changes now)

if that's not enough we'd need to come up with that nop1/nop5 macro
solution, where tooling (libbpf with extra data in usdt note) would
install uprobe on top of nop1 on older kernels and on top of nop5 on
new ones.. but that'd need more work of course

jirka


[1] patch#7 - uprobes/x86: Add support to emulate nop5 instruction


> 
> 
> > diff --git a/tools/testing/selftests/bpf/sdt.h b/tools/testing/selftests/bpf/sdt.h
> > index ca0162b4dc57..7ac9291f45f1 100644
> > --- a/tools/testing/selftests/bpf/sdt.h
> > +++ b/tools/testing/selftests/bpf/sdt.h
> > @@ -234,6 +234,13 @@ __extension__ extern unsigned long long __sdt_unsp;
> >  #define _SDT_NOP       nop
> >  #endif
> >
> > +/* Use 5 byte nop for x86_64 to allow optimizing uprobes. */
> > +#if defined(__x86_64__)
> > +# define _SDT_DEF_NOP _SDT_ASM_5(990:  .byte 0x0f, 0x1f, 0x44, 0x00, 0x00)
> > +#else
> > +# define _SDT_DEF_NOP _SDT_ASM_1(990:  _SDT_NOP)
> > +#endif
> > +
> >  #define _SDT_NOTE_NAME "stapsdt"
> >  #define _SDT_NOTE_TYPE 3
> >
> > @@ -286,7 +293,7 @@ __extension__ extern unsigned long long __sdt_unsp;
> >
> >  #define _SDT_ASM_BODY(provider, name, pack_args, args, ...)                  \
> >    _SDT_DEF_MACROS                                                            \
> > -  _SDT_ASM_1(990:      _SDT_NOP)                                             \
> > +  _SDT_DEF_NOP                                                               \
> >    _SDT_ASM_3(          .pushsection .note.stapsdt,_SDT_ASM_AUTOGROUP,"note") \
> >    _SDT_ASM_1(          .balign 4)                                            \
> >    _SDT_ASM_3(          .4byte 992f-991f, 994f-993f, _SDT_NOTE_TYPE)          \
> > --
> > 2.47.0
> >

^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: [PATCH bpf-next 08/13] uprobes/x86: Add support to optimize uprobes
  2024-12-16  8:08       ` Jiri Olsa
@ 2024-12-16  9:18         ` David Laight
  2024-12-16 10:12           ` Oleg Nesterov
  0 siblings, 1 reply; 63+ messages in thread
From: David Laight @ 2024-12-16  9:18 UTC (permalink / raw)
  To: 'Jiri Olsa', Oleg Nesterov
  Cc: Peter Zijlstra, Andrii Nakryiko, bpf@vger.kernel.org, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel@vger.kernel.org,
	linux-trace-kernel@vger.kernel.org

From: Jiri Olsa
> Sent: 16 December 2024 08:09
> 
> On Sun, Dec 15, 2024 at 03:14:13PM +0100, Oleg Nesterov wrote:
> > On 12/15, David Laight wrote:
> > >
> > > From: Jiri Olsa
> > > > The optimized uprobe path
> > > >
> > > >   - checks the original instruction is 5-byte nop (plus other checks)
> > > >   - adds (or uses existing) user space trampoline and overwrites original
> > > >     instruction (5-byte nop) with call to user space trampoline
> > > >   - the user space trampoline executes uprobe syscall that calls related uprobe
> > > >     consumers
> > > >   - trampoline returns back to next instruction
> > > ...
> > >
> > > How on earth can you safely overwrite a randomly aligned 5 byte instruction
> > > that might be being prefetched and executed by another thread of the
> > > same process.
> >
> > uprobe_write_opcode() doesn't overwrite the instruction in place.
> >
> > It creates the new page with the same content, overwrites the probed insn in
> > that page, then calls __replace_page().
> 
> tbh I wasn't completely sure about that as well, I added selftest
> in patch #11 trying to hit the issue you described and it seems to
> work ok

Actually hitting the timing window is hard.
So 'seems to work ok' doesn't really mean much :-)
It all depends on how hard __replace_page() tries to be atomic.
The page has to change from one backed by the executable to a private
one backed by swap - otherwise you can't write to it.

But the problems arise when the instruction prefetch unit has read
part of the 5-byte instruction (it might even only read half a cache
line at a time).
I'm not sure how long the pipeline can sit in that state - but I
can do a memory read of a PCIe address that takes ~3000 clocks.
(And a misaligned AVX-512 read is probably eight 8-byte transfers.)

So I think you need to force an interrupt while the PTE is invalid.
And that need to be simultaneous on all cpu running that process.
Stopping the process using ptrace would do it.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 08/13] uprobes/x86: Add support to optimize uprobes
  2024-12-16  9:18         ` David Laight
@ 2024-12-16 10:12           ` Oleg Nesterov
  2024-12-16 11:10             ` David Laight
  0 siblings, 1 reply; 63+ messages in thread
From: Oleg Nesterov @ 2024-12-16 10:12 UTC (permalink / raw)
  To: David Laight
  Cc: 'Jiri Olsa', Peter Zijlstra, Andrii Nakryiko,
	bpf@vger.kernel.org, Song Liu, Yonghong Song, John Fastabend,
	Hao Luo, Steven Rostedt, Masami Hiramatsu, Alan Maguire,
	linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org

David,

let me say first that my understanding of this magic is very limited,
please correct me.

On 12/16, David Laight wrote:
>
> It all depends on how hard __replace_page() tries to be atomic.
> The page has to change from one backed by the executable to a private
> one backed by swap - otherwise you can't write to it.

This is what uprobe_write_opcode() does,

> But the problems arise when the instruction prefetch unit has read
> part of the 5-byte instruction (it might even only read half a cache
> line at a time).
> I'm not sure how long the pipeline can sit in that state - but I
> can do a memory read of a PCIe address that takes ~3000 clocks.
> (And a misaligned AVX-512 read is probably eight 8-byte transfers.)
>
> So I think you need to force an interrupt while the PTE is invalid.
> And that need to be simultaneous on all cpu running that process.

__replace_page() does ptep_get_and_clear(old_pte) + flush_tlb_page().

That's not enough?

> Stopping the process using ptrace would do it.

Not an option :/

Oleg.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: [PATCH bpf-next 08/13] uprobes/x86: Add support to optimize uprobes
  2024-12-16 10:12           ` Oleg Nesterov
@ 2024-12-16 11:10             ` David Laight
  2024-12-16 12:22               ` Oleg Nesterov
  0 siblings, 1 reply; 63+ messages in thread
From: David Laight @ 2024-12-16 11:10 UTC (permalink / raw)
  To: 'Oleg Nesterov', linux-mm@kvack.org
  Cc: 'Jiri Olsa', Peter Zijlstra, Andrii Nakryiko,
	bpf@vger.kernel.org, Song Liu, Yonghong Song, John Fastabend,
	Hao Luo, Steven Rostedt, Masami Hiramatsu, Alan Maguire,
	linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org

From: Oleg Nesterov
> Sent: 16 December 2024 10:13
> 
> David,
> 
> let me say first that my understanding of this magic is very limited,
> please correct me.

I only (half) understand what the 'magic' has to accomplish and
some of the pitfalls.

I've copied linux-mm - someone there might know more.

> On 12/16, David Laight wrote:
> >
> > It all depends on how hard __replace_page() tries to be atomic.
> > The page has to change from one backed by the executable to a private
> > one backed by swap - otherwise you can't write to it.
> 
> This is what uprobe_write_opcode() does,

And will be enough for single byte changes - they'll be picked up
at some point after the change.

> > But the problems arise when the instruction prefetch unit has read
> > part of the 5-byte instruction (it might even only read half a cache
> > line at a time).
> > I'm not sure how long the pipeline can sit in that state - but I
> > can do a memory read of a PCIe address that takes ~3000 clocks.
> > (And a misaligned AVX-512 read is probably eight 8-byte transfers.)
> >
> > So I think you need to force an interrupt while the PTE is invalid.
> > And that need to be simultaneous on all cpu running that process.
> 
> __replace_page() does ptep_get_and_clear(old_pte) + flush_tlb_page().
> 
> That's not enough?

I doubt it. As I understand it.
The hardware page tables will be shared by all the threads of a process.
So unless you hard synchronise all the cpu (and flush the TLB) while the
PTE is being changed there is always the possibility of a cpu picking up
the new PTE before the IPI that (I presume) flush_tlb_page() generates
is processed.
If that happens when the instruction you are patching is part-read into
the instruction decode buffer then you'll execute a mismatch of the two
instructions.

I can't remember the outcome of discussions about live-patching kernel
code - and I'm sure that was aligned 32bit writes.

> 
> > Stopping the process using ptrace would do it.
> 
> Not an option :/

Thought you'd say that.

	David

> 
> Oleg.

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 08/13] uprobes/x86: Add support to optimize uprobes
  2024-12-16 11:10             ` David Laight
@ 2024-12-16 12:22               ` Oleg Nesterov
  2024-12-16 12:50                 ` Jiri Olsa
  0 siblings, 1 reply; 63+ messages in thread
From: Oleg Nesterov @ 2024-12-16 12:22 UTC (permalink / raw)
  To: David Laight
  Cc: linux-mm@kvack.org, 'Jiri Olsa', Peter Zijlstra,
	Andrii Nakryiko, bpf@vger.kernel.org, Song Liu, Yonghong Song,
	John Fastabend, Hao Luo, Steven Rostedt, Masami Hiramatsu,
	Alan Maguire, linux-kernel@vger.kernel.org,
	linux-trace-kernel@vger.kernel.org

OK, thanks, I am starting to share your concerns...

Oleg.

On 12/16, David Laight wrote:
>
> From: Oleg Nesterov
> > Sent: 16 December 2024 10:13
> >
> > David,
> >
> > let me say first that my understanding of this magic is very limited,
> > please correct me.
>
> I only (half) understand what the 'magic' has to accomplish and
> some of the pitfalls.
>
> I've copied linux-mm - someone there might know more.
>
> > On 12/16, David Laight wrote:
> > >
> > > It all depends on how hard __replace_page() tries to be atomic.
> > > The page has to change from one backed by the executable to a private
> > > one backed by swap - otherwise you can't write to it.
> >
> > This is what uprobe_write_opcode() does,
>
> And will be enough for single byte changes - they'll be picked up
> at some point after the change.
>
> > > But the problems arise when the instruction prefetch unit has read
> > > part of the 5-byte instruction (it might even only read half a cache
> > > line at a time).
> > > I'm not sure how long the pipeline can sit in that state - but I
> > > can do a memory read of a PCIe address that takes ~3000 clocks.
> > > (And a misaligned AVX-512 read is probably eight 8-byte transfers.)
> > >
> > > So I think you need to force an interrupt while the PTE is invalid.
> > > And that need to be simultaneous on all cpu running that process.
> >
> > __replace_page() does ptep_get_and_clear(old_pte) + flush_tlb_page().
> >
> > That's not enough?
>
> I doubt it. As I understand it.
> The hardware page tables will be shared by all the threads of a process.
> So unless you hard synchronise all the cpu (and flush the TLB) while the
> PTE is being changed there is always the possibility of a cpu picking up
> the new PTE before the IPI that (I presume) flush_tlb_page() generates
> is processed.
> If that happens when the instruction you are patching is part-read into
> the instruction decode buffer then you'll execute a mismatch of the two
> instructions.
>
> I can't remember the outcome of discussions about live-patching kernel
> code - and I'm sure that was aligned 32bit writes.
>
> >
> > > Stopping the process using ptrace would do it.
> >
> > Not an option :/
>
> Thought you'd say that.
>
> 	David
>
> >
> > Oleg.
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 08/13] uprobes/x86: Add support to optimize uprobes
  2024-12-16 12:22               ` Oleg Nesterov
@ 2024-12-16 12:50                 ` Jiri Olsa
  2024-12-16 15:08                   ` David Laight
  0 siblings, 1 reply; 63+ messages in thread
From: Jiri Olsa @ 2024-12-16 12:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: David Laight, linux-mm@kvack.org, 'Jiri Olsa',
	Peter Zijlstra, Andrii Nakryiko, bpf@vger.kernel.org, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel@vger.kernel.org,
	linux-trace-kernel@vger.kernel.org

On Mon, Dec 16, 2024 at 01:22:05PM +0100, Oleg Nesterov wrote:
> OK, thanks, I am starting to share your concerns...
> 
> Oleg.
> 
> On 12/16, David Laight wrote:
> >
> > From: Oleg Nesterov
> > > Sent: 16 December 2024 10:13
> > >
> > > David,
> > >
> > > let me say first that my understanding of this magic is very limited,
> > > please correct me.
> >
> > I only (half) understand what the 'magic' has to accomplish and
> > some of the pitfalls.
> >
> > I've copied linux-mm - someone there might know more.
> >
> > > On 12/16, David Laight wrote:
> > > >
> > > > It all depends on how hard __replace_page() tries to be atomic.
> > > > The page has to change from one backed by the executable to a private
> > > > one backed by swap - otherwise you can't write to it.
> > >
> > > This is what uprobe_write_opcode() does,
> >
> > And will be enough for single byte changes - they'll be picked up
> > at some point after the change.
> >
> > > > But the problems arise when the instruction prefetch unit has read
> > > > part of the 5-byte instruction (it might even only read half a cache
> > > > line at a time).
> > > > I'm not sure how long the pipeline can sit in that state - but I
> > > > can do a memory read of a PCIe address that takes ~3000 clocks.
> > > > (And a misaligned AVX-512 read is probably eight 8-byte transfers.)
> > > >
> > > > So I think you need to force an interrupt while the PTE is invalid.
> > > > And that need to be simultaneous on all cpu running that process.
> > >
> > > __replace_page() does ptep_get_and_clear(old_pte) + flush_tlb_page().
> > >
> > > That's not enough?
> >
> > I doubt it. As I understand it.
> > The hardware page tables will be shared by all the threads of a process.
> > So unless you hard synchronise all the cpu (and flush the TLB) while the
> > PTE is being changed there is always the possibility of a cpu picking up
> > the new PTE before the IPI that (I presume) flush_tlb_page() generates
> > is processed.
> > If that happens when the instruction you are patching is part-read into
> > the instruction decode buffer then you'll execute a mismatch of the two
> > instructions.

if 5 byte update would be a problem, I guess we could workaround that through
partial updates using int3 like we do in text_poke_bp_batch?

  - changing nop5 instruction to 'call xxx'
  - write int3 to first byte of nop5 instruction
  - have poke_int3_handler to emulate nop5 if int3 is triggered
  - write rest of the call instruction to nop5 last 4 bytes
  - overwrite first byte of nop5 with call opcode

similar update from 'call xxx' -> 'nop5'

thanks,
jirka

> >
> > I can't remember the outcome of discussions about live-patching kernel
> > code - and I'm sure that was aligned 32bit writes.
> >
> > >
> > > > Stopping the process using ptrace would do it.
> > >
> > > Not an option :/
> >
> > Thought you'd say that.
> >
> > 	David
> >
> > >
> > > Oleg.
> >
> > -
> > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > Registration No: 1397386 (Wales)
> >
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: [PATCH bpf-next 08/13] uprobes/x86: Add support to optimize uprobes
  2024-12-16 12:50                 ` Jiri Olsa
@ 2024-12-16 15:08                   ` David Laight
  2024-12-16 16:06                     ` Jiri Olsa
  0 siblings, 1 reply; 63+ messages in thread
From: David Laight @ 2024-12-16 15:08 UTC (permalink / raw)
  To: 'Jiri Olsa', Oleg Nesterov
  Cc: linux-mm@kvack.org, Peter Zijlstra, Andrii Nakryiko,
	bpf@vger.kernel.org, Song Liu, Yonghong Song, John Fastabend,
	Hao Luo, Steven Rostedt, Masami Hiramatsu, Alan Maguire,
	linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org

From: Jiri Olsa
> Sent: 16 December 2024 12:50
> 
> On Mon, Dec 16, 2024 at 01:22:05PM +0100, Oleg Nesterov wrote:
> > OK, thanks, I am starting to share your concerns...
> >
> > Oleg.
> >
> > On 12/16, David Laight wrote:
> > >
> > > From: Oleg Nesterov
> > > > Sent: 16 December 2024 10:13
> > > >
> > > > David,
> > > >
> > > > let me say first that my understanding of this magic is very limited,
> > > > please correct me.
> > >
> > > I only (half) understand what the 'magic' has to accomplish and
> > > some of the pitfalls.
> > >
> > > I've copied linux-mm - someone there might know more.
> > >
> > > > On 12/16, David Laight wrote:
> > > > >
> > > > > It all depends on how hard __replace_page() tries to be atomic.
> > > > > The page has to change from one backed by the executable to a private
> > > > > one backed by swap - otherwise you can't write to it.
> > > >
> > > > This is what uprobe_write_opcode() does,
> > >
> > > And will be enough for single byte changes - they'll be picked up
> > > at some point after the change.
> > >
> > > > > But the problems arise when the instruction prefetch unit has read
> > > > > part of the 5-byte instruction (it might even only read half a cache
> > > > > line at a time).
> > > > > I'm not sure how long the pipeline can sit in that state - but I
> > > > > can do a memory read of a PCIe address that takes ~3000 clocks.
> > > > > (And a misaligned AVX-512 read is probably eight 8-byte transfers.)
> > > > >
> > > > > So I think you need to force an interrupt while the PTE is invalid.
> > > > > And that need to be simultaneous on all cpu running that process.
> > > >
> > > > __replace_page() does ptep_get_and_clear(old_pte) + flush_tlb_page().
> > > >
> > > > That's not enough?
> > >
> > > I doubt it. As I understand it.
> > > The hardware page tables will be shared by all the threads of a process.
> > > So unless you hard synchronise all the cpu (and flush the TLB) while the
> > > PTE is being changed there is always the possibility of a cpu picking up
> > > the new PTE before the IPI that (I presume) flush_tlb_page() generates
> > > is processed.
> > > If that happens when the instruction you are patching is part-read into
> > > the instruction decode buffer then you'll execute a mismatch of the two
> > > instructions.
> 
> if 5 byte update would be a problem, I guess we could workaround that through
> partial updates using int3 like we do in text_poke_bp_batch?
> 
>   - changing nop5 instruction to 'call xxx'
>   - write int3 to first byte of nop5 instruction
>   - have poke_int3_handler to emulate nop5 if int3 is triggered
>   - write rest of the call instruction to nop5 last 4 bytes
>   - overwrite first byte of nop5 with call opcode

That might work provided there are IPI (to flush the decode pipeline)
after the write of the 'int3' and one before the write of the 'call'.
You'll need to ensure the I-cache gets invalidated as well.

And if the sequence crosses a page boundary....

	David

> 
> similar update from 'call xxx' -> 'nop5'
> 
> thanks,
> jirka
> 
> > >
> > > I can't remember the outcome of discussions about live-patching kernel
> > > code - and I'm sure that was aligned 32bit writes.
> > >
> > > >
> > > > > Stopping the process using ptrace would do it.
> > > >
> > > > Not an option :/
> > >
> > > Thought you'd say that.
> > >
> > > 	David
> > >
> > > >
> > > > Oleg.
> > >
> > > -
> > > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > > Registration No: 1397386 (Wales)
> > >
> >

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 08/13] uprobes/x86: Add support to optimize uprobes
  2024-12-16 15:08                   ` David Laight
@ 2024-12-16 16:06                     ` Jiri Olsa
  0 siblings, 0 replies; 63+ messages in thread
From: Jiri Olsa @ 2024-12-16 16:06 UTC (permalink / raw)
  To: David Laight
  Cc: 'Jiri Olsa', Oleg Nesterov, linux-mm@kvack.org,
	Peter Zijlstra, Andrii Nakryiko, bpf@vger.kernel.org, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel@vger.kernel.org,
	linux-trace-kernel@vger.kernel.org

On Mon, Dec 16, 2024 at 03:08:14PM +0000, David Laight wrote:
> From: Jiri Olsa
> > Sent: 16 December 2024 12:50
> > 
> > On Mon, Dec 16, 2024 at 01:22:05PM +0100, Oleg Nesterov wrote:
> > > OK, thanks, I am starting to share your concerns...
> > >
> > > Oleg.
> > >
> > > On 12/16, David Laight wrote:
> > > >
> > > > From: Oleg Nesterov
> > > > > Sent: 16 December 2024 10:13
> > > > >
> > > > > David,
> > > > >
> > > > > let me say first that my understanding of this magic is very limited,
> > > > > please correct me.
> > > >
> > > > I only (half) understand what the 'magic' has to accomplish and
> > > > some of the pitfalls.
> > > >
> > > > I've copied linux-mm - someone there might know more.
> > > >
> > > > > On 12/16, David Laight wrote:
> > > > > >
> > > > > > It all depends on how hard __replace_page() tries to be atomic.
> > > > > > The page has to change from one backed by the executable to a private
> > > > > > one backed by swap - otherwise you can't write to it.
> > > > >
> > > > > This is what uprobe_write_opcode() does,
> > > >
> > > > And will be enough for single byte changes - they'll be picked up
> > > > at some point after the change.
> > > >
> > > > > > But the problems arise when the instruction prefetch unit has read
> > > > > > part of the 5-byte instruction (it might even only read half a cache
> > > > > > line at a time).
> > > > > > I'm not sure how long the pipeline can sit in that state - but I
> > > > > > can do a memory read of a PCIe address that takes ~3000 clocks.
> > > > > > (And a misaligned AVX-512 read is probably eight 8-byte transfers.)
> > > > > >
> > > > > > So I think you need to force an interrupt while the PTE is invalid.
> > > > > > And that need to be simultaneous on all cpu running that process.
> > > > >
> > > > > __replace_page() does ptep_get_and_clear(old_pte) + flush_tlb_page().
> > > > >
> > > > > That's not enough?
> > > >
> > > > I doubt it. As I understand it.
> > > > The hardware page tables will be shared by all the threads of a process.
> > > > So unless you hard synchronise all the cpu (and flush the TLB) while the
> > > > PTE is being changed there is always the possibility of a cpu picking up
> > > > the new PTE before the IPI that (I presume) flush_tlb_page() generates
> > > > is processed.
> > > > If that happens when the instruction you are patching is part-read into
> > > > the instruction decode buffer then you'll execute a mismatch of the two
> > > > instructions.
> > 
> > if 5 byte update would be a problem, I guess we could workaround that through
> > partial updates using int3 like we do in text_poke_bp_batch?
> > 
> >   - changing nop5 instruction to 'call xxx'
> >   - write int3 to first byte of nop5 instruction
> >   - have poke_int3_handler to emulate nop5 if int3 is triggered
> >   - write rest of the call instruction to nop5 last 4 bytes
> >   - overwrite first byte of nop5 with call opcode
> 
> That might work provided there are IPI (to flush the decode pipeline)
> after the write of the 'int3' and one before the write of the 'call'.
> You'll need to ensure the I-cache gets invalidated as well.

ok, seems to be done by text_poke_sync

> 
> And if the sequence crosses a page boundary....

that was already limitation for the current change

jirka

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH bpf-next 09/13] selftests/bpf: Use 5-byte nop for x86 usdt probes
  2024-12-16  8:32     ` Jiri Olsa
@ 2024-12-16 23:06       ` Andrii Nakryiko
  0 siblings, 0 replies; 63+ messages in thread
From: Andrii Nakryiko @ 2024-12-16 23:06 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, linux-kernel, linux-trace-kernel

On Mon, Dec 16, 2024 at 12:32 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Fri, Dec 13, 2024 at 01:58:33PM -0800, Andrii Nakryiko wrote:
> > On Wed, Dec 11, 2024 at 5:35 AM Jiri Olsa <jolsa@kernel.org> wrote:
> > >
> > > Using 5-byte nop for x86 usdt probes so we can switch
> > > to optimized uprobe them.
> > >
> > > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > > ---
> > >  tools/testing/selftests/bpf/sdt.h | 9 ++++++++-
> > >  1 file changed, 8 insertions(+), 1 deletion(-)
> > >
> >
> > This change will make it impossible to run latest selftests on older
> > kernels. Let's do what we did with -DENABLE_ATOMICS_TESTS and allow to
> > disable this through Makefile, ok?
>
> ok, I wanted to start addressing this after this version
>
> so the problem is using this macro with nop5 in application running
> on older kernels that do not have nop5 emulation and uprobe syscall
> optimization, because uprobe/usdt on top of nop5 (single-stepped)
> will be slower than on top of current nop1 (emulated)
>
> AFAICS selftests should still work, just bit slower due to nop5 emulation

ah, fair enough, it won't break anything, true

>
> one part of the solution would be to backport [1] to stable kernels
> which is an easy fix (even though it needs changes now)
>
> if that's not enough we'd need to come up with that nop1/nop5 macro
> solution, where tooling (libbpf with extra data in usdt note) would
> install uprobe on top of nop1 on older kernels and on top of nop5 on
> new ones.. but that'd need more work of course

yes, I think that's the way we should go, but as you said, that's
outside of kernel and we should work on that as a follow up to this
series

>
> jirka
>
>
> [1] patch#7 - uprobes/x86: Add support to emulate nop5 instruction
>
>
> >
> >
> > > diff --git a/tools/testing/selftests/bpf/sdt.h b/tools/testing/selftests/bpf/sdt.h
> > > index ca0162b4dc57..7ac9291f45f1 100644
> > > --- a/tools/testing/selftests/bpf/sdt.h
> > > +++ b/tools/testing/selftests/bpf/sdt.h
> > > @@ -234,6 +234,13 @@ __extension__ extern unsigned long long __sdt_unsp;
> > >  #define _SDT_NOP       nop
> > >  #endif
> > >
> > > +/* Use 5 byte nop for x86_64 to allow optimizing uprobes. */
> > > +#if defined(__x86_64__)
> > > +# define _SDT_DEF_NOP _SDT_ASM_5(990:  .byte 0x0f, 0x1f, 0x44, 0x00, 0x00)
> > > +#else
> > > +# define _SDT_DEF_NOP _SDT_ASM_1(990:  _SDT_NOP)
> > > +#endif
> > > +
> > >  #define _SDT_NOTE_NAME "stapsdt"
> > >  #define _SDT_NOTE_TYPE 3
> > >
> > > @@ -286,7 +293,7 @@ __extension__ extern unsigned long long __sdt_unsp;
> > >
> > >  #define _SDT_ASM_BODY(provider, name, pack_args, args, ...)                  \
> > >    _SDT_DEF_MACROS                                                            \
> > > -  _SDT_ASM_1(990:      _SDT_NOP)                                             \
> > > +  _SDT_DEF_NOP                                                               \
> > >    _SDT_ASM_3(          .pushsection .note.stapsdt,_SDT_ASM_AUTOGROUP,"note") \
> > >    _SDT_ASM_1(          .balign 4)                                            \
> > >    _SDT_ASM_3(          .4byte 992f-991f, 994f-993f, _SDT_NOTE_TYPE)          \
> > > --
> > > 2.47.0
> > >

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2024-12-16 23:06 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-11 13:33 [PATCH bpf-next 00/13] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
2024-12-11 13:33 ` [PATCH bpf-next 01/13] uprobes: Rename arch_uretprobe_trampoline function Jiri Olsa
2024-12-13  0:42   ` Andrii Nakryiko
2024-12-11 13:33 ` [PATCH bpf-next 02/13] uprobes: Make copy_from_page global Jiri Olsa
2024-12-13  0:43   ` Andrii Nakryiko
2024-12-11 13:33 ` [PATCH bpf-next 03/13] uprobes: Add nbytes argument to uprobe_write_opcode Jiri Olsa
2024-12-13  0:45   ` Andrii Nakryiko
2024-12-11 13:33 ` [PATCH bpf-next 04/13] uprobes: Add arch_uprobe_verify_opcode function Jiri Olsa
2024-12-13  0:48   ` Andrii Nakryiko
2024-12-13 13:21     ` Jiri Olsa
2024-12-13 21:11       ` Andrii Nakryiko
2024-12-13 21:52         ` Jiri Olsa
2024-12-11 13:33 ` [PATCH bpf-next 05/13] uprobes: Add mapping for optimized uprobe trampolines Jiri Olsa
2024-12-13  1:01   ` Andrii Nakryiko
2024-12-13 13:42     ` Jiri Olsa
2024-12-13 21:58       ` Andrii Nakryiko
2024-12-11 13:33 ` [PATCH bpf-next 06/13] uprobes/x86: Add uprobe syscall to speed up uprobe Jiri Olsa
2024-12-13 13:48   ` Thomas Weißschuh
2024-12-13 14:51     ` Jiri Olsa
2024-12-13 15:12       ` Thomas Weißschuh
2024-12-13 21:52         ` Jiri Olsa
2024-12-14 13:21           ` Thomas Weißschuh
2024-12-16  8:03             ` Jiri Olsa
2024-12-11 13:33 ` [PATCH bpf-next 07/13] uprobes/x86: Add support to emulate nop5 instruction Jiri Olsa
2024-12-13 10:45   ` Peter Zijlstra
2024-12-13 13:02     ` Jiri Olsa
2024-12-11 13:33 ` [PATCH bpf-next 08/13] uprobes/x86: Add support to optimize uprobes Jiri Olsa
2024-12-13 10:49   ` Peter Zijlstra
2024-12-13 13:06     ` Jiri Olsa
2024-12-13 21:58   ` Andrii Nakryiko
2024-12-15 12:06   ` David Laight
2024-12-15 14:14     ` Oleg Nesterov
2024-12-16  8:08       ` Jiri Olsa
2024-12-16  9:18         ` David Laight
2024-12-16 10:12           ` Oleg Nesterov
2024-12-16 11:10             ` David Laight
2024-12-16 12:22               ` Oleg Nesterov
2024-12-16 12:50                 ` Jiri Olsa
2024-12-16 15:08                   ` David Laight
2024-12-16 16:06                     ` Jiri Olsa
2024-12-11 13:33 ` [PATCH bpf-next 09/13] selftests/bpf: Use 5-byte nop for x86 usdt probes Jiri Olsa
2024-12-13 21:58   ` Andrii Nakryiko
2024-12-16  8:32     ` Jiri Olsa
2024-12-16 23:06       ` Andrii Nakryiko
2024-12-11 13:33 ` [PATCH bpf-next 10/13] selftests/bpf: Add uprobe/usdt optimized test Jiri Olsa
2024-12-13 21:58   ` Andrii Nakryiko
2024-12-16  7:58     ` Jiri Olsa
2024-12-11 13:34 ` [PATCH bpf-next 11/13] selftests/bpf: Add hit/attach/detach race optimized uprobe test Jiri Olsa
2024-12-13 21:58   ` Andrii Nakryiko
2024-12-16  7:59     ` Jiri Olsa
2024-12-11 13:34 ` [PATCH bpf-next 12/13] selftests/bpf: Add uprobe syscall sigill signal test Jiri Olsa
2024-12-11 13:34 ` [PATCH bpf-next 13/13] selftests/bpf: Add 5-byte nop uprobe trigger bench Jiri Olsa
2024-12-13 21:57   ` Andrii Nakryiko
2024-12-16  7:56     ` Jiri Olsa
2024-12-13  0:43 ` [PATCH bpf-next 00/13] uprobes: Add support to optimize usdt probes on x86_64 Andrii Nakryiko
2024-12-13  9:46   ` Jiri Olsa
2024-12-13 10:51 ` Peter Zijlstra
2024-12-13 13:07   ` Jiri Olsa
2024-12-13 13:54     ` Peter Zijlstra
2024-12-13 14:05       ` Jiri Olsa
2024-12-13 18:39         ` Peter Zijlstra
2024-12-13 21:52           ` Jiri Olsa
2024-12-13 21:59             ` Andrii Nakryiko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).