[PATCHv6 perf/core 00/22] uprobes: Add support to optimize usdt probes on x86

linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCHv6 perf/core 00/22] uprobes: Add support to optimize usdt probes on x86_64
@ 2025-07-20 11:21 Jiri Olsa
  2025-07-20 11:21 ` [PATCHv6 perf/core 01/22] uprobes: Remove breakpoint in unapply_uprobe under mmap_write_lock Jiri Olsa
                   ` (21 more replies)
  0 siblings, 22 replies; 44+ messages in thread
From: Jiri Olsa @ 2025-07-20 11:21 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: Alejandro Colomar, Eyal Birger, kees, bpf, linux-kernel,
	linux-trace-kernel, x86, Song Liu, Yonghong Song, John Fastabend,
	Hao Luo, Steven Rostedt, Masami Hiramatsu, Alan Maguire,
	David Laight, Thomas Weißschuh, Ingo Molnar

hi,
this patchset adds support to optimize usdt probes on top of 5-byte
nop instruction.

The generic approach (optimize all uprobes) is hard due to emulating
possible multiple original instructions and its related issues. The
usdt case, which stores 5-byte nop seems much easier, so starting
with that.

The basic idea is to replace breakpoint exception with syscall which
is faster on x86_64. For more details please see changelog of patch 8.

The run_bench_uprobes.sh benchmark triggers uprobe (on top of different
original instructions) in a loop and counts how many of those happened
per second (the unit below is million loops).

There's big speed up if you consider current usdt implementation
(uprobe-nop) compared to proposed usdt (uprobe-nop5):

current:
        usermode-count :  152.501 ± 0.012M/s
        syscall-count  :   14.463 ± 0.062M/s
-->     uprobe-nop     :    3.160 ± 0.005M/s
        uprobe-push    :    3.003 ± 0.003M/s
        uprobe-ret     :    1.100 ± 0.003M/s
        uprobe-nop5    :    3.132 ± 0.012M/s
        uretprobe-nop  :    2.103 ± 0.002M/s
        uretprobe-push :    2.027 ± 0.004M/s
        uretprobe-ret  :    0.914 ± 0.002M/s
        uretprobe-nop5 :    2.115 ± 0.002M/s

after the change:
        usermode-count :  152.343 ± 0.400M/s
        syscall-count  :   14.851 ± 0.033M/s
        uprobe-nop     :    3.204 ± 0.005M/s
        uprobe-push    :    3.040 ± 0.005M/s
        uprobe-ret     :    1.098 ± 0.003M/s
-->     uprobe-nop5    :    7.286 ± 0.017M/s
        uretprobe-nop  :    2.144 ± 0.001M/s
        uretprobe-push :    2.069 ± 0.002M/s
        uretprobe-ret  :    0.922 ± 0.000M/s
        uretprobe-nop5 :    3.487 ± 0.001M/s

I see bit more speed up on Intel (above) compared to AMD. The big nop5
speed up is partly due to emulating nop5 and partly due to optimization.

The key speed up we do this for is the USDT switch from nop to nop5:
	uprobe-nop     :    3.160 ± 0.005M/s
	uprobe-nop5    :    7.286 ± 0.017M/s

Changes from v5:
- reworked and unified int3 update code [Peter]
- used struct to read syscall argument [Peter]
- skip trampoline call instruction when returning via iret
- small changes plus commenct [Masami]
- added acks [Oleg, Andrii, Masami]

Changes from v4:
- reworked search for trampoline page, dropped Oleg's ack from that patch
  because of the change [Masami]

Changes from v3:
- rebased on top of tip/master + bpf-next/master + mm/mm-nonmm-unstable
- reworked patch#8 to lookup trampoline trampoline every 4GB so we don't
  waste page frames in some cases [Masami]
- several minor fixes [Masami]
- added acks [Oleg, Alejandro, Masami]

Changes from v2:
- rebased on top of tip/master + mm/mm-stable + 1 extra change [1]
- added acks [Oleg,Andrii]
- more details changelog for patch 1 [Masami]
- several tests changes [Andrii]
- add explicit PAGE_SIZE low limit to vm_unmapped_area call [Andrii]


This patchset is adding new syscall, here are notes to check list items
in Documentation/process/adding-syscalls.rst:

- System Call Alternatives
  New syscall seems like the best way in here, because we need
  just to quickly enter kernel with no extra arguments processing,
  which we'd need to do if we decided to use another syscall.

- Designing the API: Planning for Extension
  The uprobe syscall is very specific and most likely won't be
  extended in the future.

- Designing the API: Other Considerations
  N/A because uprobe syscall does not return reference to kernel
  object.

- Proposing the API
  Wiring up of the uprobe system call is in separate change,
  selftests and man page changes are part of the patchset.

- Generic System Call Implementation
  There's no CONFIG option for the new functionality because it
  keeps the same behaviour from the user POV.

- x86 System Call Implementation
  It's 64-bit syscall only.

- Compatibility System Calls (Generic)
  N/A uprobe syscall has no arguments and is not supported
  for compat processes.

- Compatibility System Calls (x86)
  N/A uprobe syscall is not supported for compat processes.

- System Calls Returning Elsewhere
  N/A.

- Other Details
  N/A.

- Testing
  Adding new bpf selftests.

- Man Page
  Attached.

- Do not call System Calls in the Kernel
  N/A

pending todo (or follow ups):
- use PROCMAP_QUERY in tests
- alloc 'struct uprobes_state' for mm_struct only when needed [Andrii]
- use mm_cpumask(vma->vm_mm) in text_poke_sync

thanks,
jirka


Cc: Alejandro Colomar <alx@kernel.org>
Cc: Eyal Birger <eyal.birger@gmail.com>
Cc: kees@kernel.org

[1] https://lore.kernel.org/linux-trace-kernel/20250514101809.2010193-1-jolsa@kernel.org/T/#u
---
Jiri Olsa (21):
      uprobes: Remove breakpoint in unapply_uprobe under mmap_write_lock
      uprobes: Rename arch_uretprobe_trampoline function
      uprobes: Make copy_from_page global
      uprobes: Add uprobe_write function
      uprobes: Add nbytes argument to uprobe_write
      uprobes: Add is_register argument to uprobe_write and uprobe_write_opcode
      uprobes: Add do_ref_ctr argument to uprobe_write function
      uprobes/x86: Add mapping for optimized uprobe trampolines
      uprobes/x86: Add uprobe syscall to speed up uprobe
      uprobes/x86: Add support to optimize uprobes
      selftests/bpf: Import usdt.h from libbpf/usdt project
      selftests/bpf: Reorg the uprobe_syscall test function
      selftests/bpf: Rename uprobe_syscall_executed prog to test_uretprobe_multi
      selftests/bpf: Add uprobe/usdt syscall tests
      selftests/bpf: Add hit/attach/detach race optimized uprobe test
      selftests/bpf: Add uprobe syscall sigill signal test
      selftests/bpf: Add optimized usdt variant for basic usdt test
      selftests/bpf: Add uprobe_regs_equal test
      selftests/bpf: Change test_uretprobe_regs_change for uprobe and uretprobe
      seccomp: passthrough uprobe systemcall without filtering
      selftests/seccomp: validate uprobe syscall passes through seccomp

 arch/arm/probes/uprobes/core.c                              |   2 +-
 arch/x86/entry/syscalls/syscall_64.tbl                      |   1 +
 arch/x86/include/asm/uprobes.h                              |   7 ++
 arch/x86/kernel/uprobes.c                                   | 566 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/syscalls.h                                    |   2 +
 include/linux/uprobes.h                                     |  20 +++-
 kernel/events/uprobes.c                                     | 100 +++++++++++-----
 kernel/fork.c                                               |   1 +
 kernel/seccomp.c                                            |  32 ++++--
 kernel/sys_ni.c                                             |   1 +
 tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c     | 518 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
 tools/testing/selftests/bpf/prog_tests/usdt.c               |  38 ++++---
 tools/testing/selftests/bpf/progs/uprobe_syscall.c          |   4 +-
 tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c |  60 +++++++++-
 tools/testing/selftests/bpf/test_kmods/bpf_testmod.c        |  11 +-
 tools/testing/selftests/bpf/usdt.h                          | 545 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 tools/testing/selftests/seccomp/seccomp_bpf.c               | 107 ++++++++++++++----
 17 files changed, 1903 insertions(+), 112 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/usdt.h

Jiri Olsa (1):
      man2: Add uprobe syscall page

 man/man2/uprobe.2    |  1 +
 man/man2/uretprobe.2 | 36 ++++++++++++++++++++++++------------
 2 files changed, 25 insertions(+), 12 deletions(-)
 create mode 100644 man/man2/uprobe.2

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCHv6 perf/core 01/22] uprobes: Remove breakpoint in unapply_uprobe under mmap_write_lock
  2025-07-20 11:21 [PATCHv6 perf/core 00/22] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
@ 2025-07-20 11:21 ` Jiri Olsa
  2025-07-20 11:21 ` [PATCHv6 perf/core 02/22] uprobes: Rename arch_uretprobe_trampoline function Jiri Olsa
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 44+ messages in thread
From: Jiri Olsa @ 2025-07-20 11:21 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: Masami Hiramatsu (Google), bpf, linux-kernel, linux-trace-kernel,
	x86, Song Liu, Yonghong Song, John Fastabend, Hao Luo,
	Steven Rostedt, Alan Maguire, David Laight, Thomas Weißschuh,
	Ingo Molnar

Currently unapply_uprobe takes mmap_read_lock, but it might call
remove_breakpoint which eventually changes user pages.

Current code writes either breakpoint or original instruction, so it can
go away with read lock as explained in here [1]. But with the upcoming
change that writes multiple instructions on the probed address we need
to ensure that any update to mm's pages is exclusive.

[1] https://lore.kernel.org/all/20240710140045.GA1084@redhat.com/

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 kernel/events/uprobes.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 84ee7b590861..257581432cd8 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -483,7 +483,7 @@ static int __uprobe_write_opcode(struct vm_area_struct *vma,
  * @opcode_vaddr: the virtual address to store the opcode.
  * @opcode: opcode to be written at @opcode_vaddr.
  *
- * Called with mm->mmap_lock held for read or write.
+ * Called with mm->mmap_lock held for write.
  * Return 0 (success) or a negative errno.
  */
 int uprobe_write_opcode(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
@@ -1464,7 +1464,7 @@ static int unapply_uprobe(struct uprobe *uprobe, struct mm_struct *mm)
 	struct vm_area_struct *vma;
 	int err = 0;
 
-	mmap_read_lock(mm);
+	mmap_write_lock(mm);
 	for_each_vma(vmi, vma) {
 		unsigned long vaddr;
 		loff_t offset;
@@ -1481,7 +1481,7 @@ static int unapply_uprobe(struct uprobe *uprobe, struct mm_struct *mm)
 		vaddr = offset_to_vaddr(vma, uprobe->offset);
 		err |= remove_breakpoint(uprobe, vma, vaddr);
 	}
-	mmap_read_unlock(mm);
+	mmap_write_unlock(mm);
 
 	return err;
 }
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv6 perf/core 02/22] uprobes: Rename arch_uretprobe_trampoline function
  2025-07-20 11:21 [PATCHv6 perf/core 00/22] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
  2025-07-20 11:21 ` [PATCHv6 perf/core 01/22] uprobes: Remove breakpoint in unapply_uprobe under mmap_write_lock Jiri Olsa
@ 2025-07-20 11:21 ` Jiri Olsa
  2025-07-20 11:21 ` [PATCHv6 perf/core 03/22] uprobes: Make copy_from_page global Jiri Olsa
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 44+ messages in thread
From: Jiri Olsa @ 2025-07-20 11:21 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, linux-kernel, linux-trace-kernel, x86, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, David Laight,
	Thomas Weißschuh, Ingo Molnar

We are about to add uprobe trampoline, so cleaning up the namespace.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 arch/x86/kernel/uprobes.c | 2 +-
 include/linux/uprobes.h   | 2 +-
 kernel/events/uprobes.c   | 4 ++--
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 6d383839e839..77050e5a4680 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -338,7 +338,7 @@ extern u8 uretprobe_trampoline_entry[];
 extern u8 uretprobe_trampoline_end[];
 extern u8 uretprobe_syscall_check[];
 
-void *arch_uprobe_trampoline(unsigned long *psize)
+void *arch_uretprobe_trampoline(unsigned long *psize)
 {
 	static uprobe_opcode_t insn = UPROBE_SWBP_INSN;
 	struct pt_regs *regs = task_pt_regs(current);
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 516217c39094..01112f27cd21 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -224,7 +224,7 @@ extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
 extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
 					 void *src, unsigned long len);
 extern void uprobe_handle_trampoline(struct pt_regs *regs);
-extern void *arch_uprobe_trampoline(unsigned long *psize);
+extern void *arch_uretprobe_trampoline(unsigned long *psize);
 extern unsigned long uprobe_get_trampoline_vaddr(void);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 257581432cd8..4e8e607abda8 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1727,7 +1727,7 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
 	return ret;
 }
 
-void * __weak arch_uprobe_trampoline(unsigned long *psize)
+void * __weak arch_uretprobe_trampoline(unsigned long *psize)
 {
 	static uprobe_opcode_t insn = UPROBE_SWBP_INSN;
 
@@ -1759,7 +1759,7 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
 	init_waitqueue_head(&area->wq);
 	/* Reserve the 1st slot for get_trampoline_vaddr() */
 	set_bit(0, area->bitmap);
-	insns = arch_uprobe_trampoline(&insns_size);
+	insns = arch_uretprobe_trampoline(&insns_size);
 	arch_uprobe_copy_ixol(area->page, 0, insns, insns_size);
 
 	if (!xol_add_vma(mm, area))
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv6 perf/core 03/22] uprobes: Make copy_from_page global
  2025-07-20 11:21 [PATCHv6 perf/core 00/22] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
  2025-07-20 11:21 ` [PATCHv6 perf/core 01/22] uprobes: Remove breakpoint in unapply_uprobe under mmap_write_lock Jiri Olsa
  2025-07-20 11:21 ` [PATCHv6 perf/core 02/22] uprobes: Rename arch_uretprobe_trampoline function Jiri Olsa
@ 2025-07-20 11:21 ` Jiri Olsa
  2025-07-20 11:21 ` [PATCHv6 perf/core 04/22] uprobes: Add uprobe_write function Jiri Olsa
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 44+ messages in thread
From: Jiri Olsa @ 2025-07-20 11:21 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, linux-kernel, linux-trace-kernel, x86, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, David Laight,
	Thomas Weißschuh, Ingo Molnar

Making copy_from_page global and adding uprobe prefix.
Adding the uprobe prefix to copy_to_page as well for symmetry.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 include/linux/uprobes.h |  1 +
 kernel/events/uprobes.c | 10 +++++-----
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 01112f27cd21..7447e15559b8 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -226,6 +226,7 @@ extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
 extern void uprobe_handle_trampoline(struct pt_regs *regs);
 extern void *arch_uretprobe_trampoline(unsigned long *psize);
 extern unsigned long uprobe_get_trampoline_vaddr(void);
+extern void uprobe_copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 4e8e607abda8..37d3a3f6e48a 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -177,7 +177,7 @@ bool __weak is_trap_insn(uprobe_opcode_t *insn)
 	return is_swbp_insn(insn);
 }
 
-static void copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len)
+void uprobe_copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len)
 {
 	void *kaddr = kmap_atomic(page);
 	memcpy(dst, kaddr + (vaddr & ~PAGE_MASK), len);
@@ -205,7 +205,7 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t
 	 * is a trap variant; uprobes always wins over any other (gdb)
 	 * breakpoint.
 	 */
-	copy_from_page(page, vaddr, &old_opcode, UPROBE_SWBP_INSN_SIZE);
+	uprobe_copy_from_page(page, vaddr, &old_opcode, UPROBE_SWBP_INSN_SIZE);
 	is_swbp = is_swbp_insn(&old_opcode);
 
 	if (is_swbp_insn(new_opcode)) {
@@ -1052,7 +1052,7 @@ static int __copy_insn(struct address_space *mapping, struct file *filp,
 	if (IS_ERR(page))
 		return PTR_ERR(page);
 
-	copy_from_page(page, offset, insn, nbytes);
+	uprobe_copy_from_page(page, offset, insn, nbytes);
 	put_page(page);
 
 	return 0;
@@ -1398,7 +1398,7 @@ struct uprobe *uprobe_register(struct inode *inode,
 		return ERR_PTR(-EINVAL);
 
 	/*
-	 * This ensures that copy_from_page(), copy_to_page() and
+	 * This ensures that uprobe_copy_from_page(), copy_to_page() and
 	 * __update_ref_ctr() can't cross page boundary.
 	 */
 	if (!IS_ALIGNED(offset, UPROBE_SWBP_INSN_SIZE))
@@ -2394,7 +2394,7 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr)
 	if (result < 0)
 		return result;
 
-	copy_from_page(page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
+	uprobe_copy_from_page(page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
 	put_page(page);
  out:
 	/* This needs to return true for any variant of the trap insn */
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv6 perf/core 04/22] uprobes: Add uprobe_write function
  2025-07-20 11:21 [PATCHv6 perf/core 00/22] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (2 preceding siblings ...)
  2025-07-20 11:21 ` [PATCHv6 perf/core 03/22] uprobes: Make copy_from_page global Jiri Olsa
@ 2025-07-20 11:21 ` Jiri Olsa
  2025-07-20 11:21 ` [PATCHv6 perf/core 05/22] uprobes: Add nbytes argument to uprobe_write Jiri Olsa
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 44+ messages in thread
From: Jiri Olsa @ 2025-07-20 11:21 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: Masami Hiramatsu (Google), bpf, linux-kernel, linux-trace-kernel,
	x86, Song Liu, Yonghong Song, John Fastabend, Hao Luo,
	Steven Rostedt, Alan Maguire, David Laight, Thomas Weißschuh,
	Ingo Molnar

Adding uprobe_write function that does what uprobe_write_opcode did
so far, but allows to pass verify callback function that checks the
memory location before writing the opcode.

It will be used in following changes to implement specific checking
logic for instruction update.

The uprobe_write_opcode now calls uprobe_write with verify_opcode as
the verify callback.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 include/linux/uprobes.h |  5 +++++
 kernel/events/uprobes.c | 14 ++++++++++----
 2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 7447e15559b8..e13382054435 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -187,6 +187,9 @@ struct uprobes_state {
 	struct xol_area		*xol_area;
 };
 
+typedef int (*uprobe_write_verify_t)(struct page *page, unsigned long vaddr,
+				     uprobe_opcode_t *opcode);
+
 extern void __init uprobes_init(void);
 extern int set_swbp(struct arch_uprobe *aup, struct vm_area_struct *vma, unsigned long vaddr);
 extern int set_orig_insn(struct arch_uprobe *aup, struct vm_area_struct *vma, unsigned long vaddr);
@@ -195,6 +198,8 @@ extern bool is_trap_insn(uprobe_opcode_t *insn);
 extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs);
 extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
 extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct vm_area_struct *vma, unsigned long vaddr, uprobe_opcode_t);
+extern int uprobe_write(struct arch_uprobe *auprobe, struct vm_area_struct *vma, const unsigned long opcode_vaddr,
+			uprobe_opcode_t opcode, uprobe_write_verify_t verify);
 extern struct uprobe *uprobe_register(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
 extern int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool);
 extern void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 37d3a3f6e48a..777de9b95dd7 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -399,7 +399,7 @@ static bool orig_page_is_identical(struct vm_area_struct *vma,
 	return identical;
 }
 
-static int __uprobe_write_opcode(struct vm_area_struct *vma,
+static int __uprobe_write(struct vm_area_struct *vma,
 		struct folio_walk *fw, struct folio *folio,
 		unsigned long opcode_vaddr, uprobe_opcode_t opcode)
 {
@@ -488,6 +488,12 @@ static int __uprobe_write_opcode(struct vm_area_struct *vma,
  */
 int uprobe_write_opcode(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 		const unsigned long opcode_vaddr, uprobe_opcode_t opcode)
+{
+	return uprobe_write(auprobe, vma, opcode_vaddr, opcode, verify_opcode);
+}
+
+int uprobe_write(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
+		 const unsigned long opcode_vaddr, uprobe_opcode_t opcode, uprobe_write_verify_t verify)
 {
 	const unsigned long vaddr = opcode_vaddr & PAGE_MASK;
 	struct mm_struct *mm = vma->vm_mm;
@@ -510,7 +516,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 	 * page that we can safely modify. Use FOLL_WRITE to trigger a write
 	 * fault if required. When unregistering, we might be lucky and the
 	 * anon page is already gone. So defer write faults until really
-	 * required. Use FOLL_SPLIT_PMD, because __uprobe_write_opcode()
+	 * required. Use FOLL_SPLIT_PMD, because __uprobe_write()
 	 * cannot deal with PMDs yet.
 	 */
 	if (is_register)
@@ -522,7 +528,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 		goto out;
 	folio = page_folio(page);
 
-	ret = verify_opcode(page, opcode_vaddr, &opcode);
+	ret = verify(page, opcode_vaddr, &opcode);
 	if (ret <= 0) {
 		folio_put(folio);
 		goto out;
@@ -561,7 +567,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 	/* Walk the page tables again, to perform the actual update. */
 	if (folio_walk_start(&fw, vma, vaddr, 0)) {
 		if (fw.page == page)
-			ret = __uprobe_write_opcode(vma, &fw, folio, opcode_vaddr, opcode);
+			ret = __uprobe_write(vma, &fw, folio, opcode_vaddr, opcode);
 		folio_walk_end(&fw, vma);
 	}
 
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv6 perf/core 05/22] uprobes: Add nbytes argument to uprobe_write
  2025-07-20 11:21 [PATCHv6 perf/core 00/22] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (3 preceding siblings ...)
  2025-07-20 11:21 ` [PATCHv6 perf/core 04/22] uprobes: Add uprobe_write function Jiri Olsa
@ 2025-07-20 11:21 ` Jiri Olsa
  2025-07-20 11:21 ` [PATCHv6 perf/core 06/22] uprobes: Add is_register argument to uprobe_write and uprobe_write_opcode Jiri Olsa
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 44+ messages in thread
From: Jiri Olsa @ 2025-07-20 11:21 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: Masami Hiramatsu (Google), bpf, linux-kernel, linux-trace-kernel,
	x86, Song Liu, Yonghong Song, John Fastabend, Hao Luo,
	Steven Rostedt, Alan Maguire, David Laight, Thomas Weißschuh,
	Ingo Molnar

Adding nbytes argument to uprobe_write and related functions as
preparation for writing whole instructions in following changes.

Also renaming opcode arguments to insn, which seems to fit better.

Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 include/linux/uprobes.h |  4 ++--
 kernel/events/uprobes.c | 26 ++++++++++++++------------
 2 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index e13382054435..147c4a0a1af9 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -188,7 +188,7 @@ struct uprobes_state {
 };
 
 typedef int (*uprobe_write_verify_t)(struct page *page, unsigned long vaddr,
-				     uprobe_opcode_t *opcode);
+				     uprobe_opcode_t *insn, int nbytes);
 
 extern void __init uprobes_init(void);
 extern int set_swbp(struct arch_uprobe *aup, struct vm_area_struct *vma, unsigned long vaddr);
@@ -199,7 +199,7 @@ extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs);
 extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
 extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct vm_area_struct *vma, unsigned long vaddr, uprobe_opcode_t);
 extern int uprobe_write(struct arch_uprobe *auprobe, struct vm_area_struct *vma, const unsigned long opcode_vaddr,
-			uprobe_opcode_t opcode, uprobe_write_verify_t verify);
+			uprobe_opcode_t *insn, int nbytes, uprobe_write_verify_t verify);
 extern struct uprobe *uprobe_register(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
 extern int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool);
 extern void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 777de9b95dd7..f7feb7417a2c 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -191,7 +191,8 @@ static void copy_to_page(struct page *page, unsigned long vaddr, const void *src
 	kunmap_atomic(kaddr);
 }
 
-static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode)
+static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *insn,
+			 int nbytes)
 {
 	uprobe_opcode_t old_opcode;
 	bool is_swbp;
@@ -208,7 +209,7 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t
 	uprobe_copy_from_page(page, vaddr, &old_opcode, UPROBE_SWBP_INSN_SIZE);
 	is_swbp = is_swbp_insn(&old_opcode);
 
-	if (is_swbp_insn(new_opcode)) {
+	if (is_swbp_insn(insn)) {
 		if (is_swbp)		/* register: already installed? */
 			return 0;
 	} else {
@@ -401,10 +402,10 @@ static bool orig_page_is_identical(struct vm_area_struct *vma,
 
 static int __uprobe_write(struct vm_area_struct *vma,
 		struct folio_walk *fw, struct folio *folio,
-		unsigned long opcode_vaddr, uprobe_opcode_t opcode)
+		unsigned long insn_vaddr, uprobe_opcode_t *insn, int nbytes)
 {
-	const unsigned long vaddr = opcode_vaddr & PAGE_MASK;
-	const bool is_register = !!is_swbp_insn(&opcode);
+	const unsigned long vaddr = insn_vaddr & PAGE_MASK;
+	const bool is_register = !!is_swbp_insn(insn);
 	bool pmd_mappable;
 
 	/* For now, we'll only handle PTE-mapped folios. */
@@ -429,7 +430,7 @@ static int __uprobe_write(struct vm_area_struct *vma,
 	 */
 	flush_cache_page(vma, vaddr, pte_pfn(fw->pte));
 	fw->pte = ptep_clear_flush(vma, vaddr, fw->ptep);
-	copy_to_page(fw->page, opcode_vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
+	copy_to_page(fw->page, insn_vaddr, insn, nbytes);
 
 	/*
 	 * When unregistering, we may only zap a PTE if uffd is disabled and
@@ -489,13 +490,14 @@ static int __uprobe_write(struct vm_area_struct *vma,
 int uprobe_write_opcode(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 		const unsigned long opcode_vaddr, uprobe_opcode_t opcode)
 {
-	return uprobe_write(auprobe, vma, opcode_vaddr, opcode, verify_opcode);
+	return uprobe_write(auprobe, vma, opcode_vaddr, &opcode, UPROBE_SWBP_INSN_SIZE, verify_opcode);
 }
 
 int uprobe_write(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
-		 const unsigned long opcode_vaddr, uprobe_opcode_t opcode, uprobe_write_verify_t verify)
+		 const unsigned long insn_vaddr, uprobe_opcode_t *insn, int nbytes,
+		 uprobe_write_verify_t verify)
 {
-	const unsigned long vaddr = opcode_vaddr & PAGE_MASK;
+	const unsigned long vaddr = insn_vaddr & PAGE_MASK;
 	struct mm_struct *mm = vma->vm_mm;
 	struct uprobe *uprobe;
 	int ret, is_register, ref_ctr_updated = 0;
@@ -505,7 +507,7 @@ int uprobe_write(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 	struct folio *folio;
 	struct page *page;
 
-	is_register = is_swbp_insn(&opcode);
+	is_register = is_swbp_insn(insn);
 	uprobe = container_of(auprobe, struct uprobe, arch);
 
 	if (WARN_ON_ONCE(!is_cow_mapping(vma->vm_flags)))
@@ -528,7 +530,7 @@ int uprobe_write(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 		goto out;
 	folio = page_folio(page);
 
-	ret = verify(page, opcode_vaddr, &opcode);
+	ret = verify(page, insn_vaddr, insn, nbytes);
 	if (ret <= 0) {
 		folio_put(folio);
 		goto out;
@@ -567,7 +569,7 @@ int uprobe_write(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 	/* Walk the page tables again, to perform the actual update. */
 	if (folio_walk_start(&fw, vma, vaddr, 0)) {
 		if (fw.page == page)
-			ret = __uprobe_write(vma, &fw, folio, opcode_vaddr, opcode);
+			ret = __uprobe_write(vma, &fw, folio, insn_vaddr, insn, nbytes);
 		folio_walk_end(&fw, vma);
 	}
 
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv6 perf/core 06/22] uprobes: Add is_register argument to uprobe_write and uprobe_write_opcode
  2025-07-20 11:21 [PATCHv6 perf/core 00/22] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (4 preceding siblings ...)
  2025-07-20 11:21 ` [PATCHv6 perf/core 05/22] uprobes: Add nbytes argument to uprobe_write Jiri Olsa
@ 2025-07-20 11:21 ` Jiri Olsa
  2025-07-20 11:21 ` [PATCHv6 perf/core 07/22] uprobes: Add do_ref_ctr argument to uprobe_write function Jiri Olsa
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 44+ messages in thread
From: Jiri Olsa @ 2025-07-20 11:21 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: Masami Hiramatsu (Google), bpf, linux-kernel, linux-trace-kernel,
	x86, Song Liu, Yonghong Song, John Fastabend, Hao Luo,
	Steven Rostedt, Alan Maguire, David Laight, Thomas Weißschuh,
	Ingo Molnar

The uprobe_write has special path to restore the original page when we
write original instruction back. This happens when uprobe_write detects
that we want to write anything else but breakpoint instruction.

Moving the detection away and passing it to uprobe_write as argument,
so it's possible to write different instructions (other than just
breakpoint and rest).

Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 arch/arm/probes/uprobes/core.c |  2 +-
 include/linux/uprobes.h        |  5 +++--
 kernel/events/uprobes.c        | 21 +++++++++++----------
 3 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/arch/arm/probes/uprobes/core.c b/arch/arm/probes/uprobes/core.c
index 885e0c5e8c20..3d96fb41d624 100644
--- a/arch/arm/probes/uprobes/core.c
+++ b/arch/arm/probes/uprobes/core.c
@@ -30,7 +30,7 @@ int set_swbp(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 	     unsigned long vaddr)
 {
 	return uprobe_write_opcode(auprobe, vma, vaddr,
-		   __opcode_to_mem_arm(auprobe->bpinsn));
+		   __opcode_to_mem_arm(auprobe->bpinsn), true);
 }
 
 bool arch_uprobe_ignore(struct arch_uprobe *auprobe, struct pt_regs *regs)
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 147c4a0a1af9..518b26756469 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -197,9 +197,10 @@ extern bool is_swbp_insn(uprobe_opcode_t *insn);
 extern bool is_trap_insn(uprobe_opcode_t *insn);
 extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs);
 extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
-extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct vm_area_struct *vma, unsigned long vaddr, uprobe_opcode_t);
+extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct vm_area_struct *vma, unsigned long vaddr, uprobe_opcode_t,
+			       bool is_register);
 extern int uprobe_write(struct arch_uprobe *auprobe, struct vm_area_struct *vma, const unsigned long opcode_vaddr,
-			uprobe_opcode_t *insn, int nbytes, uprobe_write_verify_t verify);
+			uprobe_opcode_t *insn, int nbytes, uprobe_write_verify_t verify, bool is_register);
 extern struct uprobe *uprobe_register(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
 extern int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool);
 extern void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index f7feb7417a2c..1e5dc3b30707 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -402,10 +402,10 @@ static bool orig_page_is_identical(struct vm_area_struct *vma,
 
 static int __uprobe_write(struct vm_area_struct *vma,
 		struct folio_walk *fw, struct folio *folio,
-		unsigned long insn_vaddr, uprobe_opcode_t *insn, int nbytes)
+		unsigned long insn_vaddr, uprobe_opcode_t *insn, int nbytes,
+		bool is_register)
 {
 	const unsigned long vaddr = insn_vaddr & PAGE_MASK;
-	const bool is_register = !!is_swbp_insn(insn);
 	bool pmd_mappable;
 
 	/* For now, we'll only handle PTE-mapped folios. */
@@ -488,26 +488,27 @@ static int __uprobe_write(struct vm_area_struct *vma,
  * Return 0 (success) or a negative errno.
  */
 int uprobe_write_opcode(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
-		const unsigned long opcode_vaddr, uprobe_opcode_t opcode)
+		const unsigned long opcode_vaddr, uprobe_opcode_t opcode,
+		bool is_register)
 {
-	return uprobe_write(auprobe, vma, opcode_vaddr, &opcode, UPROBE_SWBP_INSN_SIZE, verify_opcode);
+	return uprobe_write(auprobe, vma, opcode_vaddr, &opcode, UPROBE_SWBP_INSN_SIZE,
+			    verify_opcode, is_register);
 }
 
 int uprobe_write(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 		 const unsigned long insn_vaddr, uprobe_opcode_t *insn, int nbytes,
-		 uprobe_write_verify_t verify)
+		 uprobe_write_verify_t verify, bool is_register)
 {
 	const unsigned long vaddr = insn_vaddr & PAGE_MASK;
 	struct mm_struct *mm = vma->vm_mm;
 	struct uprobe *uprobe;
-	int ret, is_register, ref_ctr_updated = 0;
+	int ret, ref_ctr_updated = 0;
 	unsigned int gup_flags = FOLL_FORCE;
 	struct mmu_notifier_range range;
 	struct folio_walk fw;
 	struct folio *folio;
 	struct page *page;
 
-	is_register = is_swbp_insn(insn);
 	uprobe = container_of(auprobe, struct uprobe, arch);
 
 	if (WARN_ON_ONCE(!is_cow_mapping(vma->vm_flags)))
@@ -569,7 +570,7 @@ int uprobe_write(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 	/* Walk the page tables again, to perform the actual update. */
 	if (folio_walk_start(&fw, vma, vaddr, 0)) {
 		if (fw.page == page)
-			ret = __uprobe_write(vma, &fw, folio, insn_vaddr, insn, nbytes);
+			ret = __uprobe_write(vma, &fw, folio, insn_vaddr, insn, nbytes, is_register);
 		folio_walk_end(&fw, vma);
 	}
 
@@ -611,7 +612,7 @@ int uprobe_write(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 int __weak set_swbp(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 		unsigned long vaddr)
 {
-	return uprobe_write_opcode(auprobe, vma, vaddr, UPROBE_SWBP_INSN);
+	return uprobe_write_opcode(auprobe, vma, vaddr, UPROBE_SWBP_INSN, true);
 }
 
 /**
@@ -627,7 +628,7 @@ int __weak set_orig_insn(struct arch_uprobe *auprobe,
 		struct vm_area_struct *vma, unsigned long vaddr)
 {
 	return uprobe_write_opcode(auprobe, vma, vaddr,
-			*(uprobe_opcode_t *)&auprobe->insn);
+			*(uprobe_opcode_t *)&auprobe->insn, false);
 }
 
 /* uprobe should have guaranteed positive refcount */
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv6 perf/core 07/22] uprobes: Add do_ref_ctr argument to uprobe_write function
  2025-07-20 11:21 [PATCHv6 perf/core 00/22] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (5 preceding siblings ...)
  2025-07-20 11:21 ` [PATCHv6 perf/core 06/22] uprobes: Add is_register argument to uprobe_write and uprobe_write_opcode Jiri Olsa
@ 2025-07-20 11:21 ` Jiri Olsa
  2025-07-20 11:21 ` [PATCHv6 perf/core 08/22] uprobes/x86: Add mapping for optimized uprobe trampolines Jiri Olsa
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 44+ messages in thread
From: Jiri Olsa @ 2025-07-20 11:21 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: Masami Hiramatsu (Google), bpf, linux-kernel, linux-trace-kernel,
	x86, Song Liu, Yonghong Song, John Fastabend, Hao Luo,
	Steven Rostedt, Alan Maguire, David Laight, Thomas Weißschuh,
	Ingo Molnar

Making update_ref_ctr call in uprobe_write conditional based
on do_ref_ctr argument. This way we can use uprobe_write for
instruction update without doing ref_ctr_offset update.

Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 include/linux/uprobes.h | 2 +-
 kernel/events/uprobes.c | 8 ++++----
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 518b26756469..5080619560d4 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -200,7 +200,7 @@ extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
 extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct vm_area_struct *vma, unsigned long vaddr, uprobe_opcode_t,
 			       bool is_register);
 extern int uprobe_write(struct arch_uprobe *auprobe, struct vm_area_struct *vma, const unsigned long opcode_vaddr,
-			uprobe_opcode_t *insn, int nbytes, uprobe_write_verify_t verify, bool is_register);
+			uprobe_opcode_t *insn, int nbytes, uprobe_write_verify_t verify, bool is_register, bool do_update_ref_ctr);
 extern struct uprobe *uprobe_register(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
 extern int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool);
 extern void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 1e5dc3b30707..6795b8d82b9c 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -492,12 +492,12 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 		bool is_register)
 {
 	return uprobe_write(auprobe, vma, opcode_vaddr, &opcode, UPROBE_SWBP_INSN_SIZE,
-			    verify_opcode, is_register);
+			    verify_opcode, is_register, true /* do_update_ref_ctr */);
 }
 
 int uprobe_write(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 		 const unsigned long insn_vaddr, uprobe_opcode_t *insn, int nbytes,
-		 uprobe_write_verify_t verify, bool is_register)
+		 uprobe_write_verify_t verify, bool is_register, bool do_update_ref_ctr)
 {
 	const unsigned long vaddr = insn_vaddr & PAGE_MASK;
 	struct mm_struct *mm = vma->vm_mm;
@@ -538,7 +538,7 @@ int uprobe_write(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 	}
 
 	/* We are going to replace instruction, update ref_ctr. */
-	if (!ref_ctr_updated && uprobe->ref_ctr_offset) {
+	if (do_update_ref_ctr && !ref_ctr_updated && uprobe->ref_ctr_offset) {
 		ret = update_ref_ctr(uprobe, mm, is_register ? 1 : -1);
 		if (ret) {
 			folio_put(folio);
@@ -590,7 +590,7 @@ int uprobe_write(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 
 out:
 	/* Revert back reference counter if instruction update failed. */
-	if (ret < 0 && ref_ctr_updated)
+	if (do_update_ref_ctr && ret < 0 && ref_ctr_updated)
 		update_ref_ctr(uprobe, mm, is_register ? -1 : 1);
 
 	/* try collapse pmd for compound page */
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv6 perf/core 08/22] uprobes/x86: Add mapping for optimized uprobe trampolines
  2025-07-20 11:21 [PATCHv6 perf/core 00/22] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (6 preceding siblings ...)
  2025-07-20 11:21 ` [PATCHv6 perf/core 07/22] uprobes: Add do_ref_ctr argument to uprobe_write function Jiri Olsa
@ 2025-07-20 11:21 ` Jiri Olsa
  2025-08-19 14:53   ` Peter Zijlstra
  2025-07-20 11:21 ` [PATCHv6 perf/core 09/22] uprobes/x86: Add uprobe syscall to speed up uprobe Jiri Olsa
                   ` (13 subsequent siblings)
  21 siblings, 1 reply; 44+ messages in thread
From: Jiri Olsa @ 2025-07-20 11:21 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: Masami Hiramatsu (Google), bpf, linux-kernel, linux-trace-kernel,
	x86, Song Liu, Yonghong Song, John Fastabend, Hao Luo,
	Steven Rostedt, Alan Maguire, David Laight, Thomas Weißschuh,
	Ingo Molnar

Adding support to add special mapping for user space trampoline with
following functions:

  uprobe_trampoline_get - find or add uprobe_trampoline
  uprobe_trampoline_put - remove or destroy uprobe_trampoline

The user space trampoline is exported as arch specific user space special
mapping through tramp_mapping, which is initialized in following changes
with new uprobe syscall.

The uprobe trampoline needs to be callable/reachable from the probed address,
so while searching for available address we use is_reachable_by_call function
to decide if the uprobe trampoline is callable from the probe address.

All uprobe_trampoline objects are stored in uprobes_state object and are
cleaned up when the process mm_struct goes down. Adding new arch hooks
for that, because this change is x86_64 specific.

Locking is provided by callers in following changes.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 arch/x86/kernel/uprobes.c | 144 ++++++++++++++++++++++++++++++++++++++
 include/linux/uprobes.h   |   6 ++
 kernel/events/uprobes.c   |  10 +++
 kernel/fork.c             |   1 +
 4 files changed, 161 insertions(+)

diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 77050e5a4680..6c4dcbdd0c3c 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -608,6 +608,150 @@ static void riprel_post_xol(struct arch_uprobe *auprobe, struct pt_regs *regs)
 		*sr = utask->autask.saved_scratch_register;
 	}
 }
+
+static int tramp_mremap(const struct vm_special_mapping *sm, struct vm_area_struct *new_vma)
+{
+	return -EPERM;
+}
+
+static struct page *tramp_mapping_pages[2] __ro_after_init;
+
+static struct vm_special_mapping tramp_mapping = {
+	.name   = "[uprobes-trampoline]",
+	.mremap = tramp_mremap,
+	.pages  = tramp_mapping_pages,
+};
+
+struct uprobe_trampoline {
+	struct hlist_node	node;
+	unsigned long		vaddr;
+};
+
+static bool is_reachable_by_call(unsigned long vtramp, unsigned long vaddr)
+{
+	long delta = (long)(vaddr + 5 - vtramp);
+
+	return delta >= INT_MIN && delta <= INT_MAX;
+}
+
+static unsigned long find_nearest_trampoline(unsigned long vaddr)
+{
+	struct vm_unmapped_area_info info = {
+		.length     = PAGE_SIZE,
+		.align_mask = ~PAGE_MASK,
+	};
+	unsigned long low_limit, high_limit;
+	unsigned long low_tramp, high_tramp;
+	unsigned long call_end = vaddr + 5;
+
+	if (check_add_overflow(call_end, INT_MIN, &low_limit))
+		low_limit = PAGE_SIZE;
+
+	high_limit = call_end + INT_MAX;
+
+	/* Search up from the caller address. */
+	info.low_limit = call_end;
+	info.high_limit = min(high_limit, TASK_SIZE);
+	high_tramp = vm_unmapped_area(&info);
+
+	/* Search down from the caller address. */
+	info.low_limit = max(low_limit, PAGE_SIZE);
+	info.high_limit = call_end;
+	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
+	low_tramp = vm_unmapped_area(&info);
+
+	if (IS_ERR_VALUE(high_tramp) && IS_ERR_VALUE(low_tramp))
+		return -ENOMEM;
+	if (IS_ERR_VALUE(high_tramp))
+		return low_tramp;
+	if (IS_ERR_VALUE(low_tramp))
+		return high_tramp;
+
+	/* Return address that's closest to the caller address. */
+	if (call_end - low_tramp < high_tramp - call_end)
+		return low_tramp;
+	return high_tramp;
+}
+
+static struct uprobe_trampoline *create_uprobe_trampoline(unsigned long vaddr)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+	struct mm_struct *mm = current->mm;
+	struct uprobe_trampoline *tramp;
+	struct vm_area_struct *vma;
+
+	if (!user_64bit_mode(regs))
+		return NULL;
+
+	vaddr = find_nearest_trampoline(vaddr);
+	if (IS_ERR_VALUE(vaddr))
+		return NULL;
+
+	tramp = kzalloc(sizeof(*tramp), GFP_KERNEL);
+	if (unlikely(!tramp))
+		return NULL;
+
+	tramp->vaddr = vaddr;
+	vma = _install_special_mapping(mm, tramp->vaddr, PAGE_SIZE,
+				VM_READ|VM_EXEC|VM_MAYEXEC|VM_MAYREAD|VM_DONTCOPY|VM_IO,
+				&tramp_mapping);
+	if (IS_ERR(vma)) {
+		kfree(tramp);
+		return NULL;
+	}
+	return tramp;
+}
+
+__maybe_unused
+static struct uprobe_trampoline *get_uprobe_trampoline(unsigned long vaddr, bool *new)
+{
+	struct uprobes_state *state = &current->mm->uprobes_state;
+	struct uprobe_trampoline *tramp = NULL;
+
+	if (vaddr > TASK_SIZE || vaddr < PAGE_SIZE)
+		return NULL;
+
+	hlist_for_each_entry(tramp, &state->head_tramps, node) {
+		if (is_reachable_by_call(tramp->vaddr, vaddr)) {
+			*new = false;
+			return tramp;
+		}
+	}
+
+	tramp = create_uprobe_trampoline(vaddr);
+	if (!tramp)
+		return NULL;
+
+	*new = true;
+	hlist_add_head(&tramp->node, &state->head_tramps);
+	return tramp;
+}
+
+static void destroy_uprobe_trampoline(struct uprobe_trampoline *tramp)
+{
+	/*
+	 * We do not unmap and release uprobe trampoline page itself,
+	 * because there's no easy way to make sure none of the threads
+	 * is still inside the trampoline.
+	 */
+	hlist_del(&tramp->node);
+	kfree(tramp);
+}
+
+void arch_uprobe_init_state(struct mm_struct *mm)
+{
+	INIT_HLIST_HEAD(&mm->uprobes_state.head_tramps);
+}
+
+void arch_uprobe_clear_state(struct mm_struct *mm)
+{
+	struct uprobes_state *state = &mm->uprobes_state;
+	struct uprobe_trampoline *tramp;
+	struct hlist_node *n;
+
+	hlist_for_each_entry_safe(tramp, n, &state->head_tramps, node)
+		destroy_uprobe_trampoline(tramp);
+}
 #else /* 32-bit: */
 /*
  * No RIP-relative addressing on 32-bit
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 5080619560d4..b40d33aae016 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -17,6 +17,7 @@
 #include <linux/wait.h>
 #include <linux/timer.h>
 #include <linux/seqlock.h>
+#include <linux/mutex.h>
 
 struct uprobe;
 struct vm_area_struct;
@@ -185,6 +186,9 @@ struct xol_area;
 
 struct uprobes_state {
 	struct xol_area		*xol_area;
+#ifdef CONFIG_X86_64
+	struct hlist_head	head_tramps;
+#endif
 };
 
 typedef int (*uprobe_write_verify_t)(struct page *page, unsigned long vaddr,
@@ -233,6 +237,8 @@ extern void uprobe_handle_trampoline(struct pt_regs *regs);
 extern void *arch_uretprobe_trampoline(unsigned long *psize);
 extern unsigned long uprobe_get_trampoline_vaddr(void);
 extern void uprobe_copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len);
+extern void arch_uprobe_clear_state(struct mm_struct *mm);
+extern void arch_uprobe_init_state(struct mm_struct *mm);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 6795b8d82b9c..acec91a676b7 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1802,6 +1802,14 @@ static struct xol_area *get_xol_area(void)
 	return area;
 }
 
+void __weak arch_uprobe_clear_state(struct mm_struct *mm)
+{
+}
+
+void __weak arch_uprobe_init_state(struct mm_struct *mm)
+{
+}
+
 /*
  * uprobe_clear_state - Free the area allocated for slots.
  */
@@ -1813,6 +1821,8 @@ void uprobe_clear_state(struct mm_struct *mm)
 	delayed_uprobe_remove(NULL, mm);
 	mutex_unlock(&delayed_uprobe_lock);
 
+	arch_uprobe_clear_state(mm);
+
 	if (!area)
 		return;
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 1882d06b9138..78ab2009576c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1009,6 +1009,7 @@ static void mm_init_uprobes_state(struct mm_struct *mm)
 {
 #ifdef CONFIG_UPROBES
 	mm->uprobes_state.xol_area = NULL;
+	arch_uprobe_init_state(mm);
 #endif
 }
 
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv6 perf/core 09/22] uprobes/x86: Add uprobe syscall to speed up uprobe
  2025-07-20 11:21 [PATCHv6 perf/core 00/22] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (7 preceding siblings ...)
  2025-07-20 11:21 ` [PATCHv6 perf/core 08/22] uprobes/x86: Add mapping for optimized uprobe trampolines Jiri Olsa
@ 2025-07-20 11:21 ` Jiri Olsa
  2025-07-20 11:38   ` Oleg Nesterov
                     ` (2 more replies)
  2025-07-20 11:21 ` [PATCHv6 perf/core 10/22] uprobes/x86: Add support to optimize uprobes Jiri Olsa
                   ` (12 subsequent siblings)
  21 siblings, 3 replies; 44+ messages in thread
From: Jiri Olsa @ 2025-07-20 11:21 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, linux-kernel, linux-trace-kernel, x86, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, David Laight,
	Thomas Weißschuh, Ingo Molnar

Adding new uprobe syscall that calls uprobe handlers for given
'breakpoint' address.

The idea is that the 'breakpoint' address calls the user space
trampoline which executes the uprobe syscall.

The syscall handler reads the return address of the initial call
to retrieve the original 'breakpoint' address. With this address
we find the related uprobe object and call its consumers.

Adding the arch_uprobe_trampoline_mapping function that provides
uprobe trampoline mapping. This mapping is backed with one global
page initialized at __init time and shared by the all the mapping
instances.

We do not allow to execute uprobe syscall if the caller is not
from uprobe trampoline mapping.

The uprobe syscall ensures the consumer (bpf program) sees registers
values in the state before the trampoline was called.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 arch/x86/kernel/uprobes.c              | 139 +++++++++++++++++++++++++
 include/linux/syscalls.h               |   2 +
 include/linux/uprobes.h                |   1 +
 kernel/events/uprobes.c                |  17 +++
 kernel/sys_ni.c                        |   1 +
 6 files changed, 161 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index cfb5ca41e30d..9fd1291e7bdf 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -345,6 +345,7 @@
 333	common	io_pgetevents		sys_io_pgetevents
 334	common	rseq			sys_rseq
 335	common	uretprobe		sys_uretprobe
+336	common	uprobe			sys_uprobe
 # don't use numbers 387 through 423, add new calls after the last
 # 'common' entry
 424	common	pidfd_send_signal	sys_pidfd_send_signal
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 6c4dcbdd0c3c..d18e1ae59901 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -752,6 +752,145 @@ void arch_uprobe_clear_state(struct mm_struct *mm)
 	hlist_for_each_entry_safe(tramp, n, &state->head_tramps, node)
 		destroy_uprobe_trampoline(tramp);
 }
+
+static bool __in_uprobe_trampoline(unsigned long ip)
+{
+	struct vm_area_struct *vma = vma_lookup(current->mm, ip);
+
+	return vma && vma_is_special_mapping(vma, &tramp_mapping);
+}
+
+static bool in_uprobe_trampoline(unsigned long ip)
+{
+	struct mm_struct *mm = current->mm;
+	bool found, retry = true;
+	unsigned int seq;
+
+	rcu_read_lock();
+	if (mmap_lock_speculate_try_begin(mm, &seq)) {
+		found = __in_uprobe_trampoline(ip);
+		retry = mmap_lock_speculate_retry(mm, seq);
+	}
+	rcu_read_unlock();
+
+	if (retry) {
+		mmap_read_lock(mm);
+		found = __in_uprobe_trampoline(ip);
+		mmap_read_unlock(mm);
+	}
+	return found;
+}
+
+/*
+ * See uprobe syscall trampoline; the call to the trampoline will push
+ * the return address on the stack, the trampoline itself then pushes
+ * cx, r11 and ax.
+ */
+struct uprobe_syscall_args {
+	unsigned long ax;
+	unsigned long r11;
+	unsigned long cx;
+	unsigned long retaddr;
+};
+
+SYSCALL_DEFINE0(uprobe)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+	struct uprobe_syscall_args args;
+	unsigned long ip, sp;
+	int err;
+
+	/* Allow execution only from uprobe trampolines. */
+	if (!in_uprobe_trampoline(regs->ip))
+		goto sigill;
+
+	err = copy_from_user(&args, (void __user *)regs->sp, sizeof(args));
+	if (err)
+		goto sigill;
+
+	ip = regs->ip;
+
+	/*
+	 * expose the "right" values of ax/r11/cx/ip/sp to uprobe_consumer/s, plus:
+	 * - adjust ip to the probe address, call saved next instruction address
+	 * - adjust sp to the probe's stack frame (check trampoline code)
+	 */
+	regs->ax  = args.ax;
+	regs->r11 = args.r11;
+	regs->cx  = args.cx;
+	regs->ip  = args.retaddr - 5;
+	regs->sp += sizeof(args);
+	regs->orig_ax = -1;
+
+	sp = regs->sp;
+
+	handle_syscall_uprobe(regs, regs->ip);
+
+	/*
+	 * Some of the uprobe consumers has changed sp, we can do nothing,
+	 * just return via iret.
+	 */
+	if (regs->sp != sp) {
+		/* skip the trampoline call */
+		if (args.retaddr - 5 == regs->ip)
+			regs->ip += 5;
+		return regs->ax;
+	}
+
+	regs->sp -= sizeof(args);
+
+	/* for the case uprobe_consumer has changed ax/r11/cx */
+	args.ax  = regs->ax;
+	args.r11 = regs->r11;
+	args.cx  = regs->cx;
+
+	/* keep return address unless we are instructed otherwise */
+	if (args.retaddr - 5 != regs->ip)
+		args.retaddr = regs->ip;
+
+	regs->ip = ip;
+
+	err = copy_to_user((void __user *)regs->sp, &args, sizeof(args));
+	if (err)
+		goto sigill;
+
+	/* ensure sysret, see do_syscall_64() */
+	regs->r11 = regs->flags;
+	regs->cx  = regs->ip;
+	return 0;
+
+sigill:
+	force_sig(SIGILL);
+	return -1;
+}
+
+asm (
+	".pushsection .rodata\n"
+	".balign " __stringify(PAGE_SIZE) "\n"
+	"uprobe_trampoline_entry:\n"
+	"push %rcx\n"
+	"push %r11\n"
+	"push %rax\n"
+	"movq $" __stringify(__NR_uprobe) ", %rax\n"
+	"syscall\n"
+	"pop %rax\n"
+	"pop %r11\n"
+	"pop %rcx\n"
+	"ret\n"
+	".balign " __stringify(PAGE_SIZE) "\n"
+	".popsection\n"
+);
+
+extern u8 uprobe_trampoline_entry[];
+
+static int __init arch_uprobes_init(void)
+{
+	tramp_mapping_pages[0] = virt_to_page(uprobe_trampoline_entry);
+	return 0;
+}
+
+late_initcall(arch_uprobes_init);
+
 #else /* 32-bit: */
 /*
  * No RIP-relative addressing on 32-bit
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e5603cc91963..b0cc60f1c458 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -998,6 +998,8 @@ asmlinkage long sys_ioperm(unsigned long from, unsigned long num, int on);
 
 asmlinkage long sys_uretprobe(void);
 
+asmlinkage long sys_uprobe(void);
+
 /* pciconfig: alpha, arm, arm64, ia64, sparc */
 asmlinkage long sys_pciconfig_read(unsigned long bus, unsigned long dfn,
 				unsigned long off, unsigned long len,
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index b40d33aae016..b6b077cc7d0f 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -239,6 +239,7 @@ extern unsigned long uprobe_get_trampoline_vaddr(void);
 extern void uprobe_copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len);
 extern void arch_uprobe_clear_state(struct mm_struct *mm);
 extern void arch_uprobe_init_state(struct mm_struct *mm);
+extern void handle_syscall_uprobe(struct pt_regs *regs, unsigned long bp_vaddr);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index acec91a676b7..cbba31c0495f 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2772,6 +2772,23 @@ static void handle_swbp(struct pt_regs *regs)
 	rcu_read_unlock_trace();
 }
 
+void handle_syscall_uprobe(struct pt_regs *regs, unsigned long bp_vaddr)
+{
+	struct uprobe *uprobe;
+	int is_swbp;
+
+	guard(rcu_tasks_trace)();
+
+	uprobe = find_active_uprobe_rcu(bp_vaddr, &is_swbp);
+	if (!uprobe)
+		return;
+	if (!get_utask())
+		return;
+	if (arch_uprobe_ignore(&uprobe->arch, regs))
+		return;
+	handler_chain(uprobe, regs);
+}
+
 /*
  * Perform required fix-ups and disable singlestep.
  * Allow pending signals to take effect.
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index c00a86931f8c..bf5d05c635ff 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -392,3 +392,4 @@ COND_SYSCALL(setuid16);
 COND_SYSCALL(rseq);
 
 COND_SYSCALL(uretprobe);
+COND_SYSCALL(uprobe);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv6 perf/core 10/22] uprobes/x86: Add support to optimize uprobes
  2025-07-20 11:21 [PATCHv6 perf/core 00/22] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (8 preceding siblings ...)
  2025-07-20 11:21 ` [PATCHv6 perf/core 09/22] uprobes/x86: Add uprobe syscall to speed up uprobe Jiri Olsa
@ 2025-07-20 11:21 ` Jiri Olsa
  2025-07-25 10:13   ` Masami Hiramatsu
  2025-08-19 19:15   ` Peter Zijlstra
  2025-07-20 11:21 ` [PATCHv6 perf/core 11/22] selftests/bpf: Import usdt.h from libbpf/usdt project Jiri Olsa
                   ` (11 subsequent siblings)
  21 siblings, 2 replies; 44+ messages in thread
From: Jiri Olsa @ 2025-07-20 11:21 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, linux-kernel, linux-trace-kernel, x86, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, David Laight,
	Thomas Weißschuh, Ingo Molnar

Putting together all the previously added pieces to support optimized
uprobes on top of 5-byte nop instruction.

The current uprobe execution goes through following:

  - installs breakpoint instruction over original instruction
  - exception handler hit and calls related uprobe consumers
  - and either simulates original instruction or does out of line single step
    execution of it
  - returns to user space

The optimized uprobe path does following:

  - checks the original instruction is 5-byte nop (plus other checks)
  - adds (or uses existing) user space trampoline with uprobe syscall
  - overwrites original instruction (5-byte nop) with call to user space
    trampoline
  - the user space trampoline executes uprobe syscall that calls related uprobe
    consumers
  - trampoline returns back to next instruction

This approach won't speed up all uprobes as it's limited to using nop5 as
original instruction, but we plan to use nop5 as USDT probe instruction
(which currently uses single byte nop) and speed up the USDT probes.

The arch_uprobe_optimize triggers the uprobe optimization and is called after
first uprobe hit. I originally had it called on uprobe installation but then
it clashed with elf loader, because the user space trampoline was added in a
place where loader might need to put elf segments, so I decided to do it after
first uprobe hit when loading is done.

The uprobe is un-optimized in arch specific set_orig_insn call.

The instruction overwrite is x86 arch specific and needs to go through 3 updates:
(on top of nop5 instruction)

  - write int3 into 1st byte
  - write last 4 bytes of the call instruction
  - update the call instruction opcode

And cleanup goes though similar reverse stages:

  - overwrite call opcode with breakpoint (int3)
  - write last 4 bytes of the nop5 instruction
  - write the nop5 first instruction byte

We do not unmap and release uprobe trampoline when it's no longer needed,
because there's no easy way to make sure none of the threads is still
inside the trampoline. But we do not waste memory, because there's just
single page for all the uprobe trampoline mappings.

We do waste frame on page mapping for every 4GB by keeping the uprobe
trampoline page mapped, but that seems ok.

We take the benefit from the fact that set_swbp and set_orig_insn are
called under mmap_write_lock(mm), so we can use the current instruction
as the state the uprobe is in - nop5/breakpoint/call trampoline -
and decide the needed action (optimize/un-optimize) based on that.

Attaching the speed up from benchs/run_bench_uprobes.sh script:

current:
        usermode-count :  152.604 ± 0.044M/s
        syscall-count  :   13.359 ± 0.042M/s
-->     uprobe-nop     :    3.229 ± 0.002M/s
        uprobe-push    :    3.086 ± 0.004M/s
        uprobe-ret     :    1.114 ± 0.004M/s
        uprobe-nop5    :    1.121 ± 0.005M/s
        uretprobe-nop  :    2.145 ± 0.002M/s
        uretprobe-push :    2.070 ± 0.001M/s
        uretprobe-ret  :    0.931 ± 0.001M/s
        uretprobe-nop5 :    0.957 ± 0.001M/s

after the change:
        usermode-count :  152.448 ± 0.244M/s
        syscall-count  :   14.321 ± 0.059M/s
        uprobe-nop     :    3.148 ± 0.007M/s
        uprobe-push    :    2.976 ± 0.004M/s
        uprobe-ret     :    1.068 ± 0.003M/s
-->     uprobe-nop5    :    7.038 ± 0.007M/s
        uretprobe-nop  :    2.109 ± 0.004M/s
        uretprobe-push :    2.035 ± 0.001M/s
        uretprobe-ret  :    0.908 ± 0.001M/s
        uretprobe-nop5 :    3.377 ± 0.009M/s

I see bit more speed up on Intel (above) compared to AMD. The big nop5
speed up is partly due to emulating nop5 and partly due to optimization.

The key speed up we do this for is the USDT switch from nop to nop5:
        uprobe-nop     :    3.148 ± 0.007M/s
        uprobe-nop5    :    7.038 ± 0.007M/s

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 arch/x86/include/asm/uprobes.h |   7 +
 arch/x86/kernel/uprobes.c      | 283 ++++++++++++++++++++++++++++++++-
 include/linux/uprobes.h        |   6 +-
 kernel/events/uprobes.c        |  16 +-
 4 files changed, 305 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 678fb546f0a7..1ee2e5115955 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -20,6 +20,11 @@ typedef u8 uprobe_opcode_t;
 #define UPROBE_SWBP_INSN		0xcc
 #define UPROBE_SWBP_INSN_SIZE		   1
 
+enum {
+	ARCH_UPROBE_FLAG_CAN_OPTIMIZE   = 0,
+	ARCH_UPROBE_FLAG_OPTIMIZE_FAIL  = 1,
+};
+
 struct uprobe_xol_ops;
 
 struct arch_uprobe {
@@ -45,6 +50,8 @@ struct arch_uprobe {
 			u8	ilen;
 		}			push;
 	};
+
+	unsigned long flags;
 };
 
 struct arch_uprobe_task {
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index d18e1ae59901..209ce74ab93f 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -18,6 +18,7 @@
 #include <asm/processor.h>
 #include <asm/insn.h>
 #include <asm/mmu_context.h>
+#include <asm/nops.h>
 
 /* Post-execution fixups. */
 
@@ -702,7 +703,6 @@ static struct uprobe_trampoline *create_uprobe_trampoline(unsigned long vaddr)
 	return tramp;
 }
 
-__maybe_unused
 static struct uprobe_trampoline *get_uprobe_trampoline(unsigned long vaddr, bool *new)
 {
 	struct uprobes_state *state = &current->mm->uprobes_state;
@@ -891,6 +891,280 @@ static int __init arch_uprobes_init(void)
 
 late_initcall(arch_uprobes_init);
 
+enum {
+	EXPECT_SWBP,
+	EXPECT_CALL,
+};
+
+struct write_opcode_ctx {
+	unsigned long base;
+	int expect;
+};
+
+static int is_call_insn(uprobe_opcode_t *insn)
+{
+	return *insn == CALL_INSN_OPCODE;
+}
+
+/*
+ * Verification callback used by int3_update uprobe_write calls to make sure
+ * the underlying instruction is as expected - either int3 or call.
+ */
+static int verify_insn(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode,
+		       int nbytes, void *data)
+{
+	struct write_opcode_ctx *ctx = data;
+	uprobe_opcode_t old_opcode[5];
+
+	uprobe_copy_from_page(page, ctx->base, (uprobe_opcode_t *) &old_opcode, 5);
+
+	switch (ctx->expect) {
+	case EXPECT_SWBP:
+		if (is_swbp_insn(&old_opcode[0]))
+			return 1;
+		break;
+	case EXPECT_CALL:
+		if (is_call_insn(&old_opcode[0]))
+			return 1;
+		break;
+	}
+
+	return -1;
+}
+
+/*
+ * Modify multi-byte instructions by using INT3 breakpoints on SMP.
+ * We completely avoid using stop_machine() here, and achieve the
+ * synchronization using INT3 breakpoints and SMP cross-calls.
+ * (borrowed comment from smp_text_poke_batch_finish)
+ *
+ * The way it is done:
+ *   - Add an INT3 trap to the address that will be patched
+ *   - SMP sync all CPUs
+ *   - Update all but the first byte of the patched range
+ *   - SMP sync all CPUs
+ *   - Replace the first byte (INT3) by the first byte of the replacing opcode
+ *   - SMP sync all CPUs
+ */
+static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
+		       unsigned long vaddr, char *insn, bool optimize)
+{
+	uprobe_opcode_t int3 = UPROBE_SWBP_INSN;
+	struct write_opcode_ctx ctx = {
+		.base = vaddr,
+	};
+	int err;
+
+	/*
+	 * Write int3 trap.
+	 *
+	 * The swbp_optimize path comes with breakpoint already installed,
+	 * so we can skip this step for optimize == true.
+	 */
+	if (!optimize) {
+		ctx.expect = EXPECT_CALL;
+		err = uprobe_write(auprobe, vma, vaddr, &int3, 1, verify_insn,
+				   true /* is_register */, false /* do_update_ref_ctr */,
+				   &ctx);
+		if (err)
+			return err;
+	}
+
+	smp_text_poke_sync_each_cpu();
+
+	/* Write all but the first byte of the patched range. */
+	ctx.expect = EXPECT_SWBP;
+	err = uprobe_write(auprobe, vma, vaddr + 1, insn + 1, 4, verify_insn,
+			   true /* is_register */, false /* do_update_ref_ctr */,
+			   &ctx);
+	if (err)
+		return err;
+
+	smp_text_poke_sync_each_cpu();
+
+	/*
+	 * Write first byte.
+	 *
+	 * The swbp_unoptimize needs to finish uprobe removal together
+	 * with ref_ctr update, using uprobe_write with proper flags.
+	 */
+	err = uprobe_write(auprobe, vma, vaddr, insn, 1, verify_insn,
+			   optimize /* is_register */, !optimize /* do_update_ref_ctr */,
+			   &ctx);
+	if (err)
+		return err;
+
+	smp_text_poke_sync_each_cpu();
+	return 0;
+}
+
+static int swbp_optimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
+			 unsigned long vaddr, unsigned long tramp)
+{
+	u8 call[5];
+
+	__text_gen_insn(call, CALL_INSN_OPCODE, (const void *) vaddr,
+			(const void *) tramp, CALL_INSN_SIZE);
+	return int3_update(auprobe, vma, vaddr, call, true /* optimize */);
+}
+
+static int swbp_unoptimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
+			   unsigned long vaddr)
+{
+	return int3_update(auprobe, vma, vaddr, auprobe->insn, false /* optimize */);
+}
+
+static int copy_from_vaddr(struct mm_struct *mm, unsigned long vaddr, void *dst, int len)
+{
+	unsigned int gup_flags = FOLL_FORCE|FOLL_SPLIT_PMD;
+	struct vm_area_struct *vma;
+	struct page *page;
+
+	page = get_user_page_vma_remote(mm, vaddr, gup_flags, &vma);
+	if (IS_ERR(page))
+		return PTR_ERR(page);
+	uprobe_copy_from_page(page, vaddr, dst, len);
+	put_page(page);
+	return 0;
+}
+
+static bool __is_optimized(uprobe_opcode_t *insn, unsigned long vaddr)
+{
+	struct __packed __arch_relative_insn {
+		u8 op;
+		s32 raddr;
+	} *call = (struct __arch_relative_insn *) insn;
+
+	if (!is_call_insn(insn))
+		return false;
+	return __in_uprobe_trampoline(vaddr + 5 + call->raddr);
+}
+
+static int is_optimized(struct mm_struct *mm, unsigned long vaddr, bool *optimized)
+{
+	uprobe_opcode_t insn[5];
+	int err;
+
+	err = copy_from_vaddr(mm, vaddr, &insn, 5);
+	if (err)
+		return err;
+	*optimized = __is_optimized((uprobe_opcode_t *)&insn, vaddr);
+	return 0;
+}
+
+static bool should_optimize(struct arch_uprobe *auprobe)
+{
+	return !test_bit(ARCH_UPROBE_FLAG_OPTIMIZE_FAIL, &auprobe->flags) &&
+		test_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags);
+}
+
+int set_swbp(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
+	     unsigned long vaddr)
+{
+	if (should_optimize(auprobe)) {
+		bool optimized = false;
+		int err;
+
+		/*
+		 * We could race with another thread that already optimized the probe,
+		 * so let's not overwrite it with int3 again in this case.
+		 */
+		err = is_optimized(vma->vm_mm, vaddr, &optimized);
+		if (err)
+			return err;
+		if (optimized)
+			return 0;
+	}
+	return uprobe_write_opcode(auprobe, vma, vaddr, UPROBE_SWBP_INSN,
+				   true /* is_register */);
+}
+
+int set_orig_insn(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
+		  unsigned long vaddr)
+{
+	if (test_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags)) {
+		struct mm_struct *mm = vma->vm_mm;
+		bool optimized = false;
+		int err;
+
+		err = is_optimized(mm, vaddr, &optimized);
+		if (err)
+			return err;
+		if (optimized) {
+			err = swbp_unoptimize(auprobe, vma, vaddr);
+			WARN_ON_ONCE(err);
+			return err;
+		}
+	}
+	return uprobe_write_opcode(auprobe, vma, vaddr, *(uprobe_opcode_t *)&auprobe->insn,
+				   false /* is_register */);
+}
+
+static int __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct *mm,
+				  unsigned long vaddr)
+{
+	struct uprobe_trampoline *tramp;
+	struct vm_area_struct *vma;
+	bool new = false;
+	int err = 0;
+
+	vma = find_vma(mm, vaddr);
+	if (!vma)
+		return -EINVAL;
+	tramp = get_uprobe_trampoline(vaddr, &new);
+	if (!tramp)
+		return -EINVAL;
+	err = swbp_optimize(auprobe, vma, vaddr, tramp->vaddr);
+	if (WARN_ON_ONCE(err) && new)
+		destroy_uprobe_trampoline(tramp);
+	return err;
+}
+
+void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
+{
+	struct mm_struct *mm = current->mm;
+	uprobe_opcode_t insn[5];
+
+	/*
+	 * Do not optimize if shadow stack is enabled, the return address hijack
+	 * code in arch_uretprobe_hijack_return_addr updates wrong frame when
+	 * the entry uprobe is optimized and the shadow stack crashes the app.
+	 */
+	if (shstk_is_enabled())
+		return;
+
+	if (!should_optimize(auprobe))
+		return;
+
+	mmap_write_lock(mm);
+
+	/*
+	 * Check if some other thread already optimized the uprobe for us,
+	 * if it's the case just go away silently.
+	 */
+	if (copy_from_vaddr(mm, vaddr, &insn, 5))
+		goto unlock;
+	if (!is_swbp_insn((uprobe_opcode_t*) &insn))
+		goto unlock;
+
+	/*
+	 * If we fail to optimize the uprobe we set the fail bit so the
+	 * above should_optimize will fail from now on.
+	 */
+	if (__arch_uprobe_optimize(auprobe, mm, vaddr))
+		set_bit(ARCH_UPROBE_FLAG_OPTIMIZE_FAIL, &auprobe->flags);
+
+unlock:
+	mmap_write_unlock(mm);
+}
+
+static bool can_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
+{
+	if (memcmp(&auprobe->insn, x86_nops[5], 5))
+		return false;
+	/* We can't do cross page atomic writes yet. */
+	return PAGE_SIZE - (vaddr & ~PAGE_MASK) >= 5;
+}
 #else /* 32-bit: */
 /*
  * No RIP-relative addressing on 32-bit
@@ -904,6 +1178,10 @@ static void riprel_pre_xol(struct arch_uprobe *auprobe, struct pt_regs *regs)
 static void riprel_post_xol(struct arch_uprobe *auprobe, struct pt_regs *regs)
 {
 }
+static bool can_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
+{
+	return false;
+}
 #endif /* CONFIG_X86_64 */
 
 struct uprobe_xol_ops {
@@ -1270,6 +1548,9 @@ int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe, struct mm_struct *mm,
 	if (ret)
 		return ret;
 
+	if (can_optimize(auprobe, addr))
+		set_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags);
+
 	ret = branch_setup_xol_ops(auprobe, &insn);
 	if (ret != -ENOSYS)
 		return ret;
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index b6b077cc7d0f..08ef78439d0d 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -192,7 +192,7 @@ struct uprobes_state {
 };
 
 typedef int (*uprobe_write_verify_t)(struct page *page, unsigned long vaddr,
-				     uprobe_opcode_t *insn, int nbytes);
+				     uprobe_opcode_t *insn, int nbytes, void *data);
 
 extern void __init uprobes_init(void);
 extern int set_swbp(struct arch_uprobe *aup, struct vm_area_struct *vma, unsigned long vaddr);
@@ -204,7 +204,8 @@ extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
 extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct vm_area_struct *vma, unsigned long vaddr, uprobe_opcode_t,
 			       bool is_register);
 extern int uprobe_write(struct arch_uprobe *auprobe, struct vm_area_struct *vma, const unsigned long opcode_vaddr,
-			uprobe_opcode_t *insn, int nbytes, uprobe_write_verify_t verify, bool is_register, bool do_update_ref_ctr);
+			uprobe_opcode_t *insn, int nbytes, uprobe_write_verify_t verify, bool is_register, bool do_update_ref_ctr,
+			void *data);
 extern struct uprobe *uprobe_register(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
 extern int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool);
 extern void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc);
@@ -240,6 +241,7 @@ extern void uprobe_copy_from_page(struct page *page, unsigned long vaddr, void *
 extern void arch_uprobe_clear_state(struct mm_struct *mm);
 extern void arch_uprobe_init_state(struct mm_struct *mm);
 extern void handle_syscall_uprobe(struct pt_regs *regs, unsigned long bp_vaddr);
+extern void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index cbba31c0495f..e54081beeab9 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -192,7 +192,7 @@ static void copy_to_page(struct page *page, unsigned long vaddr, const void *src
 }
 
 static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *insn,
-			 int nbytes)
+			 int nbytes, void *data)
 {
 	uprobe_opcode_t old_opcode;
 	bool is_swbp;
@@ -492,12 +492,13 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 		bool is_register)
 {
 	return uprobe_write(auprobe, vma, opcode_vaddr, &opcode, UPROBE_SWBP_INSN_SIZE,
-			    verify_opcode, is_register, true /* do_update_ref_ctr */);
+			    verify_opcode, is_register, true /* do_update_ref_ctr */, NULL);
 }
 
 int uprobe_write(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 		 const unsigned long insn_vaddr, uprobe_opcode_t *insn, int nbytes,
-		 uprobe_write_verify_t verify, bool is_register, bool do_update_ref_ctr)
+		 uprobe_write_verify_t verify, bool is_register, bool do_update_ref_ctr,
+		 void *data)
 {
 	const unsigned long vaddr = insn_vaddr & PAGE_MASK;
 	struct mm_struct *mm = vma->vm_mm;
@@ -531,7 +532,7 @@ int uprobe_write(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 		goto out;
 	folio = page_folio(page);
 
-	ret = verify(page, insn_vaddr, insn, nbytes);
+	ret = verify(page, insn_vaddr, insn, nbytes, data);
 	if (ret <= 0) {
 		folio_put(folio);
 		goto out;
@@ -2697,6 +2698,10 @@ bool __weak arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c
 	return true;
 }
 
+void __weak arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
+{
+}
+
 /*
  * Run handler and ask thread to singlestep.
  * Ensure all non-fatal signals cannot interrupt thread while it singlesteps.
@@ -2761,6 +2766,9 @@ static void handle_swbp(struct pt_regs *regs)
 
 	handler_chain(uprobe, regs);
 
+	/* Try to optimize after first hit. */
+	arch_uprobe_optimize(&uprobe->arch, bp_vaddr);
+
 	if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
 		goto out;
 
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv6 perf/core 11/22] selftests/bpf: Import usdt.h from libbpf/usdt project
  2025-07-20 11:21 [PATCHv6 perf/core 00/22] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (9 preceding siblings ...)
  2025-07-20 11:21 ` [PATCHv6 perf/core 10/22] uprobes/x86: Add support to optimize uprobes Jiri Olsa
@ 2025-07-20 11:21 ` Jiri Olsa
  2025-07-20 11:21 ` [PATCHv6 perf/core 12/22] selftests/bpf: Reorg the uprobe_syscall test function Jiri Olsa
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 44+ messages in thread
From: Jiri Olsa @ 2025-07-20 11:21 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, linux-kernel, linux-trace-kernel, x86, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, David Laight,
	Thomas Weißschuh, Ingo Molnar

Importing usdt.h from libbpf/usdt project.

Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 tools/testing/selftests/bpf/usdt.h | 545 +++++++++++++++++++++++++++++
 1 file changed, 545 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/usdt.h

diff --git a/tools/testing/selftests/bpf/usdt.h b/tools/testing/selftests/bpf/usdt.h
new file mode 100644
index 000000000000..549d1f774810
--- /dev/null
+++ b/tools/testing/selftests/bpf/usdt.h
@@ -0,0 +1,545 @@
+// SPDX-License-Identifier: BSD-2-Clause
+/*
+ *  This single-header library defines a collection of variadic macros for
+ *  defining and triggering USDTs (User Statically-Defined Tracepoints):
+ *
+ *      - For USDTs without associated semaphore:
+ *          USDT(group, name, args...)
+ *
+ *      - For USDTs with implicit (transparent to the user) semaphore:
+ *          USDT_WITH_SEMA(group, name, args...)
+ *          USDT_IS_ACTIVE(group, name)
+ *
+ *      - For USDTs with explicit (user-defined and provided) semaphore:
+ *          USDT_WITH_EXPLICIT_SEMA(sema, group, name, args...)
+ *          USDT_SEMA_IS_ACTIVE(sema)
+ *
+ *  all of which emit a NOP instruction into the instruction stream, and so
+ *  have *zero* overhead for the surrounding code. USDTs are identified by
+ *  a combination of `group` and `name` identifiers, which is used by external
+ *  tracing tooling (tracers) for identifying exact USDTs of interest.
+ *
+ *  USDTs can have an associated (2-byte) activity counter (USDT semaphore),
+ *  automatically maintained by Linux kernel whenever any correctly written
+ *  BPF-based tracer is attached to the USDT. This USDT semaphore can be used
+ *  to check whether there is a need to do any extra data collection and
+ *  processing for a given USDT (if necessary), and otherwise avoid extra work
+ *  for a common case of USDT not being traced ("active").
+ *
+ *  See documentation for USDT_WITH_SEMA()/USDT_IS_ACTIVE() or
+ *  USDT_WITH_EXPLICIT_SEMA()/USDT_SEMA_IS_ACTIVE() APIs below for details on
+ *  working with USDTs with implicitly or explicitly associated
+ *  USDT semaphores, respectively.
+ *
+ *  There is also some additional data recorded into an auxiliary note
+ *  section. The data in the note section describes the operands, in terms of
+ *  size and location, used by tracing tooling to know where to find USDT
+ *  arguments. Each location is encoded as an assembler operand string.
+ *  Tracing tools (bpftrace and BPF-based tracers, systemtap, etc) insert
+ *  breakpoints on top of the nop, and decode the location operand-strings,
+ *  like an assembler, to find the values being passed.
+ *
+ *  The operand strings are selected by the compiler for each operand.
+ *  They are constrained by inline-assembler codes.The default is:
+ *
+ *  #define USDT_ARG_CONSTRAINT nor
+ *
+ *  This is a good default if the operands tend to be integral and
+ *  moderate in number (smaller than number of registers). In other
+ *  cases, the compiler may report "'asm' requires impossible reload" or
+ *  similar. In this case, consider simplifying the macro call (fewer
+ *  and simpler operands), reduce optimization, or override the default
+ *  constraints string via:
+ *
+ *  #define USDT_ARG_CONSTRAINT g
+ *  #include <usdt.h>
+ *
+ * For some historical description of USDT v3 format (the one used by this
+ * library and generally recognized and assumed by BPF-based tracing tools)
+ * see [0]. The more formal specification can be found at [1]. Additional
+ * argument constraints information can be found at [2].
+ *
+ * Original SystemTap's sys/sdt.h implementation ([3]) was used as a base for
+ * this USDT library implementation. Current implementation differs *a lot* in
+ * terms of exposed user API and general usability, which was the main goal
+ * and focus of the reimplementation work. Nevertheless, underlying recorded
+ * USDT definitions are fully binary compatible and any USDT-based tooling
+ * should work equally well with USDTs defined by either SystemTap's or this
+ * library's USDT implementation.
+ *
+ *   [0] https://ecos.sourceware.org/ml/systemtap/2010-q3/msg00145.html
+ *   [1] https://sourceware.org/systemtap/wiki/UserSpaceProbeImplementation
+ *   [2] https://gcc.gnu.org/onlinedocs/gcc/Constraints.html
+ *   [3] https://sourceware.org/git/?p=systemtap.git;a=blob;f=includes/sys/sdt.h
+ */
+#ifndef __USDT_H
+#define __USDT_H
+
+/*
+ * Changelog:
+ *
+ * 0.1.0
+ * -----
+ * - Initial release
+ */
+#define USDT_MAJOR_VERSION 0
+#define USDT_MINOR_VERSION 1
+#define USDT_PATCH_VERSION 0
+
+/* C++20 and C23 added __VA_OPT__ as a standard replacement for non-standard `##__VA_ARGS__` extension */
+#if (defined(__STDC_VERSION__) && __STDC_VERSION__ > 201710L) || (defined(__cplusplus) && __cplusplus > 201703L)
+#define __usdt_va_opt 1
+#define __usdt_va_args(...) __VA_OPT__(,) __VA_ARGS__
+#else
+#define __usdt_va_args(...) , ##__VA_ARGS__
+#endif
+
+/*
+ * Trigger USDT with `group`:`name` identifier and pass through `args` as its
+ * arguments. Zero arguments are acceptable as well. No USDT semaphore is
+ * associated with this USDT.
+ *
+ * Such "semaphoreless" USDTs are commonly used when there is no extra data
+ * collection or processing needed to collect and prepare USDT arguments and
+ * they are just available in the surrounding code. USDT() macro will just
+ * record their locations in CPU registers or in memory for tracing tooling to
+ * be able to access them, if necessary.
+ */
+#ifdef __usdt_va_opt
+#define USDT(group, name, ...)							\
+	__usdt_probe(group, name, __usdt_sema_none, 0 __VA_OPT__(,) __VA_ARGS__)
+#else
+#define USDT(group, name, ...)							\
+	__usdt_probe(group, name, __usdt_sema_none, 0, ##__VA_ARGS__)
+#endif
+
+/*
+ * Trigger USDT with `group`:`name` identifier and pass through `args` as its
+ * arguments. Zero arguments are acceptable as well. USDT also get an
+ * implicitly-defined associated USDT semaphore, which will be "activated" by
+ * tracing tooling and can be used to check whether USDT is being actively
+ * observed.
+ *
+ * USDTs with semaphore are commonly used when there is a need to perform
+ * additional data collection and processing to prepare USDT arguments, which
+ * otherwise might not be necessary for the rest of application logic. In such
+ * case, USDT semaphore can be used to avoid unnecessary extra work. If USDT
+ * is not traced (which is presumed to be a common situation), the associated
+ * USDT semaphore is "inactive", and so there is no need to waste resources to
+ * prepare USDT arguments. Use USDT_IS_ACTIVE(group, name) to check whether
+ * USDT is "active".
+ *
+ * N.B. There is an inherent (albeit short) gap between checking whether USDT
+ * is active and triggering corresponding USDT, in which external tracer can
+ * be attached to an USDT and activate USDT semaphore after the activity check.
+ * If such a race occurs, tracers might miss one USDT execution. Tracers are
+ * expected to accommodate such possibility and this is expected to not be
+ * a problem for applications and tracers.
+ *
+ * N.B. Implicit USDT semaphore defined by USDT_WITH_SEMA() is contained
+ * within a single executable or shared library and is not shared outside
+ * them. I.e., if you use USDT_WITH_SEMA() with the same USDT group and name
+ * identifier across executable and shared library, it will work and won't
+ * conflict, per se, but will define independent USDT semaphores, one for each
+ * shared library/executable in which USDT_WITH_SEMA(group, name) is used.
+ * That is, if you attach to this USDT in one shared library (or executable),
+ * then only USDT semaphore within that shared library (or executable) will be
+ * updated by the kernel, while other libraries (or executable) will not see
+ * activated USDT semaphore. In short, it's best to use unique USDT group:name
+ * identifiers across different shared libraries (and, equivalently, between
+ * executable and shared library). This is advanced consideration and is
+ * rarely (if ever) seen in practice, but just to avoid surprises this is
+ * called out here. (Static libraries become a part of final executable, once
+ * linked by linker, so the above considerations don't apply to them.)
+ */
+#ifdef __usdt_va_opt
+#define USDT_WITH_SEMA(group, name, ...)					\
+	__usdt_probe(group, name,						\
+		     __usdt_sema_implicit, __usdt_sema_name(group, name)	\
+		     __VA_OPT__(,) __VA_ARGS__)
+#else
+#define USDT_WITH_SEMA(group, name, ...)					\
+	__usdt_probe(group, name,						\
+		     __usdt_sema_implicit, __usdt_sema_name(group, name),	\
+		     ##__VA_ARGS__)
+#endif
+
+struct usdt_sema { volatile unsigned short active; };
+
+/*
+ * Check if USDT with `group`:`name` identifier is "active" (i.e., whether it
+ * is attached to by external tracing tooling and is actively observed).
+ *
+ * This macro can be used to decide whether any additional and potentially
+ * expensive data collection or processing should be done to pass extra
+ * information into the given USDT. It is assumed that USDT is triggered with
+ * USDT_WITH_SEMA() macro which will implicitly define associated USDT
+ * semaphore. (If one needs more control over USDT semaphore, see
+ * USDT_DEFINE_SEMA() and USDT_WITH_EXPLICIT_SEMA() macros below.)
+ *
+ * N.B. Such checks are necessarily racy and speculative. Between checking
+ * whether USDT is active and triggering the USDT itself, tracer can be
+ * detached with no notification. This race should be extremely rare and worst
+ * case should result in one-time wasted extra data collection and processing.
+ */
+#define USDT_IS_ACTIVE(group, name) ({						\
+	extern struct usdt_sema __usdt_sema_name(group, name)			\
+		__usdt_asm_name(__usdt_sema_name(group, name));			\
+	__usdt_sema_implicit(__usdt_sema_name(group, name));			\
+	__usdt_sema_name(group, name).active > 0;				\
+})
+
+/*
+ * APIs for working with user-defined explicit USDT semaphores.
+ *
+ * This is a less commonly used advanced API for use cases in which user needs
+ * an explicit control over (potentially shared across multiple USDTs) USDT
+ * semaphore instance. This can be used when there is a group of logically
+ * related USDTs that all need extra data collection and processing whenever
+ * any of a family of related USDTs are "activated" (i.e., traced). In such
+ * a case, all such related USDTs will be associated with the same shared USDT
+ * semaphore defined with USDT_DEFINE_SEMA() and the USDTs themselves will be
+ * triggered with USDT_WITH_EXPLICIT_SEMA() macros, taking an explicit extra
+ * USDT semaphore identifier as an extra parameter.
+ */
+
+/**
+ * Underlying C global variable name for user-defined USDT semaphore with
+ * `sema` identifier. Could be useful for debugging, but normally shouldn't be
+ * used explicitly.
+ */
+#define USDT_SEMA(sema) __usdt_sema_##sema
+
+/*
+ * Define storage for user-defined USDT semaphore `sema`.
+ *
+ * Should be used only once in non-header source file to let compiler allocate
+ * space for the semaphore variable. Just like with any other global variable.
+ *
+ * This macro can be used anywhere where global variable declaration is
+ * allowed. Just like with global variable definitions, there should be only
+ * one definition of user-defined USDT semaphore with given `sema` identifier,
+ * otherwise compiler or linker will complain about duplicate variable
+ * definition.
+ *
+ * For C++, it is allowed to use USDT_DEFINE_SEMA() both in global namespace
+ * and inside namespaces (including nested namespaces). Just make sure that
+ * USDT_DECLARE_SEMA() is placed within the namespace where this semaphore is
+ * referenced, or any of its parent namespaces, so the C++ language-level
+ * identifier is visible to the code that needs to reference the semaphore.
+ * At the lowest layer, USDT semaphores have global naming and visibility
+ * (they have a corresponding `__usdt_sema_<name>` symbol, which can be linked
+ * against from C or C++ code, if necessary). To keep it simple, putting
+ * USDT_DECLARE_SEMA() declarations into global namespaces is the simplest
+ * no-brainer solution. All these aspects are irrelevant for plain C, because
+ * C doesn't have namespaces and everything is always in the global namespace.
+ *
+ * N.B. Due to USDT metadata being recorded in non-allocatable ELF note
+ * section, it has limitations when it comes to relocations, which, in
+ * practice, means that it's not possible to correctly share USDT semaphores
+ * between main executable and shared libraries, or even between multiple
+ * shared libraries. USDT semaphore has to be contained to individual shared
+ * library or executable to avoid unpleasant surprises with half-working USDT
+ * semaphores. We enforce this by marking semaphore ELF symbols as having
+ * a hidden visibility. This is quite an advanced use case and consideration
+ * and for most users this should have no consequences whatsoever.
+ */
+#define USDT_DEFINE_SEMA(sema)							\
+	struct usdt_sema __usdt_sema_sec USDT_SEMA(sema)			\
+		__usdt_asm_name(USDT_SEMA(sema))				\
+		__attribute__((visibility("hidden"))) = { 0 }
+
+/*
+ * Declare extern reference to user-defined USDT semaphore `sema`.
+ *
+ * Refers to a variable defined in another compilation unit by
+ * USDT_DEFINE_SEMA() and allows to use the same USDT semaphore across
+ * multiple compilation units (i.e., .c and .cpp files).
+ *
+ * See USDT_DEFINE_SEMA() notes above for C++ language usage peculiarities.
+ */
+#define USDT_DECLARE_SEMA(sema)							\
+	extern struct usdt_sema USDT_SEMA(sema) __usdt_asm_name(USDT_SEMA(sema))
+
+/*
+ * Check if user-defined USDT semaphore `sema` is "active" (i.e., whether it
+ * is attached to by external tracing tooling and is actively observed).
+ *
+ * This macro can be used to decide whether any additional and potentially
+ * expensive data collection or processing should be done to pass extra
+ * information into USDT(s) associated with USDT semaphore `sema`.
+ *
+ * N.B. Such checks are necessarily racy. Between checking the state of USDT
+ * semaphore and triggering associated USDT(s), the active tracer might attach
+ * or detach. This race should be extremely rare and worst case should result
+ * in one-time missed USDT event or wasted extra data collection and
+ * processing. USDT-using tracers should be written with this in mind and is
+ * not a concern of the application defining USDTs with associated semaphore.
+ */
+#define USDT_SEMA_IS_ACTIVE(sema) (USDT_SEMA(sema).active > 0)
+
+/*
+ * Invoke USDT specified by `group` and `name` identifiers and associate
+ * explicitly user-defined semaphore `sema` with it. Pass through `args` as
+ * USDT arguments. `args` are optional and zero arguments are acceptable.
+ *
+ * Semaphore is defined with the help of USDT_DEFINE_SEMA() macro and can be
+ * checked whether active with USDT_SEMA_IS_ACTIVE().
+ */
+#ifdef __usdt_va_opt
+#define USDT_WITH_EXPLICIT_SEMA(sema, group, name, ...)				\
+	__usdt_probe(group, name, __usdt_sema_explicit, USDT_SEMA(sema), ##__VA_ARGS__)
+#else
+#define USDT_WITH_EXPLICIT_SEMA(sema, group, name, ...)				\
+	__usdt_probe(group, name, __usdt_sema_explicit, USDT_SEMA(sema) __VA_OPT__(,) __VA_ARGS__)
+#endif
+
+/*
+ * Adjustable implementation aspects
+ */
+#ifndef USDT_ARG_CONSTRAINT
+#if defined __powerpc__
+#define USDT_ARG_CONSTRAINT		nZr
+#elif defined __arm__
+#define USDT_ARG_CONSTRAINT		g
+#elif defined __loongarch__
+#define USDT_ARG_CONSTRAINT		nmr
+#else
+#define USDT_ARG_CONSTRAINT		nor
+#endif
+#endif /* USDT_ARG_CONSTRAINT */
+
+#ifndef USDT_NOP
+#if defined(__ia64__) || defined(__s390__) || defined(__s390x__)
+#define USDT_NOP			nop 0
+#else
+#define USDT_NOP			nop
+#endif
+#endif /* USDT_NOP */
+
+/*
+ * Implementation details
+ */
+/* USDT name for implicitly-defined USDT semaphore, derived from group:name */
+#define __usdt_sema_name(group, name)	__usdt_sema_##group##__##name
+/* ELF section into which USDT semaphores are put */
+#define __usdt_sema_sec			__attribute__((section(".probes")))
+
+#define __usdt_concat(a, b)		a ## b
+#define __usdt_apply(fn, n)		__usdt_concat(fn, n)
+
+#ifndef __usdt_nth
+#define __usdt_nth(_, _1, _2, _3, _4, _5, _6, _7, _8, _9, _10, _11, _12, N, ...) N
+#endif
+
+#ifndef __usdt_narg
+#ifdef __usdt_va_opt
+#define __usdt_narg(...) __usdt_nth(_ __VA_OPT__(,) __VA_ARGS__, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0)
+#else
+#define __usdt_narg(...) __usdt_nth(_, ##__VA_ARGS__, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0)
+#endif
+#endif /* __usdt_narg */
+
+#define __usdt_hash			#
+#define __usdt_str_(x)			#x
+#define __usdt_str(x)			__usdt_str_(x)
+
+#ifndef __usdt_asm_name
+#define __usdt_asm_name(name)		__asm__(__usdt_str(name))
+#endif
+
+#define __usdt_asm0()		"\n"
+#define __usdt_asm1(x)		__usdt_str(x) "\n"
+#define __usdt_asm2(x, ...)	__usdt_str(x) "," __usdt_asm1(__VA_ARGS__)
+#define __usdt_asm3(x, ...)	__usdt_str(x) "," __usdt_asm2(__VA_ARGS__)
+#define __usdt_asm4(x, ...)	__usdt_str(x) "," __usdt_asm3(__VA_ARGS__)
+#define __usdt_asm5(x, ...)	__usdt_str(x) "," __usdt_asm4(__VA_ARGS__)
+#define __usdt_asm6(x, ...)	__usdt_str(x) "," __usdt_asm5(__VA_ARGS__)
+#define __usdt_asm7(x, ...)	__usdt_str(x) "," __usdt_asm6(__VA_ARGS__)
+#define __usdt_asm8(x, ...)	__usdt_str(x) "," __usdt_asm7(__VA_ARGS__)
+#define __usdt_asm9(x, ...)	__usdt_str(x) "," __usdt_asm8(__VA_ARGS__)
+#define __usdt_asm10(x, ...)	__usdt_str(x) "," __usdt_asm9(__VA_ARGS__)
+#define __usdt_asm11(x, ...)	__usdt_str(x) "," __usdt_asm10(__VA_ARGS__)
+#define __usdt_asm12(x, ...)	__usdt_str(x) "," __usdt_asm11(__VA_ARGS__)
+#define __usdt_asm(...)		__usdt_apply(__usdt_asm, __usdt_narg(__VA_ARGS__))(__VA_ARGS__)
+
+#ifdef __LP64__
+#define __usdt_asm_addr		.8byte
+#else
+#define __usdt_asm_addr		.4byte
+#endif
+
+#define __usdt_asm_strz_(x)	__usdt_asm1(.asciz #x)
+#define __usdt_asm_strz(x)	__usdt_asm_strz_(x)
+#define __usdt_asm_str_(x)	__usdt_asm1(.ascii #x)
+#define __usdt_asm_str(x)	__usdt_asm_str_(x)
+
+/* "semaphoreless" USDT case */
+#ifndef __usdt_sema_none
+#define __usdt_sema_none(sema)
+#endif
+
+/* implicitly defined __usdt_sema__group__name semaphore (using weak symbols) */
+#ifndef __usdt_sema_implicit
+#define __usdt_sema_implicit(sema)								\
+	__asm__ __volatile__ (									\
+	__usdt_asm1(.ifndef sema)								\
+	__usdt_asm3(		.pushsection .probes, "aw", "progbits")				\
+	__usdt_asm1(		.weak sema)							\
+	__usdt_asm1(		.hidden sema)							\
+	__usdt_asm1(		.align 2)							\
+	__usdt_asm1(sema:)									\
+	__usdt_asm1(		.zero 2)							\
+	__usdt_asm2(		.type sema, @object)						\
+	__usdt_asm2(		.size sema, 2)							\
+	__usdt_asm1(		.popsection)							\
+	__usdt_asm1(.endif)									\
+	);
+#endif
+
+/* externally defined semaphore using USDT_DEFINE_SEMA() and passed explicitly by user */
+#ifndef __usdt_sema_explicit
+#define __usdt_sema_explicit(sema)								\
+	__asm__ __volatile__ ("" :: "m" (sema));
+#endif
+
+/* main USDT definition (nop and .note.stapsdt metadata) */
+#define __usdt_probe(group, name, sema_def, sema, ...) do {					\
+	sema_def(sema)										\
+	__asm__ __volatile__ (									\
+	__usdt_asm( 990:	USDT_NOP)							\
+	__usdt_asm3(		.pushsection .note.stapsdt, "", "note")				\
+	__usdt_asm1(		.balign 4)							\
+	__usdt_asm3(		.4byte 992f-991f,994f-993f,3)					\
+	__usdt_asm1(991:	.asciz "stapsdt")						\
+	__usdt_asm1(992:	.balign 4)							\
+	__usdt_asm1(993:	__usdt_asm_addr 990b)						\
+	__usdt_asm1(		__usdt_asm_addr _.stapsdt.base)					\
+	__usdt_asm1(		__usdt_asm_addr sema)						\
+	__usdt_asm_strz(group)									\
+	__usdt_asm_strz(name)									\
+	__usdt_asm_args(__VA_ARGS__)								\
+	__usdt_asm1(		.ascii "\0")							\
+	__usdt_asm1(994:	.balign 4)							\
+	__usdt_asm1(		.popsection)							\
+	__usdt_asm1(.ifndef _.stapsdt.base)							\
+	__usdt_asm5(		.pushsection .stapsdt.base,"aG","progbits",.stapsdt.base,comdat)\
+	__usdt_asm1(		.weak _.stapsdt.base)						\
+	__usdt_asm1(		.hidden _.stapsdt.base)						\
+	__usdt_asm1(_.stapsdt.base:)								\
+	__usdt_asm1(		.space 1)							\
+	__usdt_asm2(		.size _.stapsdt.base, 1)					\
+	__usdt_asm1(		.popsection)							\
+	__usdt_asm1(.endif)									\
+	:: __usdt_asm_ops(__VA_ARGS__)								\
+	);											\
+} while (0)
+
+/*
+ * NB: gdb PR24541 highlighted an unspecified corner of the sdt.h
+ * operand note format.
+ *
+ * The named register may be a longer or shorter (!) alias for the
+ * storage where the value in question is found. For example, on
+ * i386, 64-bit value may be put in register pairs, and a register
+ * name stored would identify just one of them. Previously, gcc was
+ * asked to emit the %w[id] (16-bit alias of some registers holding
+ * operands), even when a wider 32-bit value was used.
+ *
+ * Bottom line: the byte-width given before the @ sign governs. If
+ * there is a mismatch between that width and that of the named
+ * register, then a sys/sdt.h note consumer may need to employ
+ * architecture-specific heuristics to figure out where the compiler
+ * has actually put the complete value.
+ */
+#if defined(__powerpc__) || defined(__powerpc64__)
+#define __usdt_argref(id)	%I[id]%[id]
+#elif defined(__i386__)
+#define __usdt_argref(id)	%k[id]		/* gcc.gnu.org/PR80115 sourceware.org/PR24541 */
+#else
+#define __usdt_argref(id)	%[id]
+#endif
+
+#define __usdt_asm_arg(n)	__usdt_asm_str(%c[__usdt_asz##n])				\
+				__usdt_asm1(.ascii "@")						\
+				__usdt_asm_str(__usdt_argref(__usdt_aval##n))
+
+#define __usdt_asm_args0	/* no arguments */
+#define __usdt_asm_args1	__usdt_asm_arg(1)
+#define __usdt_asm_args2	__usdt_asm_args1 __usdt_asm1(.ascii " ") __usdt_asm_arg(2)
+#define __usdt_asm_args3	__usdt_asm_args2 __usdt_asm1(.ascii " ") __usdt_asm_arg(3)
+#define __usdt_asm_args4	__usdt_asm_args3 __usdt_asm1(.ascii " ") __usdt_asm_arg(4)
+#define __usdt_asm_args5	__usdt_asm_args4 __usdt_asm1(.ascii " ") __usdt_asm_arg(5)
+#define __usdt_asm_args6	__usdt_asm_args5 __usdt_asm1(.ascii " ") __usdt_asm_arg(6)
+#define __usdt_asm_args7	__usdt_asm_args6 __usdt_asm1(.ascii " ") __usdt_asm_arg(7)
+#define __usdt_asm_args8	__usdt_asm_args7 __usdt_asm1(.ascii " ") __usdt_asm_arg(8)
+#define __usdt_asm_args9	__usdt_asm_args8 __usdt_asm1(.ascii " ") __usdt_asm_arg(9)
+#define __usdt_asm_args10	__usdt_asm_args9 __usdt_asm1(.ascii " ") __usdt_asm_arg(10)
+#define __usdt_asm_args11	__usdt_asm_args10 __usdt_asm1(.ascii " ") __usdt_asm_arg(11)
+#define __usdt_asm_args12	__usdt_asm_args11 __usdt_asm1(.ascii " ") __usdt_asm_arg(12)
+#define __usdt_asm_args(...)	__usdt_apply(__usdt_asm_args, __usdt_narg(__VA_ARGS__))
+
+#define __usdt_is_arr(x)	(__builtin_classify_type(x) == 14 || __builtin_classify_type(x) == 5)
+#define __usdt_arg_size(x)	(__usdt_is_arr(x) ? sizeof(void *) : sizeof(x))
+
+/*
+ * We can't use __builtin_choose_expr() in C++, so fall back to table-based
+ * signedness determination for known types, utilizing templates magic.
+ */
+#ifdef __cplusplus
+
+#define __usdt_is_signed(x)	(!__usdt_is_arr(x) && __usdt_t<__typeof(x)>::is_signed)
+
+#include <cstddef>
+
+template<typename T> struct __usdt_t { static const bool is_signed = false; };
+template<typename A> struct __usdt_t<A[]> : public __usdt_t<A *> {};
+template<typename A, size_t N> struct __usdt_t<A[N]> : public __usdt_t<A *> {};
+
+#define __usdt_def_signed(T)									\
+template<> struct __usdt_t<T>		     { static const bool is_signed = true; };		\
+template<> struct __usdt_t<const T>	     { static const bool is_signed = true; };		\
+template<> struct __usdt_t<volatile T>	     { static const bool is_signed = true; };		\
+template<> struct __usdt_t<const volatile T> { static const bool is_signed = true; }
+#define __usdt_maybe_signed(T)									\
+template<> struct __usdt_t<T>		     { static const bool is_signed = (T)-1 < (T)1; };	\
+template<> struct __usdt_t<const T>	     { static const bool is_signed = (T)-1 < (T)1; };	\
+template<> struct __usdt_t<volatile T>	     { static const bool is_signed = (T)-1 < (T)1; };	\
+template<> struct __usdt_t<const volatile T> { static const bool is_signed = (T)-1 < (T)1; }
+
+__usdt_def_signed(signed char);
+__usdt_def_signed(short);
+__usdt_def_signed(int);
+__usdt_def_signed(long);
+__usdt_def_signed(long long);
+__usdt_maybe_signed(char);
+__usdt_maybe_signed(wchar_t);
+
+#else /* !__cplusplus */
+
+#define __usdt_is_inttype(x)	(__builtin_classify_type(x) >= 1 && __builtin_classify_type(x) <= 4)
+#define __usdt_inttype(x)	__typeof(__builtin_choose_expr(__usdt_is_inttype(x), (x), 0U))
+#define __usdt_is_signed(x)	((__usdt_inttype(x))-1 < (__usdt_inttype(x))1)
+
+#endif /* __cplusplus */
+
+#define __usdt_asm_op(n, x)									\
+	[__usdt_asz##n] "n" ((__usdt_is_signed(x) ? (int)-1 : 1) * (int)__usdt_arg_size(x)),	\
+	[__usdt_aval##n] __usdt_str(USDT_ARG_CONSTRAINT)(x)
+
+#define __usdt_asm_ops0()				[__usdt_dummy] "g" (0)
+#define __usdt_asm_ops1(x)				__usdt_asm_op(1, x)
+#define __usdt_asm_ops2(a,x)				__usdt_asm_ops1(a), __usdt_asm_op(2, x)
+#define __usdt_asm_ops3(a,b,x)				__usdt_asm_ops2(a,b), __usdt_asm_op(3, x)
+#define __usdt_asm_ops4(a,b,c,x)			__usdt_asm_ops3(a,b,c), __usdt_asm_op(4, x)
+#define __usdt_asm_ops5(a,b,c,d,x)			__usdt_asm_ops4(a,b,c,d), __usdt_asm_op(5, x)
+#define __usdt_asm_ops6(a,b,c,d,e,x)			__usdt_asm_ops5(a,b,c,d,e), __usdt_asm_op(6, x)
+#define __usdt_asm_ops7(a,b,c,d,e,f,x)			__usdt_asm_ops6(a,b,c,d,e,f), __usdt_asm_op(7, x)
+#define __usdt_asm_ops8(a,b,c,d,e,f,g,x)		__usdt_asm_ops7(a,b,c,d,e,f,g), __usdt_asm_op(8, x)
+#define __usdt_asm_ops9(a,b,c,d,e,f,g,h,x)		__usdt_asm_ops8(a,b,c,d,e,f,g,h), __usdt_asm_op(9, x)
+#define __usdt_asm_ops10(a,b,c,d,e,f,g,h,i,x)		__usdt_asm_ops9(a,b,c,d,e,f,g,h,i), __usdt_asm_op(10, x)
+#define __usdt_asm_ops11(a,b,c,d,e,f,g,h,i,j,x)		__usdt_asm_ops10(a,b,c,d,e,f,g,h,i,j), __usdt_asm_op(11, x)
+#define __usdt_asm_ops12(a,b,c,d,e,f,g,h,i,j,k,x)	__usdt_asm_ops11(a,b,c,d,e,f,g,h,i,j,k), __usdt_asm_op(12, x)
+#define __usdt_asm_ops(...)				__usdt_apply(__usdt_asm_ops, __usdt_narg(__VA_ARGS__))(__VA_ARGS__)
+
+#endif /* __USDT_H */
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv6 perf/core 12/22] selftests/bpf: Reorg the uprobe_syscall test function
  2025-07-20 11:21 [PATCHv6 perf/core 00/22] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (10 preceding siblings ...)
  2025-07-20 11:21 ` [PATCHv6 perf/core 11/22] selftests/bpf: Import usdt.h from libbpf/usdt project Jiri Olsa
@ 2025-07-20 11:21 ` Jiri Olsa
  2025-07-20 11:21 ` [PATCHv6 perf/core 13/22] selftests/bpf: Rename uprobe_syscall_executed prog to test_uretprobe_multi Jiri Olsa
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 44+ messages in thread
From: Jiri Olsa @ 2025-07-20 11:21 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, linux-kernel, linux-trace-kernel, x86, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, David Laight,
	Thomas Weißschuh, Ingo Molnar

Adding __test_uprobe_syscall with non x86_64 stub to execute all the tests,
so we don't need to keep adding non x86_64 stub functions for new tests.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 .../selftests/bpf/prog_tests/uprobe_syscall.c | 34 +++++++------------
 1 file changed, 12 insertions(+), 22 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
index b17dc39a23db..a8f00aee7799 100644
--- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
@@ -350,29 +350,8 @@ static void test_uretprobe_shadow_stack(void)
 
 	ARCH_PRCTL(ARCH_SHSTK_DISABLE, ARCH_SHSTK_SHSTK);
 }
-#else
-static void test_uretprobe_regs_equal(void)
-{
-	test__skip();
-}
-
-static void test_uretprobe_regs_change(void)
-{
-	test__skip();
-}
-
-static void test_uretprobe_syscall_call(void)
-{
-	test__skip();
-}
 
-static void test_uretprobe_shadow_stack(void)
-{
-	test__skip();
-}
-#endif
-
-void test_uprobe_syscall(void)
+static void __test_uprobe_syscall(void)
 {
 	if (test__start_subtest("uretprobe_regs_equal"))
 		test_uretprobe_regs_equal();
@@ -383,3 +362,14 @@ void test_uprobe_syscall(void)
 	if (test__start_subtest("uretprobe_shadow_stack"))
 		test_uretprobe_shadow_stack();
 }
+#else
+static void __test_uprobe_syscall(void)
+{
+	test__skip();
+}
+#endif
+
+void test_uprobe_syscall(void)
+{
+	__test_uprobe_syscall();
+}
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv6 perf/core 13/22] selftests/bpf: Rename uprobe_syscall_executed prog to test_uretprobe_multi
  2025-07-20 11:21 [PATCHv6 perf/core 00/22] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (11 preceding siblings ...)
  2025-07-20 11:21 ` [PATCHv6 perf/core 12/22] selftests/bpf: Reorg the uprobe_syscall test function Jiri Olsa
@ 2025-07-20 11:21 ` Jiri Olsa
  2025-07-20 11:21 ` [PATCHv6 perf/core 14/22] selftests/bpf: Add uprobe/usdt syscall tests Jiri Olsa
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 44+ messages in thread
From: Jiri Olsa @ 2025-07-20 11:21 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, linux-kernel, linux-trace-kernel, x86, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, David Laight,
	Thomas Weißschuh, Ingo Molnar

Renaming uprobe_syscall_executed prog to test_uretprobe_multi
to fit properly in the following changes that add more programs.

Plus adding pid filter and increasing executed variable.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 .../selftests/bpf/prog_tests/uprobe_syscall.c        | 12 ++++++++----
 .../selftests/bpf/progs/uprobe_syscall_executed.c    |  8 ++++++--
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
index a8f00aee7799..6d58a44da2b2 100644
--- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
@@ -252,6 +252,7 @@ static void test_uretprobe_syscall_call(void)
 	);
 	struct uprobe_syscall_executed *skel;
 	int pid, status, err, go[2], c = 0;
+	struct bpf_link *link;
 
 	if (!ASSERT_OK(pipe(go), "pipe"))
 		return;
@@ -277,11 +278,14 @@ static void test_uretprobe_syscall_call(void)
 		_exit(0);
 	}
 
-	skel->links.test = bpf_program__attach_uprobe_multi(skel->progs.test, pid,
-							    "/proc/self/exe",
-							    "uretprobe_syscall_call", &opts);
-	if (!ASSERT_OK_PTR(skel->links.test, "bpf_program__attach_uprobe_multi"))
+	skel->bss->pid = pid;
+
+	link = bpf_program__attach_uprobe_multi(skel->progs.test_uretprobe_multi,
+						pid, "/proc/self/exe",
+						"uretprobe_syscall_call", &opts);
+	if (!ASSERT_OK_PTR(link, "bpf_program__attach_uprobe_multi"))
 		goto cleanup;
+	skel->links.test_uretprobe_multi = link;
 
 	/* kick the child */
 	write(go[1], &c, 1);
diff --git a/tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c b/tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c
index 0d7f1a7db2e2..8f48976a33aa 100644
--- a/tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c
+++ b/tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c
@@ -8,10 +8,14 @@ struct pt_regs regs;
 char _license[] SEC("license") = "GPL";
 
 int executed = 0;
+int pid;
 
 SEC("uretprobe.multi")
-int test(struct pt_regs *regs)
+int test_uretprobe_multi(struct pt_regs *ctx)
 {
-	executed = 1;
+	if (bpf_get_current_pid_tgid() >> 32 != pid)
+		return 0;
+
+	executed++;
 	return 0;
 }
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv6 perf/core 14/22] selftests/bpf: Add uprobe/usdt syscall tests
  2025-07-20 11:21 [PATCHv6 perf/core 00/22] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (12 preceding siblings ...)
  2025-07-20 11:21 ` [PATCHv6 perf/core 13/22] selftests/bpf: Rename uprobe_syscall_executed prog to test_uretprobe_multi Jiri Olsa
@ 2025-07-20 11:21 ` Jiri Olsa
  2025-07-20 11:21 ` [PATCHv6 perf/core 15/22] selftests/bpf: Add hit/attach/detach race optimized uprobe test Jiri Olsa
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 44+ messages in thread
From: Jiri Olsa @ 2025-07-20 11:21 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, linux-kernel, linux-trace-kernel, x86, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, David Laight,
	Thomas Weißschuh, Ingo Molnar

Adding tests for optimized uprobe/usdt probes.

Checking that we get expected trampoline and attached bpf programs
get executed properly.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 .../selftests/bpf/prog_tests/uprobe_syscall.c | 284 +++++++++++++++++-
 .../bpf/progs/uprobe_syscall_executed.c       |  52 ++++
 2 files changed, 335 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
index 6d58a44da2b2..b91135abcf8a 100644
--- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
@@ -8,6 +8,7 @@
 #include <asm/ptrace.h>
 #include <linux/compiler.h>
 #include <linux/stringify.h>
+#include <linux/kernel.h>
 #include <sys/wait.h>
 #include <sys/syscall.h>
 #include <sys/prctl.h>
@@ -15,6 +16,11 @@
 #include "uprobe_syscall.skel.h"
 #include "uprobe_syscall_executed.skel.h"
 
+#define USDT_NOP .byte 0x0f, 0x1f, 0x44, 0x00, 0x00
+#include "usdt.h"
+
+#pragma GCC diagnostic ignored "-Wattributes"
+
 __naked unsigned long uretprobe_regs_trigger(void)
 {
 	asm volatile (
@@ -305,6 +311,265 @@ static void test_uretprobe_syscall_call(void)
 	close(go[0]);
 }
 
+#define TRAMP "[uprobes-trampoline]"
+
+__attribute__((aligned(16)))
+__nocf_check __weak __naked void uprobe_test(void)
+{
+	asm volatile ("					\n"
+		".byte 0x0f, 0x1f, 0x44, 0x00, 0x00	\n"
+		"ret					\n"
+	);
+}
+
+__attribute__((aligned(16)))
+__nocf_check __weak void usdt_test(void)
+{
+	USDT(optimized_uprobe, usdt);
+}
+
+static int find_uprobes_trampoline(void *tramp_addr)
+{
+	void *start, *end;
+	char line[128];
+	int ret = -1;
+	FILE *maps;
+
+	maps = fopen("/proc/self/maps", "r");
+	if (!maps) {
+		fprintf(stderr, "cannot open maps\n");
+		return -1;
+	}
+
+	while (fgets(line, sizeof(line), maps)) {
+		int m = -1;
+
+		/* We care only about private r-x mappings. */
+		if (sscanf(line, "%p-%p r-xp %*x %*x:%*x %*u %n", &start, &end, &m) != 2)
+			continue;
+		if (m < 0)
+			continue;
+		if (!strncmp(&line[m], TRAMP, sizeof(TRAMP)-1) && (start == tramp_addr)) {
+			ret = 0;
+			break;
+		}
+	}
+
+	fclose(maps);
+	return ret;
+}
+
+static unsigned char nop5[5] = { 0x0f, 0x1f, 0x44, 0x00, 0x00 };
+
+static void *find_nop5(void *fn)
+{
+	int i;
+
+	for (i = 0; i < 10; i++) {
+		if (!memcmp(nop5, fn + i, 5))
+			return fn + i;
+	}
+	return NULL;
+}
+
+typedef void (__attribute__((nocf_check)) *trigger_t)(void);
+
+static bool shstk_is_enabled;
+
+static void *check_attach(struct uprobe_syscall_executed *skel, trigger_t trigger,
+			  void *addr, int executed)
+{
+	struct __arch_relative_insn {
+		__u8 op;
+		__s32 raddr;
+	} __packed *call;
+	void *tramp = NULL;
+	__u8 *bp;
+
+	/* Uprobe gets optimized after first trigger, so let's press twice. */
+	trigger();
+	trigger();
+
+	/* Make sure bpf program got executed.. */
+	ASSERT_EQ(skel->bss->executed, executed, "executed");
+
+	if (shstk_is_enabled) {
+		/* .. and check optimization is disabled under shadow stack. */
+		bp = (__u8 *) addr;
+		ASSERT_EQ(*bp, 0xcc, "int3");
+	} else {
+		/* .. and check the trampoline is as expected. */
+		call = (struct __arch_relative_insn *) addr;
+		tramp = (void *) (call + 1) + call->raddr;
+		ASSERT_EQ(call->op, 0xe8, "call");
+		ASSERT_OK(find_uprobes_trampoline(tramp), "uprobes_trampoline");
+	}
+
+	return tramp;
+}
+
+static void check_detach(void *addr, void *tramp)
+{
+	/* [uprobes_trampoline] stays after detach */
+	ASSERT_OK(!shstk_is_enabled && find_uprobes_trampoline(tramp), "uprobes_trampoline");
+	ASSERT_OK(memcmp(addr, nop5, 5), "nop5");
+}
+
+static void check(struct uprobe_syscall_executed *skel, struct bpf_link *link,
+		  trigger_t trigger, void *addr, int executed)
+{
+	void *tramp;
+
+	tramp = check_attach(skel, trigger, addr, executed);
+	bpf_link__destroy(link);
+	check_detach(addr, tramp);
+}
+
+static void test_uprobe_legacy(void)
+{
+	struct uprobe_syscall_executed *skel = NULL;
+	LIBBPF_OPTS(bpf_uprobe_opts, opts,
+		.retprobe = true,
+	);
+	struct bpf_link *link;
+	unsigned long offset;
+
+	offset = get_uprobe_offset(&uprobe_test);
+	if (!ASSERT_GE(offset, 0, "get_uprobe_offset"))
+		goto cleanup;
+
+	/* uprobe */
+	skel = uprobe_syscall_executed__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "uprobe_syscall_executed__open_and_load"))
+		return;
+
+	skel->bss->pid = getpid();
+
+	link = bpf_program__attach_uprobe_opts(skel->progs.test_uprobe,
+				0, "/proc/self/exe", offset, NULL);
+	if (!ASSERT_OK_PTR(link, "bpf_program__attach_uprobe_opts"))
+		goto cleanup;
+
+	check(skel, link, uprobe_test, uprobe_test, 2);
+
+	/* uretprobe */
+	skel->bss->executed = 0;
+
+	link = bpf_program__attach_uprobe_opts(skel->progs.test_uretprobe,
+				0, "/proc/self/exe", offset, &opts);
+	if (!ASSERT_OK_PTR(link, "bpf_program__attach_uprobe_opts"))
+		goto cleanup;
+
+	check(skel, link, uprobe_test, uprobe_test, 2);
+
+cleanup:
+	uprobe_syscall_executed__destroy(skel);
+}
+
+static void test_uprobe_multi(void)
+{
+	struct uprobe_syscall_executed *skel = NULL;
+	LIBBPF_OPTS(bpf_uprobe_multi_opts, opts);
+	struct bpf_link *link;
+	unsigned long offset;
+
+	offset = get_uprobe_offset(&uprobe_test);
+	if (!ASSERT_GE(offset, 0, "get_uprobe_offset"))
+		goto cleanup;
+
+	opts.offsets = &offset;
+	opts.cnt = 1;
+
+	skel = uprobe_syscall_executed__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "uprobe_syscall_executed__open_and_load"))
+		return;
+
+	skel->bss->pid = getpid();
+
+	/* uprobe.multi */
+	link = bpf_program__attach_uprobe_multi(skel->progs.test_uprobe_multi,
+				0, "/proc/self/exe", NULL, &opts);
+	if (!ASSERT_OK_PTR(link, "bpf_program__attach_uprobe_multi"))
+		goto cleanup;
+
+	check(skel, link, uprobe_test, uprobe_test, 2);
+
+	/* uretprobe.multi */
+	skel->bss->executed = 0;
+	opts.retprobe = true;
+	link = bpf_program__attach_uprobe_multi(skel->progs.test_uretprobe_multi,
+				0, "/proc/self/exe", NULL, &opts);
+	if (!ASSERT_OK_PTR(link, "bpf_program__attach_uprobe_multi"))
+		goto cleanup;
+
+	check(skel, link, uprobe_test, uprobe_test, 2);
+
+cleanup:
+	uprobe_syscall_executed__destroy(skel);
+}
+
+static void test_uprobe_session(void)
+{
+	struct uprobe_syscall_executed *skel = NULL;
+	LIBBPF_OPTS(bpf_uprobe_multi_opts, opts,
+		.session = true,
+	);
+	struct bpf_link *link;
+	unsigned long offset;
+
+	offset = get_uprobe_offset(&uprobe_test);
+	if (!ASSERT_GE(offset, 0, "get_uprobe_offset"))
+		goto cleanup;
+
+	opts.offsets = &offset;
+	opts.cnt = 1;
+
+	skel = uprobe_syscall_executed__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "uprobe_syscall_executed__open_and_load"))
+		return;
+
+	skel->bss->pid = getpid();
+
+	link = bpf_program__attach_uprobe_multi(skel->progs.test_uprobe_session,
+				0, "/proc/self/exe", NULL, &opts);
+	if (!ASSERT_OK_PTR(link, "bpf_program__attach_uprobe_multi"))
+		goto cleanup;
+
+	check(skel, link, uprobe_test, uprobe_test, 4);
+
+cleanup:
+	uprobe_syscall_executed__destroy(skel);
+}
+
+static void test_uprobe_usdt(void)
+{
+	struct uprobe_syscall_executed *skel;
+	struct bpf_link *link;
+	void *addr;
+
+	errno = 0;
+	addr = find_nop5(usdt_test);
+	if (!ASSERT_OK_PTR(addr, "find_nop5"))
+		return;
+
+	skel = uprobe_syscall_executed__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "uprobe_syscall_executed__open_and_load"))
+		return;
+
+	skel->bss->pid = getpid();
+
+	link = bpf_program__attach_usdt(skel->progs.test_usdt,
+				-1 /* all PIDs */, "/proc/self/exe",
+				"optimized_uprobe", "usdt", NULL);
+	if (!ASSERT_OK_PTR(link, "bpf_program__attach_usdt"))
+		goto cleanup;
+
+	check(skel, link, usdt_test, addr, 2);
+
+cleanup:
+	uprobe_syscall_executed__destroy(skel);
+}
+
 /*
  * Borrowed from tools/testing/selftests/x86/test_shadow_stack.c.
  *
@@ -347,11 +612,20 @@ static void test_uretprobe_shadow_stack(void)
 		return;
 	}
 
-	/* Run all of the uretprobe tests. */
+	/* Run all the tests with shadow stack in place. */
+	shstk_is_enabled = true;
+
 	test_uretprobe_regs_equal();
 	test_uretprobe_regs_change();
 	test_uretprobe_syscall_call();
 
+	test_uprobe_legacy();
+	test_uprobe_multi();
+	test_uprobe_session();
+	test_uprobe_usdt();
+
+	shstk_is_enabled = false;
+
 	ARCH_PRCTL(ARCH_SHSTK_DISABLE, ARCH_SHSTK_SHSTK);
 }
 
@@ -365,6 +639,14 @@ static void __test_uprobe_syscall(void)
 		test_uretprobe_syscall_call();
 	if (test__start_subtest("uretprobe_shadow_stack"))
 		test_uretprobe_shadow_stack();
+	if (test__start_subtest("uprobe_legacy"))
+		test_uprobe_legacy();
+	if (test__start_subtest("uprobe_multi"))
+		test_uprobe_multi();
+	if (test__start_subtest("uprobe_session"))
+		test_uprobe_session();
+	if (test__start_subtest("uprobe_usdt"))
+		test_uprobe_usdt();
 }
 #else
 static void __test_uprobe_syscall(void)
diff --git a/tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c b/tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c
index 8f48976a33aa..915d38591bf6 100644
--- a/tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c
+++ b/tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c
@@ -1,6 +1,8 @@
 // SPDX-License-Identifier: GPL-2.0
 #include "vmlinux.h"
 #include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/usdt.bpf.h>
 #include <string.h>
 
 struct pt_regs regs;
@@ -10,6 +12,36 @@ char _license[] SEC("license") = "GPL";
 int executed = 0;
 int pid;
 
+SEC("uprobe")
+int BPF_UPROBE(test_uprobe)
+{
+	if (bpf_get_current_pid_tgid() >> 32 != pid)
+		return 0;
+
+	executed++;
+	return 0;
+}
+
+SEC("uretprobe")
+int BPF_URETPROBE(test_uretprobe)
+{
+	if (bpf_get_current_pid_tgid() >> 32 != pid)
+		return 0;
+
+	executed++;
+	return 0;
+}
+
+SEC("uprobe.multi")
+int test_uprobe_multi(struct pt_regs *ctx)
+{
+	if (bpf_get_current_pid_tgid() >> 32 != pid)
+		return 0;
+
+	executed++;
+	return 0;
+}
+
 SEC("uretprobe.multi")
 int test_uretprobe_multi(struct pt_regs *ctx)
 {
@@ -19,3 +51,23 @@ int test_uretprobe_multi(struct pt_regs *ctx)
 	executed++;
 	return 0;
 }
+
+SEC("uprobe.session")
+int test_uprobe_session(struct pt_regs *ctx)
+{
+	if (bpf_get_current_pid_tgid() >> 32 != pid)
+		return 0;
+
+	executed++;
+	return 0;
+}
+
+SEC("usdt")
+int test_usdt(struct pt_regs *ctx)
+{
+	if (bpf_get_current_pid_tgid() >> 32 != pid)
+		return 0;
+
+	executed++;
+	return 0;
+}
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv6 perf/core 15/22] selftests/bpf: Add hit/attach/detach race optimized uprobe test
  2025-07-20 11:21 [PATCHv6 perf/core 00/22] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (13 preceding siblings ...)
  2025-07-20 11:21 ` [PATCHv6 perf/core 14/22] selftests/bpf: Add uprobe/usdt syscall tests Jiri Olsa
@ 2025-07-20 11:21 ` Jiri Olsa
  2025-07-20 11:21 ` [PATCHv6 perf/core 16/22] selftests/bpf: Add uprobe syscall sigill signal test Jiri Olsa
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 44+ messages in thread
From: Jiri Olsa @ 2025-07-20 11:21 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, linux-kernel, linux-trace-kernel, x86, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, David Laight,
	Thomas Weißschuh, Ingo Molnar

Adding test that makes sure parallel execution of the uprobe and
attach/detach of optimized uprobe on it works properly.

By default the test runs for 500ms, which is adjustable by using
BPF_SELFTESTS_UPROBE_SYSCALL_RACE_MSEC env variable.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 .../selftests/bpf/prog_tests/uprobe_syscall.c | 108 ++++++++++++++++++
 1 file changed, 108 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
index b91135abcf8a..3d27c8bc019e 100644
--- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
@@ -15,6 +15,7 @@
 #include <asm/prctl.h>
 #include "uprobe_syscall.skel.h"
 #include "uprobe_syscall_executed.skel.h"
+#include "bpf/libbpf_internal.h"
 
 #define USDT_NOP .byte 0x0f, 0x1f, 0x44, 0x00, 0x00
 #include "usdt.h"
@@ -629,6 +630,111 @@ static void test_uretprobe_shadow_stack(void)
 	ARCH_PRCTL(ARCH_SHSTK_DISABLE, ARCH_SHSTK_SHSTK);
 }
 
+static volatile bool race_stop;
+
+static USDT_DEFINE_SEMA(race);
+
+static void *worker_trigger(void *arg)
+{
+	unsigned long rounds = 0;
+
+	while (!race_stop) {
+		uprobe_test();
+		rounds++;
+	}
+
+	printf("tid %d trigger rounds: %lu\n", gettid(), rounds);
+	return NULL;
+}
+
+static void *worker_attach(void *arg)
+{
+	LIBBPF_OPTS(bpf_uprobe_opts, opts);
+	struct uprobe_syscall_executed *skel;
+	unsigned long rounds = 0, offset;
+	const char *sema[2] = {
+		__stringify(USDT_SEMA(race)),
+		NULL,
+	};
+	unsigned long *ref;
+	int err;
+
+	offset = get_uprobe_offset(&uprobe_test);
+	if (!ASSERT_GE(offset, 0, "get_uprobe_offset"))
+		return NULL;
+
+	err = elf_resolve_syms_offsets("/proc/self/exe", 1, (const char **) &sema, &ref, STT_OBJECT);
+	if (!ASSERT_OK(err, "elf_resolve_syms_offsets_sema"))
+		return NULL;
+
+	opts.ref_ctr_offset = *ref;
+
+	skel = uprobe_syscall_executed__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "uprobe_syscall_executed__open_and_load"))
+		return NULL;
+
+	skel->bss->pid = getpid();
+
+	while (!race_stop) {
+		skel->links.test_uprobe = bpf_program__attach_uprobe_opts(skel->progs.test_uprobe,
+					0, "/proc/self/exe", offset, &opts);
+		if (!ASSERT_OK_PTR(skel->links.test_uprobe, "bpf_program__attach_uprobe_opts"))
+			break;
+
+		bpf_link__destroy(skel->links.test_uprobe);
+		skel->links.test_uprobe = NULL;
+		rounds++;
+	}
+
+	printf("tid %d attach rounds: %lu hits: %d\n", gettid(), rounds, skel->bss->executed);
+	uprobe_syscall_executed__destroy(skel);
+	free(ref);
+	return NULL;
+}
+
+static useconds_t race_msec(void)
+{
+	char *env;
+
+	env = getenv("BPF_SELFTESTS_UPROBE_SYSCALL_RACE_MSEC");
+	if (env)
+		return atoi(env);
+
+	/* default duration is 500ms */
+	return 500;
+}
+
+static void test_uprobe_race(void)
+{
+	int err, i, nr_threads;
+	pthread_t *threads;
+
+	nr_threads = libbpf_num_possible_cpus();
+	if (!ASSERT_GT(nr_threads, 0, "libbpf_num_possible_cpus"))
+		return;
+	nr_threads = max(2, nr_threads);
+
+	threads = alloca(sizeof(*threads) * nr_threads);
+	if (!ASSERT_OK_PTR(threads, "malloc"))
+		return;
+
+	for (i = 0; i < nr_threads; i++) {
+		err = pthread_create(&threads[i], NULL, i % 2 ? worker_trigger : worker_attach,
+				     NULL);
+		if (!ASSERT_OK(err, "pthread_create"))
+			goto cleanup;
+	}
+
+	usleep(race_msec() * 1000);
+
+cleanup:
+	race_stop = true;
+	for (nr_threads = i, i = 0; i < nr_threads; i++)
+		pthread_join(threads[i], NULL);
+
+	ASSERT_FALSE(USDT_SEMA_IS_ACTIVE(race), "race_semaphore");
+}
+
 static void __test_uprobe_syscall(void)
 {
 	if (test__start_subtest("uretprobe_regs_equal"))
@@ -647,6 +753,8 @@ static void __test_uprobe_syscall(void)
 		test_uprobe_session();
 	if (test__start_subtest("uprobe_usdt"))
 		test_uprobe_usdt();
+	if (test__start_subtest("uprobe_race"))
+		test_uprobe_race();
 }
 #else
 static void __test_uprobe_syscall(void)
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv6 perf/core 16/22] selftests/bpf: Add uprobe syscall sigill signal test
  2025-07-20 11:21 [PATCHv6 perf/core 00/22] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (14 preceding siblings ...)
  2025-07-20 11:21 ` [PATCHv6 perf/core 15/22] selftests/bpf: Add hit/attach/detach race optimized uprobe test Jiri Olsa
@ 2025-07-20 11:21 ` Jiri Olsa
  2025-07-20 11:21 ` [PATCHv6 perf/core 17/22] selftests/bpf: Add optimized usdt variant for basic usdt test Jiri Olsa
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 44+ messages in thread
From: Jiri Olsa @ 2025-07-20 11:21 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, linux-kernel, linux-trace-kernel, x86, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, David Laight,
	Thomas Weißschuh, Ingo Molnar

Make sure that calling uprobe syscall from outside uprobe trampoline
results in sigill signal.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 .../selftests/bpf/prog_tests/uprobe_syscall.c | 36 +++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
index 3d27c8bc019e..02e98cba5cc6 100644
--- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
@@ -735,6 +735,40 @@ static void test_uprobe_race(void)
 	ASSERT_FALSE(USDT_SEMA_IS_ACTIVE(race), "race_semaphore");
 }
 
+#ifndef __NR_uprobe
+#define __NR_uprobe 336
+#endif
+
+static void test_uprobe_sigill(void)
+{
+	int status, err, pid;
+
+	pid = fork();
+	if (!ASSERT_GE(pid, 0, "fork"))
+		return;
+	/* child */
+	if (pid == 0) {
+		asm volatile (
+			"pushq %rax\n"
+			"pushq %rcx\n"
+			"pushq %r11\n"
+			"movq $" __stringify(__NR_uprobe) ", %rax\n"
+			"syscall\n"
+			"popq %r11\n"
+			"popq %rcx\n"
+			"retq\n"
+		);
+		exit(0);
+	}
+
+	err = waitpid(pid, &status, 0);
+	ASSERT_EQ(err, pid, "waitpid");
+
+	/* verify the child got killed with SIGILL */
+	ASSERT_EQ(WIFSIGNALED(status), 1, "WIFSIGNALED");
+	ASSERT_EQ(WTERMSIG(status), SIGILL, "WTERMSIG");
+}
+
 static void __test_uprobe_syscall(void)
 {
 	if (test__start_subtest("uretprobe_regs_equal"))
@@ -755,6 +789,8 @@ static void __test_uprobe_syscall(void)
 		test_uprobe_usdt();
 	if (test__start_subtest("uprobe_race"))
 		test_uprobe_race();
+	if (test__start_subtest("uprobe_sigill"))
+		test_uprobe_sigill();
 }
 #else
 static void __test_uprobe_syscall(void)
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv6 perf/core 17/22] selftests/bpf: Add optimized usdt variant for basic usdt test
  2025-07-20 11:21 [PATCHv6 perf/core 00/22] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (15 preceding siblings ...)
  2025-07-20 11:21 ` [PATCHv6 perf/core 16/22] selftests/bpf: Add uprobe syscall sigill signal test Jiri Olsa
@ 2025-07-20 11:21 ` Jiri Olsa
  2025-07-20 11:21 ` [PATCHv6 perf/core 18/22] selftests/bpf: Add uprobe_regs_equal test Jiri Olsa
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 44+ messages in thread
From: Jiri Olsa @ 2025-07-20 11:21 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, linux-kernel, linux-trace-kernel, x86, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, David Laight,
	Thomas Weißschuh, Ingo Molnar

Adding optimized usdt variant for basic usdt test to check that
usdt arguments are properly passed in optimized code path.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 tools/testing/selftests/bpf/prog_tests/usdt.c | 38 ++++++++++++-------
 1 file changed, 25 insertions(+), 13 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/usdt.c b/tools/testing/selftests/bpf/prog_tests/usdt.c
index 9057e983cc54..833eb87483a1 100644
--- a/tools/testing/selftests/bpf/prog_tests/usdt.c
+++ b/tools/testing/selftests/bpf/prog_tests/usdt.c
@@ -40,12 +40,19 @@ static void __always_inline trigger_func(int x) {
 	}
 }
 
-static void subtest_basic_usdt(void)
+static void subtest_basic_usdt(bool optimized)
 {
 	LIBBPF_OPTS(bpf_usdt_opts, opts);
 	struct test_usdt *skel;
 	struct test_usdt__bss *bss;
-	int err, i;
+	int err, i, called;
+
+#define TRIGGER(x) ({			\
+	trigger_func(x);		\
+	if (optimized)			\
+		trigger_func(x);	\
+	optimized ? 2 : 1;		\
+	})
 
 	skel = test_usdt__open_and_load();
 	if (!ASSERT_OK_PTR(skel, "skel_open"))
@@ -66,11 +73,11 @@ static void subtest_basic_usdt(void)
 	if (!ASSERT_OK_PTR(skel->links.usdt0, "usdt0_link"))
 		goto cleanup;
 
-	trigger_func(1);
+	called = TRIGGER(1);
 
-	ASSERT_EQ(bss->usdt0_called, 1, "usdt0_called");
-	ASSERT_EQ(bss->usdt3_called, 1, "usdt3_called");
-	ASSERT_EQ(bss->usdt12_called, 1, "usdt12_called");
+	ASSERT_EQ(bss->usdt0_called, called, "usdt0_called");
+	ASSERT_EQ(bss->usdt3_called, called, "usdt3_called");
+	ASSERT_EQ(bss->usdt12_called, called, "usdt12_called");
 
 	ASSERT_EQ(bss->usdt0_cookie, 0xcafedeadbeeffeed, "usdt0_cookie");
 	ASSERT_EQ(bss->usdt0_arg_cnt, 0, "usdt0_arg_cnt");
@@ -119,11 +126,11 @@ static void subtest_basic_usdt(void)
 	 * bpf_program__attach_usdt() handles this properly and attaches to
 	 * all possible places of USDT invocation.
 	 */
-	trigger_func(2);
+	called += TRIGGER(2);
 
-	ASSERT_EQ(bss->usdt0_called, 2, "usdt0_called");
-	ASSERT_EQ(bss->usdt3_called, 2, "usdt3_called");
-	ASSERT_EQ(bss->usdt12_called, 2, "usdt12_called");
+	ASSERT_EQ(bss->usdt0_called, called, "usdt0_called");
+	ASSERT_EQ(bss->usdt3_called, called, "usdt3_called");
+	ASSERT_EQ(bss->usdt12_called, called, "usdt12_called");
 
 	/* only check values that depend on trigger_func()'s input value */
 	ASSERT_EQ(bss->usdt3_args[0], 2, "usdt3_arg1");
@@ -142,9 +149,9 @@ static void subtest_basic_usdt(void)
 	if (!ASSERT_OK_PTR(skel->links.usdt3, "usdt3_reattach"))
 		goto cleanup;
 
-	trigger_func(3);
+	called += TRIGGER(3);
 
-	ASSERT_EQ(bss->usdt3_called, 3, "usdt3_called");
+	ASSERT_EQ(bss->usdt3_called, called, "usdt3_called");
 	/* this time usdt3 has custom cookie */
 	ASSERT_EQ(bss->usdt3_cookie, 0xBADC00C51E, "usdt3_cookie");
 	ASSERT_EQ(bss->usdt3_arg_cnt, 3, "usdt3_arg_cnt");
@@ -158,6 +165,7 @@ static void subtest_basic_usdt(void)
 
 cleanup:
 	test_usdt__destroy(skel);
+#undef TRIGGER
 }
 
 unsigned short test_usdt_100_semaphore SEC(".probes");
@@ -425,7 +433,11 @@ static void subtest_urandom_usdt(bool auto_attach)
 void test_usdt(void)
 {
 	if (test__start_subtest("basic"))
-		subtest_basic_usdt();
+		subtest_basic_usdt(false);
+#ifdef __x86_64__
+	if (test__start_subtest("basic_optimized"))
+		subtest_basic_usdt(true);
+#endif
 	if (test__start_subtest("multispec"))
 		subtest_multispec_usdt();
 	if (test__start_subtest("urand_auto_attach"))
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv6 perf/core 18/22] selftests/bpf: Add uprobe_regs_equal test
  2025-07-20 11:21 [PATCHv6 perf/core 00/22] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (16 preceding siblings ...)
  2025-07-20 11:21 ` [PATCHv6 perf/core 17/22] selftests/bpf: Add optimized usdt variant for basic usdt test Jiri Olsa
@ 2025-07-20 11:21 ` Jiri Olsa
  2025-07-20 11:21 ` [PATCHv6 perf/core 19/22] selftests/bpf: Change test_uretprobe_regs_change for uprobe and uretprobe Jiri Olsa
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 44+ messages in thread
From: Jiri Olsa @ 2025-07-20 11:21 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, linux-kernel, linux-trace-kernel, x86, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, David Laight,
	Thomas Weißschuh, Ingo Molnar

Changing uretprobe_regs_trigger to allow the test for both
uprobe and uretprobe and renaming it to uprobe_regs_equal.

We check that both uprobe and uretprobe probes (bpf programs)
see expected registers with few exceptions.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 .../selftests/bpf/prog_tests/uprobe_syscall.c | 56 ++++++++++++++-----
 .../selftests/bpf/progs/uprobe_syscall.c      |  4 +-
 2 files changed, 44 insertions(+), 16 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
index 02e98cba5cc6..36ce9e261b5c 100644
--- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
@@ -22,15 +22,17 @@
 
 #pragma GCC diagnostic ignored "-Wattributes"
 
-__naked unsigned long uretprobe_regs_trigger(void)
+__attribute__((aligned(16)))
+__nocf_check __weak __naked unsigned long uprobe_regs_trigger(void)
 {
 	asm volatile (
+		".byte 0x0f, 0x1f, 0x44, 0x00, 0x00\n" /* nop5 */
 		"movq $0xdeadbeef, %rax\n"
 		"ret\n"
 	);
 }
 
-__naked void uretprobe_regs(struct pt_regs *before, struct pt_regs *after)
+__naked void uprobe_regs(struct pt_regs *before, struct pt_regs *after)
 {
 	asm volatile (
 		"movq %r15,   0(%rdi)\n"
@@ -51,15 +53,17 @@ __naked void uretprobe_regs(struct pt_regs *before, struct pt_regs *after)
 		"movq   $0, 120(%rdi)\n" /* orig_rax */
 		"movq   $0, 128(%rdi)\n" /* rip      */
 		"movq   $0, 136(%rdi)\n" /* cs       */
+		"pushq %rax\n"
 		"pushf\n"
 		"pop %rax\n"
 		"movq %rax, 144(%rdi)\n" /* eflags   */
+		"pop %rax\n"
 		"movq %rsp, 152(%rdi)\n" /* rsp      */
 		"movq   $0, 160(%rdi)\n" /* ss       */
 
 		/* save 2nd argument */
 		"pushq %rsi\n"
-		"call uretprobe_regs_trigger\n"
+		"call uprobe_regs_trigger\n"
 
 		/* save  return value and load 2nd argument pointer to rax */
 		"pushq %rax\n"
@@ -99,25 +103,37 @@ __naked void uretprobe_regs(struct pt_regs *before, struct pt_regs *after)
 );
 }
 
-static void test_uretprobe_regs_equal(void)
+static void test_uprobe_regs_equal(bool retprobe)
 {
+	LIBBPF_OPTS(bpf_uprobe_opts, opts,
+		.retprobe = retprobe,
+	);
 	struct uprobe_syscall *skel = NULL;
 	struct pt_regs before = {}, after = {};
 	unsigned long *pb = (unsigned long *) &before;
 	unsigned long *pa = (unsigned long *) &after;
 	unsigned long *pp;
+	unsigned long offset;
 	unsigned int i, cnt;
-	int err;
+
+	offset = get_uprobe_offset(&uprobe_regs_trigger);
+	if (!ASSERT_GE(offset, 0, "get_uprobe_offset"))
+		return;
 
 	skel = uprobe_syscall__open_and_load();
 	if (!ASSERT_OK_PTR(skel, "uprobe_syscall__open_and_load"))
 		goto cleanup;
 
-	err = uprobe_syscall__attach(skel);
-	if (!ASSERT_OK(err, "uprobe_syscall__attach"))
+	skel->links.probe = bpf_program__attach_uprobe_opts(skel->progs.probe,
+				0, "/proc/self/exe", offset, &opts);
+	if (!ASSERT_OK_PTR(skel->links.probe, "bpf_program__attach_uprobe_opts"))
 		goto cleanup;
 
-	uretprobe_regs(&before, &after);
+	/* make sure uprobe gets optimized */
+	if (!retprobe)
+		uprobe_regs_trigger();
+
+	uprobe_regs(&before, &after);
 
 	pp = (unsigned long *) &skel->bss->regs;
 	cnt = sizeof(before)/sizeof(*pb);
@@ -126,7 +142,7 @@ static void test_uretprobe_regs_equal(void)
 		unsigned int offset = i * sizeof(unsigned long);
 
 		/*
-		 * Check register before and after uretprobe_regs_trigger call
+		 * Check register before and after uprobe_regs_trigger call
 		 * that triggers the uretprobe.
 		 */
 		switch (offset) {
@@ -140,7 +156,7 @@ static void test_uretprobe_regs_equal(void)
 
 		/*
 		 * Check register seen from bpf program and register after
-		 * uretprobe_regs_trigger call
+		 * uprobe_regs_trigger call (with rax exception, check below).
 		 */
 		switch (offset) {
 		/*
@@ -153,6 +169,15 @@ static void test_uretprobe_regs_equal(void)
 		case offsetof(struct pt_regs, rsp):
 		case offsetof(struct pt_regs, ss):
 			break;
+		/*
+		 * uprobe does not see return value in rax, it needs to see the
+		 * original (before) rax value
+		 */
+		case offsetof(struct pt_regs, rax):
+			if (!retprobe) {
+				ASSERT_EQ(pp[i], pb[i], "uprobe rax prog-before value check");
+				break;
+			}
 		default:
 			if (!ASSERT_EQ(pp[i], pa[i], "register prog-after value check"))
 				fprintf(stdout, "failed register offset %u\n", offset);
@@ -190,13 +215,13 @@ static void test_uretprobe_regs_change(void)
 	unsigned long cnt = sizeof(before)/sizeof(*pb);
 	unsigned int i, err, offset;
 
-	offset = get_uprobe_offset(uretprobe_regs_trigger);
+	offset = get_uprobe_offset(uprobe_regs_trigger);
 
 	err = write_bpf_testmod_uprobe(offset);
 	if (!ASSERT_OK(err, "register_uprobe"))
 		return;
 
-	uretprobe_regs(&before, &after);
+	uprobe_regs(&before, &after);
 
 	err = write_bpf_testmod_uprobe(0);
 	if (!ASSERT_OK(err, "unregister_uprobe"))
@@ -616,7 +641,8 @@ static void test_uretprobe_shadow_stack(void)
 	/* Run all the tests with shadow stack in place. */
 	shstk_is_enabled = true;
 
-	test_uretprobe_regs_equal();
+	test_uprobe_regs_equal(false);
+	test_uprobe_regs_equal(true);
 	test_uretprobe_regs_change();
 	test_uretprobe_syscall_call();
 
@@ -772,7 +798,7 @@ static void test_uprobe_sigill(void)
 static void __test_uprobe_syscall(void)
 {
 	if (test__start_subtest("uretprobe_regs_equal"))
-		test_uretprobe_regs_equal();
+		test_uprobe_regs_equal(true);
 	if (test__start_subtest("uretprobe_regs_change"))
 		test_uretprobe_regs_change();
 	if (test__start_subtest("uretprobe_syscall_call"))
@@ -791,6 +817,8 @@ static void __test_uprobe_syscall(void)
 		test_uprobe_race();
 	if (test__start_subtest("uprobe_sigill"))
 		test_uprobe_sigill();
+	if (test__start_subtest("uprobe_regs_equal"))
+		test_uprobe_regs_equal(false);
 }
 #else
 static void __test_uprobe_syscall(void)
diff --git a/tools/testing/selftests/bpf/progs/uprobe_syscall.c b/tools/testing/selftests/bpf/progs/uprobe_syscall.c
index 8a4fa6c7ef59..e08c31669e5a 100644
--- a/tools/testing/selftests/bpf/progs/uprobe_syscall.c
+++ b/tools/testing/selftests/bpf/progs/uprobe_syscall.c
@@ -7,8 +7,8 @@ struct pt_regs regs;
 
 char _license[] SEC("license") = "GPL";
 
-SEC("uretprobe//proc/self/exe:uretprobe_regs_trigger")
-int uretprobe(struct pt_regs *ctx)
+SEC("uprobe")
+int probe(struct pt_regs *ctx)
 {
 	__builtin_memcpy(&regs, ctx, sizeof(regs));
 	return 0;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv6 perf/core 19/22] selftests/bpf: Change test_uretprobe_regs_change for uprobe and uretprobe
  2025-07-20 11:21 [PATCHv6 perf/core 00/22] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (17 preceding siblings ...)
  2025-07-20 11:21 ` [PATCHv6 perf/core 18/22] selftests/bpf: Add uprobe_regs_equal test Jiri Olsa
@ 2025-07-20 11:21 ` Jiri Olsa
  2025-07-20 11:21 ` [PATCHv6 perf/core 20/22] seccomp: passthrough uprobe systemcall without filtering Jiri Olsa
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 44+ messages in thread
From: Jiri Olsa @ 2025-07-20 11:21 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: bpf, linux-kernel, linux-trace-kernel, x86, Song Liu,
	Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
	Masami Hiramatsu, Alan Maguire, David Laight,
	Thomas Weißschuh, Ingo Molnar

Changing the test_uretprobe_regs_change test to test both uprobe
and uretprobe by adding entry consumer handler to the testmod
and making it to change one of the registers.

Making sure that changed values both uprobe and uretprobe handlers
propagate to the user space.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 .../selftests/bpf/prog_tests/uprobe_syscall.c        | 12 ++++++++----
 tools/testing/selftests/bpf/test_kmods/bpf_testmod.c | 11 +++++++++--
 2 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
index 36ce9e261b5c..c1f945cacebc 100644
--- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
@@ -207,7 +207,7 @@ static int write_bpf_testmod_uprobe(unsigned long offset)
 	return ret != n ? (int) ret : 0;
 }
 
-static void test_uretprobe_regs_change(void)
+static void test_regs_change(void)
 {
 	struct pt_regs before = {}, after = {};
 	unsigned long *pb = (unsigned long *) &before;
@@ -221,6 +221,9 @@ static void test_uretprobe_regs_change(void)
 	if (!ASSERT_OK(err, "register_uprobe"))
 		return;
 
+	/* make sure uprobe gets optimized */
+	uprobe_regs_trigger();
+
 	uprobe_regs(&before, &after);
 
 	err = write_bpf_testmod_uprobe(0);
@@ -643,7 +646,6 @@ static void test_uretprobe_shadow_stack(void)
 
 	test_uprobe_regs_equal(false);
 	test_uprobe_regs_equal(true);
-	test_uretprobe_regs_change();
 	test_uretprobe_syscall_call();
 
 	test_uprobe_legacy();
@@ -651,6 +653,8 @@ static void test_uretprobe_shadow_stack(void)
 	test_uprobe_session();
 	test_uprobe_usdt();
 
+	test_regs_change();
+
 	shstk_is_enabled = false;
 
 	ARCH_PRCTL(ARCH_SHSTK_DISABLE, ARCH_SHSTK_SHSTK);
@@ -799,8 +803,6 @@ static void __test_uprobe_syscall(void)
 {
 	if (test__start_subtest("uretprobe_regs_equal"))
 		test_uprobe_regs_equal(true);
-	if (test__start_subtest("uretprobe_regs_change"))
-		test_uretprobe_regs_change();
 	if (test__start_subtest("uretprobe_syscall_call"))
 		test_uretprobe_syscall_call();
 	if (test__start_subtest("uretprobe_shadow_stack"))
@@ -819,6 +821,8 @@ static void __test_uprobe_syscall(void)
 		test_uprobe_sigill();
 	if (test__start_subtest("uprobe_regs_equal"))
 		test_uprobe_regs_equal(false);
+	if (test__start_subtest("regs_change"))
+		test_regs_change();
 }
 #else
 static void __test_uprobe_syscall(void)
diff --git a/tools/testing/selftests/bpf/test_kmods/bpf_testmod.c b/tools/testing/selftests/bpf/test_kmods/bpf_testmod.c
index e9e918cdf31f..511911053bdc 100644
--- a/tools/testing/selftests/bpf/test_kmods/bpf_testmod.c
+++ b/tools/testing/selftests/bpf/test_kmods/bpf_testmod.c
@@ -500,15 +500,21 @@ static struct bin_attribute bin_attr_bpf_testmod_file __ro_after_init = {
  */
 #ifdef __x86_64__
 
+static int
+uprobe_handler(struct uprobe_consumer *self, struct pt_regs *regs, __u64 *data)
+{
+	regs->cx = 0x87654321feebdaed;
+	return 0;
+}
+
 static int
 uprobe_ret_handler(struct uprobe_consumer *self, unsigned long func,
 		   struct pt_regs *regs, __u64 *data)
 
 {
 	regs->ax  = 0x12345678deadbeef;
-	regs->cx  = 0x87654321feebdaed;
 	regs->r11 = (u64) -1;
-	return true;
+	return 0;
 }
 
 struct testmod_uprobe {
@@ -520,6 +526,7 @@ struct testmod_uprobe {
 static DEFINE_MUTEX(testmod_uprobe_mutex);
 
 static struct testmod_uprobe uprobe = {
+	.consumer.handler = uprobe_handler,
 	.consumer.ret_handler = uprobe_ret_handler,
 };
 
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv6 perf/core 20/22] seccomp: passthrough uprobe systemcall without filtering
  2025-07-20 11:21 [PATCHv6 perf/core 00/22] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (18 preceding siblings ...)
  2025-07-20 11:21 ` [PATCHv6 perf/core 19/22] selftests/bpf: Change test_uretprobe_regs_change for uprobe and uretprobe Jiri Olsa
@ 2025-07-20 11:21 ` Jiri Olsa
  2025-07-20 11:21 ` [PATCHv6 perf/core 21/22] selftests/seccomp: validate uprobe syscall passes through seccomp Jiri Olsa
  2025-07-20 11:21 ` [PATCHv5 22/22] man2: Add uprobe syscall page Jiri Olsa
  21 siblings, 0 replies; 44+ messages in thread
From: Jiri Olsa @ 2025-07-20 11:21 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: Kees Cook, Eyal Birger, Kees Cook, bpf, linux-kernel,
	linux-trace-kernel, x86, Song Liu, Yonghong Song, John Fastabend,
	Hao Luo, Steven Rostedt, Masami Hiramatsu, Alan Maguire,
	David Laight, Thomas Weißschuh, Ingo Molnar

Adding uprobe as another exception to the seccomp filter alongside
with the uretprobe syscall.

Same as the uretprobe the uprobe syscall is installed by kernel as
replacement for the breakpoint exception and is limited to x86_64
arch and isn't expected to ever be supported in i386.

Cc: Kees Cook <keescook@chromium.org>
Cc: Eyal Birger <eyal.birger@gmail.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 kernel/seccomp.c | 32 +++++++++++++++++++++++++-------
 1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 41aa761c7738..7daf2da09e8e 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -741,6 +741,26 @@ seccomp_prepare_user_filter(const char __user *user_filter)
 }
 
 #ifdef SECCOMP_ARCH_NATIVE
+static bool seccomp_uprobe_exception(struct seccomp_data *sd)
+{
+#if defined __NR_uretprobe || defined __NR_uprobe
+#ifdef SECCOMP_ARCH_COMPAT
+	if (sd->arch == SECCOMP_ARCH_NATIVE)
+#endif
+	{
+#ifdef __NR_uretprobe
+		if (sd->nr == __NR_uretprobe)
+			return true;
+#endif
+#ifdef __NR_uprobe
+		if (sd->nr == __NR_uprobe)
+			return true;
+#endif
+	}
+#endif
+	return false;
+}
+
 /**
  * seccomp_is_const_allow - check if filter is constant allow with given data
  * @fprog: The BPF programs
@@ -758,13 +778,8 @@ static bool seccomp_is_const_allow(struct sock_fprog_kern *fprog,
 		return false;
 
 	/* Our single exception to filtering. */
-#ifdef __NR_uretprobe
-#ifdef SECCOMP_ARCH_COMPAT
-	if (sd->arch == SECCOMP_ARCH_NATIVE)
-#endif
-		if (sd->nr == __NR_uretprobe)
-			return true;
-#endif
+	if (seccomp_uprobe_exception(sd))
+		return true;
 
 	for (pc = 0; pc < fprog->len; pc++) {
 		struct sock_filter *insn = &fprog->filter[pc];
@@ -1042,6 +1057,9 @@ static const int mode1_syscalls[] = {
 	__NR_seccomp_read, __NR_seccomp_write, __NR_seccomp_exit, __NR_seccomp_sigreturn,
 #ifdef __NR_uretprobe
 	__NR_uretprobe,
+#endif
+#ifdef __NR_uprobe
+	__NR_uprobe,
 #endif
 	-1, /* negative terminated */
 };
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv6 perf/core 21/22] selftests/seccomp: validate uprobe syscall passes through seccomp
  2025-07-20 11:21 [PATCHv6 perf/core 00/22] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (19 preceding siblings ...)
  2025-07-20 11:21 ` [PATCHv6 perf/core 20/22] seccomp: passthrough uprobe systemcall without filtering Jiri Olsa
@ 2025-07-20 11:21 ` Jiri Olsa
  2025-07-20 11:21 ` [PATCHv5 22/22] man2: Add uprobe syscall page Jiri Olsa
  21 siblings, 0 replies; 44+ messages in thread
From: Jiri Olsa @ 2025-07-20 11:21 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: Kees Cook, Eyal Birger, Kees Cook, bpf, linux-kernel,
	linux-trace-kernel, x86, Song Liu, Yonghong Song, John Fastabend,
	Hao Luo, Steven Rostedt, Masami Hiramatsu, Alan Maguire,
	David Laight, Thomas Weißschuh, Ingo Molnar

Adding uprobe checks into the current uretprobe tests.

All the related tests are now executed with attached uprobe
or uretprobe or without any probe.

Renaming the test fixture to uprobe, because it seems better.

Cc: Kees Cook <keescook@chromium.org>
Cc: Eyal Birger <eyal.birger@gmail.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 tools/testing/selftests/seccomp/seccomp_bpf.c | 107 ++++++++++++++----
 1 file changed, 86 insertions(+), 21 deletions(-)

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 61acbd45ffaa..2cf6fc825d86 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -73,6 +73,14 @@
 #define noinline __attribute__((noinline))
 #endif
 
+#ifndef __nocf_check
+#define __nocf_check __attribute__((nocf_check))
+#endif
+
+#ifndef __naked
+#define __naked __attribute__((__naked__))
+#endif
+
 #ifndef PR_SET_NO_NEW_PRIVS
 #define PR_SET_NO_NEW_PRIVS 38
 #define PR_GET_NO_NEW_PRIVS 39
@@ -4896,7 +4904,36 @@ TEST(tsync_vs_dead_thread_leader)
 	EXPECT_EQ(0, status);
 }
 
-noinline int probed(void)
+#ifdef __x86_64__
+
+/*
+ * We need naked probed_uprobe function. Using __nocf_check
+ * check to skip possible endbr64 instruction and ignoring
+ * -Wattributes, otherwise the compilation might fail.
+ */
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wattributes"
+
+__naked __nocf_check noinline int probed_uprobe(void)
+{
+	/*
+	 * Optimized uprobe is possible only on top of nop5 instruction.
+	 */
+	asm volatile ("                                 \n"
+		".byte 0x0f, 0x1f, 0x44, 0x00, 0x00     \n"
+		"ret                                    \n"
+	);
+}
+#pragma GCC diagnostic pop
+
+#else
+noinline int probed_uprobe(void)
+{
+	return 1;
+}
+#endif
+
+noinline int probed_uretprobe(void)
 {
 	return 1;
 }
@@ -4949,35 +4986,46 @@ static ssize_t get_uprobe_offset(const void *addr)
 	return found ? (uintptr_t)addr - start + base : -1;
 }
 
-FIXTURE(URETPROBE) {
+FIXTURE(UPROBE) {
 	int fd;
 };
 
-FIXTURE_VARIANT(URETPROBE) {
+FIXTURE_VARIANT(UPROBE) {
 	/*
-	 * All of the URETPROBE behaviors can be tested with either
-	 * uretprobe attached or not
+	 * All of the U(RET)PROBE behaviors can be tested with either
+	 * u(ret)probe attached or not
 	 */
 	bool attach;
+	/*
+	 * Test both uprobe and uretprobe.
+	 */
+	bool uretprobe;
 };
 
-FIXTURE_VARIANT_ADD(URETPROBE, attached) {
+FIXTURE_VARIANT_ADD(UPROBE, not_attached) {
+	.attach = false,
+	.uretprobe = false,
+};
+
+FIXTURE_VARIANT_ADD(UPROBE, uprobe_attached) {
 	.attach = true,
+	.uretprobe = false,
 };
 
-FIXTURE_VARIANT_ADD(URETPROBE, not_attached) {
-	.attach = false,
+FIXTURE_VARIANT_ADD(UPROBE, uretprobe_attached) {
+	.attach = true,
+	.uretprobe = true,
 };
 
-FIXTURE_SETUP(URETPROBE)
+FIXTURE_SETUP(UPROBE)
 {
 	const size_t attr_sz = sizeof(struct perf_event_attr);
 	struct perf_event_attr attr;
 	ssize_t offset;
 	int type, bit;
 
-#ifndef __NR_uretprobe
-	SKIP(return, "__NR_uretprobe syscall not defined");
+#if !defined(__NR_uprobe) || !defined(__NR_uretprobe)
+	SKIP(return, "__NR_uprobe ot __NR_uretprobe syscalls not defined");
 #endif
 
 	if (!variant->attach)
@@ -4987,12 +5035,17 @@ FIXTURE_SETUP(URETPROBE)
 
 	type = determine_uprobe_perf_type();
 	ASSERT_GE(type, 0);
-	bit = determine_uprobe_retprobe_bit();
-	ASSERT_GE(bit, 0);
-	offset = get_uprobe_offset(probed);
+
+	if (variant->uretprobe) {
+		bit = determine_uprobe_retprobe_bit();
+		ASSERT_GE(bit, 0);
+	}
+
+	offset = get_uprobe_offset(variant->uretprobe ? probed_uretprobe : probed_uprobe);
 	ASSERT_GE(offset, 0);
 
-	attr.config |= 1 << bit;
+	if (variant->uretprobe)
+		attr.config |= 1 << bit;
 	attr.size = attr_sz;
 	attr.type = type;
 	attr.config1 = ptr_to_u64("/proc/self/exe");
@@ -5003,7 +5056,7 @@ FIXTURE_SETUP(URETPROBE)
 			   PERF_FLAG_FD_CLOEXEC);
 }
 
-FIXTURE_TEARDOWN(URETPROBE)
+FIXTURE_TEARDOWN(UPROBE)
 {
 	/* we could call close(self->fd), but we'd need extra filter for
 	 * that and since we are calling _exit right away..
@@ -5017,11 +5070,17 @@ static int run_probed_with_filter(struct sock_fprog *prog)
 		return -1;
 	}
 
-	probed();
+	/*
+	 * Uprobe is optimized after first hit, so let's hit twice.
+	 */
+	probed_uprobe();
+	probed_uprobe();
+
+	probed_uretprobe();
 	return 0;
 }
 
-TEST_F(URETPROBE, uretprobe_default_allow)
+TEST_F(UPROBE, uprobe_default_allow)
 {
 	struct sock_filter filter[] = {
 		BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
@@ -5034,7 +5093,7 @@ TEST_F(URETPROBE, uretprobe_default_allow)
 	ASSERT_EQ(0, run_probed_with_filter(&prog));
 }
 
-TEST_F(URETPROBE, uretprobe_default_block)
+TEST_F(UPROBE, uprobe_default_block)
 {
 	struct sock_filter filter[] = {
 		BPF_STMT(BPF_LD|BPF_W|BPF_ABS,
@@ -5051,11 +5110,14 @@ TEST_F(URETPROBE, uretprobe_default_block)
 	ASSERT_EQ(0, run_probed_with_filter(&prog));
 }
 
-TEST_F(URETPROBE, uretprobe_block_uretprobe_syscall)
+TEST_F(UPROBE, uprobe_block_syscall)
 {
 	struct sock_filter filter[] = {
 		BPF_STMT(BPF_LD|BPF_W|BPF_ABS,
 			offsetof(struct seccomp_data, nr)),
+#ifdef __NR_uprobe
+		BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_uprobe, 1, 2),
+#endif
 #ifdef __NR_uretprobe
 		BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_uretprobe, 0, 1),
 #endif
@@ -5070,11 +5132,14 @@ TEST_F(URETPROBE, uretprobe_block_uretprobe_syscall)
 	ASSERT_EQ(0, run_probed_with_filter(&prog));
 }
 
-TEST_F(URETPROBE, uretprobe_default_block_with_uretprobe_syscall)
+TEST_F(UPROBE, uprobe_default_block_with_syscall)
 {
 	struct sock_filter filter[] = {
 		BPF_STMT(BPF_LD|BPF_W|BPF_ABS,
 			offsetof(struct seccomp_data, nr)),
+#ifdef __NR_uprobe
+		BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_uprobe, 3, 0),
+#endif
 #ifdef __NR_uretprobe
 		BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_uretprobe, 2, 0),
 #endif
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv5 22/22] man2: Add uprobe syscall page
  2025-07-20 11:21 [PATCHv6 perf/core 00/22] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
                   ` (20 preceding siblings ...)
  2025-07-20 11:21 ` [PATCHv6 perf/core 21/22] selftests/seccomp: validate uprobe syscall passes through seccomp Jiri Olsa
@ 2025-07-20 11:21 ` Jiri Olsa
  21 siblings, 0 replies; 44+ messages in thread
From: Jiri Olsa @ 2025-07-20 11:21 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
  Cc: Alejandro Colomar, Masami Hiramatsu (Google), bpf, linux-kernel,
	linux-trace-kernel, x86, Song Liu, Yonghong Song, John Fastabend,
	Hao Luo, Steven Rostedt, Alan Maguire, David Laight,
	Thomas Weißschuh, Ingo Molnar

Changing uretprobe syscall man page to be shared with new
uprobe syscall man page.

Cc: Alejandro Colomar <alx@kernel.org>
Reviewed-by: Alejandro Colomar <alx@kernel.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 man/man2/uprobe.2    |  1 +
 man/man2/uretprobe.2 | 36 ++++++++++++++++++++++++------------
 2 files changed, 25 insertions(+), 12 deletions(-)
 create mode 100644 man/man2/uprobe.2

diff --git a/man/man2/uprobe.2 b/man/man2/uprobe.2
new file mode 100644
index 000000000000..ea5ccf901591
--- /dev/null
+++ b/man/man2/uprobe.2
@@ -0,0 +1 @@
+.so man2/uretprobe.2
diff --git a/man/man2/uretprobe.2 b/man/man2/uretprobe.2
index bbbfb0c59335..df0e5d92e5ed 100644
--- a/man/man2/uretprobe.2
+++ b/man/man2/uretprobe.2
@@ -2,22 +2,28 @@
 .\"
 .\" SPDX-License-Identifier: Linux-man-pages-copyleft
 .\"
-.TH uretprobe 2 (date) "Linux man-pages (unreleased)"
+.TH uprobe 2 (date) "Linux man-pages (unreleased)"
 .SH NAME
+uprobe,
 uretprobe
 \-
-execute pending return uprobes
+execute pending entry or return uprobes
 .SH SYNOPSIS
 .nf
+.B int uprobe(void);
 .B int uretprobe(void);
 .fi
 .SH DESCRIPTION
+.BR uprobe ()
+is an alternative to breakpoint instructions
+for triggering entry uprobe consumers.
+.P
 .BR uretprobe ()
 is an alternative to breakpoint instructions
 for triggering return uprobe consumers.
 .P
 Calls to
-.BR uretprobe ()
+these system calls
 are only made from the user-space trampoline provided by the kernel.
 Calls from any other place result in a
 .BR SIGILL .
@@ -26,22 +32,28 @@ The return value is architecture-specific.
 .SH ERRORS
 .TP
 .B SIGILL
-.BR uretprobe ()
-was called by a user-space program.
+These system calls
+were called by a user-space program.
 .SH VERSIONS
 The behavior varies across systems.
 .SH STANDARDS
 None.
 .SH HISTORY
+.TP
+.BR uprobe ()
+TBD
+.TP
+.BR uretprobe ()
 Linux 6.11.
 .P
-.BR uretprobe ()
-was initially introduced for the x86_64 architecture
-where it was shown to be faster than breakpoint traps.
-It might be extended to other architectures.
+These system calls
+were initially introduced for the x86_64 architecture
+where they were shown to be faster than breakpoint traps.
+They might be extended to other architectures.
 .SH CAVEATS
-.BR uretprobe ()
-exists only to allow the invocation of return uprobe consumers.
-It should
+These system calls
+exist only to allow the invocation of
+entry or return uprobe consumers.
+They should
 .B never
 be called directly.
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCHv6 perf/core 09/22] uprobes/x86: Add uprobe syscall to speed up uprobe
  2025-07-20 11:21 ` [PATCHv6 perf/core 09/22] uprobes/x86: Add uprobe syscall to speed up uprobe Jiri Olsa
@ 2025-07-20 11:38   ` Oleg Nesterov
  2025-07-25 10:11   ` Masami Hiramatsu
  2025-09-03 18:24   ` Andrii Nakryiko
  2 siblings, 0 replies; 44+ messages in thread
From: Oleg Nesterov @ 2025-07-20 11:38 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Peter Zijlstra, Andrii Nakryiko, bpf, linux-kernel,
	linux-trace-kernel, x86, Song Liu, Yonghong Song, John Fastabend,
	Hao Luo, Steven Rostedt, Masami Hiramatsu, Alan Maguire,
	David Laight, Thomas Weißschuh, Ingo Molnar

On 07/20, Jiri Olsa wrote:
>
> Adding new uprobe syscall that calls uprobe handlers for given
> 'breakpoint' address.
>
> The idea is that the 'breakpoint' address calls the user space
> trampoline which executes the uprobe syscall.
>
> The syscall handler reads the return address of the initial call
> to retrieve the original 'breakpoint' address. With this address
> we find the related uprobe object and call its consumers.
>
> Adding the arch_uprobe_trampoline_mapping function that provides
> uprobe trampoline mapping. This mapping is backed with one global
> page initialized at __init time and shared by the all the mapping
> instances.
>
> We do not allow to execute uprobe syscall if the caller is not
> from uprobe trampoline mapping.
>
> The uprobe syscall ensures the consumer (bpf program) sees registers
> values in the state before the trampoline was called.
>
> Acked-by: Andrii Nakryiko <andrii@kernel.org>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>

My ack still stands,

Acked-by: Oleg Nesterov <oleg@redhat.com>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv6 perf/core 09/22] uprobes/x86: Add uprobe syscall to speed up uprobe
  2025-07-20 11:21 ` [PATCHv6 perf/core 09/22] uprobes/x86: Add uprobe syscall to speed up uprobe Jiri Olsa
  2025-07-20 11:38   ` Oleg Nesterov
@ 2025-07-25 10:11   ` Masami Hiramatsu
  2025-09-03 18:24   ` Andrii Nakryiko
  2 siblings, 0 replies; 44+ messages in thread
From: Masami Hiramatsu @ 2025-07-25 10:11 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, linux-kernel,
	linux-trace-kernel, x86, Song Liu, Yonghong Song, John Fastabend,
	Hao Luo, Steven Rostedt, Masami Hiramatsu, Alan Maguire,
	David Laight, Thomas Weißschuh, Ingo Molnar

On Sun, 20 Jul 2025 13:21:19 +0200
Jiri Olsa <jolsa@kernel.org> wrote:

> Adding new uprobe syscall that calls uprobe handlers for given
> 'breakpoint' address.
> 
> The idea is that the 'breakpoint' address calls the user space
> trampoline which executes the uprobe syscall.
> 
> The syscall handler reads the return address of the initial call
> to retrieve the original 'breakpoint' address. With this address
> we find the related uprobe object and call its consumers.
> 
> Adding the arch_uprobe_trampoline_mapping function that provides
> uprobe trampoline mapping. This mapping is backed with one global
> page initialized at __init time and shared by the all the mapping
> instances.
> 
> We do not allow to execute uprobe syscall if the caller is not
> from uprobe trampoline mapping.
> 
> The uprobe syscall ensures the consumer (bpf program) sees registers
> values in the state before the trampoline was called.
> 

Looks good to me.

Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Thanks!

> Acked-by: Andrii Nakryiko <andrii@kernel.org>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  arch/x86/kernel/uprobes.c              | 139 +++++++++++++++++++++++++
>  include/linux/syscalls.h               |   2 +
>  include/linux/uprobes.h                |   1 +
>  kernel/events/uprobes.c                |  17 +++
>  kernel/sys_ni.c                        |   1 +
>  6 files changed, 161 insertions(+)
> 
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index cfb5ca41e30d..9fd1291e7bdf 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -345,6 +345,7 @@
>  333	common	io_pgetevents		sys_io_pgetevents
>  334	common	rseq			sys_rseq
>  335	common	uretprobe		sys_uretprobe
> +336	common	uprobe			sys_uprobe
>  # don't use numbers 387 through 423, add new calls after the last
>  # 'common' entry
>  424	common	pidfd_send_signal	sys_pidfd_send_signal
> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index 6c4dcbdd0c3c..d18e1ae59901 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -752,6 +752,145 @@ void arch_uprobe_clear_state(struct mm_struct *mm)
>  	hlist_for_each_entry_safe(tramp, n, &state->head_tramps, node)
>  		destroy_uprobe_trampoline(tramp);
>  }
> +
> +static bool __in_uprobe_trampoline(unsigned long ip)
> +{
> +	struct vm_area_struct *vma = vma_lookup(current->mm, ip);
> +
> +	return vma && vma_is_special_mapping(vma, &tramp_mapping);
> +}
> +
> +static bool in_uprobe_trampoline(unsigned long ip)
> +{
> +	struct mm_struct *mm = current->mm;
> +	bool found, retry = true;
> +	unsigned int seq;
> +
> +	rcu_read_lock();
> +	if (mmap_lock_speculate_try_begin(mm, &seq)) {
> +		found = __in_uprobe_trampoline(ip);
> +		retry = mmap_lock_speculate_retry(mm, seq);
> +	}
> +	rcu_read_unlock();
> +
> +	if (retry) {
> +		mmap_read_lock(mm);
> +		found = __in_uprobe_trampoline(ip);
> +		mmap_read_unlock(mm);
> +	}
> +	return found;
> +}
> +
> +/*
> + * See uprobe syscall trampoline; the call to the trampoline will push
> + * the return address on the stack, the trampoline itself then pushes
> + * cx, r11 and ax.
> + */
> +struct uprobe_syscall_args {
> +	unsigned long ax;
> +	unsigned long r11;
> +	unsigned long cx;
> +	unsigned long retaddr;
> +};
> +
> +SYSCALL_DEFINE0(uprobe)
> +{
> +	struct pt_regs *regs = task_pt_regs(current);
> +	struct uprobe_syscall_args args;
> +	unsigned long ip, sp;
> +	int err;
> +
> +	/* Allow execution only from uprobe trampolines. */
> +	if (!in_uprobe_trampoline(regs->ip))
> +		goto sigill;
> +
> +	err = copy_from_user(&args, (void __user *)regs->sp, sizeof(args));
> +	if (err)
> +		goto sigill;
> +
> +	ip = regs->ip;
> +
> +	/*
> +	 * expose the "right" values of ax/r11/cx/ip/sp to uprobe_consumer/s, plus:
> +	 * - adjust ip to the probe address, call saved next instruction address
> +	 * - adjust sp to the probe's stack frame (check trampoline code)
> +	 */
> +	regs->ax  = args.ax;
> +	regs->r11 = args.r11;
> +	regs->cx  = args.cx;
> +	regs->ip  = args.retaddr - 5;
> +	regs->sp += sizeof(args);
> +	regs->orig_ax = -1;
> +
> +	sp = regs->sp;
> +
> +	handle_syscall_uprobe(regs, regs->ip);
> +
> +	/*
> +	 * Some of the uprobe consumers has changed sp, we can do nothing,
> +	 * just return via iret.
> +	 */
> +	if (regs->sp != sp) {
> +		/* skip the trampoline call */
> +		if (args.retaddr - 5 == regs->ip)
> +			regs->ip += 5;
> +		return regs->ax;
> +	}
> +
> +	regs->sp -= sizeof(args);
> +
> +	/* for the case uprobe_consumer has changed ax/r11/cx */
> +	args.ax  = regs->ax;
> +	args.r11 = regs->r11;
> +	args.cx  = regs->cx;
> +
> +	/* keep return address unless we are instructed otherwise */
> +	if (args.retaddr - 5 != regs->ip)
> +		args.retaddr = regs->ip;
> +
> +	regs->ip = ip;
> +
> +	err = copy_to_user((void __user *)regs->sp, &args, sizeof(args));
> +	if (err)
> +		goto sigill;
> +
> +	/* ensure sysret, see do_syscall_64() */
> +	regs->r11 = regs->flags;
> +	regs->cx  = regs->ip;
> +	return 0;
> +
> +sigill:
> +	force_sig(SIGILL);
> +	return -1;
> +}
> +
> +asm (
> +	".pushsection .rodata\n"
> +	".balign " __stringify(PAGE_SIZE) "\n"
> +	"uprobe_trampoline_entry:\n"
> +	"push %rcx\n"
> +	"push %r11\n"
> +	"push %rax\n"
> +	"movq $" __stringify(__NR_uprobe) ", %rax\n"
> +	"syscall\n"
> +	"pop %rax\n"
> +	"pop %r11\n"
> +	"pop %rcx\n"
> +	"ret\n"
> +	".balign " __stringify(PAGE_SIZE) "\n"
> +	".popsection\n"
> +);
> +
> +extern u8 uprobe_trampoline_entry[];
> +
> +static int __init arch_uprobes_init(void)
> +{
> +	tramp_mapping_pages[0] = virt_to_page(uprobe_trampoline_entry);
> +	return 0;
> +}
> +
> +late_initcall(arch_uprobes_init);
> +
>  #else /* 32-bit: */
>  /*
>   * No RIP-relative addressing on 32-bit
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index e5603cc91963..b0cc60f1c458 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -998,6 +998,8 @@ asmlinkage long sys_ioperm(unsigned long from, unsigned long num, int on);
>  
>  asmlinkage long sys_uretprobe(void);
>  
> +asmlinkage long sys_uprobe(void);
> +
>  /* pciconfig: alpha, arm, arm64, ia64, sparc */
>  asmlinkage long sys_pciconfig_read(unsigned long bus, unsigned long dfn,
>  				unsigned long off, unsigned long len,
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index b40d33aae016..b6b077cc7d0f 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -239,6 +239,7 @@ extern unsigned long uprobe_get_trampoline_vaddr(void);
>  extern void uprobe_copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len);
>  extern void arch_uprobe_clear_state(struct mm_struct *mm);
>  extern void arch_uprobe_init_state(struct mm_struct *mm);
> +extern void handle_syscall_uprobe(struct pt_regs *regs, unsigned long bp_vaddr);
>  #else /* !CONFIG_UPROBES */
>  struct uprobes_state {
>  };
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index acec91a676b7..cbba31c0495f 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -2772,6 +2772,23 @@ static void handle_swbp(struct pt_regs *regs)
>  	rcu_read_unlock_trace();
>  }
>  
> +void handle_syscall_uprobe(struct pt_regs *regs, unsigned long bp_vaddr)
> +{
> +	struct uprobe *uprobe;
> +	int is_swbp;
> +
> +	guard(rcu_tasks_trace)();
> +
> +	uprobe = find_active_uprobe_rcu(bp_vaddr, &is_swbp);
> +	if (!uprobe)
> +		return;
> +	if (!get_utask())
> +		return;
> +	if (arch_uprobe_ignore(&uprobe->arch, regs))
> +		return;
> +	handler_chain(uprobe, regs);
> +}
> +
>  /*
>   * Perform required fix-ups and disable singlestep.
>   * Allow pending signals to take effect.
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index c00a86931f8c..bf5d05c635ff 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -392,3 +392,4 @@ COND_SYSCALL(setuid16);
>  COND_SYSCALL(rseq);
>  
>  COND_SYSCALL(uretprobe);
> +COND_SYSCALL(uprobe);
> -- 
> 2.50.1
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv6 perf/core 10/22] uprobes/x86: Add support to optimize uprobes
  2025-07-20 11:21 ` [PATCHv6 perf/core 10/22] uprobes/x86: Add support to optimize uprobes Jiri Olsa
@ 2025-07-25 10:13   ` Masami Hiramatsu
  2025-07-28 21:34     ` Jiri Olsa
  2025-08-19 19:15   ` Peter Zijlstra
  1 sibling, 1 reply; 44+ messages in thread
From: Masami Hiramatsu @ 2025-07-25 10:13 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, linux-kernel,
	linux-trace-kernel, x86, Song Liu, Yonghong Song, John Fastabend,
	Hao Luo, Steven Rostedt, Masami Hiramatsu, Alan Maguire,
	David Laight, Thomas Weißschuh, Ingo Molnar

On Sun, 20 Jul 2025 13:21:20 +0200
Jiri Olsa <jolsa@kernel.org> wrote:

> Putting together all the previously added pieces to support optimized
> uprobes on top of 5-byte nop instruction.
> 
> The current uprobe execution goes through following:
> 
>   - installs breakpoint instruction over original instruction
>   - exception handler hit and calls related uprobe consumers
>   - and either simulates original instruction or does out of line single step
>     execution of it
>   - returns to user space
> 
> The optimized uprobe path does following:
> 
>   - checks the original instruction is 5-byte nop (plus other checks)
>   - adds (or uses existing) user space trampoline with uprobe syscall
>   - overwrites original instruction (5-byte nop) with call to user space
>     trampoline
>   - the user space trampoline executes uprobe syscall that calls related uprobe
>     consumers
>   - trampoline returns back to next instruction
> 
> This approach won't speed up all uprobes as it's limited to using nop5 as
> original instruction, but we plan to use nop5 as USDT probe instruction
> (which currently uses single byte nop) and speed up the USDT probes.
> 
> The arch_uprobe_optimize triggers the uprobe optimization and is called after
> first uprobe hit. I originally had it called on uprobe installation but then
> it clashed with elf loader, because the user space trampoline was added in a
> place where loader might need to put elf segments, so I decided to do it after
> first uprobe hit when loading is done.
> 
> The uprobe is un-optimized in arch specific set_orig_insn call.
> 
> The instruction overwrite is x86 arch specific and needs to go through 3 updates:
> (on top of nop5 instruction)
> 
>   - write int3 into 1st byte
>   - write last 4 bytes of the call instruction
>   - update the call instruction opcode
> 
> And cleanup goes though similar reverse stages:
> 
>   - overwrite call opcode with breakpoint (int3)
>   - write last 4 bytes of the nop5 instruction
>   - write the nop5 first instruction byte
> 
> We do not unmap and release uprobe trampoline when it's no longer needed,
> because there's no easy way to make sure none of the threads is still
> inside the trampoline. But we do not waste memory, because there's just
> single page for all the uprobe trampoline mappings.
> 
> We do waste frame on page mapping for every 4GB by keeping the uprobe
> trampoline page mapped, but that seems ok.
> 
> We take the benefit from the fact that set_swbp and set_orig_insn are
> called under mmap_write_lock(mm), so we can use the current instruction
> as the state the uprobe is in - nop5/breakpoint/call trampoline -
> and decide the needed action (optimize/un-optimize) based on that.
> 
> Attaching the speed up from benchs/run_bench_uprobes.sh script:
> 
> current:
>         usermode-count :  152.604 ± 0.044M/s
>         syscall-count  :   13.359 ± 0.042M/s
> -->     uprobe-nop     :    3.229 ± 0.002M/s
>         uprobe-push    :    3.086 ± 0.004M/s
>         uprobe-ret     :    1.114 ± 0.004M/s
>         uprobe-nop5    :    1.121 ± 0.005M/s
>         uretprobe-nop  :    2.145 ± 0.002M/s
>         uretprobe-push :    2.070 ± 0.001M/s
>         uretprobe-ret  :    0.931 ± 0.001M/s
>         uretprobe-nop5 :    0.957 ± 0.001M/s
> 
> after the change:
>         usermode-count :  152.448 ± 0.244M/s
>         syscall-count  :   14.321 ± 0.059M/s
>         uprobe-nop     :    3.148 ± 0.007M/s
>         uprobe-push    :    2.976 ± 0.004M/s
>         uprobe-ret     :    1.068 ± 0.003M/s
> -->     uprobe-nop5    :    7.038 ± 0.007M/s
>         uretprobe-nop  :    2.109 ± 0.004M/s
>         uretprobe-push :    2.035 ± 0.001M/s
>         uretprobe-ret  :    0.908 ± 0.001M/s
>         uretprobe-nop5 :    3.377 ± 0.009M/s
> 
> I see bit more speed up on Intel (above) compared to AMD. The big nop5
> speed up is partly due to emulating nop5 and partly due to optimization.
> 
> The key speed up we do this for is the USDT switch from nop to nop5:
>         uprobe-nop     :    3.148 ± 0.007M/s
>         uprobe-nop5    :    7.038 ± 0.007M/s
> 

This also looks good to me.

Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Thanks!

> Acked-by: Andrii Nakryiko <andrii@kernel.org>
> Acked-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  arch/x86/include/asm/uprobes.h |   7 +
>  arch/x86/kernel/uprobes.c      | 283 ++++++++++++++++++++++++++++++++-
>  include/linux/uprobes.h        |   6 +-
>  kernel/events/uprobes.c        |  16 +-
>  4 files changed, 305 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
> index 678fb546f0a7..1ee2e5115955 100644
> --- a/arch/x86/include/asm/uprobes.h
> +++ b/arch/x86/include/asm/uprobes.h
> @@ -20,6 +20,11 @@ typedef u8 uprobe_opcode_t;
>  #define UPROBE_SWBP_INSN		0xcc
>  #define UPROBE_SWBP_INSN_SIZE		   1
>  
> +enum {
> +	ARCH_UPROBE_FLAG_CAN_OPTIMIZE   = 0,
> +	ARCH_UPROBE_FLAG_OPTIMIZE_FAIL  = 1,
> +};
> +
>  struct uprobe_xol_ops;
>  
>  struct arch_uprobe {
> @@ -45,6 +50,8 @@ struct arch_uprobe {
>  			u8	ilen;
>  		}			push;
>  	};
> +
> +	unsigned long flags;
>  };
>  
>  struct arch_uprobe_task {
> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index d18e1ae59901..209ce74ab93f 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -18,6 +18,7 @@
>  #include <asm/processor.h>
>  #include <asm/insn.h>
>  #include <asm/mmu_context.h>
> +#include <asm/nops.h>
>  
>  /* Post-execution fixups. */
>  
> @@ -702,7 +703,6 @@ static struct uprobe_trampoline *create_uprobe_trampoline(unsigned long vaddr)
>  	return tramp;
>  }
>  
> -__maybe_unused
>  static struct uprobe_trampoline *get_uprobe_trampoline(unsigned long vaddr, bool *new)
>  {
>  	struct uprobes_state *state = &current->mm->uprobes_state;
> @@ -891,6 +891,280 @@ static int __init arch_uprobes_init(void)
>  
>  late_initcall(arch_uprobes_init);
>  
> +enum {
> +	EXPECT_SWBP,
> +	EXPECT_CALL,
> +};
> +
> +struct write_opcode_ctx {
> +	unsigned long base;
> +	int expect;
> +};
> +
> +static int is_call_insn(uprobe_opcode_t *insn)
> +{
> +	return *insn == CALL_INSN_OPCODE;
> +}
> +
> +/*
> + * Verification callback used by int3_update uprobe_write calls to make sure
> + * the underlying instruction is as expected - either int3 or call.
> + */
> +static int verify_insn(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode,
> +		       int nbytes, void *data)
> +{
> +	struct write_opcode_ctx *ctx = data;
> +	uprobe_opcode_t old_opcode[5];
> +
> +	uprobe_copy_from_page(page, ctx->base, (uprobe_opcode_t *) &old_opcode, 5);
> +
> +	switch (ctx->expect) {
> +	case EXPECT_SWBP:
> +		if (is_swbp_insn(&old_opcode[0]))
> +			return 1;
> +		break;
> +	case EXPECT_CALL:
> +		if (is_call_insn(&old_opcode[0]))
> +			return 1;
> +		break;
> +	}
> +
> +	return -1;
> +}
> +
> +/*
> + * Modify multi-byte instructions by using INT3 breakpoints on SMP.
> + * We completely avoid using stop_machine() here, and achieve the
> + * synchronization using INT3 breakpoints and SMP cross-calls.
> + * (borrowed comment from smp_text_poke_batch_finish)
> + *
> + * The way it is done:
> + *   - Add an INT3 trap to the address that will be patched
> + *   - SMP sync all CPUs
> + *   - Update all but the first byte of the patched range
> + *   - SMP sync all CPUs
> + *   - Replace the first byte (INT3) by the first byte of the replacing opcode
> + *   - SMP sync all CPUs
> + */
> +static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> +		       unsigned long vaddr, char *insn, bool optimize)
> +{
> +	uprobe_opcode_t int3 = UPROBE_SWBP_INSN;
> +	struct write_opcode_ctx ctx = {
> +		.base = vaddr,
> +	};
> +	int err;
> +
> +	/*
> +	 * Write int3 trap.
> +	 *
> +	 * The swbp_optimize path comes with breakpoint already installed,
> +	 * so we can skip this step for optimize == true.
> +	 */
> +	if (!optimize) {
> +		ctx.expect = EXPECT_CALL;
> +		err = uprobe_write(auprobe, vma, vaddr, &int3, 1, verify_insn,
> +				   true /* is_register */, false /* do_update_ref_ctr */,
> +				   &ctx);
> +		if (err)
> +			return err;
> +	}
> +
> +	smp_text_poke_sync_each_cpu();
> +
> +	/* Write all but the first byte of the patched range. */
> +	ctx.expect = EXPECT_SWBP;
> +	err = uprobe_write(auprobe, vma, vaddr + 1, insn + 1, 4, verify_insn,
> +			   true /* is_register */, false /* do_update_ref_ctr */,
> +			   &ctx);
> +	if (err)
> +		return err;
> +
> +	smp_text_poke_sync_each_cpu();
> +
> +	/*
> +	 * Write first byte.
> +	 *
> +	 * The swbp_unoptimize needs to finish uprobe removal together
> +	 * with ref_ctr update, using uprobe_write with proper flags.
> +	 */
> +	err = uprobe_write(auprobe, vma, vaddr, insn, 1, verify_insn,
> +			   optimize /* is_register */, !optimize /* do_update_ref_ctr */,
> +			   &ctx);
> +	if (err)
> +		return err;
> +
> +	smp_text_poke_sync_each_cpu();
> +	return 0;
> +}
> +
> +static int swbp_optimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> +			 unsigned long vaddr, unsigned long tramp)
> +{
> +	u8 call[5];
> +
> +	__text_gen_insn(call, CALL_INSN_OPCODE, (const void *) vaddr,
> +			(const void *) tramp, CALL_INSN_SIZE);
> +	return int3_update(auprobe, vma, vaddr, call, true /* optimize */);
> +}
> +
> +static int swbp_unoptimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> +			   unsigned long vaddr)
> +{
> +	return int3_update(auprobe, vma, vaddr, auprobe->insn, false /* optimize */);
> +}
> +
> +static int copy_from_vaddr(struct mm_struct *mm, unsigned long vaddr, void *dst, int len)
> +{
> +	unsigned int gup_flags = FOLL_FORCE|FOLL_SPLIT_PMD;
> +	struct vm_area_struct *vma;
> +	struct page *page;
> +
> +	page = get_user_page_vma_remote(mm, vaddr, gup_flags, &vma);
> +	if (IS_ERR(page))
> +		return PTR_ERR(page);
> +	uprobe_copy_from_page(page, vaddr, dst, len);
> +	put_page(page);
> +	return 0;
> +}
> +
> +static bool __is_optimized(uprobe_opcode_t *insn, unsigned long vaddr)
> +{
> +	struct __packed __arch_relative_insn {
> +		u8 op;
> +		s32 raddr;
> +	} *call = (struct __arch_relative_insn *) insn;
> +
> +	if (!is_call_insn(insn))
> +		return false;
> +	return __in_uprobe_trampoline(vaddr + 5 + call->raddr);
> +}
> +
> +static int is_optimized(struct mm_struct *mm, unsigned long vaddr, bool *optimized)
> +{
> +	uprobe_opcode_t insn[5];
> +	int err;
> +
> +	err = copy_from_vaddr(mm, vaddr, &insn, 5);
> +	if (err)
> +		return err;
> +	*optimized = __is_optimized((uprobe_opcode_t *)&insn, vaddr);
> +	return 0;
> +}
> +
> +static bool should_optimize(struct arch_uprobe *auprobe)
> +{
> +	return !test_bit(ARCH_UPROBE_FLAG_OPTIMIZE_FAIL, &auprobe->flags) &&
> +		test_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags);
> +}
> +
> +int set_swbp(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> +	     unsigned long vaddr)
> +{
> +	if (should_optimize(auprobe)) {
> +		bool optimized = false;
> +		int err;
> +
> +		/*
> +		 * We could race with another thread that already optimized the probe,
> +		 * so let's not overwrite it with int3 again in this case.
> +		 */
> +		err = is_optimized(vma->vm_mm, vaddr, &optimized);
> +		if (err)
> +			return err;
> +		if (optimized)
> +			return 0;
> +	}
> +	return uprobe_write_opcode(auprobe, vma, vaddr, UPROBE_SWBP_INSN,
> +				   true /* is_register */);
> +}
> +
> +int set_orig_insn(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> +		  unsigned long vaddr)
> +{
> +	if (test_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags)) {
> +		struct mm_struct *mm = vma->vm_mm;
> +		bool optimized = false;
> +		int err;
> +
> +		err = is_optimized(mm, vaddr, &optimized);
> +		if (err)
> +			return err;
> +		if (optimized) {
> +			err = swbp_unoptimize(auprobe, vma, vaddr);
> +			WARN_ON_ONCE(err);
> +			return err;
> +		}
> +	}
> +	return uprobe_write_opcode(auprobe, vma, vaddr, *(uprobe_opcode_t *)&auprobe->insn,
> +				   false /* is_register */);
> +}
> +
> +static int __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct *mm,
> +				  unsigned long vaddr)
> +{
> +	struct uprobe_trampoline *tramp;
> +	struct vm_area_struct *vma;
> +	bool new = false;
> +	int err = 0;
> +
> +	vma = find_vma(mm, vaddr);
> +	if (!vma)
> +		return -EINVAL;
> +	tramp = get_uprobe_trampoline(vaddr, &new);
> +	if (!tramp)
> +		return -EINVAL;
> +	err = swbp_optimize(auprobe, vma, vaddr, tramp->vaddr);
> +	if (WARN_ON_ONCE(err) && new)
> +		destroy_uprobe_trampoline(tramp);
> +	return err;
> +}
> +
> +void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
> +{
> +	struct mm_struct *mm = current->mm;
> +	uprobe_opcode_t insn[5];
> +
> +	/*
> +	 * Do not optimize if shadow stack is enabled, the return address hijack
> +	 * code in arch_uretprobe_hijack_return_addr updates wrong frame when
> +	 * the entry uprobe is optimized and the shadow stack crashes the app.
> +	 */
> +	if (shstk_is_enabled())
> +		return;
> +
> +	if (!should_optimize(auprobe))
> +		return;
> +
> +	mmap_write_lock(mm);
> +
> +	/*
> +	 * Check if some other thread already optimized the uprobe for us,
> +	 * if it's the case just go away silently.
> +	 */
> +	if (copy_from_vaddr(mm, vaddr, &insn, 5))
> +		goto unlock;
> +	if (!is_swbp_insn((uprobe_opcode_t*) &insn))
> +		goto unlock;
> +
> +	/*
> +	 * If we fail to optimize the uprobe we set the fail bit so the
> +	 * above should_optimize will fail from now on.
> +	 */
> +	if (__arch_uprobe_optimize(auprobe, mm, vaddr))
> +		set_bit(ARCH_UPROBE_FLAG_OPTIMIZE_FAIL, &auprobe->flags);
> +
> +unlock:
> +	mmap_write_unlock(mm);
> +}
> +
> +static bool can_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
> +{
> +	if (memcmp(&auprobe->insn, x86_nops[5], 5))
> +		return false;
> +	/* We can't do cross page atomic writes yet. */
> +	return PAGE_SIZE - (vaddr & ~PAGE_MASK) >= 5;
> +}
>  #else /* 32-bit: */
>  /*
>   * No RIP-relative addressing on 32-bit
> @@ -904,6 +1178,10 @@ static void riprel_pre_xol(struct arch_uprobe *auprobe, struct pt_regs *regs)
>  static void riprel_post_xol(struct arch_uprobe *auprobe, struct pt_regs *regs)
>  {
>  }
> +static bool can_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
> +{
> +	return false;
> +}
>  #endif /* CONFIG_X86_64 */
>  
>  struct uprobe_xol_ops {
> @@ -1270,6 +1548,9 @@ int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe, struct mm_struct *mm,
>  	if (ret)
>  		return ret;
>  
> +	if (can_optimize(auprobe, addr))
> +		set_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags);
> +
>  	ret = branch_setup_xol_ops(auprobe, &insn);
>  	if (ret != -ENOSYS)
>  		return ret;
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index b6b077cc7d0f..08ef78439d0d 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -192,7 +192,7 @@ struct uprobes_state {
>  };
>  
>  typedef int (*uprobe_write_verify_t)(struct page *page, unsigned long vaddr,
> -				     uprobe_opcode_t *insn, int nbytes);
> +				     uprobe_opcode_t *insn, int nbytes, void *data);
>  
>  extern void __init uprobes_init(void);
>  extern int set_swbp(struct arch_uprobe *aup, struct vm_area_struct *vma, unsigned long vaddr);
> @@ -204,7 +204,8 @@ extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
>  extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct vm_area_struct *vma, unsigned long vaddr, uprobe_opcode_t,
>  			       bool is_register);
>  extern int uprobe_write(struct arch_uprobe *auprobe, struct vm_area_struct *vma, const unsigned long opcode_vaddr,
> -			uprobe_opcode_t *insn, int nbytes, uprobe_write_verify_t verify, bool is_register, bool do_update_ref_ctr);
> +			uprobe_opcode_t *insn, int nbytes, uprobe_write_verify_t verify, bool is_register, bool do_update_ref_ctr,
> +			void *data);
>  extern struct uprobe *uprobe_register(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
>  extern int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool);
>  extern void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc);
> @@ -240,6 +241,7 @@ extern void uprobe_copy_from_page(struct page *page, unsigned long vaddr, void *
>  extern void arch_uprobe_clear_state(struct mm_struct *mm);
>  extern void arch_uprobe_init_state(struct mm_struct *mm);
>  extern void handle_syscall_uprobe(struct pt_regs *regs, unsigned long bp_vaddr);
> +extern void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr);
>  #else /* !CONFIG_UPROBES */
>  struct uprobes_state {
>  };
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index cbba31c0495f..e54081beeab9 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -192,7 +192,7 @@ static void copy_to_page(struct page *page, unsigned long vaddr, const void *src
>  }
>  
>  static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *insn,
> -			 int nbytes)
> +			 int nbytes, void *data)
>  {
>  	uprobe_opcode_t old_opcode;
>  	bool is_swbp;
> @@ -492,12 +492,13 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
>  		bool is_register)
>  {
>  	return uprobe_write(auprobe, vma, opcode_vaddr, &opcode, UPROBE_SWBP_INSN_SIZE,
> -			    verify_opcode, is_register, true /* do_update_ref_ctr */);
> +			    verify_opcode, is_register, true /* do_update_ref_ctr */, NULL);
>  }
>  
>  int uprobe_write(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
>  		 const unsigned long insn_vaddr, uprobe_opcode_t *insn, int nbytes,
> -		 uprobe_write_verify_t verify, bool is_register, bool do_update_ref_ctr)
> +		 uprobe_write_verify_t verify, bool is_register, bool do_update_ref_ctr,
> +		 void *data)
>  {
>  	const unsigned long vaddr = insn_vaddr & PAGE_MASK;
>  	struct mm_struct *mm = vma->vm_mm;
> @@ -531,7 +532,7 @@ int uprobe_write(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
>  		goto out;
>  	folio = page_folio(page);
>  
> -	ret = verify(page, insn_vaddr, insn, nbytes);
> +	ret = verify(page, insn_vaddr, insn, nbytes, data);
>  	if (ret <= 0) {
>  		folio_put(folio);
>  		goto out;
> @@ -2697,6 +2698,10 @@ bool __weak arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c
>  	return true;
>  }
>  
> +void __weak arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
> +{
> +}
> +
>  /*
>   * Run handler and ask thread to singlestep.
>   * Ensure all non-fatal signals cannot interrupt thread while it singlesteps.
> @@ -2761,6 +2766,9 @@ static void handle_swbp(struct pt_regs *regs)
>  
>  	handler_chain(uprobe, regs);
>  
> +	/* Try to optimize after first hit. */
> +	arch_uprobe_optimize(&uprobe->arch, bp_vaddr);
> +
>  	if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
>  		goto out;
>  
> -- 
> 2.50.1
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv6 perf/core 10/22] uprobes/x86: Add support to optimize uprobes
  2025-07-25 10:13   ` Masami Hiramatsu
@ 2025-07-28 21:34     ` Jiri Olsa
  2025-08-08 17:44       ` Jiri Olsa
  2025-08-19 19:17       ` Peter Zijlstra
  0 siblings, 2 replies; 44+ messages in thread
From: Jiri Olsa @ 2025-07-28 21:34 UTC (permalink / raw)
  To: Masami Hiramatsu, Peter Zijlstra
  Cc: Oleg Nesterov, Andrii Nakryiko, bpf, linux-kernel,
	linux-trace-kernel, x86, Song Liu, Yonghong Song, John Fastabend,
	Hao Luo, Steven Rostedt, Alan Maguire, David Laight,
	Thomas Weißschuh, Ingo Molnar

On Fri, Jul 25, 2025 at 07:13:18PM +0900, Masami Hiramatsu wrote:
> On Sun, 20 Jul 2025 13:21:20 +0200
> Jiri Olsa <jolsa@kernel.org> wrote:
> 
> > Putting together all the previously added pieces to support optimized
> > uprobes on top of 5-byte nop instruction.
> > 
> > The current uprobe execution goes through following:
> > 
> >   - installs breakpoint instruction over original instruction
> >   - exception handler hit and calls related uprobe consumers
> >   - and either simulates original instruction or does out of line single step
> >     execution of it
> >   - returns to user space
> > 
> > The optimized uprobe path does following:
> > 
> >   - checks the original instruction is 5-byte nop (plus other checks)
> >   - adds (or uses existing) user space trampoline with uprobe syscall
> >   - overwrites original instruction (5-byte nop) with call to user space
> >     trampoline
> >   - the user space trampoline executes uprobe syscall that calls related uprobe
> >     consumers
> >   - trampoline returns back to next instruction
> > 
> > This approach won't speed up all uprobes as it's limited to using nop5 as
> > original instruction, but we plan to use nop5 as USDT probe instruction
> > (which currently uses single byte nop) and speed up the USDT probes.
> > 
> > The arch_uprobe_optimize triggers the uprobe optimization and is called after
> > first uprobe hit. I originally had it called on uprobe installation but then
> > it clashed with elf loader, because the user space trampoline was added in a
> > place where loader might need to put elf segments, so I decided to do it after
> > first uprobe hit when loading is done.
> > 
> > The uprobe is un-optimized in arch specific set_orig_insn call.
> > 
> > The instruction overwrite is x86 arch specific and needs to go through 3 updates:
> > (on top of nop5 instruction)
> > 
> >   - write int3 into 1st byte
> >   - write last 4 bytes of the call instruction
> >   - update the call instruction opcode
> > 
> > And cleanup goes though similar reverse stages:
> > 
> >   - overwrite call opcode with breakpoint (int3)
> >   - write last 4 bytes of the nop5 instruction
> >   - write the nop5 first instruction byte
> > 
> > We do not unmap and release uprobe trampoline when it's no longer needed,
> > because there's no easy way to make sure none of the threads is still
> > inside the trampoline. But we do not waste memory, because there's just
> > single page for all the uprobe trampoline mappings.
> > 
> > We do waste frame on page mapping for every 4GB by keeping the uprobe
> > trampoline page mapped, but that seems ok.
> > 
> > We take the benefit from the fact that set_swbp and set_orig_insn are
> > called under mmap_write_lock(mm), so we can use the current instruction
> > as the state the uprobe is in - nop5/breakpoint/call trampoline -
> > and decide the needed action (optimize/un-optimize) based on that.
> > 
> > Attaching the speed up from benchs/run_bench_uprobes.sh script:
> > 
> > current:
> >         usermode-count :  152.604 ± 0.044M/s
> >         syscall-count  :   13.359 ± 0.042M/s
> > -->     uprobe-nop     :    3.229 ± 0.002M/s
> >         uprobe-push    :    3.086 ± 0.004M/s
> >         uprobe-ret     :    1.114 ± 0.004M/s
> >         uprobe-nop5    :    1.121 ± 0.005M/s
> >         uretprobe-nop  :    2.145 ± 0.002M/s
> >         uretprobe-push :    2.070 ± 0.001M/s
> >         uretprobe-ret  :    0.931 ± 0.001M/s
> >         uretprobe-nop5 :    0.957 ± 0.001M/s
> > 
> > after the change:
> >         usermode-count :  152.448 ± 0.244M/s
> >         syscall-count  :   14.321 ± 0.059M/s
> >         uprobe-nop     :    3.148 ± 0.007M/s
> >         uprobe-push    :    2.976 ± 0.004M/s
> >         uprobe-ret     :    1.068 ± 0.003M/s
> > -->     uprobe-nop5    :    7.038 ± 0.007M/s
> >         uretprobe-nop  :    2.109 ± 0.004M/s
> >         uretprobe-push :    2.035 ± 0.001M/s
> >         uretprobe-ret  :    0.908 ± 0.001M/s
> >         uretprobe-nop5 :    3.377 ± 0.009M/s
> > 
> > I see bit more speed up on Intel (above) compared to AMD. The big nop5
> > speed up is partly due to emulating nop5 and partly due to optimization.
> > 
> > The key speed up we do this for is the USDT switch from nop to nop5:
> >         uprobe-nop     :    3.148 ± 0.007M/s
> >         uprobe-nop5    :    7.038 ± 0.007M/s
> > 
> 
> This also looks good to me.
> 
> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

thanks!

Peter, do you have more comments?

thanks,
jirka

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv6 perf/core 10/22] uprobes/x86: Add support to optimize uprobes
  2025-07-28 21:34     ` Jiri Olsa
@ 2025-08-08 17:44       ` Jiri Olsa
  2025-08-19 19:17       ` Peter Zijlstra
  1 sibling, 0 replies; 44+ messages in thread
From: Jiri Olsa @ 2025-08-08 17:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Masami Hiramatsu, Jiri Olsa, Oleg Nesterov, Andrii Nakryiko, bpf,
	linux-kernel, linux-trace-kernel, x86, Song Liu, Yonghong Song,
	John Fastabend, Hao Luo, Steven Rostedt, Alan Maguire,
	David Laight, Thomas Weißschuh, Ingo Molnar

ping, thanks

On Mon, Jul 28, 2025 at 11:34:56PM +0200, Jiri Olsa wrote:
> On Fri, Jul 25, 2025 at 07:13:18PM +0900, Masami Hiramatsu wrote:
> > On Sun, 20 Jul 2025 13:21:20 +0200
> > Jiri Olsa <jolsa@kernel.org> wrote:
> > 
> > > Putting together all the previously added pieces to support optimized
> > > uprobes on top of 5-byte nop instruction.
> > > 
> > > The current uprobe execution goes through following:
> > > 
> > >   - installs breakpoint instruction over original instruction
> > >   - exception handler hit and calls related uprobe consumers
> > >   - and either simulates original instruction or does out of line single step
> > >     execution of it
> > >   - returns to user space
> > > 
> > > The optimized uprobe path does following:
> > > 
> > >   - checks the original instruction is 5-byte nop (plus other checks)
> > >   - adds (or uses existing) user space trampoline with uprobe syscall
> > >   - overwrites original instruction (5-byte nop) with call to user space
> > >     trampoline
> > >   - the user space trampoline executes uprobe syscall that calls related uprobe
> > >     consumers
> > >   - trampoline returns back to next instruction
> > > 
> > > This approach won't speed up all uprobes as it's limited to using nop5 as
> > > original instruction, but we plan to use nop5 as USDT probe instruction
> > > (which currently uses single byte nop) and speed up the USDT probes.
> > > 
> > > The arch_uprobe_optimize triggers the uprobe optimization and is called after
> > > first uprobe hit. I originally had it called on uprobe installation but then
> > > it clashed with elf loader, because the user space trampoline was added in a
> > > place where loader might need to put elf segments, so I decided to do it after
> > > first uprobe hit when loading is done.
> > > 
> > > The uprobe is un-optimized in arch specific set_orig_insn call.
> > > 
> > > The instruction overwrite is x86 arch specific and needs to go through 3 updates:
> > > (on top of nop5 instruction)
> > > 
> > >   - write int3 into 1st byte
> > >   - write last 4 bytes of the call instruction
> > >   - update the call instruction opcode
> > > 
> > > And cleanup goes though similar reverse stages:
> > > 
> > >   - overwrite call opcode with breakpoint (int3)
> > >   - write last 4 bytes of the nop5 instruction
> > >   - write the nop5 first instruction byte
> > > 
> > > We do not unmap and release uprobe trampoline when it's no longer needed,
> > > because there's no easy way to make sure none of the threads is still
> > > inside the trampoline. But we do not waste memory, because there's just
> > > single page for all the uprobe trampoline mappings.
> > > 
> > > We do waste frame on page mapping for every 4GB by keeping the uprobe
> > > trampoline page mapped, but that seems ok.
> > > 
> > > We take the benefit from the fact that set_swbp and set_orig_insn are
> > > called under mmap_write_lock(mm), so we can use the current instruction
> > > as the state the uprobe is in - nop5/breakpoint/call trampoline -
> > > and decide the needed action (optimize/un-optimize) based on that.
> > > 
> > > Attaching the speed up from benchs/run_bench_uprobes.sh script:
> > > 
> > > current:
> > >         usermode-count :  152.604 ± 0.044M/s
> > >         syscall-count  :   13.359 ± 0.042M/s
> > > -->     uprobe-nop     :    3.229 ± 0.002M/s
> > >         uprobe-push    :    3.086 ± 0.004M/s
> > >         uprobe-ret     :    1.114 ± 0.004M/s
> > >         uprobe-nop5    :    1.121 ± 0.005M/s
> > >         uretprobe-nop  :    2.145 ± 0.002M/s
> > >         uretprobe-push :    2.070 ± 0.001M/s
> > >         uretprobe-ret  :    0.931 ± 0.001M/s
> > >         uretprobe-nop5 :    0.957 ± 0.001M/s
> > > 
> > > after the change:
> > >         usermode-count :  152.448 ± 0.244M/s
> > >         syscall-count  :   14.321 ± 0.059M/s
> > >         uprobe-nop     :    3.148 ± 0.007M/s
> > >         uprobe-push    :    2.976 ± 0.004M/s
> > >         uprobe-ret     :    1.068 ± 0.003M/s
> > > -->     uprobe-nop5    :    7.038 ± 0.007M/s
> > >         uretprobe-nop  :    2.109 ± 0.004M/s
> > >         uretprobe-push :    2.035 ± 0.001M/s
> > >         uretprobe-ret  :    0.908 ± 0.001M/s
> > >         uretprobe-nop5 :    3.377 ± 0.009M/s
> > > 
> > > I see bit more speed up on Intel (above) compared to AMD. The big nop5
> > > speed up is partly due to emulating nop5 and partly due to optimization.
> > > 
> > > The key speed up we do this for is the USDT switch from nop to nop5:
> > >         uprobe-nop     :    3.148 ± 0.007M/s
> > >         uprobe-nop5    :    7.038 ± 0.007M/s
> > > 
> > 
> > This also looks good to me.
> > 
> > Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> 
> thanks!
> 
> Peter, do you have more comments?
> 
> thanks,
> jirka

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv6 perf/core 08/22] uprobes/x86: Add mapping for optimized uprobe trampolines
  2025-07-20 11:21 ` [PATCHv6 perf/core 08/22] uprobes/x86: Add mapping for optimized uprobe trampolines Jiri Olsa
@ 2025-08-19 14:53   ` Peter Zijlstra
  2025-08-20 12:18     ` Jiri Olsa
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2025-08-19 14:53 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Andrii Nakryiko, Masami Hiramatsu (Google), bpf,
	linux-kernel, linux-trace-kernel, x86, Song Liu, Yonghong Song,
	John Fastabend, Hao Luo, Steven Rostedt, Alan Maguire,
	David Laight, Thomas Weißschuh, Ingo Molnar

On Sun, Jul 20, 2025 at 01:21:18PM +0200, Jiri Olsa wrote:
> +static void destroy_uprobe_trampoline(struct uprobe_trampoline *tramp)
> +{
> +	/*
> +	 * We do not unmap and release uprobe trampoline page itself,
> +	 * because there's no easy way to make sure none of the threads
> +	 * is still inside the trampoline.
> +	 */
> +	hlist_del(&tramp->node);
> +	kfree(tramp);
> +}

I am somewhat confused; isn't this called from
__mmput()->uprobe_clear_state()->arch_uprobe_clear_state ?

At that time we don't have threads anymore and mm is about to be
destroyed anyway.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv6 perf/core 10/22] uprobes/x86: Add support to optimize uprobes
  2025-07-20 11:21 ` [PATCHv6 perf/core 10/22] uprobes/x86: Add support to optimize uprobes Jiri Olsa
  2025-07-25 10:13   ` Masami Hiramatsu
@ 2025-08-19 19:15   ` Peter Zijlstra
  2025-08-20 12:19     ` Jiri Olsa
                       ` (2 more replies)
  1 sibling, 3 replies; 44+ messages in thread
From: Peter Zijlstra @ 2025-08-19 19:15 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Andrii Nakryiko, bpf, linux-kernel,
	linux-trace-kernel, x86, Song Liu, Yonghong Song, John Fastabend,
	Hao Luo, Steven Rostedt, Masami Hiramatsu, Alan Maguire,
	David Laight, Thomas Weißschuh, Ingo Molnar

On Sun, Jul 20, 2025 at 01:21:20PM +0200, Jiri Olsa wrote:

> +static bool __is_optimized(uprobe_opcode_t *insn, unsigned long vaddr)
> +{
> +	struct __packed __arch_relative_insn {
> +		u8 op;
> +		s32 raddr;
> +	} *call = (struct __arch_relative_insn *) insn;

Not something you need to clean up now I suppose, but we could do with
unifying this thing. we have a bunch of instances around.

> +
> +	if (!is_call_insn(insn))
> +		return false;
> +	return __in_uprobe_trampoline(vaddr + 5 + call->raddr);
> +}

> +void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
> +{
> +	struct mm_struct *mm = current->mm;
> +	uprobe_opcode_t insn[5];
> +
> +	/*
> +	 * Do not optimize if shadow stack is enabled, the return address hijack
> +	 * code in arch_uretprobe_hijack_return_addr updates wrong frame when
> +	 * the entry uprobe is optimized and the shadow stack crashes the app.
> +	 */
> +	if (shstk_is_enabled())
> +		return;

Kernel should be able to fix up userspace shadow stack just fine.

> +	if (!should_optimize(auprobe))
> +		return;
> +
> +	mmap_write_lock(mm);
> +
> +	/*
> +	 * Check if some other thread already optimized the uprobe for us,
> +	 * if it's the case just go away silently.
> +	 */
> +	if (copy_from_vaddr(mm, vaddr, &insn, 5))
> +		goto unlock;
> +	if (!is_swbp_insn((uprobe_opcode_t*) &insn))
> +		goto unlock;
> +
> +	/*
> +	 * If we fail to optimize the uprobe we set the fail bit so the
> +	 * above should_optimize will fail from now on.
> +	 */
> +	if (__arch_uprobe_optimize(auprobe, mm, vaddr))
> +		set_bit(ARCH_UPROBE_FLAG_OPTIMIZE_FAIL, &auprobe->flags);
> +
> +unlock:
> +	mmap_write_unlock(mm);
> +}
> +
> +static bool can_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
> +{
> +	if (memcmp(&auprobe->insn, x86_nops[5], 5))
> +		return false;
> +	/* We can't do cross page atomic writes yet. */
> +	return PAGE_SIZE - (vaddr & ~PAGE_MASK) >= 5;
> +}

This seems needlessly restrictive. Something like:

is_nop5(const char *buf)
{
	struct insn insn;

	ret = insn_decode_kernel(&insn, buf)
	if (ret < 0)
		return false;

	if (insn.length != 5)
		return false;

	if (insn.opcode[0] != 0x0f ||
	    insn.opcode[1] != 0x1f)
	    	return false;

	return true;
}

Should do I suppose. Anyway, I think something like:

  f0 0f 1f 44 00 00	lock nopl 0(%eax, %eax, 1)

is a valid NOP5 at +1 and will 'optimize' and result in:

  f0 e8 disp32		lock call disp32

which will #UD.

But this is nearly unfixable. Just doing my best to find weirdo cases
;-)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv6 perf/core 10/22] uprobes/x86: Add support to optimize uprobes
  2025-07-28 21:34     ` Jiri Olsa
  2025-08-08 17:44       ` Jiri Olsa
@ 2025-08-19 19:17       ` Peter Zijlstra
  2025-08-20 12:19         ` Jiri Olsa
  1 sibling, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2025-08-19 19:17 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Masami Hiramatsu, Oleg Nesterov, Andrii Nakryiko, bpf,
	linux-kernel, linux-trace-kernel, x86, Song Liu, Yonghong Song,
	John Fastabend, Hao Luo, Steven Rostedt, Alan Maguire,
	David Laight, Thomas Weißschuh, Ingo Molnar

On Mon, Jul 28, 2025 at 11:34:56PM +0200, Jiri Olsa wrote:

> Peter, do you have more comments?

I'm not really a fan of this syscall is faster than exception stuff. Yes
it is for current hardware, but I suspect much of this will be a
maintenance burden 'soon'.

Anyway, I'll queue the patches tomorrow. I think the shadow stack thing
wants fixing though. The rest we can prod at whenever.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv6 perf/core 08/22] uprobes/x86: Add mapping for optimized uprobe trampolines
  2025-08-19 14:53   ` Peter Zijlstra
@ 2025-08-20 12:18     ` Jiri Olsa
  0 siblings, 0 replies; 44+ messages in thread
From: Jiri Olsa @ 2025-08-20 12:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Andrii Nakryiko, Masami Hiramatsu (Google), bpf,
	linux-kernel, linux-trace-kernel, x86, Song Liu, Yonghong Song,
	John Fastabend, Hao Luo, Steven Rostedt, Alan Maguire,
	David Laight, Thomas Weißschuh, Ingo Molnar

On Tue, Aug 19, 2025 at 04:53:45PM +0200, Peter Zijlstra wrote:
> On Sun, Jul 20, 2025 at 01:21:18PM +0200, Jiri Olsa wrote:
> > +static void destroy_uprobe_trampoline(struct uprobe_trampoline *tramp)
> > +{
> > +	/*
> > +	 * We do not unmap and release uprobe trampoline page itself,
> > +	 * because there's no easy way to make sure none of the threads
> > +	 * is still inside the trampoline.
> > +	 */
> > +	hlist_del(&tramp->node);
> > +	kfree(tramp);
> > +}
> 
> I am somewhat confused; isn't this called from
> __mmput()->uprobe_clear_state()->arch_uprobe_clear_state ?
> 
> At that time we don't have threads anymore and mm is about to be
> destroyed anyway.

ah you mean the comment, right? we need to release tramp objects

the comment is leftover from some of the previous version, it can be
removed, sorry

jirka

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv6 perf/core 10/22] uprobes/x86: Add support to optimize uprobes
  2025-08-19 19:17       ` Peter Zijlstra
@ 2025-08-20 12:19         ` Jiri Olsa
  0 siblings, 0 replies; 44+ messages in thread
From: Jiri Olsa @ 2025-08-20 12:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jiri Olsa, Masami Hiramatsu, Oleg Nesterov, Andrii Nakryiko, bpf,
	linux-kernel, linux-trace-kernel, x86, Song Liu, Yonghong Song,
	John Fastabend, Hao Luo, Steven Rostedt, Alan Maguire,
	David Laight, Thomas Weißschuh, Ingo Molnar

On Tue, Aug 19, 2025 at 09:17:44PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 28, 2025 at 11:34:56PM +0200, Jiri Olsa wrote:
> 
> > Peter, do you have more comments?
> 
> I'm not really a fan of this syscall is faster than exception stuff. Yes
> it is for current hardware, but I suspect much of this will be a
> maintenance burden 'soon'.
> 
> Anyway, I'll queue the patches tomorrow. I think the shadow stack thing
> wants fixing though. The rest we can prod at whenever.

ok, will send follow up

thanks,
jirka

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv6 perf/core 10/22] uprobes/x86: Add support to optimize uprobes
  2025-08-19 19:15   ` Peter Zijlstra
@ 2025-08-20 12:19     ` Jiri Olsa
  2025-08-20 13:01       ` Peter Zijlstra
  2025-08-20 12:30     ` Peter Zijlstra
  2025-09-03  6:48     ` Jiri Olsa
  2 siblings, 1 reply; 44+ messages in thread
From: Jiri Olsa @ 2025-08-20 12:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Andrii Nakryiko, bpf, linux-kernel,
	linux-trace-kernel, x86, Song Liu, Yonghong Song, John Fastabend,
	Hao Luo, Steven Rostedt, Masami Hiramatsu, Alan Maguire,
	David Laight, Thomas Weißschuh, Ingo Molnar

On Tue, Aug 19, 2025 at 09:15:15PM +0200, Peter Zijlstra wrote:
> On Sun, Jul 20, 2025 at 01:21:20PM +0200, Jiri Olsa wrote:
> 
> > +static bool __is_optimized(uprobe_opcode_t *insn, unsigned long vaddr)
> > +{
> > +	struct __packed __arch_relative_insn {
> > +		u8 op;
> > +		s32 raddr;
> > +	} *call = (struct __arch_relative_insn *) insn;
> 
> Not something you need to clean up now I suppose, but we could do with
> unifying this thing. we have a bunch of instances around.

ok, I noticed, will send patch for that

> 
> > +
> > +	if (!is_call_insn(insn))
> > +		return false;
> > +	return __in_uprobe_trampoline(vaddr + 5 + call->raddr);
> > +}
> 
> > +void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
> > +{
> > +	struct mm_struct *mm = current->mm;
> > +	uprobe_opcode_t insn[5];
> > +
> > +	/*
> > +	 * Do not optimize if shadow stack is enabled, the return address hijack
> > +	 * code in arch_uretprobe_hijack_return_addr updates wrong frame when
> > +	 * the entry uprobe is optimized and the shadow stack crashes the app.
> > +	 */
> > +	if (shstk_is_enabled())
> > +		return;
> 
> Kernel should be able to fix up userspace shadow stack just fine.

ok, will send follow up fix

> 
> > +	if (!should_optimize(auprobe))
> > +		return;
> > +
> > +	mmap_write_lock(mm);
> > +
> > +	/*
> > +	 * Check if some other thread already optimized the uprobe for us,
> > +	 * if it's the case just go away silently.
> > +	 */
> > +	if (copy_from_vaddr(mm, vaddr, &insn, 5))
> > +		goto unlock;
> > +	if (!is_swbp_insn((uprobe_opcode_t*) &insn))
> > +		goto unlock;
> > +
> > +	/*
> > +	 * If we fail to optimize the uprobe we set the fail bit so the
> > +	 * above should_optimize will fail from now on.
> > +	 */
> > +	if (__arch_uprobe_optimize(auprobe, mm, vaddr))
> > +		set_bit(ARCH_UPROBE_FLAG_OPTIMIZE_FAIL, &auprobe->flags);
> > +
> > +unlock:
> > +	mmap_write_unlock(mm);
> > +}
> > +
> > +static bool can_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
> > +{
> > +	if (memcmp(&auprobe->insn, x86_nops[5], 5))
> > +		return false;
> > +	/* We can't do cross page atomic writes yet. */
> > +	return PAGE_SIZE - (vaddr & ~PAGE_MASK) >= 5;
> > +}
> 
> This seems needlessly restrictive. Something like:
> 
> is_nop5(const char *buf)
> {
> 	struct insn insn;
> 
> 	ret = insn_decode_kernel(&insn, buf)
> 	if (ret < 0)
> 		return false;
> 
> 	if (insn.length != 5)
> 		return false;
> 
> 	if (insn.opcode[0] != 0x0f ||
> 	    insn.opcode[1] != 0x1f)
> 	    	return false;
> 
> 	return true;
> }
> 
> Should do I suppose.

ok, looks good, should I respin with this, or is follow up ok?

> Anyway, I think something like:
> 
>   f0 0f 1f 44 00 00	lock nopl 0(%eax, %eax, 1)
> 
> is a valid NOP5 at +1 and will 'optimize' and result in:
> 
>   f0 e8 disp32		lock call disp32
> 
> which will #UD.
> 
> But this is nearly unfixable. Just doing my best to find weirdo cases
> ;-)

nice, but I think if user puts not-optimized uprobe in the middle of the
instruction like to lock-nop5 + 1 the app would crash as well

thanks,
jirka

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv6 perf/core 10/22] uprobes/x86: Add support to optimize uprobes
  2025-08-19 19:15   ` Peter Zijlstra
  2025-08-20 12:19     ` Jiri Olsa
@ 2025-08-20 12:30     ` Peter Zijlstra
  2025-08-20 15:58       ` Edgecombe, Rick P
  2025-08-20 21:38       ` Jiri Olsa
  2025-09-03  6:48     ` Jiri Olsa
  2 siblings, 2 replies; 44+ messages in thread
From: Peter Zijlstra @ 2025-08-20 12:30 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Andrii Nakryiko, bpf, linux-kernel,
	linux-trace-kernel, x86, Song Liu, Yonghong Song, John Fastabend,
	Hao Luo, Steven Rostedt, Masami Hiramatsu, Alan Maguire,
	David Laight, Thomas Weißschuh, Ingo Molnar,
	rick.p.edgecombe

On Tue, Aug 19, 2025 at 09:15:15PM +0200, Peter Zijlstra wrote:
> On Sun, Jul 20, 2025 at 01:21:20PM +0200, Jiri Olsa wrote:

> > +void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
> > +{
> > +	struct mm_struct *mm = current->mm;
> > +	uprobe_opcode_t insn[5];
> > +
> > +	/*
> > +	 * Do not optimize if shadow stack is enabled, the return address hijack
> > +	 * code in arch_uretprobe_hijack_return_addr updates wrong frame when
> > +	 * the entry uprobe is optimized and the shadow stack crashes the app.
> > +	 */
> > +	if (shstk_is_enabled())
> > +		return;
> 
> Kernel should be able to fix up userspace shadow stack just fine.
> 
> > +	if (!should_optimize(auprobe))
> > +		return;
> > +
> > +	mmap_write_lock(mm);
> > +
> > +	/*
> > +	 * Check if some other thread already optimized the uprobe for us,
> > +	 * if it's the case just go away silently.
> > +	 */
> > +	if (copy_from_vaddr(mm, vaddr, &insn, 5))
> > +		goto unlock;
> > +	if (!is_swbp_insn((uprobe_opcode_t*) &insn))
> > +		goto unlock;
> > +
> > +	/*
> > +	 * If we fail to optimize the uprobe we set the fail bit so the
> > +	 * above should_optimize will fail from now on.
> > +	 */
> > +	if (__arch_uprobe_optimize(auprobe, mm, vaddr))
> > +		set_bit(ARCH_UPROBE_FLAG_OPTIMIZE_FAIL, &auprobe->flags);
> > +
> > +unlock:
> > +	mmap_write_unlock(mm);
> > +}

Something a little like this should do I suppose...

--- a/arch/x86/include/asm/shstk.h
+++ b/arch/x86/include/asm/shstk.h
@@ -23,6 +23,8 @@ int setup_signal_shadow_stack(struct ksi
 int restore_signal_shadow_stack(void);
 int shstk_update_last_frame(unsigned long val);
 bool shstk_is_enabled(void);
+int shstk_pop(u64 *val);
+int shstk_push(u64 val);
 #else
 static inline long shstk_prctl(struct task_struct *task, int option,
 			       unsigned long arg2) { return -EINVAL; }
@@ -35,6 +37,8 @@ static inline int setup_signal_shadow_st
 static inline int restore_signal_shadow_stack(void) { return 0; }
 static inline int shstk_update_last_frame(unsigned long val) { return 0; }
 static inline bool shstk_is_enabled(void) { return false; }
+static inline int shstk_pop(u64 *val) { return -ENOTSUPP; }
+static inline int shstk_push(u64 val) { return -ENOTSUPP; }
 #endif /* CONFIG_X86_USER_SHADOW_STACK */
 
 #endif /* __ASSEMBLER__ */
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -246,6 +246,46 @@ static unsigned long get_user_shstk_addr
 	return ssp;
 }
 
+int shstk_pop(u64 *val)
+{
+	int ret = 0;
+	u64 ssp;
+
+	if (!features_enabled(ARCH_SHSTK_SHSTK))
+		return -ENOTSUPP;
+
+	fpregs_lock_and_load();
+
+	rdmsrq(MSR_IA32_PL3_SSP, ssp);
+	if (val && get_user(*val, (__user u64 *)ssp))
+	    ret = -EFAULT;
+	ssp += SS_FRAME_SIZE;
+	wrmsrq(MSR_IA32_PL3_SSP, ssp);
+
+	fpregs_unlock();
+
+	return ret;
+}
+
+int shstk_push(u64 val)
+{
+	u64 ssp;
+	int ret;
+
+	if (!features_enabled(ARCH_SHSTK_SHSTK))
+		return -ENOTSUPP;
+
+	fpregs_lock_and_load();
+
+	rdmsrq(MSR_IA32_PL3_SSP, ssp);
+	ssp -= SS_FRAME_SIZE;
+	wrmsrq(MSR_IA32_PL3_SSP, ssp);
+	ret = write_user_shstk_64((__user void *)ssp, val);
+	fpregs_unlock();
+
+	return ret;
+}
+
 #define SHSTK_DATA_BIT BIT(63)
 
 static int put_shstk_data(u64 __user *addr, u64 data)
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -804,7 +804,7 @@ SYSCALL_DEFINE0(uprobe)
 {
 	struct pt_regs *regs = task_pt_regs(current);
 	struct uprobe_syscall_args args;
-	unsigned long ip, sp;
+	unsigned long ip, sp, sret;
 	int err;
 
 	/* Allow execution only from uprobe trampolines. */
@@ -831,6 +831,9 @@ SYSCALL_DEFINE0(uprobe)
 
 	sp = regs->sp;
 
+	if (shstk_pop(&sret) == 0 && sret != args.retaddr)
+		goto sigill;
+
 	handle_syscall_uprobe(regs, regs->ip);
 
 	/*
@@ -855,6 +858,9 @@ SYSCALL_DEFINE0(uprobe)
 	if (args.retaddr - 5 != regs->ip)
 		args.retaddr = regs->ip;
 
+	if (shstk_push(args.retaddr) == -EFAULT)
+		goto sigill;
+
 	regs->ip = ip;
 
 	err = copy_to_user((void __user *)regs->sp, &args, sizeof(args));
@@ -1124,14 +1130,6 @@ void arch_uprobe_optimize(struct arch_up
 	struct mm_struct *mm = current->mm;
 	uprobe_opcode_t insn[5];
 
-	/*
-	 * Do not optimize if shadow stack is enabled, the return address hijack
-	 * code in arch_uretprobe_hijack_return_addr updates wrong frame when
-	 * the entry uprobe is optimized and the shadow stack crashes the app.
-	 */
-	if (shstk_is_enabled())
-		return;
-
 	if (!should_optimize(auprobe))
 		return;
 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv6 perf/core 10/22] uprobes/x86: Add support to optimize uprobes
  2025-08-20 12:19     ` Jiri Olsa
@ 2025-08-20 13:01       ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2025-08-20 13:01 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Andrii Nakryiko, bpf, linux-kernel,
	linux-trace-kernel, x86, Song Liu, Yonghong Song, John Fastabend,
	Hao Luo, Steven Rostedt, Masami Hiramatsu, Alan Maguire,
	David Laight, Thomas Weißschuh, Ingo Molnar

On Wed, Aug 20, 2025 at 02:19:15PM +0200, Jiri Olsa wrote:

> > This seems needlessly restrictive. Something like:
> > 
> > is_nop5(const char *buf)
> > {
> > 	struct insn insn;
> > 
> > 	ret = insn_decode_kernel(&insn, buf)
> > 	if (ret < 0)
> > 		return false;
> > 
> > 	if (insn.length != 5)
> > 		return false;
> > 
> > 	if (insn.opcode[0] != 0x0f ||
> > 	    insn.opcode[1] != 0x1f)
> > 	    	return false;
> > 
> > 	return true;
> > }
> > 
> > Should do I suppose.
> 
> ok, looks good, should I respin with this, or is follow up ok?

I cleaned up already; I pushed out these patches to queue/perf/core and
added a few of my own.

I will need to write better Changelogs, and post them, but I need to run
some errants first.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv6 perf/core 10/22] uprobes/x86: Add support to optimize uprobes
  2025-08-20 12:30     ` Peter Zijlstra
@ 2025-08-20 15:58       ` Edgecombe, Rick P
  2025-08-20 17:12         ` Peter Zijlstra
  2025-08-20 21:38       ` Jiri Olsa
  1 sibling, 1 reply; 44+ messages in thread
From: Edgecombe, Rick P @ 2025-08-20 15:58 UTC (permalink / raw)
  To: jolsa@kernel.org, peterz@infradead.org
  Cc: songliubraving@fb.com, alan.maguire@oracle.com,
	mhiramat@kernel.org, andrii@kernel.org, john.fastabend@gmail.com,
	linux-kernel@vger.kernel.org, mingo@kernel.org,
	rostedt@goodmis.org, David.Laight@aculab.com, yhs@fb.com,
	oleg@redhat.com, linux-trace-kernel@vger.kernel.org,
	bpf@vger.kernel.org, thomas@t-8ch.de, haoluo@google.com,
	x86@kernel.org

I'm not sure we should optimize for shadow stack yet. Unless it's easy to think
about... (below)

On Wed, 2025-08-20 at 14:30 +0200, Peter Zijlstra wrote:
> --- a/arch/x86/include/asm/shstk.h
> +++ b/arch/x86/include/asm/shstk.h
> @@ -23,6 +23,8 @@ int setup_signal_shadow_stack(struct ksi
>  int restore_signal_shadow_stack(void);
>  int shstk_update_last_frame(unsigned long val);
>  bool shstk_is_enabled(void);
> +int shstk_pop(u64 *val);
> +int shstk_push(u64 val);
>  #else
>  static inline long shstk_prctl(struct task_struct *task, int option,
>  			       unsigned long arg2) { return -EINVAL; }
> @@ -35,6 +37,8 @@ static inline int setup_signal_shadow_st
>  static inline int restore_signal_shadow_stack(void) { return 0; }
>  static inline int shstk_update_last_frame(unsigned long val) { return 0; }
>  static inline bool shstk_is_enabled(void) { return false; }
> +static inline int shstk_pop(u64 *val) { return -ENOTSUPP; }
> +static inline int shstk_push(u64 val) { return -ENOTSUPP; }
>  #endif /* CONFIG_X86_USER_SHADOW_STACK */
>  
>  #endif /* __ASSEMBLER__ */
> --- a/arch/x86/kernel/shstk.c
> +++ b/arch/x86/kernel/shstk.c
> @@ -246,6 +246,46 @@ static unsigned long get_user_shstk_addr
>  	return ssp;
>  }
>  
> +int shstk_pop(u64 *val)
> +{
> +	int ret = 0;
> +	u64 ssp;
> +
> +	if (!features_enabled(ARCH_SHSTK_SHSTK))
> +		return -ENOTSUPP;
> +
> +	fpregs_lock_and_load();
> +
> +	rdmsrq(MSR_IA32_PL3_SSP, ssp);
> +	if (val && get_user(*val, (__user u64 *)ssp))
> +	    ret = -EFAULT;
> +	ssp += SS_FRAME_SIZE;
> +	wrmsrq(MSR_IA32_PL3_SSP, ssp);
> +
> +	fpregs_unlock();
> +
> +	return ret;
> +}
> +
> +int shstk_push(u64 val)
> +{
> +	u64 ssp;
> +	int ret;
> +
> +	if (!features_enabled(ARCH_SHSTK_SHSTK))
> +		return -ENOTSUPP;
> +
> +	fpregs_lock_and_load();
> +
> +	rdmsrq(MSR_IA32_PL3_SSP, ssp);
> +	ssp -= SS_FRAME_SIZE;
> +	wrmsrq(MSR_IA32_PL3_SSP, ssp);
> +	ret = write_user_shstk_64((__user void *)ssp, val);

Should we role back ssp if there is a fault?

> +	fpregs_unlock();
> +
> +	return ret;
> +}
> +
>  #define SHSTK_DATA_BIT BIT(63)
>  
>  static int put_shstk_data(u64 __user *addr, u64 data)
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -804,7 +804,7 @@ SYSCALL_DEFINE0(uprobe)
>  {
>  	struct pt_regs *regs = task_pt_regs(current);
>  	struct uprobe_syscall_args args;
> -	unsigned long ip, sp;
> +	unsigned long ip, sp, sret;
>  	int err;
>  
>  	/* Allow execution only from uprobe trampolines. */
> @@ -831,6 +831,9 @@ SYSCALL_DEFINE0(uprobe)
>  
>  	sp = regs->sp;
>  
> +	if (shstk_pop(&sret) == 0 && sret != args.retaddr)
> +		goto sigill;
> +
>  	handle_syscall_uprobe(regs, regs->ip);
>  
>  	/*
> @@ -855,6 +858,9 @@ SYSCALL_DEFINE0(uprobe)
>  	if (args.retaddr - 5 != regs->ip)
>  		args.retaddr = regs->ip;
>  
> +	if (shstk_push(args.retaddr) == -EFAULT)
> +		goto sigill;
> +

Are we effectively allowing arbitrary shadow stack push here? I see we need to
be in in_uprobe_trampoline(), but there is no mmap lock taken, so it's a racy
check. I'm questioning if the security posture tweak is worth thinking about for
whatever the level of intersection of uprobes usage and shadow stack is today.

>  	regs->ip = ip;
>  
>  	err = copy_to_user((void __user *)regs->sp, &args, sizeof(args));
> @@ -1124,14 +1130,6 @@ void arch_uprobe_optimize(struct arch_up
>  	struct mm_struct *mm = current->mm;
>  	uprobe_opcode_t insn[5];
>  
> -	/*
> -	 * Do not optimize if shadow stack is enabled, the return address hijack
> -	 * code in arch_uretprobe_hijack_return_addr updates wrong frame when
> -	 * the entry uprobe is optimized and the shadow stack crashes the app.
> -	 */
> -	if (shstk_is_enabled())
> -		return;
> -
>  	if (!should_optimize(auprobe))
>  		return;
>  


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv6 perf/core 10/22] uprobes/x86: Add support to optimize uprobes
  2025-08-20 15:58       ` Edgecombe, Rick P
@ 2025-08-20 17:12         ` Peter Zijlstra
  2025-08-20 17:26           ` Edgecombe, Rick P
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2025-08-20 17:12 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: jolsa@kernel.org, songliubraving@fb.com, alan.maguire@oracle.com,
	mhiramat@kernel.org, andrii@kernel.org, john.fastabend@gmail.com,
	linux-kernel@vger.kernel.org, mingo@kernel.org,
	rostedt@goodmis.org, David.Laight@aculab.com, yhs@fb.com,
	oleg@redhat.com, linux-trace-kernel@vger.kernel.org,
	bpf@vger.kernel.org, thomas@t-8ch.de, haoluo@google.com,
	x86@kernel.org

On Wed, Aug 20, 2025 at 03:58:14PM +0000, Edgecombe, Rick P wrote:
> I'm not sure we should optimize for shadow stack yet. Unless it's easy to think
> about... (below)
> 
> On Wed, 2025-08-20 at 14:30 +0200, Peter Zijlstra wrote:
> > --- a/arch/x86/include/asm/shstk.h
> > +++ b/arch/x86/include/asm/shstk.h
> > @@ -23,6 +23,8 @@ int setup_signal_shadow_stack(struct ksi
> >  int restore_signal_shadow_stack(void);
> >  int shstk_update_last_frame(unsigned long val);
> >  bool shstk_is_enabled(void);
> > +int shstk_pop(u64 *val);
> > +int shstk_push(u64 val);
> >  #else
> >  static inline long shstk_prctl(struct task_struct *task, int option,
> >  			       unsigned long arg2) { return -EINVAL; }
> > @@ -35,6 +37,8 @@ static inline int setup_signal_shadow_st
> >  static inline int restore_signal_shadow_stack(void) { return 0; }
> >  static inline int shstk_update_last_frame(unsigned long val) { return 0; }
> >  static inline bool shstk_is_enabled(void) { return false; }
> > +static inline int shstk_pop(u64 *val) { return -ENOTSUPP; }
> > +static inline int shstk_push(u64 val) { return -ENOTSUPP; }
> >  #endif /* CONFIG_X86_USER_SHADOW_STACK */
> >  
> >  #endif /* __ASSEMBLER__ */
> > --- a/arch/x86/kernel/shstk.c
> > +++ b/arch/x86/kernel/shstk.c
> > @@ -246,6 +246,46 @@ static unsigned long get_user_shstk_addr
> >  	return ssp;
> >  }
> >  
> > +int shstk_pop(u64 *val)
> > +{
> > +	int ret = 0;
> > +	u64 ssp;
> > +
> > +	if (!features_enabled(ARCH_SHSTK_SHSTK))
> > +		return -ENOTSUPP;
> > +
> > +	fpregs_lock_and_load();
> > +
> > +	rdmsrq(MSR_IA32_PL3_SSP, ssp);
> > +	if (val && get_user(*val, (__user u64 *)ssp))
> > +	    ret = -EFAULT;
> > +	ssp += SS_FRAME_SIZE;
> > +	wrmsrq(MSR_IA32_PL3_SSP, ssp);
> > +
> > +	fpregs_unlock();
> > +
> > +	return ret;
> > +}
> > +
> > +int shstk_push(u64 val)
> > +{
> > +	u64 ssp;
> > +	int ret;
> > +
> > +	if (!features_enabled(ARCH_SHSTK_SHSTK))
> > +		return -ENOTSUPP;
> > +
> > +	fpregs_lock_and_load();
> > +
> > +	rdmsrq(MSR_IA32_PL3_SSP, ssp);
> > +	ssp -= SS_FRAME_SIZE;
> > +	wrmsrq(MSR_IA32_PL3_SSP, ssp);
> > +	ret = write_user_shstk_64((__user void *)ssp, val);
> 
> Should we role back ssp if there is a fault?

Ah, probably. And same with pop I suppose, don't adjust ssp if we can't
read it etc.

> > +	fpregs_unlock();
> > +
> > +	return ret;
> > +}
> > +
> >  #define SHSTK_DATA_BIT BIT(63)
> >  
> >  static int put_shstk_data(u64 __user *addr, u64 data)
> > --- a/arch/x86/kernel/uprobes.c
> > +++ b/arch/x86/kernel/uprobes.c
> > @@ -804,7 +804,7 @@ SYSCALL_DEFINE0(uprobe)
> >  {
> >  	struct pt_regs *regs = task_pt_regs(current);
> >  	struct uprobe_syscall_args args;
> > -	unsigned long ip, sp;
> > +	unsigned long ip, sp, sret;
> >  	int err;
> >  
> >  	/* Allow execution only from uprobe trampolines. */
> > @@ -831,6 +831,9 @@ SYSCALL_DEFINE0(uprobe)
> >  
> >  	sp = regs->sp;
> >  
> > +	if (shstk_pop(&sret) == 0 && sret != args.retaddr)
> > +		goto sigill;
> > +
> >  	handle_syscall_uprobe(regs, regs->ip);
> >  
> >  	/*
> > @@ -855,6 +858,9 @@ SYSCALL_DEFINE0(uprobe)
> >  	if (args.retaddr - 5 != regs->ip)
> >  		args.retaddr = regs->ip;
> >  
> > +	if (shstk_push(args.retaddr) == -EFAULT)
> > +		goto sigill;
> > +
> 
> Are we effectively allowing arbitrary shadow stack push here? 

Yeah, why not? Userspace shadow stacks does not, and cannot, protect
from the kernel being funneh. It fully relies on the kernel being
trusted. So the kernel doing a shstk_{pop,push}() to make things line up
properly shouldn't be a problem.

> I see we need to be in in_uprobe_trampoline(), but there is no mmap
> lock taken, so it's a racy check.

Racy how? Isn't this more or less equivalent to what a normal CALL
instruction would do?

> I'm questioning if the security posture tweak is worth thinking about for
> whatever the level of intersection of uprobes usage and shadow stack is today.

I have no idea how much code is built with shadow stack enabled today;
but I see no point in not supporting uprobes on it. The whole of
userspace shadow stacks only ever protects from userspace attacking
other userspace -- and that protection isn't changed by this.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv6 perf/core 10/22] uprobes/x86: Add support to optimize uprobes
  2025-08-20 17:12         ` Peter Zijlstra
@ 2025-08-20 17:26           ` Edgecombe, Rick P
  2025-08-20 17:43             ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Edgecombe, Rick P @ 2025-08-20 17:26 UTC (permalink / raw)
  To: peterz@infradead.org
  Cc: songliubraving@fb.com, alan.maguire@oracle.com,
	mhiramat@kernel.org, andrii@kernel.org, john.fastabend@gmail.com,
	mingo@kernel.org, linux-kernel@vger.kernel.org,
	rostedt@goodmis.org, David.Laight@aculab.com, yhs@fb.com,
	oleg@redhat.com, bpf@vger.kernel.org,
	linux-trace-kernel@vger.kernel.org, thomas@t-8ch.de,
	jolsa@kernel.org, haoluo@google.com, x86@kernel.org

On Wed, 2025-08-20 at 19:12 +0200, Peter Zijlstra wrote:
> > Are we effectively allowing arbitrary shadow stack push here? 
> 
> Yeah, why not? Userspace shadow stacks does not, and cannot, protect
> from the kernel being funneh. It fully relies on the kernel being
> trusted. So the kernel doing a shstk_{pop,push}() to make things line up
> properly shouldn't be a problem.

Emulating a call/ret should be fine.

> 
> > I see we need to be in in_uprobe_trampoline(), but there is no mmap
> > lock taken, so it's a racy check.
> 
> Racy how? Isn't this more or less equivalent to what a normal CALL
> instruction would do?

Racy in terms of the "is trampoline" check happening before the write to the
shadow stack. I was thinking like a TOCTOU thing. The "Allow execution only from
uprobe trampolines" check is not very strong.

As for call equivalence, args.retaddr comes from userspace, right?

> 
> > I'm questioning if the security posture tweak is worth thinking about for
> > whatever the level of intersection of uprobes usage and shadow stack is
> > today.
> 
> I have no idea how much code is built with shadow stack enabled today;
> but I see no point in not supporting uprobes on it. The whole of
> userspace shadow stacks only ever protects from userspace attacking
> other userspace -- and that protection isn't changed by this.

Isn't this just about whether to support an optimization for uprobes?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv6 perf/core 10/22] uprobes/x86: Add support to optimize uprobes
  2025-08-20 17:26           ` Edgecombe, Rick P
@ 2025-08-20 17:43             ` Peter Zijlstra
  2025-08-20 18:04               ` Edgecombe, Rick P
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2025-08-20 17:43 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: songliubraving@fb.com, alan.maguire@oracle.com,
	mhiramat@kernel.org, andrii@kernel.org, john.fastabend@gmail.com,
	mingo@kernel.org, linux-kernel@vger.kernel.org,
	rostedt@goodmis.org, David.Laight@aculab.com, yhs@fb.com,
	oleg@redhat.com, bpf@vger.kernel.org,
	linux-trace-kernel@vger.kernel.org, thomas@t-8ch.de,
	jolsa@kernel.org, haoluo@google.com, x86@kernel.org

On Wed, Aug 20, 2025 at 05:26:38PM +0000, Edgecombe, Rick P wrote:
> On Wed, 2025-08-20 at 19:12 +0200, Peter Zijlstra wrote:
> > > Are we effectively allowing arbitrary shadow stack push here? 
> > 
> > Yeah, why not? Userspace shadow stacks does not, and cannot, protect
> > from the kernel being funneh. It fully relies on the kernel being
> > trusted. So the kernel doing a shstk_{pop,push}() to make things line up
> > properly shouldn't be a problem.
> 
> Emulating a call/ret should be fine.
> 
> > 
> > > I see we need to be in in_uprobe_trampoline(), but there is no mmap
> > > lock taken, so it's a racy check.
> > 
> > Racy how? Isn't this more or less equivalent to what a normal CALL
> > instruction would do?
> 
> Racy in terms of the "is trampoline" check happening before the write to the
> shadow stack. I was thinking like a TOCTOU thing. The "Allow execution only from
> uprobe trampolines" check is not very strong.
> 
> As for call equivalence, args.retaddr comes from userspace, right?

Yeah. So this whole thing is your random code having a 5 byte nop. And
instead of using INT3 to turn it into #BP, we turn it into "CALL
uprobe_trampoline".

That trampoline looks like:

	push %rcx
	push %r11
	push %rax;
	mov $__NR_uprobe, %rax
	syscall
	pop %rax
	pop %r11
	pop %rcx
	ret

Now, that syscall handler is the one doing shstk_pop/push. But it does
that right along with modifying the normal SP.

Basically the syscall copies the 4 (CALL,PUSH,PUSH,PUSH) words from the
stack into a local struct (args), adjusts SP, and adjusts IP to point to
the CALL instruction that got us here (retaddr-5).

This way, we get the same context as that #BP would've gotten. Then we
run uprobe crap, and on return:

 - sp changed, we take the (slow) IRET path out, and can just jump
   wherever -- typically right after the CALL that got us here, no need
   to re-adjust the stack and take the trampoline tail.

 - sp didn't change, we take the (fast) sysexit path out, and have to
   re-apply the CALL,PUSH,PUSH,PUSH such that the trampoline tail can
   undo them again.

The added shstk_pop/push() exactly match the above undo/redo of the CALL
(and other stack ops).

> > > I'm questioning if the security posture tweak is worth thinking about for
> > > whatever the level of intersection of uprobes usage and shadow stack is
> > > today.
> > 
> > I have no idea how much code is built with shadow stack enabled today;
> > but I see no point in not supporting uprobes on it. The whole of
> > userspace shadow stacks only ever protects from userspace attacking
> > other userspace -- and that protection isn't changed by this.
> 
> Isn't this just about whether to support an optimization for uprobes?

Yes. But supporting the shstk isn't hard (as per this patch), it exactly
matches what it already does to the normal stack. So I don't see a
reason not to do it.

Anyway, I'm not a huge fan of any of this. I suspect FRED will make all
this fancy code totally irrelevant. But until people have FRED enabled
hardware in large quantities, I suppose this has a use.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv6 perf/core 10/22] uprobes/x86: Add support to optimize uprobes
  2025-08-20 17:43             ` Peter Zijlstra
@ 2025-08-20 18:04               ` Edgecombe, Rick P
  0 siblings, 0 replies; 44+ messages in thread
From: Edgecombe, Rick P @ 2025-08-20 18:04 UTC (permalink / raw)
  To: peterz@infradead.org
  Cc: songliubraving@fb.com, alan.maguire@oracle.com,
	mhiramat@kernel.org, andrii@kernel.org, john.fastabend@gmail.com,
	mingo@kernel.org, linux-kernel@vger.kernel.org,
	rostedt@goodmis.org, David.Laight@aculab.com, yhs@fb.com,
	oleg@redhat.com, bpf@vger.kernel.org,
	linux-trace-kernel@vger.kernel.org, thomas@t-8ch.de,
	jolsa@kernel.org, haoluo@google.com, x86@kernel.org

On Wed, 2025-08-20 at 19:43 +0200, Peter Zijlstra wrote:
> Yes. But supporting the shstk isn't hard (as per this patch), it exactly
> matches what it already does to the normal stack. So I don't see a
> reason not to do it.

Thanks for explaining, and sorry for being slow. Going to blame this head cold.

> 
> Anyway, I'm not a huge fan of any of this. I suspect FRED will make all
> this fancy code totally irrelevant. But until people have FRED enabled
> hardware in large quantities, I suppose this has a use.

It doesn't sound too unbounded and I guess as long as it's just an optimization
we can always back it out if someone finds a way to abuse it.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv6 perf/core 10/22] uprobes/x86: Add support to optimize uprobes
  2025-08-20 12:30     ` Peter Zijlstra
  2025-08-20 15:58       ` Edgecombe, Rick P
@ 2025-08-20 21:38       ` Jiri Olsa
  1 sibling, 0 replies; 44+ messages in thread
From: Jiri Olsa @ 2025-08-20 21:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Andrii Nakryiko, bpf, linux-kernel,
	linux-trace-kernel, x86, Song Liu, Yonghong Song, John Fastabend,
	Hao Luo, Steven Rostedt, Masami Hiramatsu, Alan Maguire,
	David Laight, Thomas Weißschuh, Ingo Molnar,
	rick.p.edgecombe

On Wed, Aug 20, 2025 at 02:30:33PM +0200, Peter Zijlstra wrote:
> On Tue, Aug 19, 2025 at 09:15:15PM +0200, Peter Zijlstra wrote:
> > On Sun, Jul 20, 2025 at 01:21:20PM +0200, Jiri Olsa wrote:
> 
> > > +void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
> > > +{
> > > +	struct mm_struct *mm = current->mm;
> > > +	uprobe_opcode_t insn[5];
> > > +
> > > +	/*
> > > +	 * Do not optimize if shadow stack is enabled, the return address hijack
> > > +	 * code in arch_uretprobe_hijack_return_addr updates wrong frame when
> > > +	 * the entry uprobe is optimized and the shadow stack crashes the app.
> > > +	 */
> > > +	if (shstk_is_enabled())
> > > +		return;
> > 
> > Kernel should be able to fix up userspace shadow stack just fine.
> > 
> > > +	if (!should_optimize(auprobe))
> > > +		return;
> > > +
> > > +	mmap_write_lock(mm);
> > > +
> > > +	/*
> > > +	 * Check if some other thread already optimized the uprobe for us,
> > > +	 * if it's the case just go away silently.
> > > +	 */
> > > +	if (copy_from_vaddr(mm, vaddr, &insn, 5))
> > > +		goto unlock;
> > > +	if (!is_swbp_insn((uprobe_opcode_t*) &insn))
> > > +		goto unlock;
> > > +
> > > +	/*
> > > +	 * If we fail to optimize the uprobe we set the fail bit so the
> > > +	 * above should_optimize will fail from now on.
> > > +	 */
> > > +	if (__arch_uprobe_optimize(auprobe, mm, vaddr))
> > > +		set_bit(ARCH_UPROBE_FLAG_OPTIMIZE_FAIL, &auprobe->flags);
> > > +
> > > +unlock:
> > > +	mmap_write_unlock(mm);
> > > +}
> 
> Something a little like this should do I suppose...
> 
> --- a/arch/x86/include/asm/shstk.h
> +++ b/arch/x86/include/asm/shstk.h
> @@ -23,6 +23,8 @@ int setup_signal_shadow_stack(struct ksi
>  int restore_signal_shadow_stack(void);
>  int shstk_update_last_frame(unsigned long val);
>  bool shstk_is_enabled(void);
> +int shstk_pop(u64 *val);
> +int shstk_push(u64 val);
>  #else
>  static inline long shstk_prctl(struct task_struct *task, int option,
>  			       unsigned long arg2) { return -EINVAL; }
> @@ -35,6 +37,8 @@ static inline int setup_signal_shadow_st
>  static inline int restore_signal_shadow_stack(void) { return 0; }
>  static inline int shstk_update_last_frame(unsigned long val) { return 0; }
>  static inline bool shstk_is_enabled(void) { return false; }
> +static inline int shstk_pop(u64 *val) { return -ENOTSUPP; }
> +static inline int shstk_push(u64 val) { return -ENOTSUPP; }
>  #endif /* CONFIG_X86_USER_SHADOW_STACK */
>  
>  #endif /* __ASSEMBLER__ */
> --- a/arch/x86/kernel/shstk.c
> +++ b/arch/x86/kernel/shstk.c
> @@ -246,6 +246,46 @@ static unsigned long get_user_shstk_addr
>  	return ssp;
>  }
>  
> +int shstk_pop(u64 *val)
> +{
> +	int ret = 0;
> +	u64 ssp;
> +
> +	if (!features_enabled(ARCH_SHSTK_SHSTK))
> +		return -ENOTSUPP;
> +
> +	fpregs_lock_and_load();
> +
> +	rdmsrq(MSR_IA32_PL3_SSP, ssp);
> +	if (val && get_user(*val, (__user u64 *)ssp))
> +	    ret = -EFAULT;
> +	ssp += SS_FRAME_SIZE;
> +	wrmsrq(MSR_IA32_PL3_SSP, ssp);
> +
> +	fpregs_unlock();
> +
> +	return ret;
> +}
> +
> +int shstk_push(u64 val)
> +{
> +	u64 ssp;
> +	int ret;
> +
> +	if (!features_enabled(ARCH_SHSTK_SHSTK))
> +		return -ENOTSUPP;
> +
> +	fpregs_lock_and_load();
> +
> +	rdmsrq(MSR_IA32_PL3_SSP, ssp);
> +	ssp -= SS_FRAME_SIZE;
> +	wrmsrq(MSR_IA32_PL3_SSP, ssp);
> +	ret = write_user_shstk_64((__user void *)ssp, val);
> +	fpregs_unlock();
> +
> +	return ret;
> +}
> +
>  #define SHSTK_DATA_BIT BIT(63)
>  
>  static int put_shstk_data(u64 __user *addr, u64 data)
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -804,7 +804,7 @@ SYSCALL_DEFINE0(uprobe)
>  {
>  	struct pt_regs *regs = task_pt_regs(current);
>  	struct uprobe_syscall_args args;
> -	unsigned long ip, sp;
> +	unsigned long ip, sp, sret;
>  	int err;
>  
>  	/* Allow execution only from uprobe trampolines. */
> @@ -831,6 +831,9 @@ SYSCALL_DEFINE0(uprobe)
>  
>  	sp = regs->sp;
>  
> +	if (shstk_pop(&sret) == 0 && sret != args.retaddr)
> +		goto sigill;
> +
>  	handle_syscall_uprobe(regs, regs->ip);
>  
>  	/*
> @@ -855,6 +858,9 @@ SYSCALL_DEFINE0(uprobe)
>  	if (args.retaddr - 5 != regs->ip)
>  		args.retaddr = regs->ip;
>  
> +	if (shstk_push(args.retaddr) == -EFAULT)
> +		goto sigill;
> +
>  	regs->ip = ip;
>  
>  	err = copy_to_user((void __user *)regs->sp, &args, sizeof(args));
> @@ -1124,14 +1130,6 @@ void arch_uprobe_optimize(struct arch_up
>  	struct mm_struct *mm = current->mm;
>  	uprobe_opcode_t insn[5];
>  
> -	/*
> -	 * Do not optimize if shadow stack is enabled, the return address hijack
> -	 * code in arch_uretprobe_hijack_return_addr updates wrong frame when
> -	 * the entry uprobe is optimized and the shadow stack crashes the app.
> -	 */
> -	if (shstk_is_enabled())
> -		return;
> -

nice, we will need to adjust selftests for that, there's shadow stack part
in prog_tests/uprobe_syscall.c that expects non optimized uprobe after enabling
shadow stack.. I'll run it and send the change

thanks,
jirka

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv6 perf/core 10/22] uprobes/x86: Add support to optimize uprobes
  2025-08-19 19:15   ` Peter Zijlstra
  2025-08-20 12:19     ` Jiri Olsa
  2025-08-20 12:30     ` Peter Zijlstra
@ 2025-09-03  6:48     ` Jiri Olsa
  2 siblings, 0 replies; 44+ messages in thread
From: Jiri Olsa @ 2025-09-03  6:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Andrii Nakryiko, bpf, linux-kernel,
	linux-trace-kernel, x86, Song Liu, Yonghong Song, John Fastabend,
	Hao Luo, Steven Rostedt, Masami Hiramatsu, Alan Maguire,
	David Laight, Thomas Weißschuh, Ingo Molnar

On Tue, Aug 19, 2025 at 09:15:15PM +0200, Peter Zijlstra wrote:
> On Sun, Jul 20, 2025 at 01:21:20PM +0200, Jiri Olsa wrote:
> 
> > +static bool __is_optimized(uprobe_opcode_t *insn, unsigned long vaddr)
> > +{
> > +	struct __packed __arch_relative_insn {
> > +		u8 op;
> > +		s32 raddr;
> > +	} *call = (struct __arch_relative_insn *) insn;
> 
> Not something you need to clean up now I suppose, but we could do with
> unifying this thing. we have a bunch of instances around.

found two below, maybe we could use 'union text_poke_insn' instead like below?

jirka


---
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index 6079d15dab8c..7fd03897d776 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -109,14 +109,10 @@ const int kretprobe_blacklist_size = ARRAY_SIZE(kretprobe_blacklist);
 static nokprobe_inline void
 __synthesize_relative_insn(void *dest, void *from, void *to, u8 op)
 {
-	struct __arch_relative_insn {
-		u8 op;
-		s32 raddr;
-	} __packed *insn;
-
-	insn = (struct __arch_relative_insn *)dest;
-	insn->raddr = (s32)((long)(to) - ((long)(from) + 5));
-	insn->op = op;
+	union text_poke_insn *insn = dest;
+
+	insn->disp = (s32)((long)(to) - ((long)(from) + 5));
+	insn->opcode = op;
 }
 
 /* Insert a jump instruction at address 'from', which jumps to address 'to'.*/
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 0a8c0a4a5423..bac14f3165c3 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -1046,14 +1046,11 @@ static int copy_from_vaddr(struct mm_struct *mm, unsigned long vaddr, void *dst,
 
 static bool __is_optimized(uprobe_opcode_t *insn, unsigned long vaddr)
 {
-	struct __packed __arch_relative_insn {
-		u8 op;
-		s32 raddr;
-	} *call = (struct __arch_relative_insn *) insn;
+	union text_poke_insn *call = (union text_poke_insn *) insn;
 
 	if (!is_call_insn(insn))
 		return false;
-	return __in_uprobe_trampoline(vaddr + 5 + call->raddr);
+	return __in_uprobe_trampoline(vaddr + 5 + call->disp);
 }
 
 static int is_optimized(struct mm_struct *mm, unsigned long vaddr)

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCHv6 perf/core 09/22] uprobes/x86: Add uprobe syscall to speed up uprobe
  2025-07-20 11:21 ` [PATCHv6 perf/core 09/22] uprobes/x86: Add uprobe syscall to speed up uprobe Jiri Olsa
  2025-07-20 11:38   ` Oleg Nesterov
  2025-07-25 10:11   ` Masami Hiramatsu
@ 2025-09-03 18:24   ` Andrii Nakryiko
  2 siblings, 0 replies; 44+ messages in thread
From: Andrii Nakryiko @ 2025-09-03 18:24 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, linux-kernel,
	linux-trace-kernel, x86, Song Liu, Yonghong Song, John Fastabend,
	Hao Luo, Steven Rostedt, Masami Hiramatsu, Alan Maguire,
	David Laight, Thomas Weißschuh, Ingo Molnar

On Sun, Jul 20, 2025 at 4:23 AM Jiri Olsa <jolsa@kernel.org> wrote:
>
> Adding new uprobe syscall that calls uprobe handlers for given
> 'breakpoint' address.
>
> The idea is that the 'breakpoint' address calls the user space
> trampoline which executes the uprobe syscall.
>
> The syscall handler reads the return address of the initial call
> to retrieve the original 'breakpoint' address. With this address
> we find the related uprobe object and call its consumers.
>
> Adding the arch_uprobe_trampoline_mapping function that provides
> uprobe trampoline mapping. This mapping is backed with one global
> page initialized at __init time and shared by the all the mapping
> instances.
>
> We do not allow to execute uprobe syscall if the caller is not
> from uprobe trampoline mapping.
>
> The uprobe syscall ensures the consumer (bpf program) sees registers
> values in the state before the trampoline was called.
>
> Acked-by: Andrii Nakryiko <andrii@kernel.org>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  arch/x86/kernel/uprobes.c              | 139 +++++++++++++++++++++++++
>  include/linux/syscalls.h               |   2 +
>  include/linux/uprobes.h                |   1 +
>  kernel/events/uprobes.c                |  17 +++
>  kernel/sys_ni.c                        |   1 +
>  6 files changed, 161 insertions(+)
>
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index cfb5ca41e30d..9fd1291e7bdf 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -345,6 +345,7 @@
>  333    common  io_pgetevents           sys_io_pgetevents
>  334    common  rseq                    sys_rseq
>  335    common  uretprobe               sys_uretprobe
> +336    common  uprobe                  sys_uprobe
>  # don't use numbers 387 through 423, add new calls after the last
>  # 'common' entry
>  424    common  pidfd_send_signal       sys_pidfd_send_signal
> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index 6c4dcbdd0c3c..d18e1ae59901 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -752,6 +752,145 @@ void arch_uprobe_clear_state(struct mm_struct *mm)
>         hlist_for_each_entry_safe(tramp, n, &state->head_tramps, node)
>                 destroy_uprobe_trampoline(tramp);
>  }
> +
> +static bool __in_uprobe_trampoline(unsigned long ip)
> +{
> +       struct vm_area_struct *vma = vma_lookup(current->mm, ip);
> +
> +       return vma && vma_is_special_mapping(vma, &tramp_mapping);
> +}
> +
> +static bool in_uprobe_trampoline(unsigned long ip)
> +{
> +       struct mm_struct *mm = current->mm;
> +       bool found, retry = true;
> +       unsigned int seq;
> +
> +       rcu_read_lock();
> +       if (mmap_lock_speculate_try_begin(mm, &seq)) {
> +               found = __in_uprobe_trampoline(ip);
> +               retry = mmap_lock_speculate_retry(mm, seq);
> +       }
> +       rcu_read_unlock();
> +
> +       if (retry) {
> +               mmap_read_lock(mm);
> +               found = __in_uprobe_trampoline(ip);
> +               mmap_read_unlock(mm);
> +       }
> +       return found;
> +}
> +
> +/*
> + * See uprobe syscall trampoline; the call to the trampoline will push
> + * the return address on the stack, the trampoline itself then pushes
> + * cx, r11 and ax.
> + */
> +struct uprobe_syscall_args {
> +       unsigned long ax;
> +       unsigned long r11;
> +       unsigned long cx;
> +       unsigned long retaddr;
> +};
> +
> +SYSCALL_DEFINE0(uprobe)
> +{
> +       struct pt_regs *regs = task_pt_regs(current);
> +       struct uprobe_syscall_args args;
> +       unsigned long ip, sp;
> +       int err;
> +
> +       /* Allow execution only from uprobe trampolines. */
> +       if (!in_uprobe_trampoline(regs->ip))
> +               goto sigill;

Hey Jiri,

So I've been thinking what's the simplest and most reliable way to
feature-detect support for this sys_uprobe (e.g., for libbpf to know
whether we should attach at nop5 vs nop1), and clearly that would be
to try to call uprobe() syscall not from trampoline, and expect some
error code.

How bad would it be to change this part to return some unique-enough
error code (-ENXIO, -EDOM, whatever).

Is there any reason not to do this? Security-wise it will be just fine, right?

> +
> +       err = copy_from_user(&args, (void __user *)regs->sp, sizeof(args));
> +       if (err)
> +               goto sigill;
> +
> +       ip = regs->ip;
> +

[...]

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2025-09-03 18:24 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-20 11:21 [PATCHv6 perf/core 00/22] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
2025-07-20 11:21 ` [PATCHv6 perf/core 01/22] uprobes: Remove breakpoint in unapply_uprobe under mmap_write_lock Jiri Olsa
2025-07-20 11:21 ` [PATCHv6 perf/core 02/22] uprobes: Rename arch_uretprobe_trampoline function Jiri Olsa
2025-07-20 11:21 ` [PATCHv6 perf/core 03/22] uprobes: Make copy_from_page global Jiri Olsa
2025-07-20 11:21 ` [PATCHv6 perf/core 04/22] uprobes: Add uprobe_write function Jiri Olsa
2025-07-20 11:21 ` [PATCHv6 perf/core 05/22] uprobes: Add nbytes argument to uprobe_write Jiri Olsa
2025-07-20 11:21 ` [PATCHv6 perf/core 06/22] uprobes: Add is_register argument to uprobe_write and uprobe_write_opcode Jiri Olsa
2025-07-20 11:21 ` [PATCHv6 perf/core 07/22] uprobes: Add do_ref_ctr argument to uprobe_write function Jiri Olsa
2025-07-20 11:21 ` [PATCHv6 perf/core 08/22] uprobes/x86: Add mapping for optimized uprobe trampolines Jiri Olsa
2025-08-19 14:53   ` Peter Zijlstra
2025-08-20 12:18     ` Jiri Olsa
2025-07-20 11:21 ` [PATCHv6 perf/core 09/22] uprobes/x86: Add uprobe syscall to speed up uprobe Jiri Olsa
2025-07-20 11:38   ` Oleg Nesterov
2025-07-25 10:11   ` Masami Hiramatsu
2025-09-03 18:24   ` Andrii Nakryiko
2025-07-20 11:21 ` [PATCHv6 perf/core 10/22] uprobes/x86: Add support to optimize uprobes Jiri Olsa
2025-07-25 10:13   ` Masami Hiramatsu
2025-07-28 21:34     ` Jiri Olsa
2025-08-08 17:44       ` Jiri Olsa
2025-08-19 19:17       ` Peter Zijlstra
2025-08-20 12:19         ` Jiri Olsa
2025-08-19 19:15   ` Peter Zijlstra
2025-08-20 12:19     ` Jiri Olsa
2025-08-20 13:01       ` Peter Zijlstra
2025-08-20 12:30     ` Peter Zijlstra
2025-08-20 15:58       ` Edgecombe, Rick P
2025-08-20 17:12         ` Peter Zijlstra
2025-08-20 17:26           ` Edgecombe, Rick P
2025-08-20 17:43             ` Peter Zijlstra
2025-08-20 18:04               ` Edgecombe, Rick P
2025-08-20 21:38       ` Jiri Olsa
2025-09-03  6:48     ` Jiri Olsa
2025-07-20 11:21 ` [PATCHv6 perf/core 11/22] selftests/bpf: Import usdt.h from libbpf/usdt project Jiri Olsa
2025-07-20 11:21 ` [PATCHv6 perf/core 12/22] selftests/bpf: Reorg the uprobe_syscall test function Jiri Olsa
2025-07-20 11:21 ` [PATCHv6 perf/core 13/22] selftests/bpf: Rename uprobe_syscall_executed prog to test_uretprobe_multi Jiri Olsa
2025-07-20 11:21 ` [PATCHv6 perf/core 14/22] selftests/bpf: Add uprobe/usdt syscall tests Jiri Olsa
2025-07-20 11:21 ` [PATCHv6 perf/core 15/22] selftests/bpf: Add hit/attach/detach race optimized uprobe test Jiri Olsa
2025-07-20 11:21 ` [PATCHv6 perf/core 16/22] selftests/bpf: Add uprobe syscall sigill signal test Jiri Olsa
2025-07-20 11:21 ` [PATCHv6 perf/core 17/22] selftests/bpf: Add optimized usdt variant for basic usdt test Jiri Olsa
2025-07-20 11:21 ` [PATCHv6 perf/core 18/22] selftests/bpf: Add uprobe_regs_equal test Jiri Olsa
2025-07-20 11:21 ` [PATCHv6 perf/core 19/22] selftests/bpf: Change test_uretprobe_regs_change for uprobe and uretprobe Jiri Olsa
2025-07-20 11:21 ` [PATCHv6 perf/core 20/22] seccomp: passthrough uprobe systemcall without filtering Jiri Olsa
2025-07-20 11:21 ` [PATCHv6 perf/core 21/22] selftests/seccomp: validate uprobe syscall passes through seccomp Jiri Olsa
2025-07-20 11:21 ` [PATCHv5 22/22] man2: Add uprobe syscall page Jiri Olsa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).