* [PATCH RFCv2 00/18] uprobes: Add support to optimize usdt probes on x86_64
@ 2025-02-24 14:01 Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 01/18] uprobes: Rename arch_uretprobe_trampoline function Jiri Olsa
` (18 more replies)
0 siblings, 19 replies; 33+ messages in thread
From: Jiri Olsa @ 2025-02-24 14:01 UTC (permalink / raw)
To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
Cc: bpf, linux-kernel, linux-trace-kernel, x86, Song Liu,
Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
Masami Hiramatsu, Alan Maguire, David Laight,
Thomas Weißschuh
hi,
this patchset adds support to optimize usdt probes on top of 5-byte
nop instruction.
The generic approach (optimize all uprobes) is hard due to emulating
possible multiple original instructions and its related issues. The
usdt case, which stores 5-byte nop seems much easier, so starting
with that.
The basic idea is to replace breakpoint exception with syscall which
is faster on x86_64. For more details please see changelog of patch 8.
The run_bench_uprobes.sh benchmark triggers uprobe (on top of different
original instructions) in a loop and counts how many of those happened
per second (the unit below is million loops).
There's big speed up if you consider current usdt implementation
(uprobe-nop) compared to proposed usdt (uprobe-nop5):
# ./benchs/run_bench_uprobes.sh
usermode-count : 818.386 ± 1.886M/s
syscall-count : 8.923 ± 0.003M/s
--> uprobe-nop : 3.086 ± 0.005M/s
uprobe-push : 2.751 ± 0.001M/s
uprobe-ret : 1.481 ± 0.000M/s
--> uprobe-nop5 : 4.016 ± 0.002M/s
uretprobe-nop : 1.712 ± 0.008M/s
uretprobe-push : 1.616 ± 0.001M/s
uretprobe-ret : 1.052 ± 0.000M/s
uretprobe-nop5 : 2.015 ± 0.000M/s
rfc v2 changes:
- make uretprobe work properly with optimized uprobe
- make the uprobe optimized code x86_64 specific [Peter]
- rework the verify function logic, using it now as callback
- fix find_nearest_page to include [PAGE_SIZE, ... ] area [Andrii]
- try lockless vma lookup in in_uprobe_trampoline [Peter]
- do per partes instructions update using int3 like in text_poke_bp_batch [David]
- map uprobe trampoline via single global page [Thomas]
- keep track of uprobes per mm_struct
pending todo (follow ups):
- use PROCMAP_QUERY in tests
- alloc 'struct uprobes_state' for mm_struct only when needed [Andrii]
- seccomp change for new uprobe syscall (same as for uretprobe)
thanks,
jirka
---
Jiri Olsa (18):
uprobes: Rename arch_uretprobe_trampoline function
uprobes: Make copy_from_page global
uprobes: Move ref_ctr_offset update out of uprobe_write_opcode
uprobes: Add uprobe_write function
uprobes: Add nbytes argument to uprobe_write_opcode
uprobes: Add orig argument to uprobe_write and uprobe_write_opcode
uprobes: Add swbp argument to arch_uretprobe_hijack_return_addr
uprobes/x86: Add uprobe syscall to speed up uprobe
uprobes/x86: Add mapping for optimized uprobe trampolines
uprobes/x86: Add mm_uprobe objects to track uprobes within mm
uprobes/x86: Add support to emulate nop5 instruction
uprobes/x86: Add support to optimize uprobes
selftests/bpf: Reorg the uprobe_syscall test function
selftests/bpf: Use 5-byte nop for x86 usdt probes
selftests/bpf: Add uprobe/usdt syscall tests
selftests/bpf: Add hit/attach/detach race optimized uprobe test
selftests/bpf: Add uprobe syscall sigill signal test
selftests/bpf: Add 5-byte nop uprobe trigger bench
arch/arm/probes/uprobes/core.c | 4 +-
arch/arm64/kernel/probes/uprobes.c | 2 +-
arch/csky/kernel/probes/uprobes.c | 2 +-
arch/loongarch/kernel/uprobes.c | 2 +-
arch/mips/kernel/uprobes.c | 2 +-
arch/powerpc/kernel/uprobes.c | 2 +-
arch/riscv/kernel/probes/uprobes.c | 2 +-
arch/s390/kernel/uprobes.c | 2 +-
arch/sparc/kernel/uprobes.c | 2 +-
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/x86/include/asm/uprobes.h | 6 ++
arch/x86/kernel/uprobes.c | 530 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
include/linux/syscalls.h | 2 +
include/linux/uprobes.h | 23 +++-
kernel/events/uprobes.c | 147 +++++++++++++++++--------
kernel/fork.c | 1 +
kernel/sys_ni.c | 1 +
tools/testing/selftests/bpf/bench.c | 12 +++
tools/testing/selftests/bpf/benchs/bench_trigger.c | 42 ++++++++
tools/testing/selftests/bpf/benchs/run_bench_uprobes.sh | 2 +-
tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c | 342 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c | 34 +++++-
tools/testing/selftests/bpf/sdt.h | 9 +-
23 files changed, 1093 insertions(+), 79 deletions(-)
^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH RFCv2 01/18] uprobes: Rename arch_uretprobe_trampoline function
2025-02-24 14:01 [PATCH RFCv2 00/18] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
@ 2025-02-24 14:01 ` Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 02/18] uprobes: Make copy_from_page global Jiri Olsa
` (17 subsequent siblings)
18 siblings, 0 replies; 33+ messages in thread
From: Jiri Olsa @ 2025-02-24 14:01 UTC (permalink / raw)
To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
Cc: bpf, linux-kernel, linux-trace-kernel, x86, Song Liu,
Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
Masami Hiramatsu, Alan Maguire, David Laight,
Thomas Weißschuh
We are about to add uprobe trampoline, so cleaning up the namespace.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
arch/x86/kernel/uprobes.c | 2 +-
include/linux/uprobes.h | 2 +-
kernel/events/uprobes.c | 4 ++--
3 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 5a952c5ea66b..22a17c149a55 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -338,7 +338,7 @@ extern u8 uretprobe_trampoline_entry[];
extern u8 uretprobe_trampoline_end[];
extern u8 uretprobe_syscall_check[];
-void *arch_uprobe_trampoline(unsigned long *psize)
+void *arch_uretprobe_trampoline(unsigned long *psize)
{
static uprobe_opcode_t insn = UPROBE_SWBP_INSN;
struct pt_regs *regs = task_pt_regs(current);
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index a40efdda9052..e3ed0d5c5ffe 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -222,7 +222,7 @@ extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
void *src, unsigned long len);
extern void uprobe_handle_trampoline(struct pt_regs *regs);
-extern void *arch_uprobe_trampoline(unsigned long *psize);
+extern void *arch_uretprobe_trampoline(unsigned long *psize);
extern unsigned long uprobe_get_trampoline_vaddr(void);
#else /* !CONFIG_UPROBES */
struct uprobes_state {
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 870f69780900..cf4c4e730f0d 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1699,7 +1699,7 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
return ret;
}
-void * __weak arch_uprobe_trampoline(unsigned long *psize)
+void * __weak arch_uretprobe_trampoline(unsigned long *psize)
{
static uprobe_opcode_t insn = UPROBE_SWBP_INSN;
@@ -1731,7 +1731,7 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
init_waitqueue_head(&area->wq);
/* Reserve the 1st slot for get_trampoline_vaddr() */
set_bit(0, area->bitmap);
- insns = arch_uprobe_trampoline(&insns_size);
+ insns = arch_uretprobe_trampoline(&insns_size);
arch_uprobe_copy_ixol(area->page, 0, insns, insns_size);
if (!xol_add_vma(mm, area))
--
2.48.1
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH RFCv2 02/18] uprobes: Make copy_from_page global
2025-02-24 14:01 [PATCH RFCv2 00/18] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 01/18] uprobes: Rename arch_uretprobe_trampoline function Jiri Olsa
@ 2025-02-24 14:01 ` Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 03/18] uprobes: Move ref_ctr_offset update out of uprobe_write_opcode Jiri Olsa
` (16 subsequent siblings)
18 siblings, 0 replies; 33+ messages in thread
From: Jiri Olsa @ 2025-02-24 14:01 UTC (permalink / raw)
To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
Cc: bpf, linux-kernel, linux-trace-kernel, x86, Song Liu,
Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
Masami Hiramatsu, Alan Maguire, David Laight,
Thomas Weißschuh
Making copy_from_page global and adding uprobe prefix.
Adding the uprobe prefix to copy_to_page as well for symmetry.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
include/linux/uprobes.h | 1 +
kernel/events/uprobes.c | 16 ++++++++--------
2 files changed, 9 insertions(+), 8 deletions(-)
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index e3ed0d5c5ffe..8b295c1f28bf 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -224,6 +224,7 @@ extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
extern void uprobe_handle_trampoline(struct pt_regs *regs);
extern void *arch_uretprobe_trampoline(unsigned long *psize);
extern unsigned long uprobe_get_trampoline_vaddr(void);
+extern void uprobe_copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len);
#else /* !CONFIG_UPROBES */
struct uprobes_state {
};
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index cf4c4e730f0d..f07b6b7b199c 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -250,14 +250,14 @@ bool __weak is_trap_insn(uprobe_opcode_t *insn)
return is_swbp_insn(insn);
}
-static void copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len)
+void uprobe_copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len)
{
void *kaddr = kmap_atomic(page);
memcpy(dst, kaddr + (vaddr & ~PAGE_MASK), len);
kunmap_atomic(kaddr);
}
-static void copy_to_page(struct page *page, unsigned long vaddr, const void *src, int len)
+static void uprobe_copy_to_page(struct page *page, unsigned long vaddr, const void *src, int len)
{
void *kaddr = kmap_atomic(page);
memcpy(kaddr + (vaddr & ~PAGE_MASK), src, len);
@@ -278,7 +278,7 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t
* is a trap variant; uprobes always wins over any other (gdb)
* breakpoint.
*/
- copy_from_page(page, vaddr, &old_opcode, UPROBE_SWBP_INSN_SIZE);
+ uprobe_copy_from_page(page, vaddr, &old_opcode, UPROBE_SWBP_INSN_SIZE);
is_swbp = is_swbp_insn(&old_opcode);
if (is_swbp_insn(new_opcode)) {
@@ -525,7 +525,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
__SetPageUptodate(new_page);
copy_highpage(new_page, old_page);
- copy_to_page(new_page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
+ uprobe_copy_to_page(new_page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
if (!is_register) {
struct page *orig_page;
@@ -1027,7 +1027,7 @@ static int __copy_insn(struct address_space *mapping, struct file *filp,
if (IS_ERR(page))
return PTR_ERR(page);
- copy_from_page(page, offset, insn, nbytes);
+ uprobe_copy_from_page(page, offset, insn, nbytes);
put_page(page);
return 0;
@@ -1371,7 +1371,7 @@ struct uprobe *uprobe_register(struct inode *inode,
return ERR_PTR(-EINVAL);
/*
- * This ensures that copy_from_page(), copy_to_page() and
+ * This ensures that uprobe_copy_from_page(), uprobe_copy_to_page() and
* __update_ref_ctr() can't cross page boundary.
*/
if (!IS_ALIGNED(offset, UPROBE_SWBP_INSN_SIZE))
@@ -1860,7 +1860,7 @@ void __weak arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
void *src, unsigned long len)
{
/* Initialize the slot */
- copy_to_page(page, vaddr, src, len);
+ uprobe_copy_to_page(page, vaddr, src, len);
/*
* We probably need flush_icache_user_page() but it needs vma.
@@ -2355,7 +2355,7 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr)
if (result < 0)
return result;
- copy_from_page(page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
+ uprobe_copy_from_page(page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
put_page(page);
out:
/* This needs to return true for any variant of the trap insn */
--
2.48.1
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH RFCv2 03/18] uprobes: Move ref_ctr_offset update out of uprobe_write_opcode
2025-02-24 14:01 [PATCH RFCv2 00/18] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 01/18] uprobes: Rename arch_uretprobe_trampoline function Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 02/18] uprobes: Make copy_from_page global Jiri Olsa
@ 2025-02-24 14:01 ` Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 04/18] uprobes: Add uprobe_write function Jiri Olsa
` (15 subsequent siblings)
18 siblings, 0 replies; 33+ messages in thread
From: Jiri Olsa @ 2025-02-24 14:01 UTC (permalink / raw)
To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
Cc: bpf, linux-kernel, linux-trace-kernel, x86, Song Liu,
Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
Masami Hiramatsu, Alan Maguire, David Laight,
Thomas Weißschuh
The uprobe_write_opcode function currently updates also refctr offset
if there's one defined for uprobe.
This is not handy for following changes which needs to make several
updates (writes) to install or remove uprobe, but update refctr offset
just once.
Adding set_swbp_refctr/set_orig_refctr which makes sure refctr offset
is updated.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
kernel/events/uprobes.c | 50 ++++++++++++++++++++++++++---------------
1 file changed, 32 insertions(+), 18 deletions(-)
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index f07b6b7b199c..86c4eeda0ed9 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -473,15 +473,13 @@ static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm,
int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
unsigned long vaddr, uprobe_opcode_t opcode)
{
- struct uprobe *uprobe;
struct page *old_page, *new_page;
struct vm_area_struct *vma;
- int ret, is_register, ref_ctr_updated = 0;
+ int ret, is_register;
bool orig_page_huge = false;
unsigned int gup_flags = FOLL_FORCE;
is_register = is_swbp_insn(&opcode);
- uprobe = container_of(auprobe, struct uprobe, arch);
retry:
if (is_register)
@@ -501,15 +499,6 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
goto put_old;
}
- /* We are going to replace instruction, update ref_ctr. */
- if (!ref_ctr_updated && uprobe->ref_ctr_offset) {
- ret = update_ref_ctr(uprobe, mm, is_register ? 1 : -1);
- if (ret)
- goto put_old;
-
- ref_ctr_updated = 1;
- }
-
ret = 0;
if (!is_register && !PageAnon(old_page))
goto put_old;
@@ -560,10 +549,6 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
if (unlikely(ret == -EAGAIN))
goto retry;
- /* Revert back reference counter if instruction update failed. */
- if (ret && is_register && ref_ctr_updated)
- update_ref_ctr(uprobe, mm, -1);
-
/* try collapse pmd for compound page */
if (!ret && orig_page_huge)
collapse_pte_mapped_thp(mm, vaddr, false);
@@ -585,6 +570,25 @@ int __weak set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned
return uprobe_write_opcode(auprobe, mm, vaddr, UPROBE_SWBP_INSN);
}
+static int set_swbp_refctr(struct uprobe *uprobe, struct mm_struct *mm, unsigned long vaddr)
+{
+ int err;
+
+ /* We are going to replace instruction, update ref_ctr. */
+ if (uprobe->ref_ctr_offset) {
+ err = update_ref_ctr(uprobe, mm, 1);
+ if (err)
+ return err;
+ }
+
+ err = set_swbp(&uprobe->arch, mm, vaddr);
+
+ /* Revert back reference counter if instruction update failed. */
+ if (err && uprobe->ref_ctr_offset)
+ update_ref_ctr(uprobe, mm, -1);
+ return err;
+}
+
/**
* set_orig_insn - Restore the original instruction.
* @mm: the probed process address space.
@@ -601,6 +605,16 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v
*(uprobe_opcode_t *)&auprobe->insn);
}
+static int set_orig_refctr(struct uprobe *uprobe, struct mm_struct *mm, unsigned long vaddr)
+{
+ int err = set_orig_insn(&uprobe->arch, mm, vaddr);
+
+ /* Revert back reference counter even if instruction update failed. */
+ if (uprobe->ref_ctr_offset)
+ update_ref_ctr(uprobe, mm, -1);
+ return err;
+}
+
/* uprobe should have guaranteed positive refcount */
static struct uprobe *get_uprobe(struct uprobe *uprobe)
{
@@ -1133,7 +1147,7 @@ install_breakpoint(struct uprobe *uprobe, struct mm_struct *mm,
if (first_uprobe)
set_bit(MMF_HAS_UPROBES, &mm->flags);
- ret = set_swbp(&uprobe->arch, mm, vaddr);
+ ret = set_swbp_refctr(uprobe, mm, vaddr);
if (!ret)
clear_bit(MMF_RECALC_UPROBES, &mm->flags);
else if (first_uprobe)
@@ -1146,7 +1160,7 @@ static int
remove_breakpoint(struct uprobe *uprobe, struct mm_struct *mm, unsigned long vaddr)
{
set_bit(MMF_RECALC_UPROBES, &mm->flags);
- return set_orig_insn(&uprobe->arch, mm, vaddr);
+ return set_orig_refctr(uprobe, mm, vaddr);
}
struct map_info {
--
2.48.1
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH RFCv2 04/18] uprobes: Add uprobe_write function
2025-02-24 14:01 [PATCH RFCv2 00/18] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
` (2 preceding siblings ...)
2025-02-24 14:01 ` [PATCH RFCv2 03/18] uprobes: Move ref_ctr_offset update out of uprobe_write_opcode Jiri Olsa
@ 2025-02-24 14:01 ` Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 05/18] uprobes: Add nbytes argument to uprobe_write_opcode Jiri Olsa
` (14 subsequent siblings)
18 siblings, 0 replies; 33+ messages in thread
From: Jiri Olsa @ 2025-02-24 14:01 UTC (permalink / raw)
To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
Cc: bpf, linux-kernel, linux-trace-kernel, x86, Song Liu,
Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
Masami Hiramatsu, Alan Maguire, David Laight,
Thomas Weißschuh
Adding uprobe_write function that does what uprobe_write_opcode did
so far, but allows to pass verify callback function that checks the
memory location before writing the opcode.
It will be used in following changes to simplify the checking logic.
The uprobe_write_opcode now calls uprobe_write with verify_opcode as
the verify callback.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
include/linux/uprobes.h | 4 ++++
kernel/events/uprobes.c | 13 ++++++++++---
2 files changed, 14 insertions(+), 3 deletions(-)
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 8b295c1f28bf..732de4b796ce 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -185,6 +185,8 @@ struct uprobes_state {
struct xol_area *xol_area;
};
+typedef int (*uprobe_write_verify_t)(struct page *page, unsigned long vaddr, uprobe_opcode_t *opcode);
+
extern void __init uprobes_init(void);
extern int set_swbp(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr);
extern int set_orig_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr);
@@ -193,6 +195,8 @@ extern bool is_trap_insn(uprobe_opcode_t *insn);
extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs);
extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t);
+extern int uprobe_write(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr,
+ uprobe_opcode_t *opcode, uprobe_write_verify_t verify);
extern struct uprobe *uprobe_register(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
extern int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool);
extern void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 86c4eeda0ed9..50560b307522 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -472,6 +472,13 @@ static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm,
*/
int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
unsigned long vaddr, uprobe_opcode_t opcode)
+{
+ return uprobe_write(auprobe, mm, vaddr, &opcode, verify_opcode);
+}
+
+int uprobe_write(struct arch_uprobe *auprobe, struct mm_struct *mm,
+ unsigned long vaddr, uprobe_opcode_t *opcode,
+ uprobe_write_verify_t verify)
{
struct page *old_page, *new_page;
struct vm_area_struct *vma;
@@ -479,7 +486,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
bool orig_page_huge = false;
unsigned int gup_flags = FOLL_FORCE;
- is_register = is_swbp_insn(&opcode);
+ is_register = is_swbp_insn(opcode);
retry:
if (is_register)
@@ -489,7 +496,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
if (IS_ERR(old_page))
return PTR_ERR(old_page);
- ret = verify_opcode(old_page, vaddr, &opcode);
+ ret = verify(old_page, vaddr, opcode);
if (ret <= 0)
goto put_old;
@@ -514,7 +521,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
__SetPageUptodate(new_page);
copy_highpage(new_page, old_page);
- uprobe_copy_to_page(new_page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
+ uprobe_copy_to_page(new_page, vaddr, opcode, UPROBE_SWBP_INSN_SIZE);
if (!is_register) {
struct page *orig_page;
--
2.48.1
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH RFCv2 05/18] uprobes: Add nbytes argument to uprobe_write_opcode
2025-02-24 14:01 [PATCH RFCv2 00/18] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
` (3 preceding siblings ...)
2025-02-24 14:01 ` [PATCH RFCv2 04/18] uprobes: Add uprobe_write function Jiri Olsa
@ 2025-02-24 14:01 ` Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 06/18] uprobes: Add orig argument to uprobe_write and uprobe_write_opcode Jiri Olsa
` (13 subsequent siblings)
18 siblings, 0 replies; 33+ messages in thread
From: Jiri Olsa @ 2025-02-24 14:01 UTC (permalink / raw)
To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
Cc: bpf, linux-kernel, linux-trace-kernel, x86, Song Liu,
Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
Masami Hiramatsu, Alan Maguire, David Laight,
Thomas Weißschuh
Adding nbytes argument to uprobe_write_opcode as preparation
for writing whole instructions in following changes.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
include/linux/uprobes.h | 4 ++--
kernel/events/uprobes.c | 14 +++++++-------
2 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 732de4b796ce..8867b6a168b2 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -185,7 +185,7 @@ struct uprobes_state {
struct xol_area *xol_area;
};
-typedef int (*uprobe_write_verify_t)(struct page *page, unsigned long vaddr, uprobe_opcode_t *opcode);
+typedef int (*uprobe_write_verify_t)(struct page *page, unsigned long vaddr, uprobe_opcode_t *opcode, int nbytes);
extern void __init uprobes_init(void);
extern int set_swbp(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr);
@@ -196,7 +196,7 @@ extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs);
extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t);
extern int uprobe_write(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr,
- uprobe_opcode_t *opcode, uprobe_write_verify_t verify);
+ uprobe_opcode_t *insn, int nbytes, uprobe_write_verify_t verify);
extern struct uprobe *uprobe_register(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
extern int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool);
extern void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 50560b307522..ad5879fc2d26 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -264,7 +264,7 @@ static void uprobe_copy_to_page(struct page *page, unsigned long vaddr, const vo
kunmap_atomic(kaddr);
}
-static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode)
+static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode, int nbytes)
{
uprobe_opcode_t old_opcode;
bool is_swbp;
@@ -473,12 +473,12 @@ static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm,
int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
unsigned long vaddr, uprobe_opcode_t opcode)
{
- return uprobe_write(auprobe, mm, vaddr, &opcode, verify_opcode);
+ return uprobe_write(auprobe, mm, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE, verify_opcode);
}
int uprobe_write(struct arch_uprobe *auprobe, struct mm_struct *mm,
- unsigned long vaddr, uprobe_opcode_t *opcode,
- uprobe_write_verify_t verify)
+ unsigned long vaddr, uprobe_opcode_t *insn,
+ int nbytes, uprobe_write_verify_t verify)
{
struct page *old_page, *new_page;
struct vm_area_struct *vma;
@@ -486,7 +486,7 @@ int uprobe_write(struct arch_uprobe *auprobe, struct mm_struct *mm,
bool orig_page_huge = false;
unsigned int gup_flags = FOLL_FORCE;
- is_register = is_swbp_insn(opcode);
+ is_register = is_swbp_insn(insn);
retry:
if (is_register)
@@ -496,7 +496,7 @@ int uprobe_write(struct arch_uprobe *auprobe, struct mm_struct *mm,
if (IS_ERR(old_page))
return PTR_ERR(old_page);
- ret = verify(old_page, vaddr, opcode);
+ ret = verify(old_page, vaddr, insn, nbytes);
if (ret <= 0)
goto put_old;
@@ -521,7 +521,7 @@ int uprobe_write(struct arch_uprobe *auprobe, struct mm_struct *mm,
__SetPageUptodate(new_page);
copy_highpage(new_page, old_page);
- uprobe_copy_to_page(new_page, vaddr, opcode, UPROBE_SWBP_INSN_SIZE);
+ uprobe_copy_to_page(new_page, vaddr, insn, nbytes);
if (!is_register) {
struct page *orig_page;
--
2.48.1
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH RFCv2 06/18] uprobes: Add orig argument to uprobe_write and uprobe_write_opcode
2025-02-24 14:01 [PATCH RFCv2 00/18] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
` (4 preceding siblings ...)
2025-02-24 14:01 ` [PATCH RFCv2 05/18] uprobes: Add nbytes argument to uprobe_write_opcode Jiri Olsa
@ 2025-02-24 14:01 ` Jiri Olsa
2025-02-28 19:07 ` Andrii Nakryiko
2025-02-24 14:01 ` [PATCH RFCv2 07/18] uprobes: Add swbp argument to arch_uretprobe_hijack_return_addr Jiri Olsa
` (12 subsequent siblings)
18 siblings, 1 reply; 33+ messages in thread
From: Jiri Olsa @ 2025-02-24 14:01 UTC (permalink / raw)
To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
Cc: bpf, linux-kernel, linux-trace-kernel, x86, Song Liu,
Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
Masami Hiramatsu, Alan Maguire, David Laight,
Thomas Weißschuh
The uprobe_write has special path to restore the original page when
we write original instruction back.
This happens when uprobe_write detects that we want to write anything
else but breakpoint instruction.
In following changes we want to use uprobe_write function for multiple
updates, so adding new function argument to denote that this is the
original instruction update. This way uprobe_write can make appropriate
checks and restore the original page when possible.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
arch/arm/probes/uprobes/core.c | 2 +-
include/linux/uprobes.h | 5 +++--
kernel/events/uprobes.c | 22 ++++++++++------------
3 files changed, 14 insertions(+), 15 deletions(-)
diff --git a/arch/arm/probes/uprobes/core.c b/arch/arm/probes/uprobes/core.c
index f5f790c6e5f8..54a90b565285 100644
--- a/arch/arm/probes/uprobes/core.c
+++ b/arch/arm/probes/uprobes/core.c
@@ -30,7 +30,7 @@ int set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm,
unsigned long vaddr)
{
return uprobe_write_opcode(auprobe, mm, vaddr,
- __opcode_to_mem_arm(auprobe->bpinsn));
+ __opcode_to_mem_arm(auprobe->bpinsn), false);
}
bool arch_uprobe_ignore(struct arch_uprobe *auprobe, struct pt_regs *regs)
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 8867b6a168b2..1abcae9cde48 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -194,9 +194,10 @@ extern bool is_swbp_insn(uprobe_opcode_t *insn);
extern bool is_trap_insn(uprobe_opcode_t *insn);
extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs);
extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
-extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t);
+extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr,
+ uprobe_opcode_t, bool);
extern int uprobe_write(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr,
- uprobe_opcode_t *insn, int nbytes, uprobe_write_verify_t verify);
+ uprobe_opcode_t *insn, int nbytes, uprobe_write_verify_t verify, bool orig);
extern struct uprobe *uprobe_register(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
extern int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool);
extern void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index ad5879fc2d26..2b542043089e 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -471,25 +471,23 @@ static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm,
* Return 0 (success) or a negative errno.
*/
int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
- unsigned long vaddr, uprobe_opcode_t opcode)
+ unsigned long vaddr, uprobe_opcode_t opcode, bool orig)
{
- return uprobe_write(auprobe, mm, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE, verify_opcode);
+ return uprobe_write(auprobe, mm, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE, verify_opcode, orig);
}
int uprobe_write(struct arch_uprobe *auprobe, struct mm_struct *mm,
unsigned long vaddr, uprobe_opcode_t *insn,
- int nbytes, uprobe_write_verify_t verify)
+ int nbytes, uprobe_write_verify_t verify, bool orig)
{
struct page *old_page, *new_page;
struct vm_area_struct *vma;
- int ret, is_register;
+ int ret;
bool orig_page_huge = false;
unsigned int gup_flags = FOLL_FORCE;
- is_register = is_swbp_insn(insn);
-
retry:
- if (is_register)
+ if (!orig)
gup_flags |= FOLL_SPLIT_PMD;
/* Read the page with vaddr into memory */
old_page = get_user_page_vma_remote(mm, vaddr, gup_flags, &vma);
@@ -500,14 +498,14 @@ int uprobe_write(struct arch_uprobe *auprobe, struct mm_struct *mm,
if (ret <= 0)
goto put_old;
- if (WARN(!is_register && PageCompound(old_page),
+ if (WARN(orig && PageCompound(old_page),
"uprobe unregister should never work on compound page\n")) {
ret = -EINVAL;
goto put_old;
}
ret = 0;
- if (!is_register && !PageAnon(old_page))
+ if (orig && !PageAnon(old_page))
goto put_old;
ret = anon_vma_prepare(vma);
@@ -523,7 +521,7 @@ int uprobe_write(struct arch_uprobe *auprobe, struct mm_struct *mm,
copy_highpage(new_page, old_page);
uprobe_copy_to_page(new_page, vaddr, insn, nbytes);
- if (!is_register) {
+ if (orig) {
struct page *orig_page;
pgoff_t index;
@@ -574,7 +572,7 @@ int uprobe_write(struct arch_uprobe *auprobe, struct mm_struct *mm,
*/
int __weak set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr)
{
- return uprobe_write_opcode(auprobe, mm, vaddr, UPROBE_SWBP_INSN);
+ return uprobe_write_opcode(auprobe, mm, vaddr, UPROBE_SWBP_INSN, false);
}
static int set_swbp_refctr(struct uprobe *uprobe, struct mm_struct *mm, unsigned long vaddr)
@@ -609,7 +607,7 @@ int __weak
set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr)
{
return uprobe_write_opcode(auprobe, mm, vaddr,
- *(uprobe_opcode_t *)&auprobe->insn);
+ *(uprobe_opcode_t *)&auprobe->insn, true);
}
static int set_orig_refctr(struct uprobe *uprobe, struct mm_struct *mm, unsigned long vaddr)
--
2.48.1
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH RFCv2 07/18] uprobes: Add swbp argument to arch_uretprobe_hijack_return_addr
2025-02-24 14:01 [PATCH RFCv2 00/18] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
` (5 preceding siblings ...)
2025-02-24 14:01 ` [PATCH RFCv2 06/18] uprobes: Add orig argument to uprobe_write and uprobe_write_opcode Jiri Olsa
@ 2025-02-24 14:01 ` Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 08/18] uprobes/x86: Add uprobe syscall to speed up uprobe Jiri Olsa
` (11 subsequent siblings)
18 siblings, 0 replies; 33+ messages in thread
From: Jiri Olsa @ 2025-02-24 14:01 UTC (permalink / raw)
To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
Cc: bpf, linux-kernel, linux-trace-kernel, x86, Song Liu,
Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
Masami Hiramatsu, Alan Maguire, David Laight,
Thomas Weißschuh
Adding swbp argument to arch_uretprobe_hijack_return_addr, that
is passed all the way from handle_swbp function, so we can add
extra logic when it's called from syscall in following changes.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
arch/arm/probes/uprobes/core.c | 2 +-
arch/arm64/kernel/probes/uprobes.c | 2 +-
arch/csky/kernel/probes/uprobes.c | 2 +-
arch/loongarch/kernel/uprobes.c | 2 +-
arch/mips/kernel/uprobes.c | 2 +-
arch/powerpc/kernel/uprobes.c | 2 +-
arch/riscv/kernel/probes/uprobes.c | 2 +-
arch/s390/kernel/uprobes.c | 2 +-
arch/sparc/kernel/uprobes.c | 2 +-
arch/x86/kernel/uprobes.c | 2 +-
include/linux/uprobes.h | 3 ++-
kernel/events/uprobes.c | 10 +++++-----
12 files changed, 17 insertions(+), 16 deletions(-)
diff --git a/arch/arm/probes/uprobes/core.c b/arch/arm/probes/uprobes/core.c
index 54a90b565285..4d854a310155 100644
--- a/arch/arm/probes/uprobes/core.c
+++ b/arch/arm/probes/uprobes/core.c
@@ -59,7 +59,7 @@ bool arch_uprobe_skip_sstep(struct arch_uprobe *auprobe, struct pt_regs *regs)
unsigned long
arch_uretprobe_hijack_return_addr(unsigned long trampoline_vaddr,
- struct pt_regs *regs)
+ struct pt_regs *regs, bool swbp)
{
unsigned long orig_ret_vaddr;
diff --git a/arch/arm64/kernel/probes/uprobes.c b/arch/arm64/kernel/probes/uprobes.c
index cb3d05af36e3..a38bf900bca1 100644
--- a/arch/arm64/kernel/probes/uprobes.c
+++ b/arch/arm64/kernel/probes/uprobes.c
@@ -156,7 +156,7 @@ bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check ctx,
unsigned long
arch_uretprobe_hijack_return_addr(unsigned long trampoline_vaddr,
- struct pt_regs *regs)
+ struct pt_regs *regs, bool swbp)
{
unsigned long orig_ret_vaddr;
diff --git a/arch/csky/kernel/probes/uprobes.c b/arch/csky/kernel/probes/uprobes.c
index 936bea6fd32d..81a950a73cb6 100644
--- a/arch/csky/kernel/probes/uprobes.c
+++ b/arch/csky/kernel/probes/uprobes.c
@@ -124,7 +124,7 @@ bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check ctx,
unsigned long
arch_uretprobe_hijack_return_addr(unsigned long trampoline_vaddr,
- struct pt_regs *regs)
+ struct pt_regs *regs, bool swbp)
{
unsigned long ra;
diff --git a/arch/loongarch/kernel/uprobes.c b/arch/loongarch/kernel/uprobes.c
index 87abc7137b73..d8b221fafb71 100644
--- a/arch/loongarch/kernel/uprobes.c
+++ b/arch/loongarch/kernel/uprobes.c
@@ -96,7 +96,7 @@ bool arch_uprobe_skip_sstep(struct arch_uprobe *auprobe, struct pt_regs *regs)
}
unsigned long arch_uretprobe_hijack_return_addr(unsigned long trampoline_vaddr,
- struct pt_regs *regs)
+ struct pt_regs *regs, bool swbp)
{
unsigned long ra = regs->regs[1];
diff --git a/arch/mips/kernel/uprobes.c b/arch/mips/kernel/uprobes.c
index 401b148f8917..2727bcfd5331 100644
--- a/arch/mips/kernel/uprobes.c
+++ b/arch/mips/kernel/uprobes.c
@@ -196,7 +196,7 @@ void arch_uprobe_abort_xol(struct arch_uprobe *aup,
}
unsigned long arch_uretprobe_hijack_return_addr(
- unsigned long trampoline_vaddr, struct pt_regs *regs)
+ unsigned long trampoline_vaddr, struct pt_regs *regs, bool swbp)
{
unsigned long ra;
diff --git a/arch/powerpc/kernel/uprobes.c b/arch/powerpc/kernel/uprobes.c
index 95a41ae9dfa7..f3b3feb44586 100644
--- a/arch/powerpc/kernel/uprobes.c
+++ b/arch/powerpc/kernel/uprobes.c
@@ -195,7 +195,7 @@ bool arch_uprobe_skip_sstep(struct arch_uprobe *auprobe, struct pt_regs *regs)
}
unsigned long
-arch_uretprobe_hijack_return_addr(unsigned long trampoline_vaddr, struct pt_regs *regs)
+arch_uretprobe_hijack_return_addr(unsigned long trampoline_vaddr, struct pt_regs *regs, bool swbp)
{
unsigned long orig_ret_vaddr;
diff --git a/arch/riscv/kernel/probes/uprobes.c b/arch/riscv/kernel/probes/uprobes.c
index 4b3dc8beaf77..ef8bc1f6a04e 100644
--- a/arch/riscv/kernel/probes/uprobes.c
+++ b/arch/riscv/kernel/probes/uprobes.c
@@ -128,7 +128,7 @@ bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check ctx,
unsigned long
arch_uretprobe_hijack_return_addr(unsigned long trampoline_vaddr,
- struct pt_regs *regs)
+ struct pt_regs *regs, bool swbp)
{
unsigned long ra;
diff --git a/arch/s390/kernel/uprobes.c b/arch/s390/kernel/uprobes.c
index 5b0633ea8d93..48f79d9a25e9 100644
--- a/arch/s390/kernel/uprobes.c
+++ b/arch/s390/kernel/uprobes.c
@@ -141,7 +141,7 @@ void arch_uprobe_abort_xol(struct arch_uprobe *auprobe, struct pt_regs *regs)
}
unsigned long arch_uretprobe_hijack_return_addr(unsigned long trampoline,
- struct pt_regs *regs)
+ struct pt_regs *regs, bool swbp)
{
unsigned long orig;
diff --git a/arch/sparc/kernel/uprobes.c b/arch/sparc/kernel/uprobes.c
index 305017bec164..aef4e0ff38f0 100644
--- a/arch/sparc/kernel/uprobes.c
+++ b/arch/sparc/kernel/uprobes.c
@@ -310,7 +310,7 @@ bool arch_uprobe_xol_was_trapped(struct task_struct *t)
unsigned long
arch_uretprobe_hijack_return_addr(unsigned long trampoline_vaddr,
- struct pt_regs *regs)
+ struct pt_regs *regs, bool swbp)
{
unsigned long orig_ret_vaddr = regs->u_regs[UREG_I7];
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 22a17c149a55..b06f3cd7551a 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -1180,7 +1180,7 @@ bool arch_uprobe_skip_sstep(struct arch_uprobe *auprobe, struct pt_regs *regs)
}
unsigned long
-arch_uretprobe_hijack_return_addr(unsigned long trampoline_vaddr, struct pt_regs *regs)
+arch_uretprobe_hijack_return_addr(unsigned long trampoline_vaddr, struct pt_regs *regs, bool swbp)
{
int rasize = sizeof_long(regs), nleft;
unsigned long orig_ret_vaddr = 0; /* clear high bits for 32-bit apps */
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 1abcae9cde48..6c3c90a0d110 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -221,7 +221,8 @@ extern int arch_uprobe_post_xol(struct arch_uprobe *aup, struct pt_regs *regs);
extern bool arch_uprobe_xol_was_trapped(struct task_struct *tsk);
extern int arch_uprobe_exception_notify(struct notifier_block *self, unsigned long val, void *data);
extern void arch_uprobe_abort_xol(struct arch_uprobe *aup, struct pt_regs *regs);
-extern unsigned long arch_uretprobe_hijack_return_addr(unsigned long trampoline_vaddr, struct pt_regs *regs);
+extern unsigned long arch_uretprobe_hijack_return_addr(unsigned long trampoline_vaddr, struct pt_regs *regs,
+ bool swbp);
extern bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check ctx, struct pt_regs *regs);
extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 2b542043089e..cfcde7295e15 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2207,7 +2207,7 @@ static void cleanup_return_instances(struct uprobe_task *utask, bool chained,
}
static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs,
- struct return_instance *ri)
+ struct return_instance *ri, bool swbp)
{
struct uprobe_task *utask = current->utask;
unsigned long orig_ret_vaddr, trampoline_vaddr;
@@ -2225,7 +2225,7 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs,
}
trampoline_vaddr = uprobe_get_trampoline_vaddr();
- orig_ret_vaddr = arch_uretprobe_hijack_return_addr(trampoline_vaddr, regs);
+ orig_ret_vaddr = arch_uretprobe_hijack_return_addr(trampoline_vaddr, regs, swbp);
if (orig_ret_vaddr == -1)
goto free;
@@ -2503,7 +2503,7 @@ static bool ignore_ret_handler(int rc)
return rc == UPROBE_HANDLER_REMOVE || rc == UPROBE_HANDLER_IGNORE;
}
-static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
+static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs, bool swbp)
{
struct uprobe_consumer *uc;
bool has_consumers = false, remove = true;
@@ -2538,7 +2538,7 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
utask->auprobe = NULL;
if (!ZERO_OR_NULL_PTR(ri))
- prepare_uretprobe(uprobe, regs, ri);
+ prepare_uretprobe(uprobe, regs, ri, swbp);
if (remove && has_consumers) {
down_read(&uprobe->register_rwsem);
@@ -2720,7 +2720,7 @@ static void handle_swbp(struct pt_regs *regs)
if (arch_uprobe_ignore(&uprobe->arch, regs))
goto out;
- handler_chain(uprobe, regs);
+ handler_chain(uprobe, regs, true);
if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
goto out;
--
2.48.1
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH RFCv2 08/18] uprobes/x86: Add uprobe syscall to speed up uprobe
2025-02-24 14:01 [PATCH RFCv2 00/18] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
` (6 preceding siblings ...)
2025-02-24 14:01 ` [PATCH RFCv2 07/18] uprobes: Add swbp argument to arch_uretprobe_hijack_return_addr Jiri Olsa
@ 2025-02-24 14:01 ` Jiri Olsa
2025-02-24 19:22 ` Alexei Starovoitov
2025-02-24 14:01 ` [PATCH RFCv2 09/18] uprobes/x86: Add mapping for optimized uprobe trampolines Jiri Olsa
` (10 subsequent siblings)
18 siblings, 1 reply; 33+ messages in thread
From: Jiri Olsa @ 2025-02-24 14:01 UTC (permalink / raw)
To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
Cc: bpf, linux-kernel, linux-trace-kernel, x86, Song Liu,
Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
Masami Hiramatsu, Alan Maguire, David Laight,
Thomas Weißschuh
Adding new uprobe syscall that calls uprobe handlers for given
'breakpoint' address.
The idea is that the 'breakpoint' address calls the user space
trampoline which executes the uprobe syscall.
The syscall handler reads the return address of the initial call
to retrieve the original 'breakpoint' address. With this address
we find the related uprobe object and call its consumers.
Adding the arch_uprobe_trampoline_mapping function that provides
uprobe trampoline mapping. This mapping is backed with one global
page initialized at __init time and shared by the all the mapping
instances.
We do not allow to execute uprobe syscall if the caller is not
from uprobe trampoline mapping.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/x86/kernel/uprobes.c | 87 ++++++++++++++++++++++++++
include/linux/syscalls.h | 2 +
include/linux/uprobes.h | 1 +
kernel/events/uprobes.c | 22 +++++++
kernel/sys_ni.c | 1 +
6 files changed, 114 insertions(+)
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 5eb708bff1c7..88e388c7675b 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -345,6 +345,7 @@
333 common io_pgetevents sys_io_pgetevents
334 common rseq sys_rseq
335 common uretprobe sys_uretprobe
+336 common uprobe sys_uprobe
# don't use numbers 387 through 423, add new calls after the last
# 'common' entry
424 common pidfd_send_signal sys_pidfd_send_signal
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index b06f3cd7551a..3ea682dbeb39 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -425,6 +425,93 @@ SYSCALL_DEFINE0(uretprobe)
return -1;
}
+static int tramp_mremap(const struct vm_special_mapping *sm, struct vm_area_struct *new_vma)
+{
+ return -EPERM;
+}
+
+static struct page *tramp_mapping_pages[2] __ro_after_init;
+
+static struct vm_special_mapping tramp_mapping = {
+ .name = "[uprobes-trampoline]",
+ .mremap = tramp_mremap,
+ .pages = tramp_mapping_pages,
+};
+
+static bool in_uprobe_trampoline(unsigned long ip)
+{
+ struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma;
+ bool found, retry = true;
+ unsigned int seq;
+
+ rcu_read_lock();
+ if (mmap_lock_speculate_try_begin(mm, &seq)) {
+ vma = vma_lookup(current->mm, ip);
+ found = vma && vma_is_special_mapping(vma, &tramp_mapping);
+ retry = mmap_lock_speculate_retry(mm, seq);
+ }
+ rcu_read_unlock();
+
+ if (retry) {
+ mmap_read_lock(mm);
+ vma = vma_lookup(current->mm, ip);
+ found = vma && vma_is_special_mapping(vma, &tramp_mapping);
+ mmap_read_unlock(mm);
+ }
+ return found;
+}
+
+SYSCALL_DEFINE0(uprobe)
+{
+ struct pt_regs *regs = task_pt_regs(current);
+ unsigned long bp_vaddr;
+ int err;
+
+ err = copy_from_user(&bp_vaddr, (void __user *)regs->sp + 3*8, sizeof(bp_vaddr));
+ if (err) {
+ force_sig(SIGILL);
+ return -1;
+ }
+
+ /* Allow execution only from uprobe trampolines. */
+ if (!in_uprobe_trampoline(regs->ip)) {
+ force_sig(SIGILL);
+ return -1;
+ }
+
+ handle_syscall_uprobe(regs, bp_vaddr - 5);
+ return 0;
+}
+
+asm (
+ ".pushsection .rodata\n"
+ ".balign " __stringify(PAGE_SIZE) "\n"
+ "uprobe_trampoline_entry:\n"
+ "endbr64\n"
+ "push %rcx\n"
+ "push %r11\n"
+ "push %rax\n"
+ "movq $" __stringify(__NR_uprobe) ", %rax\n"
+ "syscall\n"
+ "pop %rax\n"
+ "pop %r11\n"
+ "pop %rcx\n"
+ "ret\n"
+ ".balign " __stringify(PAGE_SIZE) "\n"
+ ".popsection\n"
+);
+
+extern u8 uprobe_trampoline_entry[];
+
+static int __init arch_uprobes_init(void)
+{
+ tramp_mapping_pages[0] = virt_to_page(uprobe_trampoline_entry);
+ return 0;
+}
+
+late_initcall(arch_uprobes_init);
+
/*
* If arch_uprobe->insn doesn't use rip-relative addressing, return
* immediately. Otherwise, rewrite the instruction so that it accesses
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index c6333204d451..002f4e1debe5 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -994,6 +994,8 @@ asmlinkage long sys_ioperm(unsigned long from, unsigned long num, int on);
asmlinkage long sys_uretprobe(void);
+asmlinkage long sys_uprobe(void);
+
/* pciconfig: alpha, arm, arm64, ia64, sparc */
asmlinkage long sys_pciconfig_read(unsigned long bus, unsigned long dfn,
unsigned long off, unsigned long len,
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 6c3c90a0d110..de3631ae1746 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -231,6 +231,7 @@ extern void uprobe_handle_trampoline(struct pt_regs *regs);
extern void *arch_uretprobe_trampoline(unsigned long *psize);
extern unsigned long uprobe_get_trampoline_vaddr(void);
extern void uprobe_copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len);
+extern void handle_syscall_uprobe(struct pt_regs *regs, unsigned long bp_vaddr);
#else /* !CONFIG_UPROBES */
struct uprobes_state {
};
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index cfcde7295e15..6ac691fe5682 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2733,6 +2733,28 @@ static void handle_swbp(struct pt_regs *regs)
rcu_read_unlock_trace();
}
+void handle_syscall_uprobe(struct pt_regs *regs, unsigned long bp_vaddr)
+{
+ struct uprobe *uprobe;
+ int is_swbp;
+
+ rcu_read_lock_trace();
+ uprobe = find_active_uprobe_rcu(bp_vaddr, &is_swbp);
+ if (!uprobe)
+ goto unlock;
+
+ if (!get_utask())
+ goto unlock;
+
+ if (arch_uprobe_ignore(&uprobe->arch, regs))
+ goto unlock;
+
+ handler_chain(uprobe, regs, false);
+
+ unlock:
+ rcu_read_unlock_trace();
+}
+
/*
* Perform required fix-ups and disable singlestep.
* Allow pending signals to take effect.
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index c00a86931f8c..bf5d05c635ff 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -392,3 +392,4 @@ COND_SYSCALL(setuid16);
COND_SYSCALL(rseq);
COND_SYSCALL(uretprobe);
+COND_SYSCALL(uprobe);
--
2.48.1
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH RFCv2 09/18] uprobes/x86: Add mapping for optimized uprobe trampolines
2025-02-24 14:01 [PATCH RFCv2 00/18] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
` (7 preceding siblings ...)
2025-02-24 14:01 ` [PATCH RFCv2 08/18] uprobes/x86: Add uprobe syscall to speed up uprobe Jiri Olsa
@ 2025-02-24 14:01 ` Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 10/18] uprobes/x86: Add mm_uprobe objects to track uprobes within mm Jiri Olsa
` (9 subsequent siblings)
18 siblings, 0 replies; 33+ messages in thread
From: Jiri Olsa @ 2025-02-24 14:01 UTC (permalink / raw)
To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
Cc: bpf, linux-kernel, linux-trace-kernel, x86, Song Liu,
Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
Masami Hiramatsu, Alan Maguire, David Laight,
Thomas Weißschuh
Adding support to add special mapping for for user space trampoline
with following functions:
uprobe_trampoline_get - find or add related uprobe_trampoline
uprobe_trampoline_put - remove ref or destroy uprobe_trampoline
The user space trampoline is exported as architecture specific user space
special mapping, which is provided by arch_uprobe_trampoline_mapping
function.
The uprobe trampoline needs to be callable/reachable from the probe address,
so while searching for available address we use uprobe_is_callable function
to decide if the uprobe trampoline is callable from the probe address.
All uprobe_trampoline objects are stored in uprobes_state object and are
cleaned up when the process mm_struct goes down. Adding new arch hooks
for that, because this change is x86_64 specific.
Locking is provided by callers in following changes.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
arch/x86/kernel/uprobes.c | 121 ++++++++++++++++++++++++++++++++++++++
include/linux/uprobes.h | 6 ++
kernel/events/uprobes.c | 10 ++++
kernel/fork.c | 1 +
4 files changed, 138 insertions(+)
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 3ea682dbeb39..e0c3fb01a43c 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -691,6 +691,127 @@ static void riprel_post_xol(struct arch_uprobe *auprobe, struct pt_regs *regs)
*sr = utask->autask.saved_scratch_register;
}
}
+
+struct uprobe_trampoline {
+ struct hlist_node node;
+ unsigned long vaddr;
+ atomic64_t ref;
+};
+
+static bool is_reachable_by_call(unsigned long vtramp, unsigned long vaddr)
+{
+ long delta = (long)(vaddr + 5 - vtramp);
+
+ return delta >= INT_MIN && delta <= INT_MAX;
+}
+
+static unsigned long find_nearest_page(unsigned long vaddr)
+{
+ struct vm_area_struct *vma, *prev = NULL;
+ unsigned long prev_vm_end = PAGE_SIZE;
+ VMA_ITERATOR(vmi, current->mm, 0);
+
+ vma = vma_next(&vmi);
+ while (vma) {
+ if (prev)
+ prev_vm_end = prev->vm_end;
+ if (vma->vm_start - prev_vm_end >= PAGE_SIZE) {
+ if (is_reachable_by_call(prev_vm_end, vaddr))
+ return prev_vm_end;
+ if (is_reachable_by_call(vma->vm_start - PAGE_SIZE, vaddr))
+ return vma->vm_start - PAGE_SIZE;
+ }
+ prev = vma;
+ vma = vma_next(&vmi);
+ }
+
+ return 0;
+}
+
+static struct uprobe_trampoline *create_uprobe_trampoline(unsigned long vaddr)
+{
+ struct pt_regs *regs = task_pt_regs(current);
+ const struct vm_special_mapping *mapping;
+ struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma;
+ struct uprobe_trampoline *tramp;
+
+ mapping = user_64bit_mode(regs) ? &tramp_mapping : NULL;
+ if (!mapping)
+ return NULL;
+
+ vaddr = find_nearest_page(vaddr);
+ if (!vaddr)
+ return NULL;
+
+ tramp = kzalloc(sizeof(*tramp), GFP_KERNEL);
+ if (unlikely(!tramp))
+ return NULL;
+
+ atomic64_set(&tramp->ref, 1);
+ tramp->vaddr = vaddr;
+
+ vma = _install_special_mapping(mm, tramp->vaddr, PAGE_SIZE,
+ VM_READ|VM_EXEC|VM_MAYEXEC|VM_MAYREAD|VM_DONTCOPY|VM_IO,
+ mapping);
+ if (IS_ERR(vma))
+ goto free_area;
+ return tramp;
+
+ free_area:
+ kfree(tramp);
+ return NULL;
+}
+
+static __maybe_unused struct uprobe_trampoline *uprobe_trampoline_get(unsigned long vaddr)
+{
+ struct uprobes_state *state = ¤t->mm->uprobes_state;
+ struct uprobe_trampoline *tramp = NULL;
+
+ hlist_for_each_entry(tramp, &state->head_tramps, node) {
+ if (is_reachable_by_call(tramp->vaddr, vaddr)) {
+ atomic64_inc(&tramp->ref);
+ return tramp;
+ }
+ }
+
+ tramp = create_uprobe_trampoline(vaddr);
+ if (!tramp)
+ return NULL;
+
+ hlist_add_head(&tramp->node, &state->head_tramps);
+ return tramp;
+}
+
+static void destroy_uprobe_trampoline(struct uprobe_trampoline *tramp)
+{
+ hlist_del(&tramp->node);
+ kfree(tramp);
+}
+
+static __maybe_unused void uprobe_trampoline_put(struct uprobe_trampoline *tramp)
+{
+ if (tramp == NULL)
+ return;
+
+ if (atomic64_dec_and_test(&tramp->ref))
+ destroy_uprobe_trampoline(tramp);
+}
+
+void arch_uprobe_init_state(struct mm_struct *mm)
+{
+ INIT_HLIST_HEAD(&mm->uprobes_state.head_tramps);
+}
+
+void arch_uprobe_clear_state(struct mm_struct *mm)
+{
+ struct uprobes_state *state = &mm->uprobes_state;
+ struct uprobe_trampoline *tramp;
+ struct hlist_node *n;
+
+ hlist_for_each_entry_safe(tramp, n, &state->head_tramps, node)
+ destroy_uprobe_trampoline(tramp);
+}
#else /* 32-bit: */
/*
* No RIP-relative addressing on 32-bit
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index de3631ae1746..05a156750e8d 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -17,6 +17,7 @@
#include <linux/wait.h>
#include <linux/timer.h>
#include <linux/seqlock.h>
+#include <linux/mutex.h>
struct uprobe;
struct vm_area_struct;
@@ -183,6 +184,9 @@ struct xol_area;
struct uprobes_state {
struct xol_area *xol_area;
+#ifdef CONFIG_X86_64
+ struct hlist_head head_tramps;
+#endif
};
typedef int (*uprobe_write_verify_t)(struct page *page, unsigned long vaddr, uprobe_opcode_t *opcode, int nbytes);
@@ -232,6 +236,8 @@ extern void *arch_uretprobe_trampoline(unsigned long *psize);
extern unsigned long uprobe_get_trampoline_vaddr(void);
extern void uprobe_copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len);
extern void handle_syscall_uprobe(struct pt_regs *regs, unsigned long bp_vaddr);
+extern void arch_uprobe_clear_state(struct mm_struct *mm);
+extern void arch_uprobe_init_state(struct mm_struct *mm);
#else /* !CONFIG_UPROBES */
struct uprobes_state {
};
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 6ac691fe5682..c690cde4442c 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1784,6 +1784,14 @@ static struct xol_area *get_xol_area(void)
return area;
}
+void __weak arch_uprobe_clear_state(struct mm_struct *mm)
+{
+}
+
+void __weak arch_uprobe_init_state(struct mm_struct *mm)
+{
+}
+
/*
* uprobe_clear_state - Free the area allocated for slots.
*/
@@ -1795,6 +1803,8 @@ void uprobe_clear_state(struct mm_struct *mm)
delayed_uprobe_remove(NULL, mm);
mutex_unlock(&delayed_uprobe_lock);
+ arch_uprobe_clear_state(mm);
+
if (!area)
return;
diff --git a/kernel/fork.c b/kernel/fork.c
index 735405a9c5f3..e79baa6dcce6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1260,6 +1260,7 @@ static void mm_init_uprobes_state(struct mm_struct *mm)
{
#ifdef CONFIG_UPROBES
mm->uprobes_state.xol_area = NULL;
+ arch_uprobe_init_state(mm);
#endif
}
--
2.48.1
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH RFCv2 10/18] uprobes/x86: Add mm_uprobe objects to track uprobes within mm
2025-02-24 14:01 [PATCH RFCv2 00/18] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
` (8 preceding siblings ...)
2025-02-24 14:01 ` [PATCH RFCv2 09/18] uprobes/x86: Add mapping for optimized uprobe trampolines Jiri Olsa
@ 2025-02-24 14:01 ` Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 11/18] uprobes/x86: Add support to emulate nop5 instruction Jiri Olsa
` (8 subsequent siblings)
18 siblings, 0 replies; 33+ messages in thread
From: Jiri Olsa @ 2025-02-24 14:01 UTC (permalink / raw)
To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
Cc: bpf, linux-kernel, linux-trace-kernel, x86, Song Liu,
Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
Masami Hiramatsu, Alan Maguire, David Laight,
Thomas Weißschuh
We keep track of global uprobe instances, because with just 2 types
of update - writing breakpoint or original opcode - we don't need to
track the state of the specific uprobe state for mm_struct.
With optimized uprobe support we will need to make several instructions
updates and make sure we keep the state of the update per mm_struct.
Adding the mm_uprobe object to keep track of installed uprobes per
mm_struct. It's kept in rb_tree for fast lookups and the tree is
cleaned up when the breakpoint is uninstalled or the mm_struct is
released.
The key is uprobe object's address together with virtual address of
the breakpoint. The reason for the adding the latter to the key is
that we can have multiple virtual addresses for single uprobe,
because the code (for given offset) can be loaded multiple times.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
arch/x86/kernel/uprobes.c | 115 ++++++++++++++++++++++++++++++++++++++
include/linux/uprobes.h | 1 +
2 files changed, 116 insertions(+)
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index e0c3fb01a43c..8d4eb8133221 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -798,19 +798,134 @@ static __maybe_unused void uprobe_trampoline_put(struct uprobe_trampoline *tramp
destroy_uprobe_trampoline(tramp);
}
+struct mm_uprobe {
+ struct rb_node rb_node;
+ unsigned long auprobe;
+ unsigned long vaddr;
+};
+
+#define __node_2_mm_uprobe(node) rb_entry((node), struct mm_uprobe, rb_node)
+
+struct __mm_uprobe_key {
+ unsigned long auprobe;
+ unsigned long vaddr;
+};
+
+static inline int mm_uprobe_cmp(unsigned long l_auprobe, unsigned long l_vaddr,
+ const struct mm_uprobe *r_mmu)
+{
+ if (l_auprobe < r_mmu->auprobe)
+ return -1;
+ if (l_auprobe > r_mmu->auprobe)
+ return 1;
+ if (l_vaddr < r_mmu->vaddr)
+ return -1;
+ if (l_vaddr > r_mmu->vaddr)
+ return 1;
+
+ return 0;
+}
+
+static inline int __mm_uprobe_cmp(struct rb_node *a, const struct rb_node *b)
+{
+ struct mm_uprobe *mmu_a = __node_2_mm_uprobe(a);
+
+ return mm_uprobe_cmp(mmu_a->auprobe, mmu_a->vaddr, __node_2_mm_uprobe(b));
+}
+
+static inline bool __mm_uprobe_less(struct rb_node *a, const struct rb_node *b)
+{
+ struct mm_uprobe *mmu_a = __node_2_mm_uprobe(a);
+
+ return mm_uprobe_cmp(mmu_a->auprobe, mmu_a->vaddr, __node_2_mm_uprobe(b)) < 0;
+}
+
+static inline int __mm_uprobe_cmp_key(const void *key, const struct rb_node *b)
+{
+ const struct __mm_uprobe_key *a = key;
+
+ return mm_uprobe_cmp(a->auprobe, a->vaddr, __node_2_mm_uprobe(b));
+}
+
+static struct mm_uprobe *find_mm_uprobe(struct mm_struct *mm, struct arch_uprobe *auprobe,
+ unsigned long vaddr)
+{
+ struct __mm_uprobe_key key = {
+ .auprobe = (unsigned long) auprobe,
+ .vaddr = vaddr,
+ };
+ struct rb_node *node;
+
+ node = rb_find(&key, &mm->uprobes_state.root_uprobes, __mm_uprobe_cmp_key);
+ return node ? __node_2_mm_uprobe(node) : NULL;
+}
+
+static struct mm_uprobe *insert_mm_uprobe(struct mm_struct *mm, struct arch_uprobe *auprobe,
+ unsigned long vaddr)
+{
+ struct mm_uprobe *mmu;
+
+ mmu = kmalloc(sizeof(*mmu), GFP_KERNEL);
+ if (mmu) {
+ mmu->auprobe = (unsigned long) auprobe;
+ mmu->vaddr = vaddr;
+ RB_CLEAR_NODE(&mmu->rb_node);
+ rb_add(&mmu->rb_node, &mm->uprobes_state.root_uprobes, __mm_uprobe_less);
+ }
+ return mmu;
+}
+
+static void destroy_mm_uprobe(struct mm_uprobe *mmu, struct rb_root *root)
+{
+ rb_erase(&mmu->rb_node, root);
+ kfree(mmu);
+}
+
+int set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr)
+{
+ struct mm_uprobe *mmu;
+
+ if (find_mm_uprobe(mm, auprobe, vaddr))
+ return 0;
+ mmu = insert_mm_uprobe(mm, auprobe, vaddr);
+ if (!mmu)
+ return -ENOMEM;
+ return uprobe_write_opcode(auprobe, mm, vaddr, UPROBE_SWBP_INSN, false);
+}
+
+int set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr)
+{
+ struct mm_uprobe *mmu;
+
+ mmu = find_mm_uprobe(mm, auprobe, vaddr);
+ if (!mmu)
+ return 0;
+ destroy_mm_uprobe(mmu, &mm->uprobes_state.root_uprobes);
+ return uprobe_write_opcode(auprobe, mm, vaddr, *(uprobe_opcode_t *)&auprobe->insn, true);
+}
+
void arch_uprobe_init_state(struct mm_struct *mm)
{
INIT_HLIST_HEAD(&mm->uprobes_state.head_tramps);
+ mm->uprobes_state.root_uprobes = RB_ROOT;
}
void arch_uprobe_clear_state(struct mm_struct *mm)
{
struct uprobes_state *state = &mm->uprobes_state;
struct uprobe_trampoline *tramp;
+ struct rb_node *node, *next;
struct hlist_node *n;
hlist_for_each_entry_safe(tramp, n, &state->head_tramps, node)
destroy_uprobe_trampoline(tramp);
+
+ node = rb_first(&state->root_uprobes);
+ while (node) {
+ next = rb_next(node);
+ destroy_mm_uprobe(__node_2_mm_uprobe(node), &state->root_uprobes);
+ node = next;
+ }
}
#else /* 32-bit: */
/*
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 05a156750e8d..bd726daa4428 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -186,6 +186,7 @@ struct uprobes_state {
struct xol_area *xol_area;
#ifdef CONFIG_X86_64
struct hlist_head head_tramps;
+ struct rb_root root_uprobes;
#endif
};
--
2.48.1
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH RFCv2 11/18] uprobes/x86: Add support to emulate nop5 instruction
2025-02-24 14:01 [PATCH RFCv2 00/18] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
` (9 preceding siblings ...)
2025-02-24 14:01 ` [PATCH RFCv2 10/18] uprobes/x86: Add mm_uprobe objects to track uprobes within mm Jiri Olsa
@ 2025-02-24 14:01 ` Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 12/18] uprobes/x86: Add support to optimize uprobes Jiri Olsa
` (7 subsequent siblings)
18 siblings, 0 replies; 33+ messages in thread
From: Jiri Olsa @ 2025-02-24 14:01 UTC (permalink / raw)
To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
Cc: bpf, linux-kernel, linux-trace-kernel, x86, Song Liu,
Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
Masami Hiramatsu, Alan Maguire, David Laight,
Thomas Weißschuh
Adding support to emulate nop5 as the original uprobe instruction.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
arch/x86/kernel/uprobes.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 8d4eb8133221..e8aebbda83bc 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -308,6 +308,11 @@ static int uprobe_init_insn(struct arch_uprobe *auprobe, struct insn *insn, bool
return -ENOTSUPP;
}
+static int is_nop5_insn(uprobe_opcode_t *insn)
+{
+ return !memcmp(insn, x86_nops[5], 5);
+}
+
#ifdef CONFIG_X86_64
asm (
@@ -927,6 +932,11 @@ void arch_uprobe_clear_state(struct mm_struct *mm)
node = next;
}
}
+
+static bool emulate_nop5_insn(struct arch_uprobe *auprobe)
+{
+ return is_nop5_insn((uprobe_opcode_t *) &auprobe->insn);
+}
#else /* 32-bit: */
/*
* No RIP-relative addressing on 32-bit
@@ -940,6 +950,10 @@ static void riprel_pre_xol(struct arch_uprobe *auprobe, struct pt_regs *regs)
static void riprel_post_xol(struct arch_uprobe *auprobe, struct pt_regs *regs)
{
}
+static bool emulate_nop5_insn(struct arch_uprobe *auprobe)
+{
+ return false;
+}
#endif /* CONFIG_X86_64 */
struct uprobe_xol_ops {
@@ -1171,6 +1185,8 @@ static int branch_setup_xol_ops(struct arch_uprobe *auprobe, struct insn *insn)
break;
case 0x0f:
+ if (emulate_nop5_insn(auprobe))
+ goto setup;
if (insn->opcode.nbytes != 2)
return -ENOSYS;
/*
--
2.48.1
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH RFCv2 12/18] uprobes/x86: Add support to optimize uprobes
2025-02-24 14:01 [PATCH RFCv2 00/18] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
` (10 preceding siblings ...)
2025-02-24 14:01 ` [PATCH RFCv2 11/18] uprobes/x86: Add support to emulate nop5 instruction Jiri Olsa
@ 2025-02-24 14:01 ` Jiri Olsa
2025-02-28 18:55 ` Andrii Nakryiko
2025-02-28 23:00 ` Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 13/18] selftests/bpf: Reorg the uprobe_syscall test function Jiri Olsa
` (6 subsequent siblings)
18 siblings, 2 replies; 33+ messages in thread
From: Jiri Olsa @ 2025-02-24 14:01 UTC (permalink / raw)
To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
Cc: bpf, linux-kernel, linux-trace-kernel, x86, Song Liu,
Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
Masami Hiramatsu, Alan Maguire, David Laight,
Thomas Weißschuh
Putting together all the previously added pieces to support optimized
uprobes on top of 5-byte nop instruction.
The current uprobe execution goes through following:
- installs breakpoint instruction over original instruction
- exception handler hit and calls related uprobe consumers
- and either simulates original instruction or does out of line single step
execution of it
- returns to user space
The optimized uprobe path
- checks the original instruction is 5-byte nop (plus other checks)
- adds (or uses existing) user space trampoline and overwrites original
instruction (5-byte nop) with call to user space trampoline
- the user space trampoline executes uprobe syscall that calls related uprobe
consumers
- trampoline returns back to next instruction
This approach won't speed up all uprobes as it's limited to using nop5 as
original instruction, but we could use nop5 as USDT probe instruction (which
uses single byte nop ATM) and speed up the USDT probes.
This patch overloads related arch functions in uprobe_write_opcode and
set_orig_insn so they can install call instruction if needed.
The arch_uprobe_optimize triggers the uprobe optimization and is called after
first uprobe hit. I originally had it called on uprobe installation but then
it clashed with elf loader, because the user space trampoline was added in a
place where loader might need to put elf segments, so I decided to do it after
first uprobe hit when loading is done.
We do not unmap and release uprobe trampoline when it's no longer needed,
because there's no easy way to make sure none of the threads is still
inside the trampoline. But we do not waste memory, because there's just
single page for all the uprobe trampoline mappings.
We do waste frmae on page mapping for every 4GB by keeping the uprobe
trampoline page mapped, but that seems ok.
Attaching the speed up from benchs/run_bench_uprobes.sh script:
current:
usermode-count : 818.836 ± 2.842M/s
syscall-count : 8.917 ± 0.003M/s
uprobe-nop : 3.056 ± 0.013M/s
uprobe-push : 2.903 ± 0.002M/s
uprobe-ret : 1.533 ± 0.001M/s
--> uprobe-nop5 : 1.492 ± 0.000M/s
uretprobe-nop : 1.783 ± 0.000M/s
uretprobe-push : 1.672 ± 0.001M/s
uretprobe-ret : 1.067 ± 0.002M/s
--> uretprobe-nop5 : 1.052 ± 0.000M/s
after the change:
usermode-count : 818.386 ± 1.886M/s
syscall-count : 8.923 ± 0.003M/s
uprobe-nop : 3.086 ± 0.005M/s
uprobe-push : 2.751 ± 0.001M/s
uprobe-ret : 1.481 ± 0.000M/s
--> uprobe-nop5 : 4.016 ± 0.002M/s
uretprobe-nop : 1.712 ± 0.008M/s
uretprobe-push : 1.616 ± 0.001M/s
uretprobe-ret : 1.052 ± 0.000M/s
--> uretprobe-nop5 : 2.015 ± 0.000M/s
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
arch/x86/include/asm/uprobes.h | 6 ++
arch/x86/kernel/uprobes.c | 191 ++++++++++++++++++++++++++++++++-
include/linux/uprobes.h | 6 +-
kernel/events/uprobes.c | 16 ++-
4 files changed, 209 insertions(+), 10 deletions(-)
diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 678fb546f0a7..7d4df920bb59 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -20,6 +20,10 @@ typedef u8 uprobe_opcode_t;
#define UPROBE_SWBP_INSN 0xcc
#define UPROBE_SWBP_INSN_SIZE 1
+enum {
+ ARCH_UPROBE_FLAG_CAN_OPTIMIZE = 0,
+};
+
struct uprobe_xol_ops;
struct arch_uprobe {
@@ -45,6 +49,8 @@ struct arch_uprobe {
u8 ilen;
} push;
};
+
+ unsigned long flags;
};
struct arch_uprobe_task {
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index e8aebbda83bc..73ddff823904 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -18,6 +18,7 @@
#include <asm/processor.h>
#include <asm/insn.h>
#include <asm/mmu_context.h>
+#include <asm/nops.h>
/* Post-execution fixups. */
@@ -768,7 +769,7 @@ static struct uprobe_trampoline *create_uprobe_trampoline(unsigned long vaddr)
return NULL;
}
-static __maybe_unused struct uprobe_trampoline *uprobe_trampoline_get(unsigned long vaddr)
+static struct uprobe_trampoline *uprobe_trampoline_get(unsigned long vaddr)
{
struct uprobes_state *state = ¤t->mm->uprobes_state;
struct uprobe_trampoline *tramp = NULL;
@@ -794,7 +795,7 @@ static void destroy_uprobe_trampoline(struct uprobe_trampoline *tramp)
kfree(tramp);
}
-static __maybe_unused void uprobe_trampoline_put(struct uprobe_trampoline *tramp)
+static void uprobe_trampoline_put(struct uprobe_trampoline *tramp)
{
if (tramp == NULL)
return;
@@ -807,6 +808,7 @@ struct mm_uprobe {
struct rb_node rb_node;
unsigned long auprobe;
unsigned long vaddr;
+ bool optimized;
};
#define __node_2_mm_uprobe(node) rb_entry((node), struct mm_uprobe, rb_node)
@@ -874,6 +876,7 @@ static struct mm_uprobe *insert_mm_uprobe(struct mm_struct *mm, struct arch_upro
if (mmu) {
mmu->auprobe = (unsigned long) auprobe;
mmu->vaddr = vaddr;
+ mmu->optimized = false;
RB_CLEAR_NODE(&mmu->rb_node);
rb_add(&mmu->rb_node, &mm->uprobes_state.root_uprobes, __mm_uprobe_less);
}
@@ -886,6 +889,134 @@ static void destroy_mm_uprobe(struct mm_uprobe *mmu, struct rb_root *root)
kfree(mmu);
}
+enum {
+ OPT_PART,
+ OPT_INSN,
+ UNOPT_INT3,
+ UNOPT_PART,
+};
+
+struct write_opcode_ctx {
+ unsigned long base;
+ int update;
+};
+
+static int is_call_insn(uprobe_opcode_t *insn)
+{
+ return *insn == CALL_INSN_OPCODE;
+}
+
+static int verify_insn(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode,
+ int nbytes, void *data)
+{
+ struct write_opcode_ctx *ctx = data;
+ uprobe_opcode_t old_opcode[5];
+
+ uprobe_copy_from_page(page, ctx->base, (uprobe_opcode_t *) &old_opcode, 5);
+
+ switch (ctx->update) {
+ case OPT_PART:
+ case OPT_INSN:
+ if (is_swbp_insn(&old_opcode[0]))
+ return 1;
+ break;
+ case UNOPT_INT3:
+ if (is_call_insn(&old_opcode[0]))
+ return 1;
+ break;
+ case UNOPT_PART:
+ if (is_swbp_insn(&old_opcode[0]))
+ return 1;
+ break;
+ }
+
+ return -1;
+}
+
+static int write_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr,
+ uprobe_opcode_t *insn, int nbytes, void *ctx)
+{
+ return uprobe_write(auprobe, mm, vaddr, insn, nbytes, verify_insn, false, ctx);
+}
+
+static void relative_call(void *dest, long from, long to)
+{
+ struct __packed __arch_relative_insn {
+ u8 op;
+ s32 raddr;
+ } *insn;
+
+ insn = (struct __arch_relative_insn *)dest;
+ insn->raddr = (s32)(to - (from + 5));
+ insn->op = CALL_INSN_OPCODE;
+}
+
+static int swbp_optimize(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr,
+ unsigned long tramp)
+{
+ struct write_opcode_ctx ctx = {
+ .base = vaddr,
+ };
+ char call[5];
+ int err;
+
+ relative_call(call, vaddr, tramp);
+
+ /*
+ * We are in state where breakpoint (int3) is installed on top of first
+ * byte of the nop5 instruction. We will do following steps to overwrite
+ * this to call instruction:
+ *
+ * - sync cores
+ * - write last 4 bytes of the call instruction
+ * - sync cores
+ * - update the call instruction opcode
+ */
+ text_poke_sync();
+
+ ctx.update = OPT_PART;
+ err = write_insn(auprobe, mm, vaddr + 1, call + 1, 4, &ctx);
+ if (err)
+ return err;
+
+ text_poke_sync();
+
+ ctx.update = OPT_INSN;
+ return write_insn(auprobe, mm, vaddr, call, 1, &ctx);
+}
+
+static int swbp_unoptimize(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr)
+{
+ uprobe_opcode_t int3 = UPROBE_SWBP_INSN;
+ struct write_opcode_ctx ctx = {
+ .base = vaddr,
+ };
+ int err;
+
+ /*
+ * We need to overwrite call instruction into nop5 instruction with
+ * breakpoint (int3) installed on top of its first byte. We will:
+ *
+ * - overwrite call opcode with breakpoint (int3)
+ * - sync cores
+ * - write last 4 bytes of the nop5 instruction
+ * - sync cores
+ */
+
+ ctx.update = UNOPT_INT3;
+ err = write_insn(auprobe, mm, vaddr, &int3, 1, &ctx);
+ if (err)
+ return err;
+
+ text_poke_sync();
+
+ ctx.update = UNOPT_PART;
+ err = write_insn(auprobe, mm, vaddr + 1, (uprobe_opcode_t *) auprobe->insn + 1, 4, &ctx);
+
+ text_poke_sync();
+ return err;
+}
+
int set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr)
{
struct mm_uprobe *mmu;
@@ -905,6 +1036,8 @@ int set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned lo
mmu = find_mm_uprobe(mm, auprobe, vaddr);
if (!mmu)
return 0;
+ if (mmu->optimized)
+ WARN_ON_ONCE(swbp_unoptimize(auprobe, mm, vaddr));
destroy_mm_uprobe(mmu, &mm->uprobes_state.root_uprobes);
return uprobe_write_opcode(auprobe, mm, vaddr, *(uprobe_opcode_t *)&auprobe->insn, true);
}
@@ -937,6 +1070,41 @@ static bool emulate_nop5_insn(struct arch_uprobe *auprobe)
{
return is_nop5_insn((uprobe_opcode_t *) &auprobe->insn);
}
+
+void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
+{
+ struct mm_struct *mm = current->mm;
+ struct uprobe_trampoline *tramp;
+ struct mm_uprobe *mmu;
+
+ if (!test_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags))
+ return;
+
+ mmap_write_lock(mm);
+ mmu = find_mm_uprobe(mm, auprobe, vaddr);
+ if (!mmu || mmu->optimized)
+ goto unlock;
+
+ tramp = uprobe_trampoline_get(vaddr);
+ if (!tramp)
+ goto unlock;
+
+ if (WARN_ON_ONCE(swbp_optimize(auprobe, mm, vaddr, tramp->vaddr)))
+ uprobe_trampoline_put(tramp);
+ else
+ mmu->optimized = true;
+
+unlock:
+ mmap_write_unlock(mm);
+}
+
+static bool can_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
+{
+ if (!is_nop5_insn((uprobe_opcode_t *) &auprobe->insn))
+ return false;
+ /* We can't do cross page atomic writes yet. */
+ return PAGE_SIZE - (vaddr & ~PAGE_MASK) >= 5;
+}
#else /* 32-bit: */
/*
* No RIP-relative addressing on 32-bit
@@ -954,6 +1122,10 @@ static bool emulate_nop5_insn(struct arch_uprobe *auprobe)
{
return false;
}
+static bool can_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
+{
+ return false;
+}
#endif /* CONFIG_X86_64 */
struct uprobe_xol_ops {
@@ -1317,6 +1489,9 @@ int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe, struct mm_struct *mm,
if (ret)
return ret;
+ if (can_optimize(auprobe, addr))
+ set_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags);
+
ret = branch_setup_xol_ops(auprobe, &insn);
if (ret != -ENOSYS)
return ret;
@@ -1523,15 +1698,23 @@ arch_uretprobe_hijack_return_addr(unsigned long trampoline_vaddr, struct pt_regs
{
int rasize = sizeof_long(regs), nleft;
unsigned long orig_ret_vaddr = 0; /* clear high bits for 32-bit apps */
+ unsigned long off = 0;
+
+ /*
+ * Optimized uprobe goes through uprobe trampoline which adds 4 8-byte
+ * values on stack, check uprobe_trampoline_entry for details.
+ */
+ if (!swbp)
+ off = 4*8;
- if (copy_from_user(&orig_ret_vaddr, (void __user *)regs->sp, rasize))
+ if (copy_from_user(&orig_ret_vaddr, (void __user *)regs->sp + off, rasize))
return -1;
/* check whether address has been already hijacked */
if (orig_ret_vaddr == trampoline_vaddr)
return orig_ret_vaddr;
- nleft = copy_to_user((void __user *)regs->sp, &trampoline_vaddr, rasize);
+ nleft = copy_to_user((void __user *)regs->sp + off, &trampoline_vaddr, rasize);
if (likely(!nleft)) {
if (shstk_update_last_frame(trampoline_vaddr)) {
force_sig(SIGSEGV);
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index bd726daa4428..22c79905754c 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -190,7 +190,8 @@ struct uprobes_state {
#endif
};
-typedef int (*uprobe_write_verify_t)(struct page *page, unsigned long vaddr, uprobe_opcode_t *opcode, int nbytes);
+typedef int (*uprobe_write_verify_t)(struct page *page, unsigned long vaddr, uprobe_opcode_t *opcode,
+ int nbytes, void *data);
extern void __init uprobes_init(void);
extern int set_swbp(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr);
@@ -202,7 +203,7 @@ extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr,
uprobe_opcode_t, bool);
extern int uprobe_write(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr,
- uprobe_opcode_t *insn, int nbytes, uprobe_write_verify_t verify, bool orig);
+ uprobe_opcode_t *insn, int nbytes, uprobe_write_verify_t verify, bool orig, void *data);
extern struct uprobe *uprobe_register(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
extern int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool);
extern void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc);
@@ -239,6 +240,7 @@ extern void uprobe_copy_from_page(struct page *page, unsigned long vaddr, void *
extern void handle_syscall_uprobe(struct pt_regs *regs, unsigned long bp_vaddr);
extern void arch_uprobe_clear_state(struct mm_struct *mm);
extern void arch_uprobe_init_state(struct mm_struct *mm);
+extern void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr);
#else /* !CONFIG_UPROBES */
struct uprobes_state {
};
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index c690cde4442c..ba4f18bac73e 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -264,7 +264,8 @@ static void uprobe_copy_to_page(struct page *page, unsigned long vaddr, const vo
kunmap_atomic(kaddr);
}
-static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode, int nbytes)
+static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode,
+ int nbytes, void *data)
{
uprobe_opcode_t old_opcode;
bool is_swbp;
@@ -473,12 +474,12 @@ static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm,
int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
unsigned long vaddr, uprobe_opcode_t opcode, bool orig)
{
- return uprobe_write(auprobe, mm, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE, verify_opcode, orig);
+ return uprobe_write(auprobe, mm, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE, verify_opcode, orig, NULL);
}
int uprobe_write(struct arch_uprobe *auprobe, struct mm_struct *mm,
unsigned long vaddr, uprobe_opcode_t *insn,
- int nbytes, uprobe_write_verify_t verify, bool orig)
+ int nbytes, uprobe_write_verify_t verify, bool orig, void *data)
{
struct page *old_page, *new_page;
struct vm_area_struct *vma;
@@ -494,7 +495,7 @@ int uprobe_write(struct arch_uprobe *auprobe, struct mm_struct *mm,
if (IS_ERR(old_page))
return PTR_ERR(old_page);
- ret = verify(old_page, vaddr, insn, nbytes);
+ ret = verify(old_page, vaddr, insn, nbytes, data);
if (ret <= 0)
goto put_old;
@@ -2668,6 +2669,10 @@ bool __weak arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c
return true;
}
+void __weak arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
+{
+}
+
/*
* Run handler and ask thread to singlestep.
* Ensure all non-fatal signals cannot interrupt thread while it singlesteps.
@@ -2732,6 +2737,9 @@ static void handle_swbp(struct pt_regs *regs)
handler_chain(uprobe, regs, true);
+ /* Try to optimize after first hit. */
+ arch_uprobe_optimize(&uprobe->arch, bp_vaddr);
+
if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
goto out;
--
2.48.1
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH RFCv2 13/18] selftests/bpf: Reorg the uprobe_syscall test function
2025-02-24 14:01 [PATCH RFCv2 00/18] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
` (11 preceding siblings ...)
2025-02-24 14:01 ` [PATCH RFCv2 12/18] uprobes/x86: Add support to optimize uprobes Jiri Olsa
@ 2025-02-24 14:01 ` Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 14/18] selftests/bpf: Use 5-byte nop for x86 usdt probes Jiri Olsa
` (5 subsequent siblings)
18 siblings, 0 replies; 33+ messages in thread
From: Jiri Olsa @ 2025-02-24 14:01 UTC (permalink / raw)
To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
Cc: bpf, linux-kernel, linux-trace-kernel, x86, Song Liu,
Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
Masami Hiramatsu, Alan Maguire, David Laight,
Thomas Weißschuh
Adding __test_uprobe_syscall with non x86_64 stub to execute all the tests,
so we don't need to keep adding non x86_64 stub functions for new tests.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
.../selftests/bpf/prog_tests/uprobe_syscall.c | 34 +++++++------------
1 file changed, 12 insertions(+), 22 deletions(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
index c397336fe1ed..2b00f16406c8 100644
--- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
@@ -350,29 +350,8 @@ static void test_uretprobe_shadow_stack(void)
ARCH_PRCTL(ARCH_SHSTK_DISABLE, ARCH_SHSTK_SHSTK);
}
-#else
-static void test_uretprobe_regs_equal(void)
-{
- test__skip();
-}
-
-static void test_uretprobe_regs_change(void)
-{
- test__skip();
-}
-
-static void test_uretprobe_syscall_call(void)
-{
- test__skip();
-}
-static void test_uretprobe_shadow_stack(void)
-{
- test__skip();
-}
-#endif
-
-void test_uprobe_syscall(void)
+static void __test_uprobe_syscall(void)
{
if (test__start_subtest("uretprobe_regs_equal"))
test_uretprobe_regs_equal();
@@ -383,3 +362,14 @@ void test_uprobe_syscall(void)
if (test__start_subtest("uretprobe_shadow_stack"))
test_uretprobe_shadow_stack();
}
+#else
+static void __test_uprobe_syscall(void)
+{
+ test__skip();
+}
+#endif
+
+void test_uprobe_syscall(void)
+{
+ __test_uprobe_syscall();
+}
--
2.48.1
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH RFCv2 14/18] selftests/bpf: Use 5-byte nop for x86 usdt probes
2025-02-24 14:01 [PATCH RFCv2 00/18] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
` (12 preceding siblings ...)
2025-02-24 14:01 ` [PATCH RFCv2 13/18] selftests/bpf: Reorg the uprobe_syscall test function Jiri Olsa
@ 2025-02-24 14:01 ` Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 15/18] selftests/bpf: Add uprobe/usdt syscall tests Jiri Olsa
` (4 subsequent siblings)
18 siblings, 0 replies; 33+ messages in thread
From: Jiri Olsa @ 2025-02-24 14:01 UTC (permalink / raw)
To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
Cc: bpf, linux-kernel, linux-trace-kernel, x86, Song Liu,
Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
Masami Hiramatsu, Alan Maguire, David Laight,
Thomas Weißschuh
Using 5-byte nop for x86 usdt probes so we can switch
to optimized uprobe them.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
tools/testing/selftests/bpf/sdt.h | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/sdt.h b/tools/testing/selftests/bpf/sdt.h
index 1fcfa5160231..1d62c06f5ddc 100644
--- a/tools/testing/selftests/bpf/sdt.h
+++ b/tools/testing/selftests/bpf/sdt.h
@@ -236,6 +236,13 @@ __extension__ extern unsigned long long __sdt_unsp;
#define _SDT_NOP nop
#endif
+/* Use 5 byte nop for x86_64 to allow optimizing uprobes. */
+#if defined(__x86_64__)
+# define _SDT_DEF_NOP _SDT_ASM_5(990: .byte 0x0f, 0x1f, 0x44, 0x00, 0x00)
+#else
+# define _SDT_DEF_NOP _SDT_ASM_1(990: _SDT_NOP)
+#endif
+
#define _SDT_NOTE_NAME "stapsdt"
#define _SDT_NOTE_TYPE 3
@@ -288,7 +295,7 @@ __extension__ extern unsigned long long __sdt_unsp;
#define _SDT_ASM_BODY(provider, name, pack_args, args, ...) \
_SDT_DEF_MACROS \
- _SDT_ASM_1(990: _SDT_NOP) \
+ _SDT_DEF_NOP \
_SDT_ASM_3( .pushsection .note.stapsdt,_SDT_ASM_AUTOGROUP,"note") \
_SDT_ASM_1( .balign 4) \
_SDT_ASM_3( .4byte 992f-991f, 994f-993f, _SDT_NOTE_TYPE) \
--
2.48.1
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH RFCv2 15/18] selftests/bpf: Add uprobe/usdt syscall tests
2025-02-24 14:01 [PATCH RFCv2 00/18] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
` (13 preceding siblings ...)
2025-02-24 14:01 ` [PATCH RFCv2 14/18] selftests/bpf: Use 5-byte nop for x86 usdt probes Jiri Olsa
@ 2025-02-24 14:01 ` Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 16/18] selftests/bpf: Add hit/attach/detach race optimized uprobe test Jiri Olsa
` (3 subsequent siblings)
18 siblings, 0 replies; 33+ messages in thread
From: Jiri Olsa @ 2025-02-24 14:01 UTC (permalink / raw)
To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
Cc: bpf, linux-kernel, linux-trace-kernel, x86, Song Liu,
Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
Masami Hiramatsu, Alan Maguire, David Laight,
Thomas Weißschuh
Adding tests for optimized uprobe/usdt probes.
Checking that we get expected trampoline and attached bpf programs
get executed properly.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
.../selftests/bpf/prog_tests/uprobe_syscall.c | 221 +++++++++++++++++-
.../bpf/progs/uprobe_syscall_executed.c | 34 ++-
2 files changed, 249 insertions(+), 6 deletions(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
index 2b00f16406c8..b337db6e12be 100644
--- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
@@ -14,6 +14,7 @@
#include <asm/prctl.h>
#include "uprobe_syscall.skel.h"
#include "uprobe_syscall_executed.skel.h"
+#include "sdt.h"
__naked unsigned long uretprobe_regs_trigger(void)
{
@@ -277,10 +278,10 @@ static void test_uretprobe_syscall_call(void)
_exit(0);
}
- skel->links.test = bpf_program__attach_uprobe_multi(skel->progs.test, pid,
- "/proc/self/exe",
- "uretprobe_syscall_call", &opts);
- if (!ASSERT_OK_PTR(skel->links.test, "bpf_program__attach_uprobe_multi"))
+ skel->links.test_uretprobe_multi = bpf_program__attach_uprobe_multi(skel->progs.test_uretprobe_multi,
+ pid, "/proc/self/exe",
+ "uretprobe_syscall_call", &opts);
+ if (!ASSERT_OK_PTR(skel->links.test_uretprobe_multi, "bpf_program__attach_uprobe_multi"))
goto cleanup;
/* kick the child */
@@ -351,6 +352,212 @@ static void test_uretprobe_shadow_stack(void)
ARCH_PRCTL(ARCH_SHSTK_DISABLE, ARCH_SHSTK_SHSTK);
}
+#define TRAMP "[uprobes-trampoline]"
+
+static unsigned char nop5[5] = { 0x0f, 0x1f, 0x44, 0x00, 0x00 };
+
+__naked noinline void uprobe_test(void)
+{
+ asm volatile (" \n"
+ ".byte 0x0f, 0x1f, 0x44, 0x00, 0x00 \n"
+ "ret \n"
+ );
+}
+
+noinline void usdt_test(void)
+{
+ STAP_PROBE(optimized_uprobe, usdt);
+}
+
+static void *find_nop5(void *fn)
+{
+ int i;
+
+ for (i = 0; i < 10; i++) {
+ if (!memcmp(nop5, fn + i, 5))
+ return fn + i;
+ }
+ return NULL;
+}
+
+static int find_uprobes_trampoline(void **start, void **end)
+{
+ char line[128];
+ int ret = -1;
+ FILE *maps;
+
+ maps = fopen("/proc/self/maps", "r");
+ if (!maps) {
+ fprintf(stderr, "cannot open maps\n");
+ return -1;
+ }
+
+ while (fgets(line, sizeof(line), maps)) {
+ int m = -1;
+
+ /* We care only about private r-x mappings. */
+ if (sscanf(line, "%p-%p r-xp %*x %*x:%*x %*u %n", start, end, &m) != 2)
+ continue;
+ if (m < 0)
+ continue;
+ if (!strncmp(&line[m], TRAMP, sizeof(TRAMP)-1)) {
+ ret = 0;
+ break;
+ }
+ }
+
+ fclose(maps);
+ return ret;
+}
+
+static void check_attach(struct uprobe_syscall_executed *skel, void (*trigger)(void), void *addr)
+{
+ void *tramp_start, *tramp_end;
+ struct __arch_relative_insn {
+ u8 op;
+ s32 raddr;
+ } __packed *call;
+
+ s32 delta;
+
+ /* Uprobe gets optimized after first trigger, so let's press twice. */
+ trigger();
+ trigger();
+
+ if (!ASSERT_OK(find_uprobes_trampoline(&tramp_start, &tramp_end), "uprobes_trampoline"))
+ return;
+
+ /* Make sure bpf program got executed.. */
+ ASSERT_EQ(skel->bss->executed, 2, "executed");
+
+ /* .. and check the trampoline is as expected. */
+ call = (struct __arch_relative_insn *) addr;
+ delta = (unsigned long) tramp_start - ((unsigned long) addr + 5);
+
+ ASSERT_EQ(call->op, 0xe8, "call");
+ ASSERT_EQ(call->raddr, delta, "delta");
+ ASSERT_EQ(tramp_end - tramp_start, 4096, "size");
+}
+
+static void check_detach(struct uprobe_syscall_executed *skel, void (*trigger)(void), void *addr)
+{
+ void *tramp_start, *tramp_end;
+
+ /* [uprobes_trampoline] stays after detach */
+ ASSERT_OK(find_uprobes_trampoline(&tramp_start, &tramp_end), "uprobes_trampoline");
+ ASSERT_OK(memcmp(addr, nop5, 5), "nop5");
+}
+
+static void check(struct uprobe_syscall_executed *skel, struct bpf_link *link,
+ void (*trigger)(void), void *addr)
+{
+ check_attach(skel, trigger, addr);
+ bpf_link__destroy(link);
+ check_detach(skel, trigger, addr);
+}
+
+static void test_uprobe_legacy(void)
+{
+ LIBBPF_OPTS(bpf_uprobe_opts, opts,
+ .retprobe = true,
+ );
+ struct uprobe_syscall_executed *skel;
+ struct bpf_link *link;
+ unsigned long offset;
+
+ skel = uprobe_syscall_executed__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "uprobe_syscall_executed__open_and_load"))
+ return;
+
+ /* uprobe */
+ offset = get_uprobe_offset(&uprobe_test);
+ if (!ASSERT_GE(offset, 0, "get_uprobe_offset"))
+ goto cleanup;
+
+ link = bpf_program__attach_uprobe_opts(skel->progs.test_uprobe,
+ 0, "/proc/self/exe", offset, NULL);
+ if (!ASSERT_OK_PTR(link, "bpf_program__attach_uprobe_opts"))
+ goto cleanup;
+
+ check(skel, link, uprobe_test, uprobe_test);
+
+ /* uretprobe */
+ skel->bss->executed = 0;
+ offset = get_uprobe_offset(&uprobe_test);
+ if (!ASSERT_GE(offset, 0, "get_uprobe_offset"))
+ goto cleanup;
+
+ link = bpf_program__attach_uprobe_opts(skel->progs.test_uretprobe,
+ 0, "/proc/self/exe", offset, NULL);
+ if (!ASSERT_OK_PTR(link, "bpf_program__attach_uprobe_opts"))
+ goto cleanup;
+
+ check(skel, link, uprobe_test, uprobe_test);
+
+cleanup:
+ uprobe_syscall_executed__destroy(skel);
+}
+
+static void test_uprobe_multi(void)
+{
+ LIBBPF_OPTS(bpf_uprobe_multi_opts, opts,
+ .retprobe = true,
+ );
+ struct uprobe_syscall_executed *skel;
+ struct bpf_link *link;
+
+ skel = uprobe_syscall_executed__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "uprobe_syscall_executed__open_and_load"))
+ return;
+
+ /* uprobe.multi */
+ link = bpf_program__attach_uprobe_multi(skel->progs.test_uprobe_multi,
+ 0, "/proc/self/exe", "uprobe_test", NULL);
+ if (!ASSERT_OK_PTR(link, "bpf_program__attach_uprobe_multi"))
+ goto cleanup;
+
+ check(skel, link, uprobe_test, uprobe_test);
+
+ /* uretprobe.multi */
+ skel->bss->executed = 0;
+ link = bpf_program__attach_uprobe_multi(skel->progs.test_uretprobe_multi,
+ 0, "/proc/self/exe", "uprobe_test", &opts);
+ if (!ASSERT_OK_PTR(link, "bpf_program__attach_uprobe_multi"))
+ goto cleanup;
+
+ check(skel, link, uprobe_test, uprobe_test);
+
+cleanup:
+ uprobe_syscall_executed__destroy(skel);
+}
+
+static void test_uprobe_usdt(void)
+{
+ struct uprobe_syscall_executed *skel;
+ struct bpf_link *link;
+ void *addr;
+
+ errno = 0;
+ addr = find_nop5(usdt_test);
+ if (!ASSERT_OK_PTR(addr, "find_nop5"))
+ return;
+
+ skel = uprobe_syscall_executed__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "uprobe_syscall_executed__open_and_load"))
+ return;
+
+ link = bpf_program__attach_usdt(skel->progs.test_usdt,
+ -1 /* all PIDs */, "/proc/self/exe",
+ "optimized_uprobe", "usdt", NULL);
+ if (!ASSERT_OK_PTR(link, "bpf_program__attach_usdt"))
+ goto cleanup;
+
+ check(skel, link, usdt_test, addr);
+
+cleanup:
+ uprobe_syscall_executed__destroy(skel);
+}
+
static void __test_uprobe_syscall(void)
{
if (test__start_subtest("uretprobe_regs_equal"))
@@ -361,6 +568,12 @@ static void __test_uprobe_syscall(void)
test_uretprobe_syscall_call();
if (test__start_subtest("uretprobe_shadow_stack"))
test_uretprobe_shadow_stack();
+ if (test__start_subtest("uprobe_legacy"))
+ test_uprobe_legacy();
+ if (test__start_subtest("uprobe_multi"))
+ test_uprobe_multi();
+ if (test__start_subtest("uprobe_usdt"))
+ test_uprobe_usdt();
}
#else
static void __test_uprobe_syscall(void)
diff --git a/tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c b/tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c
index 0d7f1a7db2e2..802fd562ce4c 100644
--- a/tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c
+++ b/tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c
@@ -1,6 +1,8 @@
// SPDX-License-Identifier: GPL-2.0
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/usdt.bpf.h>
#include <string.h>
struct pt_regs regs;
@@ -9,9 +11,37 @@ char _license[] SEC("license") = "GPL";
int executed = 0;
+SEC("uprobe")
+int BPF_UPROBE(test_uprobe)
+{
+ executed++;
+ return 0;
+}
+
+SEC("uretprobe")
+int BPF_URETPROBE(test_uretprobe)
+{
+ executed++;
+ return 0;
+}
+
+SEC("uprobe.multi")
+int test_uprobe_multi(struct pt_regs *ctx)
+{
+ executed++;
+ return 0;
+}
+
SEC("uretprobe.multi")
-int test(struct pt_regs *regs)
+int test_uretprobe_multi(struct pt_regs *ctx)
+{
+ executed++;
+ return 0;
+}
+
+SEC("usdt")
+int test_usdt(struct pt_regs *ctx)
{
- executed = 1;
+ executed++;
return 0;
}
--
2.48.1
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH RFCv2 16/18] selftests/bpf: Add hit/attach/detach race optimized uprobe test
2025-02-24 14:01 [PATCH RFCv2 00/18] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
` (14 preceding siblings ...)
2025-02-24 14:01 ` [PATCH RFCv2 15/18] selftests/bpf: Add uprobe/usdt syscall tests Jiri Olsa
@ 2025-02-24 14:01 ` Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 17/18] selftests/bpf: Add uprobe syscall sigill signal test Jiri Olsa
` (2 subsequent siblings)
18 siblings, 0 replies; 33+ messages in thread
From: Jiri Olsa @ 2025-02-24 14:01 UTC (permalink / raw)
To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
Cc: bpf, linux-kernel, linux-trace-kernel, x86, Song Liu,
Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
Masami Hiramatsu, Alan Maguire, David Laight,
Thomas Weißschuh
Adding test that makes sure parallel execution of the uprobe and
attach/detach of optimized uprobe on it works properly.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
.../selftests/bpf/prog_tests/uprobe_syscall.c | 75 +++++++++++++++++++
1 file changed, 75 insertions(+)
diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
index b337db6e12be..3f320da4ac46 100644
--- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
@@ -558,6 +558,79 @@ static void test_uprobe_usdt(void)
uprobe_syscall_executed__destroy(skel);
}
+static volatile bool race_stop;
+
+static void *worker_trigger(void *arg)
+{
+ unsigned long rounds = 0;
+
+ while (!race_stop) {
+ uprobe_test();
+ rounds++;
+ }
+
+ printf("tid %d trigger rounds: %lu\n", gettid(), rounds);
+ return NULL;
+}
+
+static void *worker_attach(void *arg)
+{
+ struct uprobe_syscall_executed *skel;
+ unsigned long rounds = 0;
+
+ skel = uprobe_syscall_executed__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "uprobe_syscall_executed__open_and_load"))
+ return NULL;
+
+ while (!race_stop) {
+ skel->links.test_uprobe_multi = bpf_program__attach_uprobe_multi(skel->progs.test_uprobe_multi,
+ -1, "/proc/self/exe", "uprobe_test", NULL);
+ if (!ASSERT_OK_PTR(skel->links.test_uprobe_multi, "bpf_program__attach_uprobe_multi"))
+ break;
+ bpf_link__destroy(skel->links.test_uprobe_multi);
+ skel->links.test_uprobe_multi = NULL;
+ rounds++;
+ }
+
+ printf("tid %d attach rounds: %lu hits: %d\n", gettid(), rounds, skel->bss->executed);
+ uprobe_syscall_executed__destroy(skel);
+ return NULL;
+}
+
+static void test_uprobe_race(void)
+{
+ int err, i, nr_cpus, nr;
+ pthread_t *threads;
+
+ nr_cpus = libbpf_num_possible_cpus();
+ if (!ASSERT_GE(nr_cpus, 0, "nr_cpus"))
+ return;
+
+ nr = nr_cpus * 2;
+ threads = malloc(sizeof(*threads) * nr);
+ if (!ASSERT_OK_PTR(threads, "malloc"))
+ return;
+
+ for (i = 0; i < nr_cpus; i++) {
+ err = pthread_create(&threads[i], NULL, worker_trigger, NULL);
+ if (!ASSERT_OK(err, "pthread_create"))
+ goto cleanup;
+ }
+
+ for (; i < nr; i++) {
+ err = pthread_create(&threads[i], NULL, worker_attach, NULL);
+ if (!ASSERT_OK(err, "pthread_create"))
+ goto cleanup;
+ }
+
+ sleep(4);
+
+cleanup:
+ race_stop = true;
+ for (nr = i, i = 0; i < nr; i++)
+ pthread_join(threads[i], NULL);
+}
+
static void __test_uprobe_syscall(void)
{
if (test__start_subtest("uretprobe_regs_equal"))
@@ -574,6 +647,8 @@ static void __test_uprobe_syscall(void)
test_uprobe_multi();
if (test__start_subtest("uprobe_usdt"))
test_uprobe_usdt();
+ if (test__start_subtest("uprobe_race"))
+ test_uprobe_race();
}
#else
static void __test_uprobe_syscall(void)
--
2.48.1
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH RFCv2 17/18] selftests/bpf: Add uprobe syscall sigill signal test
2025-02-24 14:01 [PATCH RFCv2 00/18] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
` (15 preceding siblings ...)
2025-02-24 14:01 ` [PATCH RFCv2 16/18] selftests/bpf: Add hit/attach/detach race optimized uprobe test Jiri Olsa
@ 2025-02-24 14:01 ` Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 18/18] selftests/bpf: Add 5-byte nop uprobe trigger bench Jiri Olsa
2025-02-24 18:46 ` [PATCH RFCv2 00/18] uprobes: Add support to optimize usdt probes on x86_64 Ingo Molnar
18 siblings, 0 replies; 33+ messages in thread
From: Jiri Olsa @ 2025-02-24 14:01 UTC (permalink / raw)
To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
Cc: bpf, linux-kernel, linux-trace-kernel, x86, Song Liu,
Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
Masami Hiramatsu, Alan Maguire, David Laight,
Thomas Weißschuh
Make sure that calling uprobe syscall from outside uprobe trampoline
results in sigill signal.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
.../selftests/bpf/prog_tests/uprobe_syscall.c | 36 +++++++++++++++++++
1 file changed, 36 insertions(+)
diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
index 3f320da4ac46..c5894ddc5b5e 100644
--- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
@@ -631,6 +631,40 @@ static void test_uprobe_race(void)
pthread_join(threads[i], NULL);
}
+#ifndef __NR_uprobe
+#define __NR_uprobe 336
+#endif
+
+static void test_uprobe_sigill(void)
+{
+ int status, err, pid;
+
+ pid = fork();
+ if (!ASSERT_GE(pid, 0, "fork"))
+ return;
+ /* child */
+ if (pid == 0) {
+ asm volatile (
+ "pushq %rax\n"
+ "pushq %rcx\n"
+ "pushq %r11\n"
+ "movq $" __stringify(__NR_uprobe) ", %rax\n"
+ "syscall\n"
+ "popq %r11\n"
+ "popq %rcx\n"
+ "retq\n"
+ );
+ exit(0);
+ }
+
+ err = waitpid(pid, &status, 0);
+ ASSERT_EQ(err, pid, "waitpid");
+
+ /* verify the child got killed with SIGILL */
+ ASSERT_EQ(WIFSIGNALED(status), 1, "WIFSIGNALED");
+ ASSERT_EQ(WTERMSIG(status), SIGILL, "WTERMSIG");
+}
+
static void __test_uprobe_syscall(void)
{
if (test__start_subtest("uretprobe_regs_equal"))
@@ -649,6 +683,8 @@ static void __test_uprobe_syscall(void)
test_uprobe_usdt();
if (test__start_subtest("uprobe_race"))
test_uprobe_race();
+ if (test__start_subtest("uprobe_sigill"))
+ test_uprobe_sigill();
}
#else
static void __test_uprobe_syscall(void)
--
2.48.1
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH RFCv2 18/18] selftests/bpf: Add 5-byte nop uprobe trigger bench
2025-02-24 14:01 [PATCH RFCv2 00/18] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
` (16 preceding siblings ...)
2025-02-24 14:01 ` [PATCH RFCv2 17/18] selftests/bpf: Add uprobe syscall sigill signal test Jiri Olsa
@ 2025-02-24 14:01 ` Jiri Olsa
2025-02-24 18:46 ` [PATCH RFCv2 00/18] uprobes: Add support to optimize usdt probes on x86_64 Ingo Molnar
18 siblings, 0 replies; 33+ messages in thread
From: Jiri Olsa @ 2025-02-24 14:01 UTC (permalink / raw)
To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
Cc: bpf, linux-kernel, linux-trace-kernel, x86, Song Liu,
Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
Masami Hiramatsu, Alan Maguire, David Laight,
Thomas Weißschuh
Add 5-byte nop uprobe trigger bench (x86_64 specific) to measure
uprobes/uretprobes on top of nop5 instruction.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
tools/testing/selftests/bpf/bench.c | 12 ++++++
.../selftests/bpf/benchs/bench_trigger.c | 42 +++++++++++++++++++
.../selftests/bpf/benchs/run_bench_uprobes.sh | 2 +-
3 files changed, 55 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index 1bd403a5ef7b..0fd8c9b0d38f 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -526,6 +526,12 @@ extern const struct bench bench_trig_uprobe_multi_push;
extern const struct bench bench_trig_uretprobe_multi_push;
extern const struct bench bench_trig_uprobe_multi_ret;
extern const struct bench bench_trig_uretprobe_multi_ret;
+#ifdef __x86_64__
+extern const struct bench bench_trig_uprobe_nop5;
+extern const struct bench bench_trig_uretprobe_nop5;
+extern const struct bench bench_trig_uprobe_multi_nop5;
+extern const struct bench bench_trig_uretprobe_multi_nop5;
+#endif
extern const struct bench bench_rb_libbpf;
extern const struct bench bench_rb_custom;
@@ -586,6 +592,12 @@ static const struct bench *benchs[] = {
&bench_trig_uretprobe_multi_push,
&bench_trig_uprobe_multi_ret,
&bench_trig_uretprobe_multi_ret,
+#ifdef __x86_64__
+ &bench_trig_uprobe_nop5,
+ &bench_trig_uretprobe_nop5,
+ &bench_trig_uprobe_multi_nop5,
+ &bench_trig_uretprobe_multi_nop5,
+#endif
/* ringbuf/perfbuf benchmarks */
&bench_rb_libbpf,
&bench_rb_custom,
diff --git a/tools/testing/selftests/bpf/benchs/bench_trigger.c b/tools/testing/selftests/bpf/benchs/bench_trigger.c
index 32e9f194d449..82327657846e 100644
--- a/tools/testing/selftests/bpf/benchs/bench_trigger.c
+++ b/tools/testing/selftests/bpf/benchs/bench_trigger.c
@@ -333,6 +333,20 @@ static void *uprobe_producer_ret(void *input)
return NULL;
}
+#ifdef __x86_64__
+__nocf_check __weak void uprobe_target_nop5(void)
+{
+ asm volatile (".byte 0x0f, 0x1f, 0x44, 0x00, 0x00");
+}
+
+static void *uprobe_producer_nop5(void *input)
+{
+ while (true)
+ uprobe_target_nop5();
+ return NULL;
+}
+#endif
+
static void usetup(bool use_retprobe, bool use_multi, void *target_addr)
{
size_t uprobe_offset;
@@ -448,6 +462,28 @@ static void uretprobe_multi_ret_setup(void)
usetup(true, true /* use_multi */, &uprobe_target_ret);
}
+#ifdef __x86_64__
+static void uprobe_nop5_setup(void)
+{
+ usetup(false, false /* !use_multi */, &uprobe_target_nop5);
+}
+
+static void uretprobe_nop5_setup(void)
+{
+ usetup(true, false /* !use_multi */, &uprobe_target_nop5);
+}
+
+static void uprobe_multi_nop5_setup(void)
+{
+ usetup(false, true /* use_multi */, &uprobe_target_nop5);
+}
+
+static void uretprobe_multi_nop5_setup(void)
+{
+ usetup(true, true /* use_multi */, &uprobe_target_nop5);
+}
+#endif
+
const struct bench bench_trig_syscall_count = {
.name = "trig-syscall-count",
.validate = trigger_validate,
@@ -506,3 +542,9 @@ BENCH_TRIG_USERMODE(uprobe_multi_ret, ret, "uprobe-multi-ret");
BENCH_TRIG_USERMODE(uretprobe_multi_nop, nop, "uretprobe-multi-nop");
BENCH_TRIG_USERMODE(uretprobe_multi_push, push, "uretprobe-multi-push");
BENCH_TRIG_USERMODE(uretprobe_multi_ret, ret, "uretprobe-multi-ret");
+#ifdef __x86_64__
+BENCH_TRIG_USERMODE(uprobe_nop5, nop5, "uprobe-nop5");
+BENCH_TRIG_USERMODE(uretprobe_nop5, nop5, "uretprobe-nop5");
+BENCH_TRIG_USERMODE(uprobe_multi_nop5, nop5, "uprobe-multi-nop5");
+BENCH_TRIG_USERMODE(uretprobe_multi_nop5, nop5, "uretprobe-multi-nop5");
+#endif
diff --git a/tools/testing/selftests/bpf/benchs/run_bench_uprobes.sh b/tools/testing/selftests/bpf/benchs/run_bench_uprobes.sh
index af169f831f2f..03f55405484b 100755
--- a/tools/testing/selftests/bpf/benchs/run_bench_uprobes.sh
+++ b/tools/testing/selftests/bpf/benchs/run_bench_uprobes.sh
@@ -2,7 +2,7 @@
set -eufo pipefail
-for i in usermode-count syscall-count {uprobe,uretprobe}-{nop,push,ret}
+for i in usermode-count syscall-count {uprobe,uretprobe}-{nop,push,ret,nop5}
do
summary=$(sudo ./bench -w2 -d5 -a trig-$i | tail -n1 | cut -d'(' -f1 | cut -d' ' -f3-)
printf "%-15s: %s\n" $i "$summary"
--
2.48.1
^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [PATCH RFCv2 00/18] uprobes: Add support to optimize usdt probes on x86_64
2025-02-24 14:01 [PATCH RFCv2 00/18] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
` (17 preceding siblings ...)
2025-02-24 14:01 ` [PATCH RFCv2 18/18] selftests/bpf: Add 5-byte nop uprobe trigger bench Jiri Olsa
@ 2025-02-24 18:46 ` Ingo Molnar
18 siblings, 0 replies; 33+ messages in thread
From: Ingo Molnar @ 2025-02-24 18:46 UTC (permalink / raw)
To: Jiri Olsa
Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, linux-kernel,
linux-trace-kernel, x86, Song Liu, Yonghong Song, John Fastabend,
Hao Luo, Steven Rostedt, Masami Hiramatsu, Alan Maguire,
David Laight, Thomas Weißschuh
* Jiri Olsa <jolsa@kernel.org> wrote:
> hi,
> this patchset adds support to optimize usdt probes on top of 5-byte
> nop instruction.
>
> The generic approach (optimize all uprobes) is hard due to emulating
> possible multiple original instructions and its related issues. The
> usdt case, which stores 5-byte nop seems much easier, so starting
> with that.
>
> The basic idea is to replace breakpoint exception with syscall which
> is faster on x86_64. For more details please see changelog of patch 8.
>
> The run_bench_uprobes.sh benchmark triggers uprobe (on top of different
> original instructions) in a loop and counts how many of those happened
> per second (the unit below is million loops).
>
> There's big speed up if you consider current usdt implementation
> (uprobe-nop) compared to proposed usdt (uprobe-nop5):
>
> # ./benchs/run_bench_uprobes.sh
>
> usermode-count : 818.386 ± 1.886M/s
> syscall-count : 8.923 ± 0.003M/s
> --> uprobe-nop : 3.086 ± 0.005M/s
> uprobe-push : 2.751 ± 0.001M/s
> uprobe-ret : 1.481 ± 0.000M/s
> --> uprobe-nop5 : 4.016 ± 0.002M/s
> uretprobe-nop : 1.712 ± 0.008M/s
> uretprobe-push : 1.616 ± 0.001M/s
> uretprobe-ret : 1.052 ± 0.000M/s
> uretprobe-nop5 : 2.015 ± 0.000M/s
So I had to dig into patch #12 to see the magnitude of the speedup:
# current:
# usermode-count : 818.836 ± 2.842M/s
# syscall-count : 8.917 ± 0.003M/s
# uprobe-nop : 3.056 ± 0.013M/s
# uprobe-push : 2.903 ± 0.002M/s
# uprobe-ret : 1.533 ± 0.001M/s
# --> uprobe-nop5 : 1.492 ± 0.000M/s
# uretprobe-nop : 1.783 ± 0.000M/s
# uretprobe-push : 1.672 ± 0.001M/s
# uretprobe-ret : 1.067 ± 0.002M/s
# --> uretprobe-nop5 : 1.052 ± 0.000M/s
#
# after the change:
#
# usermode-count : 818.386 ± 1.886M/s
# syscall-count : 8.923 ± 0.003M/s
# uprobe-nop : 3.086 ± 0.005M/s
# uprobe-push : 2.751 ± 0.001M/s
# uprobe-ret : 1.481 ± 0.000M/s
# --> uprobe-nop5 : 4.016 ± 0.002M/s
# uretprobe-nop : 1.712 ± 0.008M/s
# uretprobe-push : 1.616 ± 0.001M/s
# uretprobe-ret : 1.052 ± 0.000M/s
# --> uretprobe-nop5 : 2.015 ± 0.000M/s
That's a +169% and a +91% speedup - pretty darn impressive!
Thanks,
Ingo
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH RFCv2 08/18] uprobes/x86: Add uprobe syscall to speed up uprobe
2025-02-24 14:01 ` [PATCH RFCv2 08/18] uprobes/x86: Add uprobe syscall to speed up uprobe Jiri Olsa
@ 2025-02-24 19:22 ` Alexei Starovoitov
2025-02-25 13:35 ` Jiri Olsa
0 siblings, 1 reply; 33+ messages in thread
From: Alexei Starovoitov @ 2025-02-24 19:22 UTC (permalink / raw)
To: Jiri Olsa
Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, LKML,
linux-trace-kernel, X86 ML, Song Liu, Yonghong Song,
John Fastabend, Hao Luo, Steven Rostedt, Masami Hiramatsu,
Alan Maguire, David Laight, Thomas Weißschuh
On Mon, Feb 24, 2025 at 6:08 AM Jiri Olsa <jolsa@kernel.org> wrote:
>
> +SYSCALL_DEFINE0(uprobe)
> +{
> + struct pt_regs *regs = task_pt_regs(current);
> + unsigned long bp_vaddr;
> + int err;
> +
> + err = copy_from_user(&bp_vaddr, (void __user *)regs->sp + 3*8, sizeof(bp_vaddr));
> + if (err) {
> + force_sig(SIGILL);
> + return -1;
> + }
> +
> + /* Allow execution only from uprobe trampolines. */
> + if (!in_uprobe_trampoline(regs->ip)) {
> + force_sig(SIGILL);
> + return -1;
> + }
> +
> + handle_syscall_uprobe(regs, bp_vaddr - 5);
> + return 0;
> +}
> +
> +asm (
> + ".pushsection .rodata\n"
> + ".balign " __stringify(PAGE_SIZE) "\n"
> + "uprobe_trampoline_entry:\n"
> + "endbr64\n"
why endbr is there?
The trampoline is called with a direct call.
> + "push %rcx\n"
> + "push %r11\n"
> + "push %rax\n"
> + "movq $" __stringify(__NR_uprobe) ", %rax\n"
To avoid introducing a new syscall for a very similar operation
can we disambiguate uprobe vs uretprobe via %rdi or
some other way?
imo not too late to change uretprobe api.
Maybe it was discussed already.
> + "syscall\n"
> + "pop %rax\n"
> + "pop %r11\n"
> + "pop %rcx\n"
> + "ret\n"
In later patches I see nop5 is replaced with a call to
uprobe_trampoline_entry, but which part saves
rdi and other regs?
Compiler doesn't automatically spill/fill around USDT's nop/nop5.
Selftest is doing:
+__naked noinline void uprobe_test(void)
so just lucky ?
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH RFCv2 08/18] uprobes/x86: Add uprobe syscall to speed up uprobe
2025-02-24 19:22 ` Alexei Starovoitov
@ 2025-02-25 13:35 ` Jiri Olsa
2025-02-25 17:10 ` Andrii Nakryiko
2025-02-25 18:06 ` Alexei Starovoitov
0 siblings, 2 replies; 33+ messages in thread
From: Jiri Olsa @ 2025-02-25 13:35 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, LKML,
linux-trace-kernel, X86 ML, Song Liu, Yonghong Song,
John Fastabend, Hao Luo, Steven Rostedt, Masami Hiramatsu,
Alan Maguire, David Laight, Thomas Weißschuh
On Mon, Feb 24, 2025 at 11:22:42AM -0800, Alexei Starovoitov wrote:
> On Mon, Feb 24, 2025 at 6:08 AM Jiri Olsa <jolsa@kernel.org> wrote:
> >
> > +SYSCALL_DEFINE0(uprobe)
> > +{
> > + struct pt_regs *regs = task_pt_regs(current);
> > + unsigned long bp_vaddr;
> > + int err;
> > +
> > + err = copy_from_user(&bp_vaddr, (void __user *)regs->sp + 3*8, sizeof(bp_vaddr));
> > + if (err) {
> > + force_sig(SIGILL);
> > + return -1;
> > + }
> > +
> > + /* Allow execution only from uprobe trampolines. */
> > + if (!in_uprobe_trampoline(regs->ip)) {
> > + force_sig(SIGILL);
> > + return -1;
> > + }
> > +
> > + handle_syscall_uprobe(regs, bp_vaddr - 5);
> > + return 0;
> > +}
> > +
> > +asm (
> > + ".pushsection .rodata\n"
> > + ".balign " __stringify(PAGE_SIZE) "\n"
> > + "uprobe_trampoline_entry:\n"
> > + "endbr64\n"
>
> why endbr is there?
> The trampoline is called with a direct call.
ok, that's wrong, will remove that
>
> > + "push %rcx\n"
> > + "push %r11\n"
> > + "push %rax\n"
> > + "movq $" __stringify(__NR_uprobe) ", %rax\n"
>
> To avoid introducing a new syscall for a very similar operation
> can we disambiguate uprobe vs uretprobe via %rdi or
> some other way?
> imo not too late to change uretprobe api.
> Maybe it was discussed already.
yes, I recall discussing that early during uretprobe work with the decision to
have separate syscalls for each uprobe and uretprobe.. however wrt recent seccomp
changes, it might be easier just to add argument to uretprobe syscall to handle
uprobe
too bad it's not the other way around.. uprobe syscall with argument to do uretprobe
would sound better
>
> > + "syscall\n"
> > + "pop %rax\n"
> > + "pop %r11\n"
> > + "pop %rcx\n"
> > + "ret\n"
>
> In later patches I see nop5 is replaced with a call to
> uprobe_trampoline_entry, but which part saves
> rdi and other regs?
> Compiler doesn't automatically spill/fill around USDT's nop/nop5.
> Selftest is doing:
> +__naked noinline void uprobe_test(void)
> so just lucky ?
if you mean registers that would carry usdt arguments, ebpf programs
access those based on assembler operand string stored in usdt record:
stapsdt 0x00000048 NT_STAPSDT (SystemTap probe descriptors)
Provider: test
Name: usdt3
Location: 0x0000000000712f2f, Base: 0x0000000002f516b0, Semaphore: 0x0000000003348ec2
-> Arguments: -4@-1192(%rbp) -8@-1200(%rbp) 8@%rax
it's up to bpf program to know which register(+offset) to access, libbpf have all
this logic hidden behind usdt_manager_attach_usdt and bpf_usdt_arg bpf call
the trampoline only saves rcx/r11/rax, because they are changed by syscall instruction
but actually I forgot to load these saved values (of rcx/r11/rax) and rsp into regs that
are passed to ebpf program, (like we do in uretprobe syscall) will fix that in next version
I'll add tests for optimized usdt with more arguments
thanks,
jirka
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH RFCv2 08/18] uprobes/x86: Add uprobe syscall to speed up uprobe
2025-02-25 13:35 ` Jiri Olsa
@ 2025-02-25 17:10 ` Andrii Nakryiko
2025-02-25 18:06 ` Alexei Starovoitov
1 sibling, 0 replies; 33+ messages in thread
From: Andrii Nakryiko @ 2025-02-25 17:10 UTC (permalink / raw)
To: Jiri Olsa
Cc: Alexei Starovoitov, Oleg Nesterov, Peter Zijlstra,
Andrii Nakryiko, bpf, LKML, linux-trace-kernel, X86 ML, Song Liu,
Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
Masami Hiramatsu, Alan Maguire, David Laight,
Thomas Weißschuh
On Tue, Feb 25, 2025 at 5:35 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Mon, Feb 24, 2025 at 11:22:42AM -0800, Alexei Starovoitov wrote:
> > On Mon, Feb 24, 2025 at 6:08 AM Jiri Olsa <jolsa@kernel.org> wrote:
> > >
> > > +SYSCALL_DEFINE0(uprobe)
> > > +{
> > > + struct pt_regs *regs = task_pt_regs(current);
> > > + unsigned long bp_vaddr;
> > > + int err;
> > > +
> > > + err = copy_from_user(&bp_vaddr, (void __user *)regs->sp + 3*8, sizeof(bp_vaddr));
> > > + if (err) {
> > > + force_sig(SIGILL);
> > > + return -1;
> > > + }
> > > +
> > > + /* Allow execution only from uprobe trampolines. */
> > > + if (!in_uprobe_trampoline(regs->ip)) {
> > > + force_sig(SIGILL);
> > > + return -1;
> > > + }
> > > +
> > > + handle_syscall_uprobe(regs, bp_vaddr - 5);
> > > + return 0;
> > > +}
> > > +
> > > +asm (
> > > + ".pushsection .rodata\n"
> > > + ".balign " __stringify(PAGE_SIZE) "\n"
> > > + "uprobe_trampoline_entry:\n"
> > > + "endbr64\n"
> >
> > why endbr is there?
> > The trampoline is called with a direct call.
>
> ok, that's wrong, will remove that
>
> >
> > > + "push %rcx\n"
> > > + "push %r11\n"
> > > + "push %rax\n"
> > > + "movq $" __stringify(__NR_uprobe) ", %rax\n"
> >
> > To avoid introducing a new syscall for a very similar operation
> > can we disambiguate uprobe vs uretprobe via %rdi or
> > some other way?
> > imo not too late to change uretprobe api.
> > Maybe it was discussed already.
>
> yes, I recall discussing that early during uretprobe work with the decision to
> have separate syscalls for each uprobe and uretprobe.. however wrt recent seccomp
> changes, it might be easier just to add argument to uretprobe syscall to handle
> uprobe
>
> too bad it's not the other way around.. uprobe syscall with argument to do uretprobe
> would sound better
It's an "internal" syscall, why can't we rename it, if we want to?
Though I'm not sure I see the problem having both sys_uprobe and
sys_uretprobe, tbh. We just add sys_uprobe to the same list(s) that
sys_uretprobe is in for seccomp.
>
> >
> > > + "syscall\n"
> > > + "pop %rax\n"
> > > + "pop %r11\n"
> > > + "pop %rcx\n"
> > > + "ret\n"
> >
> > In later patches I see nop5 is replaced with a call to
> > uprobe_trampoline_entry, but which part saves
> > rdi and other regs?
> > Compiler doesn't automatically spill/fill around USDT's nop/nop5.
> > Selftest is doing:
> > +__naked noinline void uprobe_test(void)
> > so just lucky ?
>
> if you mean registers that would carry usdt arguments, ebpf programs
> access those based on assembler operand string stored in usdt record:
>
> stapsdt 0x00000048 NT_STAPSDT (SystemTap probe descriptors)
> Provider: test
> Name: usdt3
> Location: 0x0000000000712f2f, Base: 0x0000000002f516b0, Semaphore: 0x0000000003348ec2
> -> Arguments: -4@-1192(%rbp) -8@-1200(%rbp) 8@%rax
>
> it's up to bpf program to know which register(+offset) to access, libbpf have all
> this logic hidden behind usdt_manager_attach_usdt and bpf_usdt_arg bpf call
>
> the trampoline only saves rcx/r11/rax, because they are changed by syscall instruction
>
> but actually I forgot to load these saved values (of rcx/r11/rax) and rsp into regs that
> are passed to ebpf program, (like we do in uretprobe syscall) will fix that in next version
>
> I'll add tests for optimized usdt with more arguments
>
> thanks,
> jirka
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH RFCv2 08/18] uprobes/x86: Add uprobe syscall to speed up uprobe
2025-02-25 13:35 ` Jiri Olsa
2025-02-25 17:10 ` Andrii Nakryiko
@ 2025-02-25 18:06 ` Alexei Starovoitov
2025-02-26 2:36 ` Alexei Starovoitov
1 sibling, 1 reply; 33+ messages in thread
From: Alexei Starovoitov @ 2025-02-25 18:06 UTC (permalink / raw)
To: Jiri Olsa
Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, LKML,
linux-trace-kernel, X86 ML, Song Liu, Yonghong Song,
John Fastabend, Hao Luo, Steven Rostedt, Masami Hiramatsu,
Alan Maguire, David Laight, Thomas Weißschuh
On Tue, Feb 25, 2025 at 5:35 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> > In later patches I see nop5 is replaced with a call to
> > uprobe_trampoline_entry, but which part saves
> > rdi and other regs?
> > Compiler doesn't automatically spill/fill around USDT's nop/nop5.
> > Selftest is doing:
> > +__naked noinline void uprobe_test(void)
> > so just lucky ?
>
> if you mean registers that would carry usdt arguments, ebpf programs
> access those based on assembler operand string stored in usdt record:
No. I'm talking about all normal registers that trap-style uprobe
preserves, but this nop5->call will scratch.
Instead of void uprobe_test(void)
add some arguments to it, and read them before and after nop5 uprobe.
They must remain the same.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH RFCv2 08/18] uprobes/x86: Add uprobe syscall to speed up uprobe
2025-02-25 18:06 ` Alexei Starovoitov
@ 2025-02-26 2:36 ` Alexei Starovoitov
0 siblings, 0 replies; 33+ messages in thread
From: Alexei Starovoitov @ 2025-02-26 2:36 UTC (permalink / raw)
To: Jiri Olsa
Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, LKML,
linux-trace-kernel, X86 ML, Song Liu, Yonghong Song,
John Fastabend, Hao Luo, Steven Rostedt, Masami Hiramatsu,
Alan Maguire, David Laight, Thomas Weißschuh
On Tue, Feb 25, 2025 at 10:06 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Feb 25, 2025 at 5:35 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > > In later patches I see nop5 is replaced with a call to
> > > uprobe_trampoline_entry, but which part saves
> > > rdi and other regs?
> > > Compiler doesn't automatically spill/fill around USDT's nop/nop5.
> > > Selftest is doing:
> > > +__naked noinline void uprobe_test(void)
> > > so just lucky ?
> >
> > if you mean registers that would carry usdt arguments, ebpf programs
> > access those based on assembler operand string stored in usdt record:
>
> No. I'm talking about all normal registers that trap-style uprobe
> preserves, but this nop5->call will scratch.
> Instead of void uprobe_test(void)
> add some arguments to it, and read them before and after nop5 uprobe.
> They must remain the same.
Ignore me. It's a syscall insn. All fine.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH RFCv2 12/18] uprobes/x86: Add support to optimize uprobes
2025-02-24 14:01 ` [PATCH RFCv2 12/18] uprobes/x86: Add support to optimize uprobes Jiri Olsa
@ 2025-02-28 18:55 ` Andrii Nakryiko
2025-02-28 22:55 ` Jiri Olsa
2025-02-28 23:00 ` Jiri Olsa
1 sibling, 1 reply; 33+ messages in thread
From: Andrii Nakryiko @ 2025-02-28 18:55 UTC (permalink / raw)
To: Jiri Olsa
Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, linux-kernel,
linux-trace-kernel, x86, Song Liu, Yonghong Song, John Fastabend,
Hao Luo, Steven Rostedt, Masami Hiramatsu, Alan Maguire,
David Laight, Thomas Weißschuh
On Mon, Feb 24, 2025 at 6:04 AM Jiri Olsa <jolsa@kernel.org> wrote:
>
> Putting together all the previously added pieces to support optimized
> uprobes on top of 5-byte nop instruction.
>
> The current uprobe execution goes through following:
> - installs breakpoint instruction over original instruction
> - exception handler hit and calls related uprobe consumers
> - and either simulates original instruction or does out of line single step
> execution of it
> - returns to user space
>
> The optimized uprobe path
>
> - checks the original instruction is 5-byte nop (plus other checks)
> - adds (or uses existing) user space trampoline and overwrites original
> instruction (5-byte nop) with call to user space trampoline
> - the user space trampoline executes uprobe syscall that calls related uprobe
> consumers
> - trampoline returns back to next instruction
>
> This approach won't speed up all uprobes as it's limited to using nop5 as
> original instruction, but we could use nop5 as USDT probe instruction (which
> uses single byte nop ATM) and speed up the USDT probes.
>
> This patch overloads related arch functions in uprobe_write_opcode and
> set_orig_insn so they can install call instruction if needed.
>
> The arch_uprobe_optimize triggers the uprobe optimization and is called after
> first uprobe hit. I originally had it called on uprobe installation but then
> it clashed with elf loader, because the user space trampoline was added in a
> place where loader might need to put elf segments, so I decided to do it after
> first uprobe hit when loading is done.
>
> We do not unmap and release uprobe trampoline when it's no longer needed,
> because there's no easy way to make sure none of the threads is still
> inside the trampoline. But we do not waste memory, because there's just
> single page for all the uprobe trampoline mappings.
>
> We do waste frmae on page mapping for every 4GB by keeping the uprobe
> trampoline page mapped, but that seems ok.
>
> Attaching the speed up from benchs/run_bench_uprobes.sh script:
>
> current:
> usermode-count : 818.836 ± 2.842M/s
> syscall-count : 8.917 ± 0.003M/s
> uprobe-nop : 3.056 ± 0.013M/s
> uprobe-push : 2.903 ± 0.002M/s
> uprobe-ret : 1.533 ± 0.001M/s
> --> uprobe-nop5 : 1.492 ± 0.000M/s
> uretprobe-nop : 1.783 ± 0.000M/s
> uretprobe-push : 1.672 ± 0.001M/s
> uretprobe-ret : 1.067 ± 0.002M/s
> --> uretprobe-nop5 : 1.052 ± 0.000M/s
>
> after the change:
>
> usermode-count : 818.386 ± 1.886M/s
> syscall-count : 8.923 ± 0.003M/s
> uprobe-nop : 3.086 ± 0.005M/s
> uprobe-push : 2.751 ± 0.001M/s
> uprobe-ret : 1.481 ± 0.000M/s
> --> uprobe-nop5 : 4.016 ± 0.002M/s
> uretprobe-nop : 1.712 ± 0.008M/s
> uretprobe-push : 1.616 ± 0.001M/s
> uretprobe-ret : 1.052 ± 0.000M/s
> --> uretprobe-nop5 : 2.015 ± 0.000M/s
>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
> arch/x86/include/asm/uprobes.h | 6 ++
> arch/x86/kernel/uprobes.c | 191 ++++++++++++++++++++++++++++++++-
> include/linux/uprobes.h | 6 +-
> kernel/events/uprobes.c | 16 ++-
> 4 files changed, 209 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
> index 678fb546f0a7..7d4df920bb59 100644
> --- a/arch/x86/include/asm/uprobes.h
> +++ b/arch/x86/include/asm/uprobes.h
> @@ -20,6 +20,10 @@ typedef u8 uprobe_opcode_t;
> #define UPROBE_SWBP_INSN 0xcc
> #define UPROBE_SWBP_INSN_SIZE 1
>
> +enum {
> + ARCH_UPROBE_FLAG_CAN_OPTIMIZE = 0,
> +};
> +
> struct uprobe_xol_ops;
>
> struct arch_uprobe {
> @@ -45,6 +49,8 @@ struct arch_uprobe {
> u8 ilen;
> } push;
> };
> +
> + unsigned long flags;
> };
>
> struct arch_uprobe_task {
> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index e8aebbda83bc..73ddff823904 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -18,6 +18,7 @@
> #include <asm/processor.h>
> #include <asm/insn.h>
> #include <asm/mmu_context.h>
> +#include <asm/nops.h>
>
> /* Post-execution fixups. */
>
> @@ -768,7 +769,7 @@ static struct uprobe_trampoline *create_uprobe_trampoline(unsigned long vaddr)
> return NULL;
> }
>
> -static __maybe_unused struct uprobe_trampoline *uprobe_trampoline_get(unsigned long vaddr)
> +static struct uprobe_trampoline *uprobe_trampoline_get(unsigned long vaddr)
> {
> struct uprobes_state *state = ¤t->mm->uprobes_state;
> struct uprobe_trampoline *tramp = NULL;
> @@ -794,7 +795,7 @@ static void destroy_uprobe_trampoline(struct uprobe_trampoline *tramp)
> kfree(tramp);
> }
>
> -static __maybe_unused void uprobe_trampoline_put(struct uprobe_trampoline *tramp)
> +static void uprobe_trampoline_put(struct uprobe_trampoline *tramp)
> {
> if (tramp == NULL)
> return;
> @@ -807,6 +808,7 @@ struct mm_uprobe {
> struct rb_node rb_node;
> unsigned long auprobe;
> unsigned long vaddr;
> + bool optimized;
> };
>
I'm trying to understand if this RB-tree based mm_uprobe is strictly
necessary. Is it? Sure we keep optimized flag, but that's more for
defensive checks, no? Is there any other reason we need this separate
look up data structure?
> #define __node_2_mm_uprobe(node) rb_entry((node), struct mm_uprobe, rb_node)
> @@ -874,6 +876,7 @@ static struct mm_uprobe *insert_mm_uprobe(struct mm_struct *mm, struct arch_upro
> if (mmu) {
> mmu->auprobe = (unsigned long) auprobe;
> mmu->vaddr = vaddr;
> + mmu->optimized = false;
> RB_CLEAR_NODE(&mmu->rb_node);
> rb_add(&mmu->rb_node, &mm->uprobes_state.root_uprobes, __mm_uprobe_less);
> }
[...]
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH RFCv2 06/18] uprobes: Add orig argument to uprobe_write and uprobe_write_opcode
2025-02-24 14:01 ` [PATCH RFCv2 06/18] uprobes: Add orig argument to uprobe_write and uprobe_write_opcode Jiri Olsa
@ 2025-02-28 19:07 ` Andrii Nakryiko
2025-02-28 23:12 ` Jiri Olsa
0 siblings, 1 reply; 33+ messages in thread
From: Andrii Nakryiko @ 2025-02-28 19:07 UTC (permalink / raw)
To: Jiri Olsa
Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, linux-kernel,
linux-trace-kernel, x86, Song Liu, Yonghong Song, John Fastabend,
Hao Luo, Steven Rostedt, Masami Hiramatsu, Alan Maguire,
David Laight, Thomas Weißschuh
On Mon, Feb 24, 2025 at 6:03 AM Jiri Olsa <jolsa@kernel.org> wrote:
>
> The uprobe_write has special path to restore the original page when
> we write original instruction back.
>
> This happens when uprobe_write detects that we want to write anything
> else but breakpoint instruction.
>
> In following changes we want to use uprobe_write function for multiple
> updates, so adding new function argument to denote that this is the
> original instruction update. This way uprobe_write can make appropriate
> checks and restore the original page when possible.
>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
> arch/arm/probes/uprobes/core.c | 2 +-
> include/linux/uprobes.h | 5 +++--
> kernel/events/uprobes.c | 22 ++++++++++------------
> 3 files changed, 14 insertions(+), 15 deletions(-)
>
[...]
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index ad5879fc2d26..2b542043089e 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -471,25 +471,23 @@ static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm,
> * Return 0 (success) or a negative errno.
> */
> int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> - unsigned long vaddr, uprobe_opcode_t opcode)
> + unsigned long vaddr, uprobe_opcode_t opcode, bool orig)
> {
> - return uprobe_write(auprobe, mm, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE, verify_opcode);
> + return uprobe_write(auprobe, mm, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE, verify_opcode, orig);
> }
>
> int uprobe_write(struct arch_uprobe *auprobe, struct mm_struct *mm,
> unsigned long vaddr, uprobe_opcode_t *insn,
> - int nbytes, uprobe_write_verify_t verify)
> + int nbytes, uprobe_write_verify_t verify, bool orig)
why not call orig -> is_register and avoid a bunch of code churn?...
(and while "is_register" is not awesome name, still a bit more clear
compared to "orig", IMO)
> {
> struct page *old_page, *new_page;
> struct vm_area_struct *vma;
> - int ret, is_register;
> + int ret;
> bool orig_page_huge = false;
> unsigned int gup_flags = FOLL_FORCE;
>
> - is_register = is_swbp_insn(insn);
> -
[...]
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH RFCv2 12/18] uprobes/x86: Add support to optimize uprobes
2025-02-28 18:55 ` Andrii Nakryiko
@ 2025-02-28 22:55 ` Jiri Olsa
2025-02-28 23:00 ` Andrii Nakryiko
0 siblings, 1 reply; 33+ messages in thread
From: Jiri Olsa @ 2025-02-28 22:55 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, linux-kernel,
linux-trace-kernel, x86, Song Liu, Yonghong Song, John Fastabend,
Hao Luo, Steven Rostedt, Masami Hiramatsu, Alan Maguire,
David Laight, Thomas Weißschuh
On Fri, Feb 28, 2025 at 10:55:24AM -0800, Andrii Nakryiko wrote:
> On Mon, Feb 24, 2025 at 6:04 AM Jiri Olsa <jolsa@kernel.org> wrote:
> >
> > Putting together all the previously added pieces to support optimized
> > uprobes on top of 5-byte nop instruction.
> >
> > The current uprobe execution goes through following:
> > - installs breakpoint instruction over original instruction
> > - exception handler hit and calls related uprobe consumers
> > - and either simulates original instruction or does out of line single step
> > execution of it
> > - returns to user space
> >
> > The optimized uprobe path
> >
> > - checks the original instruction is 5-byte nop (plus other checks)
> > - adds (or uses existing) user space trampoline and overwrites original
> > instruction (5-byte nop) with call to user space trampoline
> > - the user space trampoline executes uprobe syscall that calls related uprobe
> > consumers
> > - trampoline returns back to next instruction
> >
> > This approach won't speed up all uprobes as it's limited to using nop5 as
> > original instruction, but we could use nop5 as USDT probe instruction (which
> > uses single byte nop ATM) and speed up the USDT probes.
> >
> > This patch overloads related arch functions in uprobe_write_opcode and
> > set_orig_insn so they can install call instruction if needed.
> >
> > The arch_uprobe_optimize triggers the uprobe optimization and is called after
> > first uprobe hit. I originally had it called on uprobe installation but then
> > it clashed with elf loader, because the user space trampoline was added in a
> > place where loader might need to put elf segments, so I decided to do it after
> > first uprobe hit when loading is done.
> >
> > We do not unmap and release uprobe trampoline when it's no longer needed,
> > because there's no easy way to make sure none of the threads is still
> > inside the trampoline. But we do not waste memory, because there's just
> > single page for all the uprobe trampoline mappings.
> >
> > We do waste frmae on page mapping for every 4GB by keeping the uprobe
> > trampoline page mapped, but that seems ok.
> >
> > Attaching the speed up from benchs/run_bench_uprobes.sh script:
> >
> > current:
> > usermode-count : 818.836 ± 2.842M/s
> > syscall-count : 8.917 ± 0.003M/s
> > uprobe-nop : 3.056 ± 0.013M/s
> > uprobe-push : 2.903 ± 0.002M/s
> > uprobe-ret : 1.533 ± 0.001M/s
> > --> uprobe-nop5 : 1.492 ± 0.000M/s
> > uretprobe-nop : 1.783 ± 0.000M/s
> > uretprobe-push : 1.672 ± 0.001M/s
> > uretprobe-ret : 1.067 ± 0.002M/s
> > --> uretprobe-nop5 : 1.052 ± 0.000M/s
> >
> > after the change:
> >
> > usermode-count : 818.386 ± 1.886M/s
> > syscall-count : 8.923 ± 0.003M/s
> > uprobe-nop : 3.086 ± 0.005M/s
> > uprobe-push : 2.751 ± 0.001M/s
> > uprobe-ret : 1.481 ± 0.000M/s
> > --> uprobe-nop5 : 4.016 ± 0.002M/s
> > uretprobe-nop : 1.712 ± 0.008M/s
> > uretprobe-push : 1.616 ± 0.001M/s
> > uretprobe-ret : 1.052 ± 0.000M/s
> > --> uretprobe-nop5 : 2.015 ± 0.000M/s
> >
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > ---
> > arch/x86/include/asm/uprobes.h | 6 ++
> > arch/x86/kernel/uprobes.c | 191 ++++++++++++++++++++++++++++++++-
> > include/linux/uprobes.h | 6 +-
> > kernel/events/uprobes.c | 16 ++-
> > 4 files changed, 209 insertions(+), 10 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
> > index 678fb546f0a7..7d4df920bb59 100644
> > --- a/arch/x86/include/asm/uprobes.h
> > +++ b/arch/x86/include/asm/uprobes.h
> > @@ -20,6 +20,10 @@ typedef u8 uprobe_opcode_t;
> > #define UPROBE_SWBP_INSN 0xcc
> > #define UPROBE_SWBP_INSN_SIZE 1
> >
> > +enum {
> > + ARCH_UPROBE_FLAG_CAN_OPTIMIZE = 0,
> > +};
> > +
> > struct uprobe_xol_ops;
> >
> > struct arch_uprobe {
> > @@ -45,6 +49,8 @@ struct arch_uprobe {
> > u8 ilen;
> > } push;
> > };
> > +
> > + unsigned long flags;
> > };
> >
> > struct arch_uprobe_task {
> > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> > index e8aebbda83bc..73ddff823904 100644
> > --- a/arch/x86/kernel/uprobes.c
> > +++ b/arch/x86/kernel/uprobes.c
> > @@ -18,6 +18,7 @@
> > #include <asm/processor.h>
> > #include <asm/insn.h>
> > #include <asm/mmu_context.h>
> > +#include <asm/nops.h>
> >
> > /* Post-execution fixups. */
> >
> > @@ -768,7 +769,7 @@ static struct uprobe_trampoline *create_uprobe_trampoline(unsigned long vaddr)
> > return NULL;
> > }
> >
> > -static __maybe_unused struct uprobe_trampoline *uprobe_trampoline_get(unsigned long vaddr)
> > +static struct uprobe_trampoline *uprobe_trampoline_get(unsigned long vaddr)
> > {
> > struct uprobes_state *state = ¤t->mm->uprobes_state;
> > struct uprobe_trampoline *tramp = NULL;
> > @@ -794,7 +795,7 @@ static void destroy_uprobe_trampoline(struct uprobe_trampoline *tramp)
> > kfree(tramp);
> > }
> >
> > -static __maybe_unused void uprobe_trampoline_put(struct uprobe_trampoline *tramp)
> > +static void uprobe_trampoline_put(struct uprobe_trampoline *tramp)
> > {
> > if (tramp == NULL)
> > return;
> > @@ -807,6 +808,7 @@ struct mm_uprobe {
> > struct rb_node rb_node;
> > unsigned long auprobe;
> > unsigned long vaddr;
> > + bool optimized;
> > };
> >
>
> I'm trying to understand if this RB-tree based mm_uprobe is strictly
> necessary. Is it? Sure we keep optimized flag, but that's more for
> defensive checks, no? Is there any other reason we need this separate
> look up data structure?
so the call instruction update is done in 2 locked steps:
- first we write breakpoint as part of normal uprobe registration
- then uprobe is hit, we overwrite breakpoint with call instruction
in between we could race with another thread that could either unregister the
uprobe or try to optimize the uprobe as well
I think we either need to keep the state of the uprobe per process (mm_struct),
or we would need to read the probed instruction each time when we need to make
decision based on what state are we at (nop5,breakpoint,call)
jirka
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH RFCv2 12/18] uprobes/x86: Add support to optimize uprobes
2025-02-28 22:55 ` Jiri Olsa
@ 2025-02-28 23:00 ` Andrii Nakryiko
2025-02-28 23:18 ` Jiri Olsa
0 siblings, 1 reply; 33+ messages in thread
From: Andrii Nakryiko @ 2025-02-28 23:00 UTC (permalink / raw)
To: Jiri Olsa
Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, linux-kernel,
linux-trace-kernel, x86, Song Liu, Yonghong Song, John Fastabend,
Hao Luo, Steven Rostedt, Masami Hiramatsu, Alan Maguire,
David Laight, Thomas Weißschuh
On Fri, Feb 28, 2025 at 2:55 PM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Fri, Feb 28, 2025 at 10:55:24AM -0800, Andrii Nakryiko wrote:
> > On Mon, Feb 24, 2025 at 6:04 AM Jiri Olsa <jolsa@kernel.org> wrote:
> > >
> > > Putting together all the previously added pieces to support optimized
> > > uprobes on top of 5-byte nop instruction.
> > >
> > > The current uprobe execution goes through following:
> > > - installs breakpoint instruction over original instruction
> > > - exception handler hit and calls related uprobe consumers
> > > - and either simulates original instruction or does out of line single step
> > > execution of it
> > > - returns to user space
> > >
> > > The optimized uprobe path
> > >
> > > - checks the original instruction is 5-byte nop (plus other checks)
> > > - adds (or uses existing) user space trampoline and overwrites original
> > > instruction (5-byte nop) with call to user space trampoline
> > > - the user space trampoline executes uprobe syscall that calls related uprobe
> > > consumers
> > > - trampoline returns back to next instruction
> > >
> > > This approach won't speed up all uprobes as it's limited to using nop5 as
> > > original instruction, but we could use nop5 as USDT probe instruction (which
> > > uses single byte nop ATM) and speed up the USDT probes.
> > >
> > > This patch overloads related arch functions in uprobe_write_opcode and
> > > set_orig_insn so they can install call instruction if needed.
> > >
> > > The arch_uprobe_optimize triggers the uprobe optimization and is called after
> > > first uprobe hit. I originally had it called on uprobe installation but then
> > > it clashed with elf loader, because the user space trampoline was added in a
> > > place where loader might need to put elf segments, so I decided to do it after
> > > first uprobe hit when loading is done.
> > >
> > > We do not unmap and release uprobe trampoline when it's no longer needed,
> > > because there's no easy way to make sure none of the threads is still
> > > inside the trampoline. But we do not waste memory, because there's just
> > > single page for all the uprobe trampoline mappings.
> > >
> > > We do waste frmae on page mapping for every 4GB by keeping the uprobe
> > > trampoline page mapped, but that seems ok.
> > >
> > > Attaching the speed up from benchs/run_bench_uprobes.sh script:
> > >
> > > current:
> > > usermode-count : 818.836 ± 2.842M/s
> > > syscall-count : 8.917 ± 0.003M/s
> > > uprobe-nop : 3.056 ± 0.013M/s
> > > uprobe-push : 2.903 ± 0.002M/s
> > > uprobe-ret : 1.533 ± 0.001M/s
> > > --> uprobe-nop5 : 1.492 ± 0.000M/s
> > > uretprobe-nop : 1.783 ± 0.000M/s
> > > uretprobe-push : 1.672 ± 0.001M/s
> > > uretprobe-ret : 1.067 ± 0.002M/s
> > > --> uretprobe-nop5 : 1.052 ± 0.000M/s
> > >
> > > after the change:
> > >
> > > usermode-count : 818.386 ± 1.886M/s
> > > syscall-count : 8.923 ± 0.003M/s
> > > uprobe-nop : 3.086 ± 0.005M/s
> > > uprobe-push : 2.751 ± 0.001M/s
> > > uprobe-ret : 1.481 ± 0.000M/s
> > > --> uprobe-nop5 : 4.016 ± 0.002M/s
> > > uretprobe-nop : 1.712 ± 0.008M/s
> > > uretprobe-push : 1.616 ± 0.001M/s
> > > uretprobe-ret : 1.052 ± 0.000M/s
> > > --> uretprobe-nop5 : 2.015 ± 0.000M/s
> > >
> > > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > > ---
> > > arch/x86/include/asm/uprobes.h | 6 ++
> > > arch/x86/kernel/uprobes.c | 191 ++++++++++++++++++++++++++++++++-
> > > include/linux/uprobes.h | 6 +-
> > > kernel/events/uprobes.c | 16 ++-
> > > 4 files changed, 209 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
> > > index 678fb546f0a7..7d4df920bb59 100644
> > > --- a/arch/x86/include/asm/uprobes.h
> > > +++ b/arch/x86/include/asm/uprobes.h
> > > @@ -20,6 +20,10 @@ typedef u8 uprobe_opcode_t;
> > > #define UPROBE_SWBP_INSN 0xcc
> > > #define UPROBE_SWBP_INSN_SIZE 1
> > >
> > > +enum {
> > > + ARCH_UPROBE_FLAG_CAN_OPTIMIZE = 0,
> > > +};
> > > +
> > > struct uprobe_xol_ops;
> > >
> > > struct arch_uprobe {
> > > @@ -45,6 +49,8 @@ struct arch_uprobe {
> > > u8 ilen;
> > > } push;
> > > };
> > > +
> > > + unsigned long flags;
> > > };
> > >
> > > struct arch_uprobe_task {
> > > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> > > index e8aebbda83bc..73ddff823904 100644
> > > --- a/arch/x86/kernel/uprobes.c
> > > +++ b/arch/x86/kernel/uprobes.c
> > > @@ -18,6 +18,7 @@
> > > #include <asm/processor.h>
> > > #include <asm/insn.h>
> > > #include <asm/mmu_context.h>
> > > +#include <asm/nops.h>
> > >
> > > /* Post-execution fixups. */
> > >
> > > @@ -768,7 +769,7 @@ static struct uprobe_trampoline *create_uprobe_trampoline(unsigned long vaddr)
> > > return NULL;
> > > }
> > >
> > > -static __maybe_unused struct uprobe_trampoline *uprobe_trampoline_get(unsigned long vaddr)
> > > +static struct uprobe_trampoline *uprobe_trampoline_get(unsigned long vaddr)
> > > {
> > > struct uprobes_state *state = ¤t->mm->uprobes_state;
> > > struct uprobe_trampoline *tramp = NULL;
> > > @@ -794,7 +795,7 @@ static void destroy_uprobe_trampoline(struct uprobe_trampoline *tramp)
> > > kfree(tramp);
> > > }
> > >
> > > -static __maybe_unused void uprobe_trampoline_put(struct uprobe_trampoline *tramp)
> > > +static void uprobe_trampoline_put(struct uprobe_trampoline *tramp)
> > > {
> > > if (tramp == NULL)
> > > return;
> > > @@ -807,6 +808,7 @@ struct mm_uprobe {
> > > struct rb_node rb_node;
> > > unsigned long auprobe;
> > > unsigned long vaddr;
> > > + bool optimized;
> > > };
> > >
> >
> > I'm trying to understand if this RB-tree based mm_uprobe is strictly
> > necessary. Is it? Sure we keep optimized flag, but that's more for
> > defensive checks, no? Is there any other reason we need this separate
> > look up data structure?
>
> so the call instruction update is done in 2 locked steps:
> - first we write breakpoint as part of normal uprobe registration
> - then uprobe is hit, we overwrite breakpoint with call instruction
>
> in between we could race with another thread that could either unregister the
> uprobe or try to optimize the uprobe as well
>
> I think we either need to keep the state of the uprobe per process (mm_struct),
> or we would need to read the probed instruction each time when we need to make
> decision based on what state are we at (nop5,breakpoint,call)
This decision is only done in "slow path", right? Only when
registering/unregistering. And those operations are done under lock.
So reading those 5 bytes every time we register/unregister seems
completely acceptable, rather than now *also* having a per-mm uprobe
lookup tree.
>
> jirka
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH RFCv2 12/18] uprobes/x86: Add support to optimize uprobes
2025-02-24 14:01 ` [PATCH RFCv2 12/18] uprobes/x86: Add support to optimize uprobes Jiri Olsa
2025-02-28 18:55 ` Andrii Nakryiko
@ 2025-02-28 23:00 ` Jiri Olsa
1 sibling, 0 replies; 33+ messages in thread
From: Jiri Olsa @ 2025-02-28 23:00 UTC (permalink / raw)
To: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko
Cc: bpf, linux-kernel, linux-trace-kernel, x86, Song Liu,
Yonghong Song, John Fastabend, Hao Luo, Steven Rostedt,
Masami Hiramatsu, Alan Maguire, David Laight,
Thomas Weißschuh
On Mon, Feb 24, 2025 at 03:01:44PM +0100, Jiri Olsa wrote:
SNIP
> @@ -1523,15 +1698,23 @@ arch_uretprobe_hijack_return_addr(unsigned long trampoline_vaddr, struct pt_regs
> {
> int rasize = sizeof_long(regs), nleft;
> unsigned long orig_ret_vaddr = 0; /* clear high bits for 32-bit apps */
> + unsigned long off = 0;
> +
> + /*
> + * Optimized uprobe goes through uprobe trampoline which adds 4 8-byte
> + * values on stack, check uprobe_trampoline_entry for details.
> + */
> + if (!swbp)
> + off = 4*8;
ok, now when I started to add the missing register modifications in uprobe syscall,
I realized we will modify the regs->sp appropriately already in the uprobe syscall
(before the code above is hit)
so we don't need this code and we can get rid of the swbp flag and patch#7 completely
jirka
>
> - if (copy_from_user(&orig_ret_vaddr, (void __user *)regs->sp, rasize))
> + if (copy_from_user(&orig_ret_vaddr, (void __user *)regs->sp + off, rasize))
> return -1;
>
> /* check whether address has been already hijacked */
> if (orig_ret_vaddr == trampoline_vaddr)
> return orig_ret_vaddr;
>
> - nleft = copy_to_user((void __user *)regs->sp, &trampoline_vaddr, rasize);
> + nleft = copy_to_user((void __user *)regs->sp + off, &trampoline_vaddr, rasize);
> if (likely(!nleft)) {
> if (shstk_update_last_frame(trampoline_vaddr)) {
> force_sig(SIGSEGV);
SNIP
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH RFCv2 06/18] uprobes: Add orig argument to uprobe_write and uprobe_write_opcode
2025-02-28 19:07 ` Andrii Nakryiko
@ 2025-02-28 23:12 ` Jiri Olsa
0 siblings, 0 replies; 33+ messages in thread
From: Jiri Olsa @ 2025-02-28 23:12 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, linux-kernel,
linux-trace-kernel, x86, Song Liu, Yonghong Song, John Fastabend,
Hao Luo, Steven Rostedt, Masami Hiramatsu, Alan Maguire,
David Laight, Thomas Weißschuh
On Fri, Feb 28, 2025 at 11:07:38AM -0800, Andrii Nakryiko wrote:
> On Mon, Feb 24, 2025 at 6:03 AM Jiri Olsa <jolsa@kernel.org> wrote:
> >
> > The uprobe_write has special path to restore the original page when
> > we write original instruction back.
> >
> > This happens when uprobe_write detects that we want to write anything
> > else but breakpoint instruction.
> >
> > In following changes we want to use uprobe_write function for multiple
> > updates, so adding new function argument to denote that this is the
> > original instruction update. This way uprobe_write can make appropriate
> > checks and restore the original page when possible.
> >
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > ---
> > arch/arm/probes/uprobes/core.c | 2 +-
> > include/linux/uprobes.h | 5 +++--
> > kernel/events/uprobes.c | 22 ++++++++++------------
> > 3 files changed, 14 insertions(+), 15 deletions(-)
> >
>
> [...]
>
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index ad5879fc2d26..2b542043089e 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -471,25 +471,23 @@ static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm,
> > * Return 0 (success) or a negative errno.
> > */
> > int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
> > - unsigned long vaddr, uprobe_opcode_t opcode)
> > + unsigned long vaddr, uprobe_opcode_t opcode, bool orig)
> > {
> > - return uprobe_write(auprobe, mm, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE, verify_opcode);
> > + return uprobe_write(auprobe, mm, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE, verify_opcode, orig);
> > }
> >
> > int uprobe_write(struct arch_uprobe *auprobe, struct mm_struct *mm,
> > unsigned long vaddr, uprobe_opcode_t *insn,
> > - int nbytes, uprobe_write_verify_t verify)
> > + int nbytes, uprobe_write_verify_t verify, bool orig)
>
> why not call orig -> is_register and avoid a bunch of code churn?...
> (and while "is_register" is not awesome name, still a bit more clear
> compared to "orig", IMO)
I see the logic in the function the other way around: if you want to try
and load the original page as part of your update, then you pass true to
orig argument
also the is_register makes sense to me in the old code where you have
just 2 states: breakpoint or original instruction ... now when we added
call instruction the is_register would need to cover that as well
jirka
>
> > {
> > struct page *old_page, *new_page;
> > struct vm_area_struct *vma;
> > - int ret, is_register;
> > + int ret;
> > bool orig_page_huge = false;
> > unsigned int gup_flags = FOLL_FORCE;
> >
> > - is_register = is_swbp_insn(insn);
> > -
>
> [...]
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH RFCv2 12/18] uprobes/x86: Add support to optimize uprobes
2025-02-28 23:00 ` Andrii Nakryiko
@ 2025-02-28 23:18 ` Jiri Olsa
2025-02-28 23:27 ` Andrii Nakryiko
0 siblings, 1 reply; 33+ messages in thread
From: Jiri Olsa @ 2025-02-28 23:18 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Jiri Olsa, Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf,
linux-kernel, linux-trace-kernel, x86, Song Liu, Yonghong Song,
John Fastabend, Hao Luo, Steven Rostedt, Masami Hiramatsu,
Alan Maguire, David Laight, Thomas Weißschuh
On Fri, Feb 28, 2025 at 03:00:22PM -0800, Andrii Nakryiko wrote:
> On Fri, Feb 28, 2025 at 2:55 PM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Fri, Feb 28, 2025 at 10:55:24AM -0800, Andrii Nakryiko wrote:
> > > On Mon, Feb 24, 2025 at 6:04 AM Jiri Olsa <jolsa@kernel.org> wrote:
> > > >
> > > > Putting together all the previously added pieces to support optimized
> > > > uprobes on top of 5-byte nop instruction.
> > > >
> > > > The current uprobe execution goes through following:
> > > > - installs breakpoint instruction over original instruction
> > > > - exception handler hit and calls related uprobe consumers
> > > > - and either simulates original instruction or does out of line single step
> > > > execution of it
> > > > - returns to user space
> > > >
> > > > The optimized uprobe path
> > > >
> > > > - checks the original instruction is 5-byte nop (plus other checks)
> > > > - adds (or uses existing) user space trampoline and overwrites original
> > > > instruction (5-byte nop) with call to user space trampoline
> > > > - the user space trampoline executes uprobe syscall that calls related uprobe
> > > > consumers
> > > > - trampoline returns back to next instruction
> > > >
> > > > This approach won't speed up all uprobes as it's limited to using nop5 as
> > > > original instruction, but we could use nop5 as USDT probe instruction (which
> > > > uses single byte nop ATM) and speed up the USDT probes.
> > > >
> > > > This patch overloads related arch functions in uprobe_write_opcode and
> > > > set_orig_insn so they can install call instruction if needed.
> > > >
> > > > The arch_uprobe_optimize triggers the uprobe optimization and is called after
> > > > first uprobe hit. I originally had it called on uprobe installation but then
> > > > it clashed with elf loader, because the user space trampoline was added in a
> > > > place where loader might need to put elf segments, so I decided to do it after
> > > > first uprobe hit when loading is done.
> > > >
> > > > We do not unmap and release uprobe trampoline when it's no longer needed,
> > > > because there's no easy way to make sure none of the threads is still
> > > > inside the trampoline. But we do not waste memory, because there's just
> > > > single page for all the uprobe trampoline mappings.
> > > >
> > > > We do waste frmae on page mapping for every 4GB by keeping the uprobe
> > > > trampoline page mapped, but that seems ok.
> > > >
> > > > Attaching the speed up from benchs/run_bench_uprobes.sh script:
> > > >
> > > > current:
> > > > usermode-count : 818.836 ± 2.842M/s
> > > > syscall-count : 8.917 ± 0.003M/s
> > > > uprobe-nop : 3.056 ± 0.013M/s
> > > > uprobe-push : 2.903 ± 0.002M/s
> > > > uprobe-ret : 1.533 ± 0.001M/s
> > > > --> uprobe-nop5 : 1.492 ± 0.000M/s
> > > > uretprobe-nop : 1.783 ± 0.000M/s
> > > > uretprobe-push : 1.672 ± 0.001M/s
> > > > uretprobe-ret : 1.067 ± 0.002M/s
> > > > --> uretprobe-nop5 : 1.052 ± 0.000M/s
> > > >
> > > > after the change:
> > > >
> > > > usermode-count : 818.386 ± 1.886M/s
> > > > syscall-count : 8.923 ± 0.003M/s
> > > > uprobe-nop : 3.086 ± 0.005M/s
> > > > uprobe-push : 2.751 ± 0.001M/s
> > > > uprobe-ret : 1.481 ± 0.000M/s
> > > > --> uprobe-nop5 : 4.016 ± 0.002M/s
> > > > uretprobe-nop : 1.712 ± 0.008M/s
> > > > uretprobe-push : 1.616 ± 0.001M/s
> > > > uretprobe-ret : 1.052 ± 0.000M/s
> > > > --> uretprobe-nop5 : 2.015 ± 0.000M/s
> > > >
> > > > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > > > ---
> > > > arch/x86/include/asm/uprobes.h | 6 ++
> > > > arch/x86/kernel/uprobes.c | 191 ++++++++++++++++++++++++++++++++-
> > > > include/linux/uprobes.h | 6 +-
> > > > kernel/events/uprobes.c | 16 ++-
> > > > 4 files changed, 209 insertions(+), 10 deletions(-)
> > > >
> > > > diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
> > > > index 678fb546f0a7..7d4df920bb59 100644
> > > > --- a/arch/x86/include/asm/uprobes.h
> > > > +++ b/arch/x86/include/asm/uprobes.h
> > > > @@ -20,6 +20,10 @@ typedef u8 uprobe_opcode_t;
> > > > #define UPROBE_SWBP_INSN 0xcc
> > > > #define UPROBE_SWBP_INSN_SIZE 1
> > > >
> > > > +enum {
> > > > + ARCH_UPROBE_FLAG_CAN_OPTIMIZE = 0,
> > > > +};
> > > > +
> > > > struct uprobe_xol_ops;
> > > >
> > > > struct arch_uprobe {
> > > > @@ -45,6 +49,8 @@ struct arch_uprobe {
> > > > u8 ilen;
> > > > } push;
> > > > };
> > > > +
> > > > + unsigned long flags;
> > > > };
> > > >
> > > > struct arch_uprobe_task {
> > > > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> > > > index e8aebbda83bc..73ddff823904 100644
> > > > --- a/arch/x86/kernel/uprobes.c
> > > > +++ b/arch/x86/kernel/uprobes.c
> > > > @@ -18,6 +18,7 @@
> > > > #include <asm/processor.h>
> > > > #include <asm/insn.h>
> > > > #include <asm/mmu_context.h>
> > > > +#include <asm/nops.h>
> > > >
> > > > /* Post-execution fixups. */
> > > >
> > > > @@ -768,7 +769,7 @@ static struct uprobe_trampoline *create_uprobe_trampoline(unsigned long vaddr)
> > > > return NULL;
> > > > }
> > > >
> > > > -static __maybe_unused struct uprobe_trampoline *uprobe_trampoline_get(unsigned long vaddr)
> > > > +static struct uprobe_trampoline *uprobe_trampoline_get(unsigned long vaddr)
> > > > {
> > > > struct uprobes_state *state = ¤t->mm->uprobes_state;
> > > > struct uprobe_trampoline *tramp = NULL;
> > > > @@ -794,7 +795,7 @@ static void destroy_uprobe_trampoline(struct uprobe_trampoline *tramp)
> > > > kfree(tramp);
> > > > }
> > > >
> > > > -static __maybe_unused void uprobe_trampoline_put(struct uprobe_trampoline *tramp)
> > > > +static void uprobe_trampoline_put(struct uprobe_trampoline *tramp)
> > > > {
> > > > if (tramp == NULL)
> > > > return;
> > > > @@ -807,6 +808,7 @@ struct mm_uprobe {
> > > > struct rb_node rb_node;
> > > > unsigned long auprobe;
> > > > unsigned long vaddr;
> > > > + bool optimized;
> > > > };
> > > >
> > >
> > > I'm trying to understand if this RB-tree based mm_uprobe is strictly
> > > necessary. Is it? Sure we keep optimized flag, but that's more for
> > > defensive checks, no? Is there any other reason we need this separate
> > > look up data structure?
> >
> > so the call instruction update is done in 2 locked steps:
> > - first we write breakpoint as part of normal uprobe registration
> > - then uprobe is hit, we overwrite breakpoint with call instruction
> >
> > in between we could race with another thread that could either unregister the
> > uprobe or try to optimize the uprobe as well
> >
> > I think we either need to keep the state of the uprobe per process (mm_struct),
> > or we would need to read the probed instruction each time when we need to make
> > decision based on what state are we at (nop5,breakpoint,call)
>
> This decision is only done in "slow path", right? Only when
> registering/unregistering. And those operations are done under lock.
> So reading those 5 bytes every time we register/unregister seems
> completely acceptable, rather than now *also* having a per-mm uprobe
> lookup tree.
true.. I was also thinking about having another flag in that tree for
when we fail to optimize the uprobe for other reason than it being on
page alignment.. without such flag we'd need to read the 5 bytes each
time we hit that uprobe .. but that might be rare case
jirka
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH RFCv2 12/18] uprobes/x86: Add support to optimize uprobes
2025-02-28 23:18 ` Jiri Olsa
@ 2025-02-28 23:27 ` Andrii Nakryiko
0 siblings, 0 replies; 33+ messages in thread
From: Andrii Nakryiko @ 2025-02-28 23:27 UTC (permalink / raw)
To: Jiri Olsa
Cc: Oleg Nesterov, Peter Zijlstra, Andrii Nakryiko, bpf, linux-kernel,
linux-trace-kernel, x86, Song Liu, Yonghong Song, John Fastabend,
Hao Luo, Steven Rostedt, Masami Hiramatsu, Alan Maguire,
David Laight, Thomas Weißschuh
On Fri, Feb 28, 2025 at 3:18 PM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Fri, Feb 28, 2025 at 03:00:22PM -0800, Andrii Nakryiko wrote:
> > On Fri, Feb 28, 2025 at 2:55 PM Jiri Olsa <olsajiri@gmail.com> wrote:
> > >
> > > On Fri, Feb 28, 2025 at 10:55:24AM -0800, Andrii Nakryiko wrote:
> > > > On Mon, Feb 24, 2025 at 6:04 AM Jiri Olsa <jolsa@kernel.org> wrote:
> > > > >
> > > > > Putting together all the previously added pieces to support optimized
> > > > > uprobes on top of 5-byte nop instruction.
> > > > >
> > > > > The current uprobe execution goes through following:
> > > > > - installs breakpoint instruction over original instruction
> > > > > - exception handler hit and calls related uprobe consumers
> > > > > - and either simulates original instruction or does out of line single step
> > > > > execution of it
> > > > > - returns to user space
> > > > >
> > > > > The optimized uprobe path
> > > > >
> > > > > - checks the original instruction is 5-byte nop (plus other checks)
> > > > > - adds (or uses existing) user space trampoline and overwrites original
> > > > > instruction (5-byte nop) with call to user space trampoline
> > > > > - the user space trampoline executes uprobe syscall that calls related uprobe
> > > > > consumers
> > > > > - trampoline returns back to next instruction
> > > > >
> > > > > This approach won't speed up all uprobes as it's limited to using nop5 as
> > > > > original instruction, but we could use nop5 as USDT probe instruction (which
> > > > > uses single byte nop ATM) and speed up the USDT probes.
> > > > >
> > > > > This patch overloads related arch functions in uprobe_write_opcode and
> > > > > set_orig_insn so they can install call instruction if needed.
> > > > >
> > > > > The arch_uprobe_optimize triggers the uprobe optimization and is called after
> > > > > first uprobe hit. I originally had it called on uprobe installation but then
> > > > > it clashed with elf loader, because the user space trampoline was added in a
> > > > > place where loader might need to put elf segments, so I decided to do it after
> > > > > first uprobe hit when loading is done.
> > > > >
> > > > > We do not unmap and release uprobe trampoline when it's no longer needed,
> > > > > because there's no easy way to make sure none of the threads is still
> > > > > inside the trampoline. But we do not waste memory, because there's just
> > > > > single page for all the uprobe trampoline mappings.
> > > > >
> > > > > We do waste frmae on page mapping for every 4GB by keeping the uprobe
> > > > > trampoline page mapped, but that seems ok.
> > > > >
> > > > > Attaching the speed up from benchs/run_bench_uprobes.sh script:
> > > > >
> > > > > current:
> > > > > usermode-count : 818.836 ± 2.842M/s
> > > > > syscall-count : 8.917 ± 0.003M/s
> > > > > uprobe-nop : 3.056 ± 0.013M/s
> > > > > uprobe-push : 2.903 ± 0.002M/s
> > > > > uprobe-ret : 1.533 ± 0.001M/s
> > > > > --> uprobe-nop5 : 1.492 ± 0.000M/s
> > > > > uretprobe-nop : 1.783 ± 0.000M/s
> > > > > uretprobe-push : 1.672 ± 0.001M/s
> > > > > uretprobe-ret : 1.067 ± 0.002M/s
> > > > > --> uretprobe-nop5 : 1.052 ± 0.000M/s
> > > > >
> > > > > after the change:
> > > > >
> > > > > usermode-count : 818.386 ± 1.886M/s
> > > > > syscall-count : 8.923 ± 0.003M/s
> > > > > uprobe-nop : 3.086 ± 0.005M/s
> > > > > uprobe-push : 2.751 ± 0.001M/s
> > > > > uprobe-ret : 1.481 ± 0.000M/s
> > > > > --> uprobe-nop5 : 4.016 ± 0.002M/s
> > > > > uretprobe-nop : 1.712 ± 0.008M/s
> > > > > uretprobe-push : 1.616 ± 0.001M/s
> > > > > uretprobe-ret : 1.052 ± 0.000M/s
> > > > > --> uretprobe-nop5 : 2.015 ± 0.000M/s
> > > > >
> > > > > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > > > > ---
> > > > > arch/x86/include/asm/uprobes.h | 6 ++
> > > > > arch/x86/kernel/uprobes.c | 191 ++++++++++++++++++++++++++++++++-
> > > > > include/linux/uprobes.h | 6 +-
> > > > > kernel/events/uprobes.c | 16 ++-
> > > > > 4 files changed, 209 insertions(+), 10 deletions(-)
> > > > >
> > > > > diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
> > > > > index 678fb546f0a7..7d4df920bb59 100644
> > > > > --- a/arch/x86/include/asm/uprobes.h
> > > > > +++ b/arch/x86/include/asm/uprobes.h
> > > > > @@ -20,6 +20,10 @@ typedef u8 uprobe_opcode_t;
> > > > > #define UPROBE_SWBP_INSN 0xcc
> > > > > #define UPROBE_SWBP_INSN_SIZE 1
> > > > >
> > > > > +enum {
> > > > > + ARCH_UPROBE_FLAG_CAN_OPTIMIZE = 0,
> > > > > +};
> > > > > +
> > > > > struct uprobe_xol_ops;
> > > > >
> > > > > struct arch_uprobe {
> > > > > @@ -45,6 +49,8 @@ struct arch_uprobe {
> > > > > u8 ilen;
> > > > > } push;
> > > > > };
> > > > > +
> > > > > + unsigned long flags;
> > > > > };
> > > > >
> > > > > struct arch_uprobe_task {
> > > > > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> > > > > index e8aebbda83bc..73ddff823904 100644
> > > > > --- a/arch/x86/kernel/uprobes.c
> > > > > +++ b/arch/x86/kernel/uprobes.c
> > > > > @@ -18,6 +18,7 @@
> > > > > #include <asm/processor.h>
> > > > > #include <asm/insn.h>
> > > > > #include <asm/mmu_context.h>
> > > > > +#include <asm/nops.h>
> > > > >
> > > > > /* Post-execution fixups. */
> > > > >
> > > > > @@ -768,7 +769,7 @@ static struct uprobe_trampoline *create_uprobe_trampoline(unsigned long vaddr)
> > > > > return NULL;
> > > > > }
> > > > >
> > > > > -static __maybe_unused struct uprobe_trampoline *uprobe_trampoline_get(unsigned long vaddr)
> > > > > +static struct uprobe_trampoline *uprobe_trampoline_get(unsigned long vaddr)
> > > > > {
> > > > > struct uprobes_state *state = ¤t->mm->uprobes_state;
> > > > > struct uprobe_trampoline *tramp = NULL;
> > > > > @@ -794,7 +795,7 @@ static void destroy_uprobe_trampoline(struct uprobe_trampoline *tramp)
> > > > > kfree(tramp);
> > > > > }
> > > > >
> > > > > -static __maybe_unused void uprobe_trampoline_put(struct uprobe_trampoline *tramp)
> > > > > +static void uprobe_trampoline_put(struct uprobe_trampoline *tramp)
> > > > > {
> > > > > if (tramp == NULL)
> > > > > return;
> > > > > @@ -807,6 +808,7 @@ struct mm_uprobe {
> > > > > struct rb_node rb_node;
> > > > > unsigned long auprobe;
> > > > > unsigned long vaddr;
> > > > > + bool optimized;
> > > > > };
> > > > >
> > > >
> > > > I'm trying to understand if this RB-tree based mm_uprobe is strictly
> > > > necessary. Is it? Sure we keep optimized flag, but that's more for
> > > > defensive checks, no? Is there any other reason we need this separate
> > > > look up data structure?
> > >
> > > so the call instruction update is done in 2 locked steps:
> > > - first we write breakpoint as part of normal uprobe registration
> > > - then uprobe is hit, we overwrite breakpoint with call instruction
> > >
> > > in between we could race with another thread that could either unregister the
> > > uprobe or try to optimize the uprobe as well
> > >
> > > I think we either need to keep the state of the uprobe per process (mm_struct),
> > > or we would need to read the probed instruction each time when we need to make
> > > decision based on what state are we at (nop5,breakpoint,call)
> >
> > This decision is only done in "slow path", right? Only when
> > registering/unregistering. And those operations are done under lock.
> > So reading those 5 bytes every time we register/unregister seems
> > completely acceptable, rather than now *also* having a per-mm uprobe
> > lookup tree.
>
> true.. I was also thinking about having another flag in that tree for
> when we fail to optimize the uprobe for other reason than it being on
> page alignment.. without such flag we'd need to read the 5 bytes each
> time we hit that uprobe .. but that might be rare case
that's another problematic part, IMO, that we'll be doing an upgrade
from int3 to call (for nop5) when uprobe is actually hit. So now you
need to do this mm_uprobe looking in the hot path, right? We spent
tons of effort to optimize the hot path and now we'll be adding
another lookup there :(
but I also don't have an alternative... I was thinking how do we
choose [vdso] address and whether we can use similar approach for
[uprobe-trampoline], but [vdso] doesn't have the constraint of being
within +/-2GB range, so that probably won't work.
anyways, just a bit unfortunate
P.S. Maybe we should keep the flag whether we tried to optimize (and
failed) in global uprobe information, not per-mm? I.e., if we failed
to find place for uprobe trampoline for any mm, no subsequent ones
would try this? It's likely that either all or none of processes will
fail/succeed nop5 optimization, right?
>
> jirka
^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2025-02-28 23:27 UTC | newest]
Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-24 14:01 [PATCH RFCv2 00/18] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 01/18] uprobes: Rename arch_uretprobe_trampoline function Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 02/18] uprobes: Make copy_from_page global Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 03/18] uprobes: Move ref_ctr_offset update out of uprobe_write_opcode Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 04/18] uprobes: Add uprobe_write function Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 05/18] uprobes: Add nbytes argument to uprobe_write_opcode Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 06/18] uprobes: Add orig argument to uprobe_write and uprobe_write_opcode Jiri Olsa
2025-02-28 19:07 ` Andrii Nakryiko
2025-02-28 23:12 ` Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 07/18] uprobes: Add swbp argument to arch_uretprobe_hijack_return_addr Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 08/18] uprobes/x86: Add uprobe syscall to speed up uprobe Jiri Olsa
2025-02-24 19:22 ` Alexei Starovoitov
2025-02-25 13:35 ` Jiri Olsa
2025-02-25 17:10 ` Andrii Nakryiko
2025-02-25 18:06 ` Alexei Starovoitov
2025-02-26 2:36 ` Alexei Starovoitov
2025-02-24 14:01 ` [PATCH RFCv2 09/18] uprobes/x86: Add mapping for optimized uprobe trampolines Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 10/18] uprobes/x86: Add mm_uprobe objects to track uprobes within mm Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 11/18] uprobes/x86: Add support to emulate nop5 instruction Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 12/18] uprobes/x86: Add support to optimize uprobes Jiri Olsa
2025-02-28 18:55 ` Andrii Nakryiko
2025-02-28 22:55 ` Jiri Olsa
2025-02-28 23:00 ` Andrii Nakryiko
2025-02-28 23:18 ` Jiri Olsa
2025-02-28 23:27 ` Andrii Nakryiko
2025-02-28 23:00 ` Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 13/18] selftests/bpf: Reorg the uprobe_syscall test function Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 14/18] selftests/bpf: Use 5-byte nop for x86 usdt probes Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 15/18] selftests/bpf: Add uprobe/usdt syscall tests Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 16/18] selftests/bpf: Add hit/attach/detach race optimized uprobe test Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 17/18] selftests/bpf: Add uprobe syscall sigill signal test Jiri Olsa
2025-02-24 14:01 ` [PATCH RFCv2 18/18] selftests/bpf: Add 5-byte nop uprobe trigger bench Jiri Olsa
2025-02-24 18:46 ` [PATCH RFCv2 00/18] uprobes: Add support to optimize usdt probes on x86_64 Ingo Molnar
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).