linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
To: Jiri Olsa <jolsa@kernel.org>
Cc: "Oleg Nesterov" <oleg@redhat.com>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Andrii Nakryiko" <andrii@kernel.org>,
	bpf@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-trace-kernel@vger.kernel.org, x86@kernel.org,
	"Song Liu" <songliubraving@fb.com>, "Yonghong Song" <yhs@fb.com>,
	"John Fastabend" <john.fastabend@gmail.com>,
	"Hao Luo" <haoluo@google.com>,
	"Steven Rostedt" <rostedt@goodmis.org>,
	"Masami Hiramatsu" <mhiramat@kernel.org>,
	"Alan Maguire" <alan.maguire@oracle.com>,
	"David Laight" <David.Laight@ACULAB.COM>,
	"Thomas Weißschuh" <thomas@t-8ch.de>,
	"Ingo Molnar" <mingo@kernel.org>
Subject: Re: [PATCHv5 perf/core 09/22] uprobes/x86: Add uprobe syscall to speed up uprobe
Date: Mon, 14 Jul 2025 17:39:15 +0900	[thread overview]
Message-ID: <20250714173915.b9edd474742de46bcbe9c617@kernel.org> (raw)
In-Reply-To: <20250711082931.3398027-10-jolsa@kernel.org>

On Fri, 11 Jul 2025 10:29:17 +0200
Jiri Olsa <jolsa@kernel.org> wrote:

> Adding new uprobe syscall that calls uprobe handlers for given
> 'breakpoint' address.
> 
> The idea is that the 'breakpoint' address calls the user space
> trampoline which executes the uprobe syscall.
> 
> The syscall handler reads the return address of the initial call
> to retrieve the original 'breakpoint' address. With this address
> we find the related uprobe object and call its consumers.
> 
> Adding the arch_uprobe_trampoline_mapping function that provides
> uprobe trampoline mapping. This mapping is backed with one global
> page initialized at __init time and shared by the all the mapping
> instances.
> 
> We do not allow to execute uprobe syscall if the caller is not
> from uprobe trampoline mapping.
> 
> The uprobe syscall ensures the consumer (bpf program) sees registers
> values in the state before the trampoline was called.
> 
> Acked-by: Andrii Nakryiko <andrii@kernel.org>
> Acked-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  arch/x86/kernel/uprobes.c              | 122 +++++++++++++++++++++++++
>  include/linux/syscalls.h               |   2 +
>  include/linux/uprobes.h                |   1 +
>  kernel/events/uprobes.c                |  17 ++++
>  kernel/sys_ni.c                        |   1 +
>  6 files changed, 144 insertions(+)
> 
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index cfb5ca41e30d..9fd1291e7bdf 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -345,6 +345,7 @@
>  333	common	io_pgetevents		sys_io_pgetevents
>  334	common	rseq			sys_rseq
>  335	common	uretprobe		sys_uretprobe
> +336	common	uprobe			sys_uprobe
>  # don't use numbers 387 through 423, add new calls after the last
>  # 'common' entry
>  424	common	pidfd_send_signal	sys_pidfd_send_signal
> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index 6c4dcbdd0c3c..5eecab712376 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -752,6 +752,128 @@ void arch_uprobe_clear_state(struct mm_struct *mm)
>  	hlist_for_each_entry_safe(tramp, n, &state->head_tramps, node)
>  		destroy_uprobe_trampoline(tramp);
>  }
> +
> +static bool __in_uprobe_trampoline(unsigned long ip)
> +{
> +	struct vm_area_struct *vma = vma_lookup(current->mm, ip);
> +
> +	return vma && vma_is_special_mapping(vma, &tramp_mapping);
> +}
> +
> +static bool in_uprobe_trampoline(unsigned long ip)
> +{
> +	struct mm_struct *mm = current->mm;
> +	bool found, retry = true;
> +	unsigned int seq;
> +
> +	rcu_read_lock();
> +	if (mmap_lock_speculate_try_begin(mm, &seq)) {
> +		found = __in_uprobe_trampoline(ip);
> +		retry = mmap_lock_speculate_retry(mm, seq);
> +	}
> +	rcu_read_unlock();
> +
> +	if (retry) {
> +		mmap_read_lock(mm);
> +		found = __in_uprobe_trampoline(ip);
> +		mmap_read_unlock(mm);
> +	}
> +	return found;
> +}
> +
> +SYSCALL_DEFINE0(uprobe)
> +{
> +	struct pt_regs *regs = task_pt_regs(current);
> +	unsigned long ip, sp, ax_r11_cx_ip[4];
> +	int err;
> +
> +	/* Allow execution only from uprobe trampolines. */
> +	if (!in_uprobe_trampoline(regs->ip))
> +		goto sigill;
> +

/*
 * When syscall from the trampoline, including a call to the trampoline
 * the stack will be shown as;
 *  regs->sp[0]: [rax]
 *          [1]: [r11]
 *          [2]: [rcx]
 *          [3]: [return-address] (probed address + sizeof(call-instruction))
 *
 * And the `&regs->sp[4]` should be the `sp` value when probe is hit.
 */

> +	err = copy_from_user(ax_r11_cx_ip, (void __user *)regs->sp, sizeof(ax_r11_cx_ip));
> +	if (err)
> +		goto sigill;
> +
> +	ip = regs->ip;
> +
> +	/*
> +	 * expose the "right" values of ax/r11/cx/ip/sp to uprobe_consumer/s, plus:
> +	 * - adjust ip to the probe address, call saved next instruction address
> +	 * - adjust sp to the probe's stack frame (check trampoline code)
> +	 */
> +	regs->ax  = ax_r11_cx_ip[0];
> +	regs->r11 = ax_r11_cx_ip[1];
> +	regs->cx  = ax_r11_cx_ip[2];
> +	regs->ip  = ax_r11_cx_ip[3] - 5;
> +	regs->sp += sizeof(ax_r11_cx_ip);
> +	regs->orig_ax = -1;
> +
> +	sp = regs->sp;
> +
> +	handle_syscall_uprobe(regs, regs->ip);
> +
> +	/*
> +	 * Some of the uprobe consumers has changed sp, we can do nothing,
> +	 * just return via iret.
> +	 */

Do we allow consumers to change the `sp`? It seems dangerous
because consumer needs to know whether it is called from
breakpoint or syscall. Note that it has to set up ax, r11
and cx on the stack correctly only if it is called from syscall,
that is not compatible with breakpoint mode.

> +	if (regs->sp != sp)
> +		return regs->ax;

Shouldn't we recover regs->ip? Or in this case does consumer has
to change ip (== return address from trampline) too?

IMHO, it should not allow to change the `sp` and `ip` directly
in syscall mode. In case of kprobes, kprobe jump optimization
must be disabled explicitly (e.g. setting dummy post_handler)
if the handler changes `ip`.

Or, even if allowing to modify `sp` and `ip`, it should be helped
by this function, e.g. stack up the dummy regs->ax/r11/cx on the
new stack at the new `regs->sp`. This will allow modifying those
registries transparently as same as breakpoint mode.
In this case, I think we just need to remove above 2 lines.

> +
> +	regs->sp -= sizeof(ax_r11_cx_ip);
> +
> +	/* for the case uprobe_consumer has changed ax/r11/cx */
> +	ax_r11_cx_ip[0] = regs->ax;
> +	ax_r11_cx_ip[1] = regs->r11;
> +	ax_r11_cx_ip[2] = regs->cx;
> +
> +	/* keep return address unless we are instructed otherwise */
> +	if (ax_r11_cx_ip[3] - 5 != regs->ip)
> +		ax_r11_cx_ip[3] = regs->ip;
> +
> +	regs->ip = ip;
> +
> +	err = copy_to_user((void __user *)regs->sp, ax_r11_cx_ip, sizeof(ax_r11_cx_ip));
> +	if (err)
> +		goto sigill;

... because above does everything what we need.

Thank you,

> +
> +	/* ensure sysret, see do_syscall_64() */
> +	regs->r11 = regs->flags;
> +	regs->cx  = regs->ip;
> +	return 0;
> +
> +sigill:
> +	force_sig(SIGILL);
> +	return -1;
> +}
> +
> +asm (
> +	".pushsection .rodata\n"
> +	".balign " __stringify(PAGE_SIZE) "\n"
> +	"uprobe_trampoline_entry:\n"
> +	"push %rcx\n"
> +	"push %r11\n"
> +	"push %rax\n"
> +	"movq $" __stringify(__NR_uprobe) ", %rax\n"
> +	"syscall\n"
> +	"pop %rax\n"
> +	"pop %r11\n"
> +	"pop %rcx\n"
> +	"ret\n"
> +	".balign " __stringify(PAGE_SIZE) "\n"
> +	".popsection\n"
> +);
> +
> +extern u8 uprobe_trampoline_entry[];
> +
> +static int __init arch_uprobes_init(void)
> +{
> +	tramp_mapping_pages[0] = virt_to_page(uprobe_trampoline_entry);
> +	return 0;
> +}
> +
> +late_initcall(arch_uprobes_init);
> +
>  #else /* 32-bit: */
>  /*
>   * No RIP-relative addressing on 32-bit
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index e5603cc91963..b0cc60f1c458 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -998,6 +998,8 @@ asmlinkage long sys_ioperm(unsigned long from, unsigned long num, int on);
>  
>  asmlinkage long sys_uretprobe(void);
>  
> +asmlinkage long sys_uprobe(void);
> +
>  /* pciconfig: alpha, arm, arm64, ia64, sparc */
>  asmlinkage long sys_pciconfig_read(unsigned long bus, unsigned long dfn,
>  				unsigned long off, unsigned long len,
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index b40d33aae016..b6b077cc7d0f 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -239,6 +239,7 @@ extern unsigned long uprobe_get_trampoline_vaddr(void);
>  extern void uprobe_copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len);
>  extern void arch_uprobe_clear_state(struct mm_struct *mm);
>  extern void arch_uprobe_init_state(struct mm_struct *mm);
> +extern void handle_syscall_uprobe(struct pt_regs *regs, unsigned long bp_vaddr);
>  #else /* !CONFIG_UPROBES */
>  struct uprobes_state {
>  };
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index acec91a676b7..cbba31c0495f 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -2772,6 +2772,23 @@ static void handle_swbp(struct pt_regs *regs)
>  	rcu_read_unlock_trace();
>  }
>  
> +void handle_syscall_uprobe(struct pt_regs *regs, unsigned long bp_vaddr)
> +{
> +	struct uprobe *uprobe;
> +	int is_swbp;
> +
> +	guard(rcu_tasks_trace)();
> +
> +	uprobe = find_active_uprobe_rcu(bp_vaddr, &is_swbp);
> +	if (!uprobe)
> +		return;
> +	if (!get_utask())
> +		return;
> +	if (arch_uprobe_ignore(&uprobe->arch, regs))
> +		return;
> +	handler_chain(uprobe, regs);
> +}
> +
>  /*
>   * Perform required fix-ups and disable singlestep.
>   * Allow pending signals to take effect.
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index c00a86931f8c..bf5d05c635ff 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -392,3 +392,4 @@ COND_SYSCALL(setuid16);
>  COND_SYSCALL(rseq);
>  
>  COND_SYSCALL(uretprobe);
> +COND_SYSCALL(uprobe);
> -- 
> 2.50.0
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

  reply	other threads:[~2025-07-14  8:39 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-11  8:29 [PATCHv5 perf/core 00/22] uprobes: Add support to optimize usdt probes on x86_64 Jiri Olsa
2025-07-11  8:29 ` [PATCHv5 perf/core 01/22] uprobes: Remove breakpoint in unapply_uprobe under mmap_write_lock Jiri Olsa
2025-07-11  8:29 ` [PATCHv5 perf/core 02/22] uprobes: Rename arch_uretprobe_trampoline function Jiri Olsa
2025-07-11  8:29 ` [PATCHv5 perf/core 03/22] uprobes: Make copy_from_page global Jiri Olsa
2025-07-11  8:29 ` [PATCHv5 perf/core 04/22] uprobes: Add uprobe_write function Jiri Olsa
2025-07-11  8:29 ` [PATCHv5 perf/core 05/22] uprobes: Add nbytes argument to uprobe_write Jiri Olsa
2025-07-11  8:29 ` [PATCHv5 perf/core 06/22] uprobes: Add is_register argument to uprobe_write and uprobe_write_opcode Jiri Olsa
2025-07-11  8:29 ` [PATCHv5 perf/core 07/22] uprobes: Add do_ref_ctr argument to uprobe_write function Jiri Olsa
2025-07-11  8:29 ` [PATCHv5 perf/core 08/22] uprobes/x86: Add mapping for optimized uprobe trampolines Jiri Olsa
2025-07-11 17:46   ` Oleg Nesterov
2025-07-11 19:36     ` Jiri Olsa
2025-07-14  7:23   ` Masami Hiramatsu
2025-07-11  8:29 ` [PATCHv5 perf/core 09/22] uprobes/x86: Add uprobe syscall to speed up uprobe Jiri Olsa
2025-07-14  8:39   ` Masami Hiramatsu [this message]
2025-07-14  9:28     ` Peter Zijlstra
2025-07-14 21:29       ` Jiri Olsa
2025-07-14  9:39     ` Peter Zijlstra
2025-07-14 10:19       ` Masami Hiramatsu
2025-07-14 21:28         ` Jiri Olsa
2025-07-14 23:54           ` Masami Hiramatsu
2025-07-15 12:16             ` Jiri Olsa
2025-07-16  2:39               ` Masami Hiramatsu
2025-07-11  8:29 ` [PATCHv5 perf/core 10/22] uprobes/x86: Add support to optimize uprobes Jiri Olsa
2025-07-14  9:48   ` Peter Zijlstra
2025-07-14 21:29     ` Jiri Olsa
2025-07-17 15:29       ` Jiri Olsa
2025-07-14 10:13   ` Masami Hiramatsu
2025-07-14 21:29     ` Jiri Olsa
2025-07-11  8:29 ` [PATCHv5 perf/core 11/22] selftests/bpf: Import usdt.h from libbpf/usdt project Jiri Olsa
2025-07-11  8:29 ` [PATCHv5 perf/core 12/22] selftests/bpf: Reorg the uprobe_syscall test function Jiri Olsa
2025-07-11  8:29 ` [PATCHv5 perf/core 13/22] selftests/bpf: Rename uprobe_syscall_executed prog to test_uretprobe_multi Jiri Olsa
2025-07-11  8:29 ` [PATCHv5 perf/core 14/22] selftests/bpf: Add uprobe/usdt syscall tests Jiri Olsa
2025-07-11  8:29 ` [PATCHv5 perf/core 15/22] selftests/bpf: Add hit/attach/detach race optimized uprobe test Jiri Olsa
2025-07-11  8:29 ` [PATCHv5 perf/core 16/22] selftests/bpf: Add uprobe syscall sigill signal test Jiri Olsa
2025-07-11  8:29 ` [PATCHv5 perf/core 17/22] selftests/bpf: Add optimized usdt variant for basic usdt test Jiri Olsa
2025-07-11  8:29 ` [PATCHv5 perf/core 18/22] selftests/bpf: Add uprobe_regs_equal test Jiri Olsa
2025-07-11  8:29 ` [PATCHv5 perf/core 19/22] selftests/bpf: Change test_uretprobe_regs_change for uprobe and uretprobe Jiri Olsa
2025-07-11  8:29 ` [PATCHv5 perf/core 20/22] seccomp: passthrough uprobe systemcall without filtering Jiri Olsa
2025-07-11  8:29 ` [PATCHv5 perf/core 21/22] selftests/seccomp: validate uprobe syscall passes through seccomp Jiri Olsa
2025-07-11  8:29 ` [PATCHv5 22/22] man2: Add uprobe syscall page Jiri Olsa
2025-07-14 14:04   ` Masami Hiramatsu
2025-07-11 17:17 ` [PATCHv5 perf/core 00/22] uprobes: Add support to optimize usdt probes on x86_64 Andrii Nakryiko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250714173915.b9edd474742de46bcbe9c617@kernel.org \
    --to=mhiramat@kernel.org \
    --cc=David.Laight@ACULAB.COM \
    --cc=alan.maguire@oracle.com \
    --cc=andrii@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=haoluo@google.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=oleg@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=songliubraving@fb.com \
    --cc=thomas@t-8ch.de \
    --cc=x86@kernel.org \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).