[PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization

BPF List
 help / color / mirror / Atom feed

* [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization
@ 2026-05-09  0:30 Andrii Nakryiko
  2026-05-09  0:30 ` [PATCH bpf 2/2] selftests/bpf: Add tests for uprobe nop5 red zone clobbering Andrii Nakryiko
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Andrii Nakryiko @ 2026-05-09  0:30 UTC (permalink / raw)
  To: bpf
  Cc: linux-trace-kernel, jolsa, oleg, peterz, mingo, mhiramat,
	Andrii Nakryiko

The x86 uprobe nop5 optimization currently replaces a 5-byte NOP at the
probe site with a CALL into a uprobe trampoline. CALL pushes a return
address to [rsp-8]. On x86-64 this is inside the 128-byte red zone, where
user code may keep temporary data without adjusting rsp.

Use a 5-byte JMP instead. JMP does not write to the user stack, but it
also does not provide a return address. Replace the single trampoline
entry with a page of 16-byte slots. Each optimized probe jumps to its
assigned slot, the slot moves rsp below the red zone, saves the registers
clobbered by syscall, and invokes the uprobe syscall:

  Probe site:   jmp slot_N              (5B, replaces nop5)

  Slot N:       lea  -128(%rsp), %rsp   (5B)  skip red zone
                push %rcx               (1B)  save (syscall clobbers)
                push %r11               (2B)  save (syscall clobbers)
                push %rax               (1B)  save (syscall uses for nr)
                mov  $336, %eax         (5B)  uprobe syscall number
                syscall                 (2B)

All slots contain identical code at different offsets, so the trampoline
page is generated once at boot and mapped read-execute into each process.
The syscall handler identifies the slot from regs->ip, which points just
after the syscall instruction, and uses a per-mm slot table to recover the
original probe address.

The uprobe syscall does not return to the trampoline slot. The handler
restores the probe-site register state, runs the uprobe consumers, sets
pt_regs to continue at probe_addr + 5 unless a consumer redirected
execution, and returns directly through the IRET path. This preserves
general purpose registers, including rcx and r11, without requiring any
post-syscall cleanup code in the trampoline and avoids call/ret, RSB, and
shadow stack concerns.

Protect the per-mm trampoline list with RCU and free trampoline metadata
with kfree_rcu(). This lets the syscall path resolve trampoline slots
without taking mmap_lock. The optimized-instruction detection path also
walks the trampoline list under an RCU read-side lock. Since that path
starts from the JMP target, it translates the slot start to the post-syscall
IP expected by the shared resolver before checking the trampoline mapping.

Each trampoline page provides 256 slots. Slots stay permanently assigned
to their first probe address and are reused only when the same address is
probed again. Reassigning detached slots is deliberately avoided because a
thread can remain in a trampoline for an unbounded time due to ptrace,
interrupts, or scheduling delays. If a reachable trampoline page runs out
of slots, probes that cannot allocate a slot fall back to the slower INT3
path.

Require the entire trampoline page to be reachable by a rel32 JMP before
reusing it for a probe. This keeps every slot in the page within the range
that can be encoded at the probe site.

Change the error code returned when the uprobe syscall is invoked outside
a kernel-generated trampoline from -ENXIO to -EPROTO. This lets libbpf and
similar libraries distinguish fixed kernels from kernels with the
red-zone-clobbering implementation and enable nop5 optimization only on
fixed kernels.

Performance (usdt single-thread, M/s):

                  usdt-nop  usdt-nop5-base  usdt-nop5-fix  nop5-change  iret%
  Skylake          3.149        6.422          4.865         -24.3%     39.1%
  Milan            2.910        3.443          3.820         +11.0%     24.3%
  Sapphire Rapids  1.896        4.023          3.693          -8.2%     24.9%
  Bergamo          3.393        3.895          3.849          -1.2%     24.5%

The fixed nop5 path remains faster than the non-optimized INT3 path on all
measured systems. The regression relative to the old CALL-based trampoline
comes from IRET being more expensive than SYSRET, most noticeably on older
Intel Skylake. Newer Intel CPUs and tested AMD CPUs have lower IRET cost,
and AMD Milan improves because removing mmap_lock from the hot path more
than offsets the IRET cost.

Multi-threaded throughput scales nearly linearly with the number of CPUs, like
it used to, thanks to lockless RCU-protected uprobe trampoline lookup.

Fixes: ba2bfc97b462 ("uprobes/x86: Add support to optimize uprobes")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 arch/x86/include/asm/uprobes.h                |  18 ++
 arch/x86/kernel/uprobes.c                     | 262 ++++++++++--------
 tools/lib/bpf/features.c                      |   8 +-
 .../selftests/bpf/prog_tests/uprobe_syscall.c |   5 +-
 tools/testing/selftests/bpf/prog_tests/usdt.c |   2 +-
 5 files changed, 181 insertions(+), 114 deletions(-)

diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 362210c79998..a7cf5c92d95a 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -25,6 +25,24 @@ enum {
 	ARCH_UPROBE_FLAG_OPTIMIZE_FAIL  = 1,
 };
 
+/*
+ * Trampoline page layout: identical 16-byte slots, each containing:
+ *   lea  -128(%rsp), %rsp (5B)  skip red zone
+ *   push %rcx             (1B)  save (syscall clobbers)
+ *   push %r11             (2B)  save (syscall clobbers)
+ *   push %rax             (1B)  save (syscall uses for nr)
+ *   mov  $336, %eax       (5B)  uprobe syscall number
+ *   syscall               (2B)
+ *                        = 16B, no padding needed
+ *
+ * The handler identifies which probe fired from regs->ip (each
+ * slot is at a unique offset), looks up the probe address from a
+ * per-process table, and returns directly to probe_addr+5 via iret
+ * with all registers restored.
+ */
+#define UPROBE_TRAMP_SLOT_SIZE	16
+#define UPROBE_TRAMP_MAX_SLOTS	(PAGE_SIZE / UPROBE_TRAMP_SLOT_SIZE)
+
 struct uprobe_xol_ops;
 
 struct arch_uprobe {
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index ebb1baf1eb1d..7e1f14200bbb 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -633,16 +633,25 @@ static struct vm_special_mapping tramp_mapping = {
 
 struct uprobe_trampoline {
 	struct hlist_node	node;
+	struct rcu_head		rcu;
 	unsigned long		vaddr;
+	unsigned long		probe_addrs[UPROBE_TRAMP_MAX_SLOTS];
 };
 
-static bool is_reachable_by_call(unsigned long vtramp, unsigned long vaddr)
+
+static bool is_reachable_by_jmp(unsigned long dst, unsigned long src)
 {
-	long delta = (long)(vaddr + 5 - vtramp);
+	long delta = (long)(dst - (src + JMP32_INSN_SIZE));
 
 	return delta >= INT_MIN && delta <= INT_MAX;
 }
 
+static bool is_reachable_by_trampoline(unsigned long vtramp, unsigned long vaddr)
+{
+	return is_reachable_by_jmp(vtramp, vaddr) &&
+	       is_reachable_by_jmp(vtramp + PAGE_SIZE - 1, vaddr);
+}
+
 static unsigned long find_nearest_trampoline(unsigned long vaddr)
 {
 	struct vm_unmapped_area_info info = {
@@ -711,6 +720,21 @@ static struct uprobe_trampoline *create_uprobe_trampoline(unsigned long vaddr)
 	return tramp;
 }
 
+static int tramp_alloc_slot(struct uprobe_trampoline *tramp, unsigned long probe_addr)
+{
+	int i;
+
+	for (i = 0; i < UPROBE_TRAMP_MAX_SLOTS; i++) {
+		if (tramp->probe_addrs[i] == probe_addr)
+			return i;
+		if (tramp->probe_addrs[i] == 0) {
+			tramp->probe_addrs[i] = probe_addr;
+			return i;
+		}
+	}
+	return -ENOSPC;
+}
+
 static struct uprobe_trampoline *get_uprobe_trampoline(unsigned long vaddr, bool *new)
 {
 	struct uprobes_state *state = &current->mm->uprobes_state;
@@ -720,7 +744,7 @@ static struct uprobe_trampoline *get_uprobe_trampoline(unsigned long vaddr, bool
 		return NULL;
 
 	hlist_for_each_entry(tramp, &state->head_tramps, node) {
-		if (is_reachable_by_call(tramp->vaddr, vaddr)) {
+		if (is_reachable_by_trampoline(tramp->vaddr, vaddr)) {
 			*new = false;
 			return tramp;
 		}
@@ -731,7 +755,7 @@ static struct uprobe_trampoline *get_uprobe_trampoline(unsigned long vaddr, bool
 		return NULL;
 
 	*new = true;
-	hlist_add_head(&tramp->node, &state->head_tramps);
+	hlist_add_head_rcu(&tramp->node, &state->head_tramps);
 	return tramp;
 }
 
@@ -742,8 +766,8 @@ static void destroy_uprobe_trampoline(struct uprobe_trampoline *tramp)
 	 * because there's no easy way to make sure none of the threads
 	 * is still inside the trampoline.
 	 */
-	hlist_del(&tramp->node);
-	kfree(tramp);
+	hlist_del_rcu(&tramp->node);
+	kfree_rcu(tramp, rcu);
 }
 
 void arch_uprobe_init_state(struct mm_struct *mm)
@@ -761,147 +785,153 @@ void arch_uprobe_clear_state(struct mm_struct *mm)
 		destroy_uprobe_trampoline(tramp);
 }
 
-static bool __in_uprobe_trampoline(unsigned long ip)
+/*
+ * Find the trampoline containing @ip. If @probe_addr is non-NULL, also
+ * resolve the slot index from @ip and return the probe address.
+ *
+ * @ip is expected to point right after the syscall instruction, i.e.,
+ * at the end of the slot (slot_start + UPROBE_TRAMP_SLOT_SIZE).
+ */
+static bool resolve_uprobe_addr(unsigned long ip, unsigned long *probe_addr)
 {
-	struct vm_area_struct *vma = vma_lookup(current->mm, ip);
+	struct uprobes_state *state = &current->mm->uprobes_state;
+	struct uprobe_trampoline *tramp;
 
-	return vma && vma_is_special_mapping(vma, &tramp_mapping);
-}
+	hlist_for_each_entry_rcu(tramp, &state->head_tramps, node) {
+		/*
+		 * ip points to after syscall, so it's on 16 byte boundary,
+		 * which means that valid ip can point right after the page
+		 * and should never be at zero offset within the page
+		 */
+		if (ip <= tramp->vaddr || ip > tramp->vaddr + PAGE_SIZE)
+			continue;
 
-static bool in_uprobe_trampoline(unsigned long ip)
-{
-	struct mm_struct *mm = current->mm;
-	bool found, retry = true;
-	unsigned int seq;
+		if (probe_addr) {
+			/* we already validated ip is within expected range */
+			unsigned int slot = (ip - tramp->vaddr - 1) / UPROBE_TRAMP_SLOT_SIZE;
+			unsigned long addr = tramp->probe_addrs[slot];
 
-	rcu_read_lock();
-	if (mmap_lock_speculate_try_begin(mm, &seq)) {
-		found = __in_uprobe_trampoline(ip);
-		retry = mmap_lock_speculate_retry(mm, seq);
-	}
-	rcu_read_unlock();
+			*probe_addr = addr;
+			if (addr == 0)
+				return false;
+		}
 
-	if (retry) {
-		mmap_read_lock(mm);
-		found = __in_uprobe_trampoline(ip);
-		mmap_read_unlock(mm);
+		return true;
 	}
-	return found;
+	return false;
+}
+
+static bool in_uprobe_trampoline(unsigned long ip, unsigned long *probe_addr)
+{
+	guard(rcu)();
+	return resolve_uprobe_addr(ip, probe_addr);
 }
 
 /*
- * See uprobe syscall trampoline; the call to the trampoline will push
- * the return address on the stack, the trampoline itself then pushes
- * cx, r11 and ax.
+ * The trampoline slot pushes cx, r11, ax (the registers syscall clobbers)
+ * before doing the uprobe syscall. No return address is pushed — the
+ * probe site uses jmp, not call.
  */
 struct uprobe_syscall_args {
 	unsigned long ax;
 	unsigned long r11;
 	unsigned long cx;
-	unsigned long retaddr;
 };
 
+#define UPROBE_TRAMP_REDZONE 128
+
 SYSCALL_DEFINE0(uprobe)
 {
 	struct pt_regs *regs = task_pt_regs(current);
 	struct uprobe_syscall_args args;
-	unsigned long ip, sp, sret;
+	unsigned long probe_addr;
 	int err;
 
 	/* Allow execution only from uprobe trampolines. */
-	if (!in_uprobe_trampoline(regs->ip))
-		return -ENXIO;
+	if (!in_uprobe_trampoline(regs->ip, &probe_addr))
+		return -EPROTO;
 
 	err = copy_from_user(&args, (void __user *)regs->sp, sizeof(args));
 	if (err)
 		goto sigill;
 
-	ip = regs->ip;
-
 	/*
-	 * expose the "right" values of ax/r11/cx/ip/sp to uprobe_consumer/s, plus:
-	 * - adjust ip to the probe address, call saved next instruction address
-	 * - adjust sp to the probe's stack frame (check trampoline code)
+	 * Restore the register state as it was at the probe site:
+	 * - ax/r11/cx from the trampoline-saved copies on user stack
+	 * - adjust ip to the probe address based on matching slot
+	 * - adjust sp to skip red zone and pushed args
 	 */
 	regs->ax  = args.ax;
 	regs->r11 = args.r11;
 	regs->cx  = args.cx;
-	regs->ip  = args.retaddr - 5;
-	regs->sp += sizeof(args);
+	regs->ip  = probe_addr;
+	regs->sp += sizeof(args) + UPROBE_TRAMP_REDZONE;
 	regs->orig_ax = -1;
 
-	sp = regs->sp;
-
-	err = shstk_pop((u64 *)&sret);
-	if (err == -EFAULT || (!err && sret != args.retaddr))
-		goto sigill;
-
-	handle_syscall_uprobe(regs, regs->ip);
+	handle_syscall_uprobe(regs, probe_addr);
 
 	/*
-	 * Some of the uprobe consumers has changed sp, we can do nothing,
-	 * just return via iret.
+	 * Skip the jmp instruction at the probe site (5 bytes) unless
+	 * a consumer redirected execution elsewhere.
 	 */
-	if (regs->sp != sp) {
-		/* skip the trampoline call */
-		if (args.retaddr - 5 == regs->ip)
-			regs->ip += 5;
-		return regs->ax;
-	}
+	if (regs->ip == probe_addr)
+		regs->ip = probe_addr + 5;
 
-	regs->sp -= sizeof(args);
-
-	/* for the case uprobe_consumer has changed ax/r11/cx */
-	args.ax  = regs->ax;
-	args.r11 = regs->r11;
-	args.cx  = regs->cx;
-
-	/* keep return address unless we are instructed otherwise */
-	if (args.retaddr - 5 != regs->ip)
-		args.retaddr = regs->ip;
-
-	if (shstk_push(args.retaddr) == -EFAULT)
-		goto sigill;
-
-	regs->ip = ip;
-
-	err = copy_to_user((void __user *)regs->sp, &args, sizeof(args));
-	if (err)
-		goto sigill;
-
-	/* ensure sysret, see do_syscall_64() */
-	regs->r11 = regs->flags;
-	regs->cx  = regs->ip;
-	return 0;
+	/*
+	 * Return via iret by returning regs->ax. This preserves all
+	 * GP registers (including cx and r11) without needing any
+	 * user-space cleanup code. The iret path is used because we
+	 * don't set up cx/r11 for sysret.
+	 */
+	return regs->ax;
 
 sigill:
 	force_sig(SIGILL);
 	return -1;
 }
 
+/*
+ * All uprobe trampoline slots are identical: skip the red zone,
+ * save the three registers that syscall clobbers, then invoke
+ * the uprobe syscall. The handler returns directly to the probe
+ * caller via iret. Execution never returns to the trampoline.
+ */
 asm (
 	".pushsection .rodata\n"
-	".balign " __stringify(PAGE_SIZE) "\n"
-	"uprobe_trampoline_entry:\n"
+	".balign " __stringify(UPROBE_TRAMP_SLOT_SIZE) "\n"
+	"uprobe_trampoline_slot:\n"
+	"lea -128(%rsp), %rsp\n"
 	"push %rcx\n"
 	"push %r11\n"
 	"push %rax\n"
-	"mov $" __stringify(__NR_uprobe) ", %rax\n"
+	"mov $" __stringify(__NR_uprobe) ", %eax\n"
 	"syscall\n"
-	"pop %rax\n"
-	"pop %r11\n"
-	"pop %rcx\n"
-	"ret\n"
-	"int3\n"
-	".balign " __stringify(PAGE_SIZE) "\n"
+	"uprobe_trampoline_slot_end:\n"
 	".popsection\n"
 );
 
-extern u8 uprobe_trampoline_entry[];
+extern u8 uprobe_trampoline_slot[];
+extern u8 uprobe_trampoline_slot_end[];
 
 static int __init arch_uprobes_init(void)
 {
-	tramp_mapping_pages[0] = virt_to_page(uprobe_trampoline_entry);
+	unsigned int slot_size = uprobe_trampoline_slot_end - uprobe_trampoline_slot;
+	struct page *page;
+	u8 *page_addr;
+	int i;
+
+	BUILD_BUG_ON(UPROBE_TRAMP_SLOT_SIZE != 16);
+	WARN_ON_ONCE(slot_size != UPROBE_TRAMP_SLOT_SIZE);
+
+	page = alloc_page(GFP_KERNEL);
+	if (!page)
+		return -ENOMEM;
+
+	page_addr = page_address(page);
+	for (i = 0; i < UPROBE_TRAMP_MAX_SLOTS; i++)
+		memcpy(page_addr + i * UPROBE_TRAMP_SLOT_SIZE, uprobe_trampoline_slot, slot_size);
+
+	tramp_mapping_pages[0] = page;
 	return 0;
 }
 
@@ -909,7 +939,7 @@ late_initcall(arch_uprobes_init);
 
 enum {
 	EXPECT_SWBP,
-	EXPECT_CALL,
+	EXPECT_JMP,
 };
 
 struct write_opcode_ctx {
@@ -917,14 +947,14 @@ struct write_opcode_ctx {
 	int expect;
 };
 
-static int is_call_insn(uprobe_opcode_t *insn)
+static int is_jmp_insn(uprobe_opcode_t *insn)
 {
-	return *insn == CALL_INSN_OPCODE;
+	return *insn == JMP32_INSN_OPCODE;
 }
 
 /*
  * Verification callback used by int3_update uprobe_write calls to make sure
- * the underlying instruction is as expected - either int3 or call.
+ * the underlying instruction is as expected - either int3 or jmp.
  */
 static int verify_insn(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode,
 		       int nbytes, void *data)
@@ -939,8 +969,8 @@ static int verify_insn(struct page *page, unsigned long vaddr, uprobe_opcode_t *
 		if (is_swbp_insn(&old_opcode[0]))
 			return 1;
 		break;
-	case EXPECT_CALL:
-		if (is_call_insn(&old_opcode[0]))
+	case EXPECT_JMP:
+		if (is_jmp_insn(&old_opcode[0]))
 			return 1;
 		break;
 	}
@@ -978,7 +1008,7 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 	 * so we can skip this step for optimize == true.
 	 */
 	if (!optimize) {
-		ctx.expect = EXPECT_CALL;
+		ctx.expect = EXPECT_JMP;
 		err = uprobe_write(auprobe, vma, vaddr, &int3, 1, verify_insn,
 				   true /* is_register */, false /* do_update_ref_ctr */,
 				   &ctx);
@@ -1015,13 +1045,13 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 }
 
 static int swbp_optimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
-			 unsigned long vaddr, unsigned long tramp)
+			 unsigned long vaddr, unsigned long slot_vaddr)
 {
-	u8 call[5];
+	u8 jmp[5];
 
-	__text_gen_insn(call, CALL_INSN_OPCODE, (const void *) vaddr,
-			(const void *) tramp, CALL_INSN_SIZE);
-	return int3_update(auprobe, vma, vaddr, call, true /* optimize */);
+	__text_gen_insn(jmp, JMP32_INSN_OPCODE, (const void *) vaddr,
+			(const void *) slot_vaddr, JMP32_INSN_SIZE);
+	return int3_update(auprobe, vma, vaddr, jmp, true /* optimize */);
 }
 
 static int swbp_unoptimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
@@ -1049,11 +1079,17 @@ static bool __is_optimized(uprobe_opcode_t *insn, unsigned long vaddr)
 	struct __packed __arch_relative_insn {
 		u8 op;
 		s32 raddr;
-	} *call = (struct __arch_relative_insn *) insn;
+	} *jmp = (struct __arch_relative_insn *) insn;
 
-	if (!is_call_insn(insn))
+	if (!is_jmp_insn(&jmp->op))
 		return false;
-	return __in_uprobe_trampoline(vaddr + 5 + call->raddr);
+
+	guard(rcu)();
+	/*
+	 * resolve_uprobe_addr() expects IP pointing after syscall instruction
+	 * (after the slot, basically), so adjust jump target address accordingly
+	 */
+	return resolve_uprobe_addr(vaddr + 5 + jmp->raddr + UPROBE_TRAMP_SLOT_SIZE, NULL);
 }
 
 static int is_optimized(struct mm_struct *mm, unsigned long vaddr)
@@ -1113,8 +1149,9 @@ static int __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct
 {
 	struct uprobe_trampoline *tramp;
 	struct vm_area_struct *vma;
+	unsigned long slot_vaddr;
 	bool new = false;
-	int err = 0;
+	int slot, err;
 
 	vma = find_vma(mm, vaddr);
 	if (!vma)
@@ -1122,8 +1159,17 @@ static int __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct
 	tramp = get_uprobe_trampoline(vaddr, &new);
 	if (!tramp)
 		return -EINVAL;
-	err = swbp_optimize(auprobe, vma, vaddr, tramp->vaddr);
-	if (WARN_ON_ONCE(err) && new)
+
+	slot = tramp_alloc_slot(tramp, vaddr);
+	if (slot < 0) {
+		if (new)
+			destroy_uprobe_trampoline(tramp);
+		return slot;
+	}
+
+	slot_vaddr = tramp->vaddr + slot * UPROBE_TRAMP_SLOT_SIZE;
+	err = swbp_optimize(auprobe, vma, vaddr, slot_vaddr);
+	if (err && new)
 		destroy_uprobe_trampoline(tramp);
 	return err;
 }
diff --git a/tools/lib/bpf/features.c b/tools/lib/bpf/features.c
index 4f19a0d79b0c..1b6c113357b2 100644
--- a/tools/lib/bpf/features.c
+++ b/tools/lib/bpf/features.c
@@ -577,10 +577,12 @@ static int probe_ldimm64_full_range_off(int token_fd)
 static int probe_uprobe_syscall(int token_fd)
 {
 	/*
-	 * If kernel supports uprobe() syscall, it will return -ENXIO when called
-	 * from the outside of a kernel-generated uprobe trampoline.
+	 * If kernel supports uprobe() syscall, it will return -EPROTO when
+	 * called from outside a kernel-generated uprobe trampoline.
+	 * Older kernels with the red-zone-clobbering bug return -ENXIO;
+	 * we only enable the nop5 optimization on fixed kernels.
 	 */
-	return syscall(__NR_uprobe) < 0 && errno == ENXIO;
+	return syscall(__NR_uprobe) < 0 && errno == EPROTO;
 }
 #else
 static int probe_uprobe_syscall(int token_fd)
diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
index 955a37751b52..0d5eb4cd1ddf 100644
--- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
@@ -422,7 +422,8 @@ static void *check_attach(struct uprobe_syscall_executed *skel, trigger_t trigge
 	/* .. and check the trampoline is as expected. */
 	call = (struct __arch_relative_insn *) addr;
 	tramp = (void *) (call + 1) + call->raddr;
-	ASSERT_EQ(call->op, 0xe8, "call");
+	tramp = (void *)((unsigned long)tramp & ~(getpagesize() - 1UL));
+	ASSERT_EQ(call->op, 0xe9, "jmp");
 	ASSERT_OK(find_uprobes_trampoline(tramp), "uprobes_trampoline");
 
 	return tramp;
@@ -762,7 +763,7 @@ static void test_uprobe_error(void)
 	long err = syscall(__NR_uprobe);
 
 	ASSERT_EQ(err, -1, "error");
-	ASSERT_EQ(errno, ENXIO, "errno");
+	ASSERT_EQ(errno, EPROTO, "errno");
 }
 
 static void __test_uprobe_syscall(void)
diff --git a/tools/testing/selftests/bpf/prog_tests/usdt.c b/tools/testing/selftests/bpf/prog_tests/usdt.c
index 69759b27794d..9d3744d4e936 100644
--- a/tools/testing/selftests/bpf/prog_tests/usdt.c
+++ b/tools/testing/selftests/bpf/prog_tests/usdt.c
@@ -329,7 +329,7 @@ static void subtest_optimized_attach(void)
 	ASSERT_EQ(*addr_2, 0x90, "nop");
 
 	/* call is on addr_2 + 1 address */
-	ASSERT_EQ(*(addr_2 + 1), 0xe8, "call");
+	ASSERT_EQ(*(addr_2 + 1), 0xe9, "jmp");
 	ASSERT_EQ(skel->bss->executed, 4, "executed");
 
 cleanup:
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH bpf 2/2] selftests/bpf: Add tests for uprobe nop5 red zone clobbering
  2026-05-09  0:30 [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization Andrii Nakryiko
@ 2026-05-09  0:30 ` Andrii Nakryiko
  2026-05-09  2:12   ` sashiko-bot
  2026-05-09  2:02 ` [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization sashiko-bot
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Andrii Nakryiko @ 2026-05-09  0:30 UTC (permalink / raw)
  To: bpf
  Cc: linux-trace-kernel, jolsa, oleg, peterz, mingo, mhiramat,
	Andrii Nakryiko

The uprobe nop5 optimization used to replace a 5-byte NOP with a 5-byte
CALL to a trampoline. The CALL pushes a return address onto the stack at
[rsp-8], clobbering whatever was stored there.

On x86-64, the red zone is the 128 bytes below rsp that user code may use
for temporary storage without adjusting rsp. Compilers can place USDT
argument operands there, generating specs like "8@-8(%rbp)" when rbp ==
rsp. With the CALL-based optimization, the return address overwrites that
argument before the BPF-side USDT argument fetch runs.

Add two tests for this case. The uprobe_syscall subtest stores known values
at -8(%rsp), -16(%rsp), and -24(%rsp), executes an optimized nop5 uprobe,
and verifies the red-zone data is still intact. The USDT subtest triggers a
probe in a function where the compiler places three USDT operands in the
red zone and verifies that all 10 optimized invocations deliver the expected
argument values to BPF.

On an unfixed kernel, the first hit goes through the INT3 path and later
hits use the optimized CALL path, so the red-zone checks fail after
optimization.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 .../selftests/bpf/prog_tests/uprobe_syscall.c | 75 ++++++++++++++++++-
 tools/testing/selftests/bpf/prog_tests/usdt.c | 46 ++++++++++++
 tools/testing/selftests/bpf/progs/test_usdt.c | 25 +++++++
 tools/testing/selftests/bpf/usdt_2.c          | 13 ++++
 4 files changed, 158 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
index 0d5eb4cd1ddf..6c651e4ff49a 100644
--- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
@@ -357,6 +357,46 @@ __nocf_check __weak void usdt_test(void)
 	USDT(optimized_uprobe, usdt);
 }
 
+/*
+ * Assembly-level red zone clobbering test. Stores known values in the
+ * red zone (below RSP), executes a nop5 (uprobe site), and checks that
+ * the values survived. Returns 0 if intact, 1 if clobbered.
+ *
+ * If the nop5 optimization uses CALL (which pushes a return address to
+ * [rsp-8]), the value at -8(%rsp) gets overwritten.
+ */
+__attribute__((aligned(16)))
+__nocf_check __weak __naked unsigned long uprobe_red_zone_test(void)
+{
+	asm volatile (
+		"movabs $0x1111111111111111, %%rax\n"
+		"movq   %%rax, -8(%%rsp)\n"
+		"movabs $0x2222222222222222, %%rax\n"
+		"movq   %%rax, -16(%%rsp)\n"
+		"movabs $0x3333333333333333, %%rax\n"
+		"movq   %%rax, -24(%%rsp)\n"
+
+		".byte 0x0f, 0x1f, 0x44, 0x00, 0x00\n" /* nop5: uprobe site */
+
+		"movabs $0x1111111111111111, %%rax\n"
+		"cmpq   %%rax, -8(%%rsp)\n"
+		"jne    1f\n"
+		"movabs $0x2222222222222222, %%rax\n"
+		"cmpq   %%rax, -16(%%rsp)\n"
+		"jne    1f\n"
+		"movabs $0x3333333333333333, %%rax\n"
+		"cmpq   %%rax, -24(%%rsp)\n"
+		"jne    1f\n"
+
+		"xorl   %%eax, %%eax\n"
+		"retq\n"
+		"1:\n"
+		"movl   $1, %%eax\n"
+		"retq\n"
+		::: "rax", "memory"
+	);
+}
+
 static int find_uprobes_trampoline(void *tramp_addr)
 {
 	void *start, *end;
@@ -394,7 +434,7 @@ static void *find_nop5(void *fn)
 {
 	int i;
 
-	for (i = 0; i < 10; i++) {
+	for (i = 0; i < 128; i++) {
 		if (!memcmp(nop5, fn + i, 5))
 			return fn + i;
 	}
@@ -758,6 +798,37 @@ static void test_uprobe_race(void)
 #define __NR_uprobe 336
 #endif
 
+static void test_uprobe_red_zone(void)
+{
+	struct uprobe_syscall_executed *skel;
+	struct bpf_link *link;
+	void *nop5_addr;
+	size_t offset;
+	int i;
+
+	nop5_addr = find_nop5(uprobe_red_zone_test);
+	if (!ASSERT_NEQ(nop5_addr, NULL, "find_nop5"))
+		return;
+
+	skel = uprobe_syscall_executed__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "open_and_load"))
+		return;
+
+	offset = get_uprobe_offset(nop5_addr);
+	link = bpf_program__attach_uprobe_opts(skel->progs.test_uprobe,
+			0, "/proc/self/exe", offset, NULL);
+	if (!ASSERT_OK_PTR(link, "attach_uprobe"))
+		goto cleanup;
+
+	for (i = 0; i < 10; i++)
+		ASSERT_EQ(uprobe_red_zone_test(), 0, "red_zone_intact");
+
+	bpf_link__destroy(link);
+
+cleanup:
+	uprobe_syscall_executed__destroy(skel);
+}
+
 static void test_uprobe_error(void)
 {
 	long err = syscall(__NR_uprobe);
@@ -784,6 +855,8 @@ static void __test_uprobe_syscall(void)
 		test_uprobe_usdt();
 	if (test__start_subtest("uprobe_race"))
 		test_uprobe_race();
+	if (test__start_subtest("uprobe_red_zone"))
+		test_uprobe_red_zone();
 	if (test__start_subtest("uprobe_error"))
 		test_uprobe_error();
 	if (test__start_subtest("uprobe_regs_equal"))
diff --git a/tools/testing/selftests/bpf/prog_tests/usdt.c b/tools/testing/selftests/bpf/prog_tests/usdt.c
index 9d3744d4e936..5e607773d5cc 100644
--- a/tools/testing/selftests/bpf/prog_tests/usdt.c
+++ b/tools/testing/selftests/bpf/prog_tests/usdt.c
@@ -250,6 +250,7 @@ static void subtest_basic_usdt(bool optimized)
 #ifdef __x86_64__
 extern void usdt_1(void);
 extern void usdt_2(void);
+extern void usdt_red_zone_trigger(void);
 
 static unsigned char nop1[1] = { 0x90 };
 static unsigned char nop1_nop5_combo[6] = { 0x90, 0x0f, 0x1f, 0x44, 0x00, 0x00 };
@@ -335,6 +336,49 @@ static void subtest_optimized_attach(void)
 cleanup:
 	test_usdt__destroy(skel);
 }
+
+/*
+ * Test that USDT arguments survive nop5 optimization in a function where
+ * the compiler places operands in the red zone.
+ *
+ * Signal handlers are prone to having the compiler place USDT argument
+ * operands in the red zone (below rsp). When nop5 is optimized to a
+ * call instruction, the call pushes the return address to [rsp-8],
+ * potentially clobbering the argument.
+ */
+static void subtest_optimized_red_zone(void)
+{
+	struct test_usdt *skel;
+	int i;
+
+	skel = test_usdt__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "open_and_load"))
+		return;
+
+	skel->bss->expected_arg[0] = 0xDEADBEEF;
+	skel->bss->expected_arg[1] = 0xCAFEBABE;
+	skel->bss->expected_arg[2] = 0xFEEDFACE;
+	skel->bss->expected_pid = getpid();
+
+	skel->links.usdt_check_arg = bpf_program__attach_usdt(
+		skel->progs.usdt_check_arg, 0, "/proc/self/exe",
+		"optimized_attach", "usdt_red_zone", NULL);
+	if (!ASSERT_OK_PTR(skel->links.usdt_check_arg, "attach_usdt_red_zone"))
+		goto cleanup;
+
+	for (i = 0; i < 10; i++)
+		usdt_red_zone_trigger();
+
+	ASSERT_EQ(skel->bss->arg_total, 10, "arg_total");
+	ASSERT_EQ(skel->bss->arg_bad, 0, "arg_bad");
+	ASSERT_EQ(skel->bss->arg_last[0], 0xDEADBEEF, "arg_last_1");
+	ASSERT_EQ(skel->bss->arg_last[1], 0xCAFEBABE, "arg_last_2");
+	ASSERT_EQ(skel->bss->arg_last[2], 0xFEEDFACE, "arg_last_3");
+
+cleanup:
+	test_usdt__destroy(skel);
+}
+
 #endif
 
 unsigned short test_usdt_100_semaphore SEC(".probes");
@@ -608,6 +652,8 @@ void test_usdt(void)
 		subtest_basic_usdt(true);
 	if (test__start_subtest("optimized_attach"))
 		subtest_optimized_attach();
+	if (test__start_subtest("optimized_red_zone"))
+		subtest_optimized_red_zone();
 #endif
 	if (test__start_subtest("multispec"))
 		subtest_multispec_usdt();
diff --git a/tools/testing/selftests/bpf/progs/test_usdt.c b/tools/testing/selftests/bpf/progs/test_usdt.c
index f00cb52874e0..0ee78fb050a1 100644
--- a/tools/testing/selftests/bpf/progs/test_usdt.c
+++ b/tools/testing/selftests/bpf/progs/test_usdt.c
@@ -149,5 +149,30 @@ int usdt_executed(struct pt_regs *ctx)
 		executed++;
 	return 0;
 }
+
+int arg_total;
+int arg_bad;
+long arg_last[3];
+long expected_arg[3];
+int expected_pid;
+
+SEC("usdt")
+int BPF_USDT(usdt_check_arg, long arg1, long arg2, long arg3)
+{
+	if (expected_pid != (bpf_get_current_pid_tgid() >> 32))
+		return 0;
+
+	__sync_fetch_and_add(&arg_total, 1);
+	arg_last[0] = arg1;
+	arg_last[1] = arg2;
+	arg_last[2] = arg3;
+
+	if (arg1 != expected_arg[0] ||
+	    arg2 != expected_arg[1] ||
+	    arg3 != expected_arg[2])
+		__sync_fetch_and_add(&arg_bad, 1);
+
+	return 0;
+}
 #endif
 char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/usdt_2.c b/tools/testing/selftests/bpf/usdt_2.c
index 789883aaca4c..fc7e6d220a38 100644
--- a/tools/testing/selftests/bpf/usdt_2.c
+++ b/tools/testing/selftests/bpf/usdt_2.c
@@ -13,4 +13,17 @@ void usdt_2(void)
 	USDT(optimized_attach, usdt_2);
 }
 
+static volatile unsigned long usdt_red_zone_arg1 = 0xDEADBEEF;
+static volatile unsigned long usdt_red_zone_arg2 = 0xCAFEBABE;
+static volatile unsigned long usdt_red_zone_arg3 = 0xFEEDFACE;
+
+void __attribute__((noinline)) usdt_red_zone_trigger(void)
+{
+	unsigned long a1 = usdt_red_zone_arg1;
+	unsigned long a2 = usdt_red_zone_arg2;
+	unsigned long a3 = usdt_red_zone_arg3;
+
+	USDT(optimized_attach, usdt_red_zone, a1, a2, a3);
+}
+
 #endif
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization
  2026-05-09  0:30 [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization Andrii Nakryiko
  2026-05-09  0:30 ` [PATCH bpf 2/2] selftests/bpf: Add tests for uprobe nop5 red zone clobbering Andrii Nakryiko
@ 2026-05-09  2:02 ` sashiko-bot
  2026-05-11 16:38   ` Andrii Nakryiko
  2026-05-10 21:25 ` Jiri Olsa
  2026-05-11 14:45 ` Oleg Nesterov
  3 siblings, 1 reply; 13+ messages in thread
From: sashiko-bot @ 2026-05-09  2:02 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [Critical] `resolve_uprobe_addr()` unconditionally uses `current->mm`, causing `is_optimized()` to fail when called from tracer context, which leads to target instruction stream corruption and potential kernel panics.
- [High] Exhausting a process's trampoline page slots permanently disables nop5 optimization for that uprobe across the entire system.
--

commit 2c63c19710be8b050bc0e126d84e52c24209e062
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Sat May 9 00:30:56 2026 +0000

uprobes/x86: Fix red zone clobbering in nop5 optimization

The x86 uprobe nop5 optimization currently replaces a 5-byte NOP at the
probe site with a CALL into a uprobe trampoline. This commit changes the
optimization to use JMP instead, introducing a trampoline page with slots
to handle register state saving and jumping to the syscall.

> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index ebb1baf1eb1d0..7e1f14200bbbf 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c

[ ... ]

> @@ -761,147 +785,153 @@ void arch_uprobe_clear_state(struct mm_struct *mm)
[ ... ]
> +static bool resolve_uprobe_addr(unsigned long ip, unsigned long *probe_addr)
>  {
> -	struct vm_area_struct *vma = vma_lookup(current->mm, ip);
> +	struct uprobes_state *state = &current->mm->uprobes_state;

Can this cause a NULL pointer dereference or instruction corruption when
executed outside the target task's context?

resolve_uprobe_addr() unconditionally uses current->mm. However, it is
called by __is_optimized() via is_optimized() during uprobe registration
and unregistration.

If a tracer closes a BPF link file descriptor, triggering uprobe_unregister()
in the tracer's context, current->mm will point to the tracer rather than the
target. This causes resolve_uprobe_addr() to fail, making is_optimized()
return false. As a result, set_orig_insn() might write only 1 byte over
the 5-byte JMP instruction, leaving 4 garbage bytes in the target's
instruction stream.

Furthermore, if unregistration runs from a kernel workqueue, current->mm
is NULL, and dereferencing current->mm->uprobes_state will cause a kernel
panic.

Could we pass the correct mm struct down from is_optimized() to
resolve_uprobe_addr() instead of relying on current->mm?

[ ... ]

> @@ -1113,8 +1149,9 @@ static int __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct
>  {
>  	struct uprobe_trampoline *tramp;
>  	struct vm_area_struct *vma;
> +	unsigned long slot_vaddr;
>  	bool new = false;
> -	int err = 0;
> +	int slot, err;
>  
>  	vma = find_vma(mm, vaddr);
>  	if (!vma)
> @@ -1122,8 +1159,17 @@ static int __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct
>  	tramp = get_uprobe_trampoline(vaddr, &new);
>  	if (!tramp)
>  		return -EINVAL;
> -	err = swbp_optimize(auprobe, vma, vaddr, tramp->vaddr);
> -	if (WARN_ON_ONCE(err) && new)
> +
> +	slot = tramp_alloc_slot(tramp, vaddr);
> +	if (slot < 0) {
> +		if (new)
> +			destroy_uprobe_trampoline(tramp);
> +		return slot;
> +	}

If a process exhausts its trampoline page slots, will this permanently
disable the nop5 optimization for this uprobe across the entire system?

If get_uprobe_trampoline() returns a reachable trampoline that is completely
full (all 256 slots used), tramp_alloc_slot() returns -ENOSPC. This error
propagates up to arch_uprobe_optimize(), which handles failures by setting
the ARCH_UPROBE_FLAG_OPTIMIZE_FAIL bit on auprobe->flags.

Since struct arch_uprobe is shared across all processes tracing the same
inode and offset, hitting the 256-slot limit in any single process seems
to permanently blacklist the probe from being optimized system-wide for
all other processes, even if those processes have completely empty
trampolines.

Should get_uprobe_trampoline() also check if the trampoline has available
slots before returning it, or can the failure be handled without setting
the global failure flag?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260509003146.976844-1-andrii@kernel.org?part=1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf 2/2] selftests/bpf: Add tests for uprobe nop5 red zone clobbering
  2026-05-09  0:30 ` [PATCH bpf 2/2] selftests/bpf: Add tests for uprobe nop5 red zone clobbering Andrii Nakryiko
@ 2026-05-09  2:12   ` sashiko-bot
  2026-05-11 16:58     ` Andrii Nakryiko
  0 siblings, 1 reply; 13+ messages in thread
From: sashiko-bot @ 2026-05-09  2:12 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Medium] The `optimized_red_zone` USDT subtest is brittle and may produce false positive passes because it relies on the compiler arbitrarily placing operands in the stack's red zone.
--

commit 1fbbd69b145cd2d80be83d6482e65179431782ac
Author: Andrii Nakryiko <andrii@kernel.org>

selftests/bpf: Add tests for uprobe nop5 red zone clobbering

This commit adds tests to verify that the uprobe nop5 CALL optimization does
not clobber the x86-64 stack red zone. It introduces an assembly-level test
and a USDT subtest which intends to trigger a probe in a function where the
compiler places operands in the red zone.

> diff --git a/tools/testing/selftests/bpf/usdt_2.c b/tools/testing/selftests/bpf/usdt_2.c
> index 789883aaca4c3..fc7e6d220a38b 100644
> --- a/tools/testing/selftests/bpf/usdt_2.c
> +++ b/tools/testing/selftests/bpf/usdt_2.c

[ ... ]

> @@ -13,4 +13,17 @@ void usdt_2(void)
>  	USDT(optimized_attach, usdt_2);
>  }
>  
> +static volatile unsigned long usdt_red_zone_arg1 = 0xDEADBEEF;
> +static volatile unsigned long usdt_red_zone_arg2 = 0xCAFEBABE;
> +static volatile unsigned long usdt_red_zone_arg3 = 0xFEEDFACE;
> +
> +void __attribute__((noinline)) usdt_red_zone_trigger(void)
> +{
> +	unsigned long a1 = usdt_red_zone_arg1;
> +	unsigned long a2 = usdt_red_zone_arg2;
> +	unsigned long a3 = usdt_red_zone_arg3;
> +
> +	USDT(optimized_attach, usdt_red_zone, a1, a2, a3);

Is this test reliable across different compiler versions and optimizations?

The x86-64 USDT macro uses the "nor" inline assembly constraint, which allows
the compiler to allocate these operands to CPU registers. Because there is no
register pressure in usdt_red_zone_trigger(), modern compilers might allocate
the local variables a1, a2, and a3 directly to registers rather than the
stack's red zone.

If the operands are placed in registers, the uprobe CALL optimization
clobbering [rsp-8] will not corrupt them. This could cause the test to pass
even on buggy kernels, creating a false positive.

Would it be safer to explicitly force these operands into the red zone
using inline assembly constraints, rather than depending on the compiler's
register allocator?

> +}
> +
>  #endif

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260509003146.976844-1-andrii@kernel.org?part=2

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization
  2026-05-09  0:30 [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization Andrii Nakryiko
  2026-05-09  0:30 ` [PATCH bpf 2/2] selftests/bpf: Add tests for uprobe nop5 red zone clobbering Andrii Nakryiko
  2026-05-09  2:02 ` [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization sashiko-bot
@ 2026-05-10 21:25 ` Jiri Olsa
  2026-05-11 16:41   ` Andrii Nakryiko
  2026-05-12  5:14   ` Masami Hiramatsu
  2026-05-11 14:45 ` Oleg Nesterov
  3 siblings, 2 replies; 13+ messages in thread
From: Jiri Olsa @ 2026-05-10 21:25 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: bpf, linux-trace-kernel, oleg, peterz, mingo, mhiramat

On Fri, May 08, 2026 at 05:30:56PM -0700, Andrii Nakryiko wrote:
> The x86 uprobe nop5 optimization currently replaces a 5-byte NOP at the
> probe site with a CALL into a uprobe trampoline. CALL pushes a return
> address to [rsp-8]. On x86-64 this is inside the 128-byte red zone, where
> user code may keep temporary data without adjusting rsp.
> 
> Use a 5-byte JMP instead. JMP does not write to the user stack, but it
> also does not provide a return address. Replace the single trampoline
> entry with a page of 16-byte slots. Each optimized probe jumps to its
> assigned slot, the slot moves rsp below the red zone, saves the registers
> clobbered by syscall, and invokes the uprobe syscall:
> 
>   Probe site:   jmp slot_N              (5B, replaces nop5)
> 
>   Slot N:       lea  -128(%rsp), %rsp   (5B)  skip red zone
>                 push %rcx               (1B)  save (syscall clobbers)
>                 push %r11               (2B)  save (syscall clobbers)
>                 push %rax               (1B)  save (syscall uses for nr)
>                 mov  $336, %eax         (5B)  uprobe syscall number
>                 syscall                 (2B)
> 
> All slots contain identical code at different offsets, so the trampoline
> page is generated once at boot and mapped read-execute into each process.
> The syscall handler identifies the slot from regs->ip, which points just
> after the syscall instruction, and uses a per-mm slot table to recover the
> original probe address.
> 
> The uprobe syscall does not return to the trampoline slot. The handler
> restores the probe-site register state, runs the uprobe consumers, sets
> pt_regs to continue at probe_addr + 5 unless a consumer redirected
> execution, and returns directly through the IRET path. This preserves
> general purpose registers, including rcx and r11, without requiring any
> post-syscall cleanup code in the trampoline and avoids call/ret, RSB, and
> shadow stack concerns.
> 
> Protect the per-mm trampoline list with RCU and free trampoline metadata
> with kfree_rcu(). This lets the syscall path resolve trampoline slots
> without taking mmap_lock. The optimized-instruction detection path also
> walks the trampoline list under an RCU read-side lock. Since that path
> starts from the JMP target, it translates the slot start to the post-syscall
> IP expected by the shared resolver before checking the trampoline mapping.
> 
> Each trampoline page provides 256 slots. Slots stay permanently assigned
> to their first probe address and are reused only when the same address is
> probed again. Reassigning detached slots is deliberately avoided because a
> thread can remain in a trampoline for an unbounded time due to ptrace,
> interrupts, or scheduling delays. If a reachable trampoline page runs out
> of slots, probes that cannot allocate a slot fall back to the slower INT3
> path.
> 
> Require the entire trampoline page to be reachable by a rel32 JMP before
> reusing it for a probe. This keeps every slot in the page within the range
> that can be encoded at the probe site.
> 
> Change the error code returned when the uprobe syscall is invoked outside
> a kernel-generated trampoline from -ENXIO to -EPROTO. This lets libbpf and
> similar libraries distinguish fixed kernels from kernels with the
> red-zone-clobbering implementation and enable nop5 optimization only on
> fixed kernels.
> 
> Performance (usdt single-thread, M/s):
> 
>                   usdt-nop  usdt-nop5-base  usdt-nop5-fix  nop5-change  iret%
>   Skylake          3.149        6.422          4.865         -24.3%     39.1%
>   Milan            2.910        3.443          3.820         +11.0%     24.3%
>   Sapphire Rapids  1.896        4.023          3.693          -8.2%     24.9%
>   Bergamo          3.393        3.895          3.849          -1.2%     24.5%
> 
> The fixed nop5 path remains faster than the non-optimized INT3 path on all
> measured systems. The regression relative to the old CALL-based trampoline
> comes from IRET being more expensive than SYSRET, most noticeably on older
> Intel Skylake. Newer Intel CPUs and tested AMD CPUs have lower IRET cost,
> and AMD Milan improves because removing mmap_lock from the hot path more
> than offsets the IRET cost.
> 
> Multi-threaded throughput scales nearly linearly with the number of CPUs, like
> it used to, thanks to lockless RCU-protected uprobe trampoline lookup.

hi,
thanks a lot for the fix

FWIW we discussed also an option to have 10-bytes nop and do:
  [rsp+0x80, call trampoline]

we would not need the slots re-use logic, but not sure what other
surprises there are with 10-bytes nop

I tried that change [1], it seems to work, but it has other
difficulties, like I think the unoptimized path needs to do:
  [rsp+0x80, call trampoline] -> [jmp end of 10-bytes nop]
instead of patching back the 10-byte nop, because some thread
could be inside the nop area already.

jirka


[1] https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/commit/?h=redzone_fix&id=74b09240289dba8368c2783b771e678b2cc31574

> 
> Fixes: ba2bfc97b462 ("uprobes/x86: Add support to optimize uprobes")
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  arch/x86/include/asm/uprobes.h                |  18 ++
>  arch/x86/kernel/uprobes.c                     | 262 ++++++++++--------
>  tools/lib/bpf/features.c                      |   8 +-
>  .../selftests/bpf/prog_tests/uprobe_syscall.c |   5 +-
>  tools/testing/selftests/bpf/prog_tests/usdt.c |   2 +-
>  5 files changed, 181 insertions(+), 114 deletions(-)
> 
> diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
> index 362210c79998..a7cf5c92d95a 100644
> --- a/arch/x86/include/asm/uprobes.h
> +++ b/arch/x86/include/asm/uprobes.h
> @@ -25,6 +25,24 @@ enum {
>  	ARCH_UPROBE_FLAG_OPTIMIZE_FAIL  = 1,
>  };
>  
> +/*
> + * Trampoline page layout: identical 16-byte slots, each containing:
> + *   lea  -128(%rsp), %rsp (5B)  skip red zone
> + *   push %rcx             (1B)  save (syscall clobbers)
> + *   push %r11             (2B)  save (syscall clobbers)
> + *   push %rax             (1B)  save (syscall uses for nr)
> + *   mov  $336, %eax       (5B)  uprobe syscall number
> + *   syscall               (2B)
> + *                        = 16B, no padding needed
> + *
> + * The handler identifies which probe fired from regs->ip (each
> + * slot is at a unique offset), looks up the probe address from a
> + * per-process table, and returns directly to probe_addr+5 via iret
> + * with all registers restored.
> + */
> +#define UPROBE_TRAMP_SLOT_SIZE	16
> +#define UPROBE_TRAMP_MAX_SLOTS	(PAGE_SIZE / UPROBE_TRAMP_SLOT_SIZE)
> +
>  struct uprobe_xol_ops;
>  
>  struct arch_uprobe {
> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index ebb1baf1eb1d..7e1f14200bbb 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -633,16 +633,25 @@ static struct vm_special_mapping tramp_mapping = {
>  
>  struct uprobe_trampoline {
>  	struct hlist_node	node;
> +	struct rcu_head		rcu;
>  	unsigned long		vaddr;
> +	unsigned long		probe_addrs[UPROBE_TRAMP_MAX_SLOTS];
>  };
>  
> -static bool is_reachable_by_call(unsigned long vtramp, unsigned long vaddr)
> +
> +static bool is_reachable_by_jmp(unsigned long dst, unsigned long src)
>  {
> -	long delta = (long)(vaddr + 5 - vtramp);
> +	long delta = (long)(dst - (src + JMP32_INSN_SIZE));
>  
>  	return delta >= INT_MIN && delta <= INT_MAX;
>  }
>  
> +static bool is_reachable_by_trampoline(unsigned long vtramp, unsigned long vaddr)
> +{
> +	return is_reachable_by_jmp(vtramp, vaddr) &&
> +	       is_reachable_by_jmp(vtramp + PAGE_SIZE - 1, vaddr);
> +}
> +
>  static unsigned long find_nearest_trampoline(unsigned long vaddr)
>  {
>  	struct vm_unmapped_area_info info = {
> @@ -711,6 +720,21 @@ static struct uprobe_trampoline *create_uprobe_trampoline(unsigned long vaddr)
>  	return tramp;
>  }
>  
> +static int tramp_alloc_slot(struct uprobe_trampoline *tramp, unsigned long probe_addr)
> +{
> +	int i;
> +
> +	for (i = 0; i < UPROBE_TRAMP_MAX_SLOTS; i++) {
> +		if (tramp->probe_addrs[i] == probe_addr)
> +			return i;
> +		if (tramp->probe_addrs[i] == 0) {
> +			tramp->probe_addrs[i] = probe_addr;
> +			return i;
> +		}
> +	}
> +	return -ENOSPC;
> +}
> +
>  static struct uprobe_trampoline *get_uprobe_trampoline(unsigned long vaddr, bool *new)
>  {
>  	struct uprobes_state *state = &current->mm->uprobes_state;
> @@ -720,7 +744,7 @@ static struct uprobe_trampoline *get_uprobe_trampoline(unsigned long vaddr, bool
>  		return NULL;
>  
>  	hlist_for_each_entry(tramp, &state->head_tramps, node) {
> -		if (is_reachable_by_call(tramp->vaddr, vaddr)) {
> +		if (is_reachable_by_trampoline(tramp->vaddr, vaddr)) {
>  			*new = false;
>  			return tramp;
>  		}
> @@ -731,7 +755,7 @@ static struct uprobe_trampoline *get_uprobe_trampoline(unsigned long vaddr, bool
>  		return NULL;
>  
>  	*new = true;
> -	hlist_add_head(&tramp->node, &state->head_tramps);
> +	hlist_add_head_rcu(&tramp->node, &state->head_tramps);
>  	return tramp;
>  }
>  
> @@ -742,8 +766,8 @@ static void destroy_uprobe_trampoline(struct uprobe_trampoline *tramp)
>  	 * because there's no easy way to make sure none of the threads
>  	 * is still inside the trampoline.
>  	 */
> -	hlist_del(&tramp->node);
> -	kfree(tramp);
> +	hlist_del_rcu(&tramp->node);
> +	kfree_rcu(tramp, rcu);
>  }
>  
>  void arch_uprobe_init_state(struct mm_struct *mm)
> @@ -761,147 +785,153 @@ void arch_uprobe_clear_state(struct mm_struct *mm)
>  		destroy_uprobe_trampoline(tramp);
>  }
>  
> -static bool __in_uprobe_trampoline(unsigned long ip)
> +/*
> + * Find the trampoline containing @ip. If @probe_addr is non-NULL, also
> + * resolve the slot index from @ip and return the probe address.
> + *
> + * @ip is expected to point right after the syscall instruction, i.e.,
> + * at the end of the slot (slot_start + UPROBE_TRAMP_SLOT_SIZE).
> + */
> +static bool resolve_uprobe_addr(unsigned long ip, unsigned long *probe_addr)
>  {
> -	struct vm_area_struct *vma = vma_lookup(current->mm, ip);
> +	struct uprobes_state *state = &current->mm->uprobes_state;
> +	struct uprobe_trampoline *tramp;
>  
> -	return vma && vma_is_special_mapping(vma, &tramp_mapping);
> -}
> +	hlist_for_each_entry_rcu(tramp, &state->head_tramps, node) {
> +		/*
> +		 * ip points to after syscall, so it's on 16 byte boundary,
> +		 * which means that valid ip can point right after the page
> +		 * and should never be at zero offset within the page
> +		 */
> +		if (ip <= tramp->vaddr || ip > tramp->vaddr + PAGE_SIZE)
> +			continue;
>  
> -static bool in_uprobe_trampoline(unsigned long ip)
> -{
> -	struct mm_struct *mm = current->mm;
> -	bool found, retry = true;
> -	unsigned int seq;
> +		if (probe_addr) {
> +			/* we already validated ip is within expected range */
> +			unsigned int slot = (ip - tramp->vaddr - 1) / UPROBE_TRAMP_SLOT_SIZE;
> +			unsigned long addr = tramp->probe_addrs[slot];
>  
> -	rcu_read_lock();
> -	if (mmap_lock_speculate_try_begin(mm, &seq)) {
> -		found = __in_uprobe_trampoline(ip);
> -		retry = mmap_lock_speculate_retry(mm, seq);
> -	}
> -	rcu_read_unlock();
> +			*probe_addr = addr;
> +			if (addr == 0)
> +				return false;
> +		}
>  
> -	if (retry) {
> -		mmap_read_lock(mm);
> -		found = __in_uprobe_trampoline(ip);
> -		mmap_read_unlock(mm);
> +		return true;
>  	}
> -	return found;
> +	return false;
> +}
> +
> +static bool in_uprobe_trampoline(unsigned long ip, unsigned long *probe_addr)
> +{
> +	guard(rcu)();
> +	return resolve_uprobe_addr(ip, probe_addr);
>  }
>  
>  /*
> - * See uprobe syscall trampoline; the call to the trampoline will push
> - * the return address on the stack, the trampoline itself then pushes
> - * cx, r11 and ax.
> + * The trampoline slot pushes cx, r11, ax (the registers syscall clobbers)
> + * before doing the uprobe syscall. No return address is pushed — the
> + * probe site uses jmp, not call.
>   */
>  struct uprobe_syscall_args {
>  	unsigned long ax;
>  	unsigned long r11;
>  	unsigned long cx;
> -	unsigned long retaddr;
>  };
>  
> +#define UPROBE_TRAMP_REDZONE 128
> +
>  SYSCALL_DEFINE0(uprobe)
>  {
>  	struct pt_regs *regs = task_pt_regs(current);
>  	struct uprobe_syscall_args args;
> -	unsigned long ip, sp, sret;
> +	unsigned long probe_addr;
>  	int err;
>  
>  	/* Allow execution only from uprobe trampolines. */
> -	if (!in_uprobe_trampoline(regs->ip))
> -		return -ENXIO;
> +	if (!in_uprobe_trampoline(regs->ip, &probe_addr))
> +		return -EPROTO;
>  
>  	err = copy_from_user(&args, (void __user *)regs->sp, sizeof(args));
>  	if (err)
>  		goto sigill;
>  
> -	ip = regs->ip;
> -
>  	/*
> -	 * expose the "right" values of ax/r11/cx/ip/sp to uprobe_consumer/s, plus:
> -	 * - adjust ip to the probe address, call saved next instruction address
> -	 * - adjust sp to the probe's stack frame (check trampoline code)
> +	 * Restore the register state as it was at the probe site:
> +	 * - ax/r11/cx from the trampoline-saved copies on user stack
> +	 * - adjust ip to the probe address based on matching slot
> +	 * - adjust sp to skip red zone and pushed args
>  	 */
>  	regs->ax  = args.ax;
>  	regs->r11 = args.r11;
>  	regs->cx  = args.cx;
> -	regs->ip  = args.retaddr - 5;
> -	regs->sp += sizeof(args);
> +	regs->ip  = probe_addr;
> +	regs->sp += sizeof(args) + UPROBE_TRAMP_REDZONE;
>  	regs->orig_ax = -1;
>  
> -	sp = regs->sp;
> -
> -	err = shstk_pop((u64 *)&sret);
> -	if (err == -EFAULT || (!err && sret != args.retaddr))
> -		goto sigill;
> -
> -	handle_syscall_uprobe(regs, regs->ip);
> +	handle_syscall_uprobe(regs, probe_addr);
>  
>  	/*
> -	 * Some of the uprobe consumers has changed sp, we can do nothing,
> -	 * just return via iret.
> +	 * Skip the jmp instruction at the probe site (5 bytes) unless
> +	 * a consumer redirected execution elsewhere.
>  	 */
> -	if (regs->sp != sp) {
> -		/* skip the trampoline call */
> -		if (args.retaddr - 5 == regs->ip)
> -			regs->ip += 5;
> -		return regs->ax;
> -	}
> +	if (regs->ip == probe_addr)
> +		regs->ip = probe_addr + 5;
>  
> -	regs->sp -= sizeof(args);
> -
> -	/* for the case uprobe_consumer has changed ax/r11/cx */
> -	args.ax  = regs->ax;
> -	args.r11 = regs->r11;
> -	args.cx  = regs->cx;
> -
> -	/* keep return address unless we are instructed otherwise */
> -	if (args.retaddr - 5 != regs->ip)
> -		args.retaddr = regs->ip;
> -
> -	if (shstk_push(args.retaddr) == -EFAULT)
> -		goto sigill;
> -
> -	regs->ip = ip;
> -
> -	err = copy_to_user((void __user *)regs->sp, &args, sizeof(args));
> -	if (err)
> -		goto sigill;
> -
> -	/* ensure sysret, see do_syscall_64() */
> -	regs->r11 = regs->flags;
> -	regs->cx  = regs->ip;
> -	return 0;
> +	/*
> +	 * Return via iret by returning regs->ax. This preserves all
> +	 * GP registers (including cx and r11) without needing any
> +	 * user-space cleanup code. The iret path is used because we
> +	 * don't set up cx/r11 for sysret.
> +	 */
> +	return regs->ax;
>  
>  sigill:
>  	force_sig(SIGILL);
>  	return -1;
>  }
>  
> +/*
> + * All uprobe trampoline slots are identical: skip the red zone,
> + * save the three registers that syscall clobbers, then invoke
> + * the uprobe syscall. The handler returns directly to the probe
> + * caller via iret. Execution never returns to the trampoline.
> + */
>  asm (
>  	".pushsection .rodata\n"
> -	".balign " __stringify(PAGE_SIZE) "\n"
> -	"uprobe_trampoline_entry:\n"
> +	".balign " __stringify(UPROBE_TRAMP_SLOT_SIZE) "\n"
> +	"uprobe_trampoline_slot:\n"
> +	"lea -128(%rsp), %rsp\n"
>  	"push %rcx\n"
>  	"push %r11\n"
>  	"push %rax\n"
> -	"mov $" __stringify(__NR_uprobe) ", %rax\n"
> +	"mov $" __stringify(__NR_uprobe) ", %eax\n"
>  	"syscall\n"
> -	"pop %rax\n"
> -	"pop %r11\n"
> -	"pop %rcx\n"
> -	"ret\n"
> -	"int3\n"
> -	".balign " __stringify(PAGE_SIZE) "\n"
> +	"uprobe_trampoline_slot_end:\n"
>  	".popsection\n"
>  );
>  
> -extern u8 uprobe_trampoline_entry[];
> +extern u8 uprobe_trampoline_slot[];
> +extern u8 uprobe_trampoline_slot_end[];
>  
>  static int __init arch_uprobes_init(void)
>  {
> -	tramp_mapping_pages[0] = virt_to_page(uprobe_trampoline_entry);
> +	unsigned int slot_size = uprobe_trampoline_slot_end - uprobe_trampoline_slot;
> +	struct page *page;
> +	u8 *page_addr;
> +	int i;
> +
> +	BUILD_BUG_ON(UPROBE_TRAMP_SLOT_SIZE != 16);
> +	WARN_ON_ONCE(slot_size != UPROBE_TRAMP_SLOT_SIZE);
> +
> +	page = alloc_page(GFP_KERNEL);
> +	if (!page)
> +		return -ENOMEM;
> +
> +	page_addr = page_address(page);
> +	for (i = 0; i < UPROBE_TRAMP_MAX_SLOTS; i++)
> +		memcpy(page_addr + i * UPROBE_TRAMP_SLOT_SIZE, uprobe_trampoline_slot, slot_size);
> +
> +	tramp_mapping_pages[0] = page;
>  	return 0;
>  }
>  
> @@ -909,7 +939,7 @@ late_initcall(arch_uprobes_init);
>  
>  enum {
>  	EXPECT_SWBP,
> -	EXPECT_CALL,
> +	EXPECT_JMP,
>  };
>  
>  struct write_opcode_ctx {
> @@ -917,14 +947,14 @@ struct write_opcode_ctx {
>  	int expect;
>  };
>  
> -static int is_call_insn(uprobe_opcode_t *insn)
> +static int is_jmp_insn(uprobe_opcode_t *insn)
>  {
> -	return *insn == CALL_INSN_OPCODE;
> +	return *insn == JMP32_INSN_OPCODE;
>  }
>  
>  /*
>   * Verification callback used by int3_update uprobe_write calls to make sure
> - * the underlying instruction is as expected - either int3 or call.
> + * the underlying instruction is as expected - either int3 or jmp.
>   */
>  static int verify_insn(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode,
>  		       int nbytes, void *data)
> @@ -939,8 +969,8 @@ static int verify_insn(struct page *page, unsigned long vaddr, uprobe_opcode_t *
>  		if (is_swbp_insn(&old_opcode[0]))
>  			return 1;
>  		break;
> -	case EXPECT_CALL:
> -		if (is_call_insn(&old_opcode[0]))
> +	case EXPECT_JMP:
> +		if (is_jmp_insn(&old_opcode[0]))
>  			return 1;
>  		break;
>  	}
> @@ -978,7 +1008,7 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
>  	 * so we can skip this step for optimize == true.
>  	 */
>  	if (!optimize) {
> -		ctx.expect = EXPECT_CALL;
> +		ctx.expect = EXPECT_JMP;
>  		err = uprobe_write(auprobe, vma, vaddr, &int3, 1, verify_insn,
>  				   true /* is_register */, false /* do_update_ref_ctr */,
>  				   &ctx);
> @@ -1015,13 +1045,13 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
>  }
>  
>  static int swbp_optimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> -			 unsigned long vaddr, unsigned long tramp)
> +			 unsigned long vaddr, unsigned long slot_vaddr)
>  {
> -	u8 call[5];
> +	u8 jmp[5];
>  
> -	__text_gen_insn(call, CALL_INSN_OPCODE, (const void *) vaddr,
> -			(const void *) tramp, CALL_INSN_SIZE);
> -	return int3_update(auprobe, vma, vaddr, call, true /* optimize */);
> +	__text_gen_insn(jmp, JMP32_INSN_OPCODE, (const void *) vaddr,
> +			(const void *) slot_vaddr, JMP32_INSN_SIZE);
> +	return int3_update(auprobe, vma, vaddr, jmp, true /* optimize */);
>  }
>  
>  static int swbp_unoptimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> @@ -1049,11 +1079,17 @@ static bool __is_optimized(uprobe_opcode_t *insn, unsigned long vaddr)
>  	struct __packed __arch_relative_insn {
>  		u8 op;
>  		s32 raddr;
> -	} *call = (struct __arch_relative_insn *) insn;
> +	} *jmp = (struct __arch_relative_insn *) insn;
>  
> -	if (!is_call_insn(insn))
> +	if (!is_jmp_insn(&jmp->op))
>  		return false;
> -	return __in_uprobe_trampoline(vaddr + 5 + call->raddr);
> +
> +	guard(rcu)();
> +	/*
> +	 * resolve_uprobe_addr() expects IP pointing after syscall instruction
> +	 * (after the slot, basically), so adjust jump target address accordingly
> +	 */
> +	return resolve_uprobe_addr(vaddr + 5 + jmp->raddr + UPROBE_TRAMP_SLOT_SIZE, NULL);
>  }
>  
>  static int is_optimized(struct mm_struct *mm, unsigned long vaddr)
> @@ -1113,8 +1149,9 @@ static int __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct
>  {
>  	struct uprobe_trampoline *tramp;
>  	struct vm_area_struct *vma;
> +	unsigned long slot_vaddr;
>  	bool new = false;
> -	int err = 0;
> +	int slot, err;
>  
>  	vma = find_vma(mm, vaddr);
>  	if (!vma)
> @@ -1122,8 +1159,17 @@ static int __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct
>  	tramp = get_uprobe_trampoline(vaddr, &new);
>  	if (!tramp)
>  		return -EINVAL;
> -	err = swbp_optimize(auprobe, vma, vaddr, tramp->vaddr);
> -	if (WARN_ON_ONCE(err) && new)
> +
> +	slot = tramp_alloc_slot(tramp, vaddr);
> +	if (slot < 0) {
> +		if (new)
> +			destroy_uprobe_trampoline(tramp);
> +		return slot;
> +	}
> +
> +	slot_vaddr = tramp->vaddr + slot * UPROBE_TRAMP_SLOT_SIZE;
> +	err = swbp_optimize(auprobe, vma, vaddr, slot_vaddr);
> +	if (err && new)
>  		destroy_uprobe_trampoline(tramp);
>  	return err;
>  }
> diff --git a/tools/lib/bpf/features.c b/tools/lib/bpf/features.c
> index 4f19a0d79b0c..1b6c113357b2 100644
> --- a/tools/lib/bpf/features.c
> +++ b/tools/lib/bpf/features.c
> @@ -577,10 +577,12 @@ static int probe_ldimm64_full_range_off(int token_fd)
>  static int probe_uprobe_syscall(int token_fd)
>  {
>  	/*
> -	 * If kernel supports uprobe() syscall, it will return -ENXIO when called
> -	 * from the outside of a kernel-generated uprobe trampoline.
> +	 * If kernel supports uprobe() syscall, it will return -EPROTO when
> +	 * called from outside a kernel-generated uprobe trampoline.
> +	 * Older kernels with the red-zone-clobbering bug return -ENXIO;
> +	 * we only enable the nop5 optimization on fixed kernels.
>  	 */
> -	return syscall(__NR_uprobe) < 0 && errno == ENXIO;
> +	return syscall(__NR_uprobe) < 0 && errno == EPROTO;
>  }
>  #else
>  static int probe_uprobe_syscall(int token_fd)
> diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> index 955a37751b52..0d5eb4cd1ddf 100644
> --- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> +++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> @@ -422,7 +422,8 @@ static void *check_attach(struct uprobe_syscall_executed *skel, trigger_t trigge
>  	/* .. and check the trampoline is as expected. */
>  	call = (struct __arch_relative_insn *) addr;
>  	tramp = (void *) (call + 1) + call->raddr;
> -	ASSERT_EQ(call->op, 0xe8, "call");
> +	tramp = (void *)((unsigned long)tramp & ~(getpagesize() - 1UL));
> +	ASSERT_EQ(call->op, 0xe9, "jmp");
>  	ASSERT_OK(find_uprobes_trampoline(tramp), "uprobes_trampoline");
>  
>  	return tramp;
> @@ -762,7 +763,7 @@ static void test_uprobe_error(void)
>  	long err = syscall(__NR_uprobe);
>  
>  	ASSERT_EQ(err, -1, "error");
> -	ASSERT_EQ(errno, ENXIO, "errno");
> +	ASSERT_EQ(errno, EPROTO, "errno");
>  }
>  
>  static void __test_uprobe_syscall(void)
> diff --git a/tools/testing/selftests/bpf/prog_tests/usdt.c b/tools/testing/selftests/bpf/prog_tests/usdt.c
> index 69759b27794d..9d3744d4e936 100644
> --- a/tools/testing/selftests/bpf/prog_tests/usdt.c
> +++ b/tools/testing/selftests/bpf/prog_tests/usdt.c
> @@ -329,7 +329,7 @@ static void subtest_optimized_attach(void)
>  	ASSERT_EQ(*addr_2, 0x90, "nop");
>  
>  	/* call is on addr_2 + 1 address */
> -	ASSERT_EQ(*(addr_2 + 1), 0xe8, "call");
> +	ASSERT_EQ(*(addr_2 + 1), 0xe9, "jmp");
>  	ASSERT_EQ(skel->bss->executed, 4, "executed");
>  
>  cleanup:
> -- 
> 2.53.0-Meta
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization
  2026-05-09  0:30 [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization Andrii Nakryiko
                   ` (2 preceding siblings ...)
  2026-05-10 21:25 ` Jiri Olsa
@ 2026-05-11 14:45 ` Oleg Nesterov
  2026-05-11 16:56   ` Andrii Nakryiko
  3 siblings, 1 reply; 13+ messages in thread
From: Oleg Nesterov @ 2026-05-11 14:45 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: bpf, linux-trace-kernel, jolsa, peterz, mingo, mhiramat

On 05/08, Andrii Nakryiko wrote:
>
> +static bool resolve_uprobe_addr(unsigned long ip, unsigned long *probe_addr)
>  {
> -	struct vm_area_struct *vma = vma_lookup(current->mm, ip);
> +	struct uprobes_state *state = &current->mm->uprobes_state;

it seems that there is a problem, with or without this change...

Lets forget about this patch for the moment, I am still trying to understand it.

What if register_for_each_vma() calls install_breakpoint(vma) and
vma->mm != current->mm ?

In this case install_breakpoint() path will call __is_optimized() and then
__in_uprobe_trampoline() which does vma_lookup(current->mm). This looks
obviously wrong ?

And unless I am totally confused, this patch "inherits" the problem...

No?

Oleg.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization
  2026-05-09  2:02 ` [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization sashiko-bot
@ 2026-05-11 16:38   ` Andrii Nakryiko
  2026-05-11 16:53     ` Andrii Nakryiko
  0 siblings, 1 reply; 13+ messages in thread
From: Andrii Nakryiko @ 2026-05-11 16:38 UTC (permalink / raw)
  To: sashiko; +Cc: Andrii Nakryiko, bpf

On Fri, May 8, 2026 at 7:02 PM <sashiko-bot@kernel.org> wrote:
>
> Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
> - [Critical] `resolve_uprobe_addr()` unconditionally uses `current->mm`, causing `is_optimized()` to fail when called from tracer context, which leads to target instruction stream corruption and potential kernel panics.
> - [High] Exhausting a process's trampoline page slots permanently disables nop5 optimization for that uprobe across the entire system.
> --
>
> commit 2c63c19710be8b050bc0e126d84e52c24209e062
> Author: Andrii Nakryiko <andrii@kernel.org>
> Date:   Sat May 9 00:30:56 2026 +0000
>
> uprobes/x86: Fix red zone clobbering in nop5 optimization
>
> The x86 uprobe nop5 optimization currently replaces a 5-byte NOP at the
> probe site with a CALL into a uprobe trampoline. This commit changes the
> optimization to use JMP instead, introducing a trampoline page with slots
> to handle register state saving and jumping to the syscall.
>
> > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> > index ebb1baf1eb1d0..7e1f14200bbbf 100644
> > --- a/arch/x86/kernel/uprobes.c
> > +++ b/arch/x86/kernel/uprobes.c
>
> [ ... ]
>
> > @@ -761,147 +785,153 @@ void arch_uprobe_clear_state(struct mm_struct *mm)
> [ ... ]
> > +static bool resolve_uprobe_addr(unsigned long ip, unsigned long *probe_addr)
> >  {
> > -     struct vm_area_struct *vma = vma_lookup(current->mm, ip);
> > +     struct uprobes_state *state = &current->mm->uprobes_state;
>
> Can this cause a NULL pointer dereference or instruction corruption when
> executed outside the target task's context?
>

...

>
> Could we pass the correct mm struct down from is_optimized() to
> resolve_uprobe_addr() instead of relying on current->mm?

yes, good observation, I'll change resolve_uprobe_addr() to accept
mm_struct directly

>
> [ ... ]
>
> > @@ -1113,8 +1149,9 @@ static int __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct
> >  {
> >       struct uprobe_trampoline *tramp;
> >       struct vm_area_struct *vma;
> > +     unsigned long slot_vaddr;
> >       bool new = false;
> > -     int err = 0;
> > +     int slot, err;
> >
> >       vma = find_vma(mm, vaddr);
> >       if (!vma)
> > @@ -1122,8 +1159,17 @@ static int __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct
> >       tramp = get_uprobe_trampoline(vaddr, &new);
> >       if (!tramp)
> >               return -EINVAL;
> > -     err = swbp_optimize(auprobe, vma, vaddr, tramp->vaddr);
> > -     if (WARN_ON_ONCE(err) && new)
> > +
> > +     slot = tramp_alloc_slot(tramp, vaddr);
> > +     if (slot < 0) {
> > +             if (new)
> > +                     destroy_uprobe_trampoline(tramp);
> > +             return slot;
> > +     }
>
> If a process exhausts its trampoline page slots, will this permanently
> disable the nop5 optimization for this uprobe across the entire system?
>
> If get_uprobe_trampoline() returns a reachable trampoline that is completely
> full (all 256 slots used), tramp_alloc_slot() returns -ENOSPC. This error
> propagates up to arch_uprobe_optimize(), which handles failures by setting
> the ARCH_UPROBE_FLAG_OPTIMIZE_FAIL bit on auprobe->flags.
>
> Since struct arch_uprobe is shared across all processes tracing the same
> inode and offset, hitting the 256-slot limit in any single process seems
> to permanently blacklist the probe from being optimized system-wide for
> all other processes, even if those processes have completely empty
> trampolines.
>
> Should get_uprobe_trampoline() also check if the trampoline has available
> slots before returning it, or can the failure be handled without setting
> the global failure flag?

It's a trade off. If we don't set ARCH_UPROBE_FLAG_OPTIMIZE_FAIL,
we'll keep wasting time trying to find and use uprobe trampoline for
each uprobe execution in that process, which is wasteful. So, given
this is a performance optimization, I think we'll just live with this.
The alternative would be a per-mm+per-uprobe extra information, which
seems like an overkill, so probably a no-go.

>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260509003146.976844-1-andrii@kernel.org?part=1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization
  2026-05-10 21:25 ` Jiri Olsa
@ 2026-05-11 16:41   ` Andrii Nakryiko
  2026-05-12  5:14   ` Masami Hiramatsu
  1 sibling, 0 replies; 13+ messages in thread
From: Andrii Nakryiko @ 2026-05-11 16:41 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Andrii Nakryiko, bpf, linux-trace-kernel, oleg, peterz, mingo,
	mhiramat

On Sun, May 10, 2026 at 2:25 PM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Fri, May 08, 2026 at 05:30:56PM -0700, Andrii Nakryiko wrote:
> > The x86 uprobe nop5 optimization currently replaces a 5-byte NOP at the
> > probe site with a CALL into a uprobe trampoline. CALL pushes a return
> > address to [rsp-8]. On x86-64 this is inside the 128-byte red zone, where
> > user code may keep temporary data without adjusting rsp.
> >
> > Use a 5-byte JMP instead. JMP does not write to the user stack, but it
> > also does not provide a return address. Replace the single trampoline
> > entry with a page of 16-byte slots. Each optimized probe jumps to its
> > assigned slot, the slot moves rsp below the red zone, saves the registers
> > clobbered by syscall, and invokes the uprobe syscall:
> >
> >   Probe site:   jmp slot_N              (5B, replaces nop5)
> >
> >   Slot N:       lea  -128(%rsp), %rsp   (5B)  skip red zone
> >                 push %rcx               (1B)  save (syscall clobbers)
> >                 push %r11               (2B)  save (syscall clobbers)
> >                 push %rax               (1B)  save (syscall uses for nr)
> >                 mov  $336, %eax         (5B)  uprobe syscall number
> >                 syscall                 (2B)
> >
> > All slots contain identical code at different offsets, so the trampoline
> > page is generated once at boot and mapped read-execute into each process.
> > The syscall handler identifies the slot from regs->ip, which points just
> > after the syscall instruction, and uses a per-mm slot table to recover the
> > original probe address.
> >
> > The uprobe syscall does not return to the trampoline slot. The handler
> > restores the probe-site register state, runs the uprobe consumers, sets
> > pt_regs to continue at probe_addr + 5 unless a consumer redirected
> > execution, and returns directly through the IRET path. This preserves
> > general purpose registers, including rcx and r11, without requiring any
> > post-syscall cleanup code in the trampoline and avoids call/ret, RSB, and
> > shadow stack concerns.
> >
> > Protect the per-mm trampoline list with RCU and free trampoline metadata
> > with kfree_rcu(). This lets the syscall path resolve trampoline slots
> > without taking mmap_lock. The optimized-instruction detection path also
> > walks the trampoline list under an RCU read-side lock. Since that path
> > starts from the JMP target, it translates the slot start to the post-syscall
> > IP expected by the shared resolver before checking the trampoline mapping.
> >
> > Each trampoline page provides 256 slots. Slots stay permanently assigned
> > to their first probe address and are reused only when the same address is
> > probed again. Reassigning detached slots is deliberately avoided because a
> > thread can remain in a trampoline for an unbounded time due to ptrace,
> > interrupts, or scheduling delays. If a reachable trampoline page runs out
> > of slots, probes that cannot allocate a slot fall back to the slower INT3
> > path.
> >
> > Require the entire trampoline page to be reachable by a rel32 JMP before
> > reusing it for a probe. This keeps every slot in the page within the range
> > that can be encoded at the probe site.
> >
> > Change the error code returned when the uprobe syscall is invoked outside
> > a kernel-generated trampoline from -ENXIO to -EPROTO. This lets libbpf and
> > similar libraries distinguish fixed kernels from kernels with the
> > red-zone-clobbering implementation and enable nop5 optimization only on
> > fixed kernels.
> >
> > Performance (usdt single-thread, M/s):
> >
> >                   usdt-nop  usdt-nop5-base  usdt-nop5-fix  nop5-change  iret%
> >   Skylake          3.149        6.422          4.865         -24.3%     39.1%
> >   Milan            2.910        3.443          3.820         +11.0%     24.3%
> >   Sapphire Rapids  1.896        4.023          3.693          -8.2%     24.9%
> >   Bergamo          3.393        3.895          3.849          -1.2%     24.5%
> >
> > The fixed nop5 path remains faster than the non-optimized INT3 path on all
> > measured systems. The regression relative to the old CALL-based trampoline
> > comes from IRET being more expensive than SYSRET, most noticeably on older
> > Intel Skylake. Newer Intel CPUs and tested AMD CPUs have lower IRET cost,
> > and AMD Milan improves because removing mmap_lock from the hot path more
> > than offsets the IRET cost.
> >
> > Multi-threaded throughput scales nearly linearly with the number of CPUs, like
> > it used to, thanks to lockless RCU-protected uprobe trampoline lookup.
>
> hi,
> thanks a lot for the fix
>
> FWIW we discussed also an option to have 10-bytes nop and do:
>   [rsp+0x80, call trampoline]
>
> we would not need the slots re-use logic, but not sure what other
> surprises there are with 10-bytes nop
>
> I tried that change [1], it seems to work, but it has other
> difficulties, like I think the unoptimized path needs to do:
>   [rsp+0x80, call trampoline] -> [jmp end of 10-bytes nop]
> instead of patching back the 10-byte nop, because some thread
> could be inside the nop area already.
>

Yeah, nop10 and this jump-over-nop10 approach is an alternative. I
don't have strong feelings apart from the ridiculousness of a 10-byte
nop :)

did you get a chance to benchmark your nop10 approach, curious how do
the number look like

> jirka
>
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/commit/?h=redzone_fix&id=74b09240289dba8368c2783b771e678b2cc31574
>
> >
> > Fixes: ba2bfc97b462 ("uprobes/x86: Add support to optimize uprobes")
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> >  arch/x86/include/asm/uprobes.h                |  18 ++
> >  arch/x86/kernel/uprobes.c                     | 262 ++++++++++--------
> >  tools/lib/bpf/features.c                      |   8 +-
> >  .../selftests/bpf/prog_tests/uprobe_syscall.c |   5 +-
> >  tools/testing/selftests/bpf/prog_tests/usdt.c |   2 +-
> >  5 files changed, 181 insertions(+), 114 deletions(-)
> >

[...]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization
  2026-05-11 16:38   ` Andrii Nakryiko
@ 2026-05-11 16:53     ` Andrii Nakryiko
  0 siblings, 0 replies; 13+ messages in thread
From: Andrii Nakryiko @ 2026-05-11 16:53 UTC (permalink / raw)
  To: sashiko; +Cc: Andrii Nakryiko, bpf

On Mon, May 11, 2026 at 9:38 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Fri, May 8, 2026 at 7:02 PM <sashiko-bot@kernel.org> wrote:
> >
> > Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
> > - [Critical] `resolve_uprobe_addr()` unconditionally uses `current->mm`, causing `is_optimized()` to fail when called from tracer context, which leads to target instruction stream corruption and potential kernel panics.
> > - [High] Exhausting a process's trampoline page slots permanently disables nop5 optimization for that uprobe across the entire system.
> > --
> >
> > commit 2c63c19710be8b050bc0e126d84e52c24209e062
> > Author: Andrii Nakryiko <andrii@kernel.org>
> > Date:   Sat May 9 00:30:56 2026 +0000
> >
> > uprobes/x86: Fix red zone clobbering in nop5 optimization
> >
> > The x86 uprobe nop5 optimization currently replaces a 5-byte NOP at the
> > probe site with a CALL into a uprobe trampoline. This commit changes the
> > optimization to use JMP instead, introducing a trampoline page with slots
> > to handle register state saving and jumping to the syscall.
> >
> > > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> > > index ebb1baf1eb1d0..7e1f14200bbbf 100644
> > > --- a/arch/x86/kernel/uprobes.c
> > > +++ b/arch/x86/kernel/uprobes.c
> >
> > [ ... ]
> >
> > > @@ -761,147 +785,153 @@ void arch_uprobe_clear_state(struct mm_struct *mm)
> > [ ... ]
> > > +static bool resolve_uprobe_addr(unsigned long ip, unsigned long *probe_addr)
> > >  {
> > > -     struct vm_area_struct *vma = vma_lookup(current->mm, ip);
> > > +     struct uprobes_state *state = &current->mm->uprobes_state;
> >
> > Can this cause a NULL pointer dereference or instruction corruption when
> > executed outside the target task's context?
> >
>
> ...
>
> >
> > Could we pass the correct mm struct down from is_optimized() to
> > resolve_uprobe_addr() instead of relying on current->mm?
>
> yes, good observation, I'll change resolve_uprobe_addr() to accept
> mm_struct directly
>
> >
> > [ ... ]
> >
> > > @@ -1113,8 +1149,9 @@ static int __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct
> > >  {
> > >       struct uprobe_trampoline *tramp;
> > >       struct vm_area_struct *vma;
> > > +     unsigned long slot_vaddr;
> > >       bool new = false;
> > > -     int err = 0;
> > > +     int slot, err;
> > >
> > >       vma = find_vma(mm, vaddr);
> > >       if (!vma)
> > > @@ -1122,8 +1159,17 @@ static int __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct
> > >       tramp = get_uprobe_trampoline(vaddr, &new);
> > >       if (!tramp)
> > >               return -EINVAL;
> > > -     err = swbp_optimize(auprobe, vma, vaddr, tramp->vaddr);
> > > -     if (WARN_ON_ONCE(err) && new)
> > > +
> > > +     slot = tramp_alloc_slot(tramp, vaddr);
> > > +     if (slot < 0) {
> > > +             if (new)
> > > +                     destroy_uprobe_trampoline(tramp);
> > > +             return slot;
> > > +     }
> >
> > If a process exhausts its trampoline page slots, will this permanently
> > disable the nop5 optimization for this uprobe across the entire system?
> >
> > If get_uprobe_trampoline() returns a reachable trampoline that is completely
> > full (all 256 slots used), tramp_alloc_slot() returns -ENOSPC. This error
> > propagates up to arch_uprobe_optimize(), which handles failures by setting
> > the ARCH_UPROBE_FLAG_OPTIMIZE_FAIL bit on auprobe->flags.
> >
> > Since struct arch_uprobe is shared across all processes tracing the same
> > inode and offset, hitting the 256-slot limit in any single process seems
> > to permanently blacklist the probe from being optimized system-wide for
> > all other processes, even if those processes have completely empty
> > trampolines.
> >
> > Should get_uprobe_trampoline() also check if the trampoline has available
> > slots before returning it, or can the failure be handled without setting
> > the global failure flag?
>
> It's a trade off. If we don't set ARCH_UPROBE_FLAG_OPTIMIZE_FAIL,
> we'll keep wasting time trying to find and use uprobe trampoline for
> each uprobe execution in that process, which is wasteful. So, given
> this is a performance optimization, I think we'll just live with this.
> The alternative would be a per-mm+per-uprobe extra information, which
> seems like an overkill, so probably a no-go.

We can have per-mm flag inside uprobes_state common to all uprobes,
but that will make reattachment to previously attached uprobe with
allocated slot not work, if I understand correctly.

Jiri, any thoughts or ideas?

>
> >
> > --
> > Sashiko AI review · https://sashiko.dev/#/patchset/20260509003146.976844-1-andrii@kernel.org?part=1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization
  2026-05-11 14:45 ` Oleg Nesterov
@ 2026-05-11 16:56   ` Andrii Nakryiko
  2026-05-11 17:24     ` Oleg Nesterov
  0 siblings, 1 reply; 13+ messages in thread
From: Andrii Nakryiko @ 2026-05-11 16:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrii Nakryiko, bpf, linux-trace-kernel, jolsa, peterz, mingo,
	mhiramat

On Mon, May 11, 2026 at 7:45 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 05/08, Andrii Nakryiko wrote:
> >
> > +static bool resolve_uprobe_addr(unsigned long ip, unsigned long *probe_addr)
> >  {
> > -     struct vm_area_struct *vma = vma_lookup(current->mm, ip);
> > +     struct uprobes_state *state = &current->mm->uprobes_state;
>
> it seems that there is a problem, with or without this change...
>
> Lets forget about this patch for the moment, I am still trying to understand it.
>
> What if register_for_each_vma() calls install_breakpoint(vma) and
> vma->mm != current->mm ?
>
> In this case install_breakpoint() path will call __is_optimized() and then
> __in_uprobe_trampoline() which does vma_lookup(current->mm). This looks
> obviously wrong ?
>
> And unless I am totally confused, this patch "inherits" the problem...
>
> No?
>

Yes, sashiko called this out as well. I'm going to pass mm explicitly
into is_optimized and all the way to resolve_uprobe_addr(), easy fix.

But we should have a stand-alone easily backportable fix for this even
without my changes, yes. I can add it as pre-patch or send separately.

> Oleg.
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf 2/2] selftests/bpf: Add tests for uprobe nop5 red zone clobbering
  2026-05-09  2:12   ` sashiko-bot
@ 2026-05-11 16:58     ` Andrii Nakryiko
  0 siblings, 0 replies; 13+ messages in thread
From: Andrii Nakryiko @ 2026-05-11 16:58 UTC (permalink / raw)
  To: sashiko; +Cc: Andrii Nakryiko, bpf

On Fri, May 8, 2026 at 7:12 PM <sashiko-bot@kernel.org> wrote:
>
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> - [Medium] The `optimized_red_zone` USDT subtest is brittle and may produce false positive passes because it relies on the compiler arbitrarily placing operands in the stack's red zone.
> --
>
> commit 1fbbd69b145cd2d80be83d6482e65179431782ac
> Author: Andrii Nakryiko <andrii@kernel.org>
>
> selftests/bpf: Add tests for uprobe nop5 red zone clobbering
>
> This commit adds tests to verify that the uprobe nop5 CALL optimization does
> not clobber the x86-64 stack red zone. It introduces an assembly-level test
> and a USDT subtest which intends to trigger a probe in a function where the
> compiler places operands in the red zone.
>
> > diff --git a/tools/testing/selftests/bpf/usdt_2.c b/tools/testing/selftests/bpf/usdt_2.c
> > index 789883aaca4c3..fc7e6d220a38b 100644
> > --- a/tools/testing/selftests/bpf/usdt_2.c
> > +++ b/tools/testing/selftests/bpf/usdt_2.c
>
> [ ... ]
>
> > @@ -13,4 +13,17 @@ void usdt_2(void)
> >       USDT(optimized_attach, usdt_2);
> >  }
> >
> > +static volatile unsigned long usdt_red_zone_arg1 = 0xDEADBEEF;
> > +static volatile unsigned long usdt_red_zone_arg2 = 0xCAFEBABE;
> > +static volatile unsigned long usdt_red_zone_arg3 = 0xFEEDFACE;
> > +
> > +void __attribute__((noinline)) usdt_red_zone_trigger(void)
> > +{
> > +     unsigned long a1 = usdt_red_zone_arg1;
> > +     unsigned long a2 = usdt_red_zone_arg2;
> > +     unsigned long a3 = usdt_red_zone_arg3;
> > +
> > +     USDT(optimized_attach, usdt_red_zone, a1, a2, a3);
>
> Is this test reliable across different compiler versions and optimizations?
>

No it's not, which is why I added an assembly-based test as well. It
doesn't hurt to have a C-level example which does trigger a red zone
issue in practice to show the issue is not hypothetical. But I can
also drop it, if people think it's unnecessary.

> The x86-64 USDT macro uses the "nor" inline assembly constraint, which allows
> the compiler to allocate these operands to CPU registers. Because there is no
> register pressure in usdt_red_zone_trigger(), modern compilers might allocate
> the local variables a1, a2, and a3 directly to registers rather than the
> stack's red zone.

we compile selftests in -O0 mode, so unlikely, but in release mode it
will definitely be optimized like that

>
> If the operands are placed in registers, the uprobe CALL optimization
> clobbering [rsp-8] will not corrupt them. This could cause the test to pass
> even on buggy kernels, creating a false positive.
>
> Would it be safer to explicitly force these operands into the red zone
> using inline assembly constraints, rather than depending on the compiler's
> register allocator?
>
> > +}
> > +
> >  #endif
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260509003146.976844-1-andrii@kernel.org?part=2

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization
  2026-05-11 16:56   ` Andrii Nakryiko
@ 2026-05-11 17:24     ` Oleg Nesterov
  0 siblings, 0 replies; 13+ messages in thread
From: Oleg Nesterov @ 2026-05-11 17:24 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, linux-trace-kernel, jolsa, peterz, mingo,
	mhiramat

On 05/11, Andrii Nakryiko wrote:
>
> On Mon, May 11, 2026 at 7:45 AM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > On 05/08, Andrii Nakryiko wrote:
> > >
> > > +static bool resolve_uprobe_addr(unsigned long ip, unsigned long *probe_addr)
> > >  {
> > > -     struct vm_area_struct *vma = vma_lookup(current->mm, ip);
> > > +     struct uprobes_state *state = &current->mm->uprobes_state;
> >
> > it seems that there is a problem, with or without this change...
> >
> > Lets forget about this patch for the moment, I am still trying to understand it.
> >
> > What if register_for_each_vma() calls install_breakpoint(vma) and
> > vma->mm != current->mm ?
> >
> > In this case install_breakpoint() path will call __is_optimized() and then
> > __in_uprobe_trampoline() which does vma_lookup(current->mm). This looks
> > obviously wrong ?
> >
> > And unless I am totally confused, this patch "inherits" the problem...
> >
> > No?
> >
>
> Yes, sashiko called this out as well. I'm going to pass mm explicitly
> into is_optimized and all the way to resolve_uprobe_addr(), easy fix.

Aah... found it: https://sashiko.dev/#/patchset/20260509003146.976844-1-andrii%40kernel.org

And note that sashiko is more clever than me; I didn't realize that _unregister()
has more problems with current->mm :/

> But we should have a stand-alone easily backportable fix for this even
> without my changes, yes. I can add it as pre-patch or send separately.

Agreed. This deserves another/simple fix. If nothing else, for documentation.

Oleg.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization
  2026-05-10 21:25 ` Jiri Olsa
  2026-05-11 16:41   ` Andrii Nakryiko
@ 2026-05-12  5:14   ` Masami Hiramatsu
  1 sibling, 0 replies; 13+ messages in thread
From: Masami Hiramatsu @ 2026-05-12  5:14 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Andrii Nakryiko, bpf, linux-trace-kernel, oleg, peterz, mingo,
	mhiramat

On Sun, 10 May 2026 23:25:26 +0200
Jiri Olsa <olsajiri@gmail.com> wrote:

> On Fri, May 08, 2026 at 05:30:56PM -0700, Andrii Nakryiko wrote:
> > The x86 uprobe nop5 optimization currently replaces a 5-byte NOP at the
> > probe site with a CALL into a uprobe trampoline. CALL pushes a return
> > address to [rsp-8]. On x86-64 this is inside the 128-byte red zone, where
> > user code may keep temporary data without adjusting rsp.
> > 
> > Use a 5-byte JMP instead. JMP does not write to the user stack, but it
> > also does not provide a return address. Replace the single trampoline
> > entry with a page of 16-byte slots. Each optimized probe jumps to its
> > assigned slot, the slot moves rsp below the red zone, saves the registers
> > clobbered by syscall, and invokes the uprobe syscall:
> > 
> >   Probe site:   jmp slot_N              (5B, replaces nop5)
> > 
> >   Slot N:       lea  -128(%rsp), %rsp   (5B)  skip red zone
> >                 push %rcx               (1B)  save (syscall clobbers)
> >                 push %r11               (2B)  save (syscall clobbers)
> >                 push %rax               (1B)  save (syscall uses for nr)
> >                 mov  $336, %eax         (5B)  uprobe syscall number
> >                 syscall                 (2B)
> > 
> > All slots contain identical code at different offsets, so the trampoline
> > page is generated once at boot and mapped read-execute into each process.
> > The syscall handler identifies the slot from regs->ip, which points just
> > after the syscall instruction, and uses a per-mm slot table to recover the
> > original probe address.
> > 
> > The uprobe syscall does not return to the trampoline slot. The handler
> > restores the probe-site register state, runs the uprobe consumers, sets
> > pt_regs to continue at probe_addr + 5 unless a consumer redirected
> > execution, and returns directly through the IRET path. This preserves
> > general purpose registers, including rcx and r11, without requiring any
> > post-syscall cleanup code in the trampoline and avoids call/ret, RSB, and
> > shadow stack concerns.
> > 
> > Protect the per-mm trampoline list with RCU and free trampoline metadata
> > with kfree_rcu(). This lets the syscall path resolve trampoline slots
> > without taking mmap_lock. The optimized-instruction detection path also
> > walks the trampoline list under an RCU read-side lock. Since that path
> > starts from the JMP target, it translates the slot start to the post-syscall
> > IP expected by the shared resolver before checking the trampoline mapping.
> > 
> > Each trampoline page provides 256 slots. Slots stay permanently assigned
> > to their first probe address and are reused only when the same address is
> > probed again. Reassigning detached slots is deliberately avoided because a
> > thread can remain in a trampoline for an unbounded time due to ptrace,
> > interrupts, or scheduling delays. If a reachable trampoline page runs out
> > of slots, probes that cannot allocate a slot fall back to the slower INT3
> > path.
> > 
> > Require the entire trampoline page to be reachable by a rel32 JMP before
> > reusing it for a probe. This keeps every slot in the page within the range
> > that can be encoded at the probe site.
> > 
> > Change the error code returned when the uprobe syscall is invoked outside
> > a kernel-generated trampoline from -ENXIO to -EPROTO. This lets libbpf and
> > similar libraries distinguish fixed kernels from kernels with the
> > red-zone-clobbering implementation and enable nop5 optimization only on
> > fixed kernels.
> > 
> > Performance (usdt single-thread, M/s):
> > 
> >                   usdt-nop  usdt-nop5-base  usdt-nop5-fix  nop5-change  iret%
> >   Skylake          3.149        6.422          4.865         -24.3%     39.1%
> >   Milan            2.910        3.443          3.820         +11.0%     24.3%
> >   Sapphire Rapids  1.896        4.023          3.693          -8.2%     24.9%
> >   Bergamo          3.393        3.895          3.849          -1.2%     24.5%
> > 
> > The fixed nop5 path remains faster than the non-optimized INT3 path on all
> > measured systems. The regression relative to the old CALL-based trampoline
> > comes from IRET being more expensive than SYSRET, most noticeably on older
> > Intel Skylake. Newer Intel CPUs and tested AMD CPUs have lower IRET cost,
> > and AMD Milan improves because removing mmap_lock from the hot path more
> > than offsets the IRET cost.
> > 
> > Multi-threaded throughput scales nearly linearly with the number of CPUs, like
> > it used to, thanks to lockless RCU-protected uprobe trampoline lookup.
> 
> hi,
> thanks a lot for the fix
> 
> FWIW we discussed also an option to have 10-bytes nop and do:
>   [rsp+0x80, call trampoline]
> 
> we would not need the slots re-use logic, but not sure what other
> surprises there are with 10-bytes nop

Does this mean we have to update UDST implementation?

> 
> I tried that change [1], it seems to work, but it has other
> difficulties, like I think the unoptimized path needs to do:
>   [rsp+0x80, call trampoline] -> [jmp end of 10-bytes nop]
> instead of patching back the 10-byte nop, because some thread
> could be inside the nop area already.

Yeah, but at that moment, we know where the modified code is.
Maybe memory dump shows different code, but that is also true
if uprobe is active. So I think it is OK.

Thanks,

> 
> jirka
> 
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/commit/?h=redzone_fix&id=74b09240289dba8368c2783b771e678b2cc31574
> 
> > 
> > Fixes: ba2bfc97b462 ("uprobes/x86: Add support to optimize uprobes")
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> >  arch/x86/include/asm/uprobes.h                |  18 ++
> >  arch/x86/kernel/uprobes.c                     | 262 ++++++++++--------
> >  tools/lib/bpf/features.c                      |   8 +-
> >  .../selftests/bpf/prog_tests/uprobe_syscall.c |   5 +-
> >  tools/testing/selftests/bpf/prog_tests/usdt.c |   2 +-
> >  5 files changed, 181 insertions(+), 114 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
> > index 362210c79998..a7cf5c92d95a 100644
> > --- a/arch/x86/include/asm/uprobes.h
> > +++ b/arch/x86/include/asm/uprobes.h
> > @@ -25,6 +25,24 @@ enum {
> >  	ARCH_UPROBE_FLAG_OPTIMIZE_FAIL  = 1,
> >  };
> >  
> > +/*
> > + * Trampoline page layout: identical 16-byte slots, each containing:
> > + *   lea  -128(%rsp), %rsp (5B)  skip red zone
> > + *   push %rcx             (1B)  save (syscall clobbers)
> > + *   push %r11             (2B)  save (syscall clobbers)
> > + *   push %rax             (1B)  save (syscall uses for nr)
> > + *   mov  $336, %eax       (5B)  uprobe syscall number
> > + *   syscall               (2B)
> > + *                        = 16B, no padding needed
> > + *
> > + * The handler identifies which probe fired from regs->ip (each
> > + * slot is at a unique offset), looks up the probe address from a
> > + * per-process table, and returns directly to probe_addr+5 via iret
> > + * with all registers restored.
> > + */
> > +#define UPROBE_TRAMP_SLOT_SIZE	16
> > +#define UPROBE_TRAMP_MAX_SLOTS	(PAGE_SIZE / UPROBE_TRAMP_SLOT_SIZE)
> > +
> >  struct uprobe_xol_ops;
> >  
> >  struct arch_uprobe {
> > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> > index ebb1baf1eb1d..7e1f14200bbb 100644
> > --- a/arch/x86/kernel/uprobes.c
> > +++ b/arch/x86/kernel/uprobes.c
> > @@ -633,16 +633,25 @@ static struct vm_special_mapping tramp_mapping = {
> >  
> >  struct uprobe_trampoline {
> >  	struct hlist_node	node;
> > +	struct rcu_head		rcu;
> >  	unsigned long		vaddr;
> > +	unsigned long		probe_addrs[UPROBE_TRAMP_MAX_SLOTS];
> >  };
> >  
> > -static bool is_reachable_by_call(unsigned long vtramp, unsigned long vaddr)
> > +
> > +static bool is_reachable_by_jmp(unsigned long dst, unsigned long src)
> >  {
> > -	long delta = (long)(vaddr + 5 - vtramp);
> > +	long delta = (long)(dst - (src + JMP32_INSN_SIZE));
> >  
> >  	return delta >= INT_MIN && delta <= INT_MAX;
> >  }
> >  
> > +static bool is_reachable_by_trampoline(unsigned long vtramp, unsigned long vaddr)
> > +{
> > +	return is_reachable_by_jmp(vtramp, vaddr) &&
> > +	       is_reachable_by_jmp(vtramp + PAGE_SIZE - 1, vaddr);
> > +}
> > +
> >  static unsigned long find_nearest_trampoline(unsigned long vaddr)
> >  {
> >  	struct vm_unmapped_area_info info = {
> > @@ -711,6 +720,21 @@ static struct uprobe_trampoline *create_uprobe_trampoline(unsigned long vaddr)
> >  	return tramp;
> >  }
> >  
> > +static int tramp_alloc_slot(struct uprobe_trampoline *tramp, unsigned long probe_addr)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < UPROBE_TRAMP_MAX_SLOTS; i++) {
> > +		if (tramp->probe_addrs[i] == probe_addr)
> > +			return i;
> > +		if (tramp->probe_addrs[i] == 0) {
> > +			tramp->probe_addrs[i] = probe_addr;
> > +			return i;
> > +		}
> > +	}
> > +	return -ENOSPC;
> > +}
> > +
> >  static struct uprobe_trampoline *get_uprobe_trampoline(unsigned long vaddr, bool *new)
> >  {
> >  	struct uprobes_state *state = &current->mm->uprobes_state;
> > @@ -720,7 +744,7 @@ static struct uprobe_trampoline *get_uprobe_trampoline(unsigned long vaddr, bool
> >  		return NULL;
> >  
> >  	hlist_for_each_entry(tramp, &state->head_tramps, node) {
> > -		if (is_reachable_by_call(tramp->vaddr, vaddr)) {
> > +		if (is_reachable_by_trampoline(tramp->vaddr, vaddr)) {
> >  			*new = false;
> >  			return tramp;
> >  		}
> > @@ -731,7 +755,7 @@ static struct uprobe_trampoline *get_uprobe_trampoline(unsigned long vaddr, bool
> >  		return NULL;
> >  
> >  	*new = true;
> > -	hlist_add_head(&tramp->node, &state->head_tramps);
> > +	hlist_add_head_rcu(&tramp->node, &state->head_tramps);
> >  	return tramp;
> >  }
> >  
> > @@ -742,8 +766,8 @@ static void destroy_uprobe_trampoline(struct uprobe_trampoline *tramp)
> >  	 * because there's no easy way to make sure none of the threads
> >  	 * is still inside the trampoline.
> >  	 */
> > -	hlist_del(&tramp->node);
> > -	kfree(tramp);
> > +	hlist_del_rcu(&tramp->node);
> > +	kfree_rcu(tramp, rcu);
> >  }
> >  
> >  void arch_uprobe_init_state(struct mm_struct *mm)
> > @@ -761,147 +785,153 @@ void arch_uprobe_clear_state(struct mm_struct *mm)
> >  		destroy_uprobe_trampoline(tramp);
> >  }
> >  
> > -static bool __in_uprobe_trampoline(unsigned long ip)
> > +/*
> > + * Find the trampoline containing @ip. If @probe_addr is non-NULL, also
> > + * resolve the slot index from @ip and return the probe address.
> > + *
> > + * @ip is expected to point right after the syscall instruction, i.e.,
> > + * at the end of the slot (slot_start + UPROBE_TRAMP_SLOT_SIZE).
> > + */
> > +static bool resolve_uprobe_addr(unsigned long ip, unsigned long *probe_addr)
> >  {
> > -	struct vm_area_struct *vma = vma_lookup(current->mm, ip);
> > +	struct uprobes_state *state = &current->mm->uprobes_state;
> > +	struct uprobe_trampoline *tramp;
> >  
> > -	return vma && vma_is_special_mapping(vma, &tramp_mapping);
> > -}
> > +	hlist_for_each_entry_rcu(tramp, &state->head_tramps, node) {
> > +		/*
> > +		 * ip points to after syscall, so it's on 16 byte boundary,
> > +		 * which means that valid ip can point right after the page
> > +		 * and should never be at zero offset within the page
> > +		 */
> > +		if (ip <= tramp->vaddr || ip > tramp->vaddr + PAGE_SIZE)
> > +			continue;
> >  
> > -static bool in_uprobe_trampoline(unsigned long ip)
> > -{
> > -	struct mm_struct *mm = current->mm;
> > -	bool found, retry = true;
> > -	unsigned int seq;
> > +		if (probe_addr) {
> > +			/* we already validated ip is within expected range */
> > +			unsigned int slot = (ip - tramp->vaddr - 1) / UPROBE_TRAMP_SLOT_SIZE;
> > +			unsigned long addr = tramp->probe_addrs[slot];
> >  
> > -	rcu_read_lock();
> > -	if (mmap_lock_speculate_try_begin(mm, &seq)) {
> > -		found = __in_uprobe_trampoline(ip);
> > -		retry = mmap_lock_speculate_retry(mm, seq);
> > -	}
> > -	rcu_read_unlock();
> > +			*probe_addr = addr;
> > +			if (addr == 0)
> > +				return false;
> > +		}
> >  
> > -	if (retry) {
> > -		mmap_read_lock(mm);
> > -		found = __in_uprobe_trampoline(ip);
> > -		mmap_read_unlock(mm);
> > +		return true;
> >  	}
> > -	return found;
> > +	return false;
> > +}
> > +
> > +static bool in_uprobe_trampoline(unsigned long ip, unsigned long *probe_addr)
> > +{
> > +	guard(rcu)();
> > +	return resolve_uprobe_addr(ip, probe_addr);
> >  }
> >  
> >  /*
> > - * See uprobe syscall trampoline; the call to the trampoline will push
> > - * the return address on the stack, the trampoline itself then pushes
> > - * cx, r11 and ax.
> > + * The trampoline slot pushes cx, r11, ax (the registers syscall clobbers)
> > + * before doing the uprobe syscall. No return address is pushed — the
> > + * probe site uses jmp, not call.
> >   */
> >  struct uprobe_syscall_args {
> >  	unsigned long ax;
> >  	unsigned long r11;
> >  	unsigned long cx;
> > -	unsigned long retaddr;
> >  };
> >  
> > +#define UPROBE_TRAMP_REDZONE 128
> > +
> >  SYSCALL_DEFINE0(uprobe)
> >  {
> >  	struct pt_regs *regs = task_pt_regs(current);
> >  	struct uprobe_syscall_args args;
> > -	unsigned long ip, sp, sret;
> > +	unsigned long probe_addr;
> >  	int err;
> >  
> >  	/* Allow execution only from uprobe trampolines. */
> > -	if (!in_uprobe_trampoline(regs->ip))
> > -		return -ENXIO;
> > +	if (!in_uprobe_trampoline(regs->ip, &probe_addr))
> > +		return -EPROTO;
> >  
> >  	err = copy_from_user(&args, (void __user *)regs->sp, sizeof(args));
> >  	if (err)
> >  		goto sigill;
> >  
> > -	ip = regs->ip;
> > -
> >  	/*
> > -	 * expose the "right" values of ax/r11/cx/ip/sp to uprobe_consumer/s, plus:
> > -	 * - adjust ip to the probe address, call saved next instruction address
> > -	 * - adjust sp to the probe's stack frame (check trampoline code)
> > +	 * Restore the register state as it was at the probe site:
> > +	 * - ax/r11/cx from the trampoline-saved copies on user stack
> > +	 * - adjust ip to the probe address based on matching slot
> > +	 * - adjust sp to skip red zone and pushed args
> >  	 */
> >  	regs->ax  = args.ax;
> >  	regs->r11 = args.r11;
> >  	regs->cx  = args.cx;
> > -	regs->ip  = args.retaddr - 5;
> > -	regs->sp += sizeof(args);
> > +	regs->ip  = probe_addr;
> > +	regs->sp += sizeof(args) + UPROBE_TRAMP_REDZONE;
> >  	regs->orig_ax = -1;
> >  
> > -	sp = regs->sp;
> > -
> > -	err = shstk_pop((u64 *)&sret);
> > -	if (err == -EFAULT || (!err && sret != args.retaddr))
> > -		goto sigill;
> > -
> > -	handle_syscall_uprobe(regs, regs->ip);
> > +	handle_syscall_uprobe(regs, probe_addr);
> >  
> >  	/*
> > -	 * Some of the uprobe consumers has changed sp, we can do nothing,
> > -	 * just return via iret.
> > +	 * Skip the jmp instruction at the probe site (5 bytes) unless
> > +	 * a consumer redirected execution elsewhere.
> >  	 */
> > -	if (regs->sp != sp) {
> > -		/* skip the trampoline call */
> > -		if (args.retaddr - 5 == regs->ip)
> > -			regs->ip += 5;
> > -		return regs->ax;
> > -	}
> > +	if (regs->ip == probe_addr)
> > +		regs->ip = probe_addr + 5;
> >  
> > -	regs->sp -= sizeof(args);
> > -
> > -	/* for the case uprobe_consumer has changed ax/r11/cx */
> > -	args.ax  = regs->ax;
> > -	args.r11 = regs->r11;
> > -	args.cx  = regs->cx;
> > -
> > -	/* keep return address unless we are instructed otherwise */
> > -	if (args.retaddr - 5 != regs->ip)
> > -		args.retaddr = regs->ip;
> > -
> > -	if (shstk_push(args.retaddr) == -EFAULT)
> > -		goto sigill;
> > -
> > -	regs->ip = ip;
> > -
> > -	err = copy_to_user((void __user *)regs->sp, &args, sizeof(args));
> > -	if (err)
> > -		goto sigill;
> > -
> > -	/* ensure sysret, see do_syscall_64() */
> > -	regs->r11 = regs->flags;
> > -	regs->cx  = regs->ip;
> > -	return 0;
> > +	/*
> > +	 * Return via iret by returning regs->ax. This preserves all
> > +	 * GP registers (including cx and r11) without needing any
> > +	 * user-space cleanup code. The iret path is used because we
> > +	 * don't set up cx/r11 for sysret.
> > +	 */
> > +	return regs->ax;
> >  
> >  sigill:
> >  	force_sig(SIGILL);
> >  	return -1;
> >  }
> >  
> > +/*
> > + * All uprobe trampoline slots are identical: skip the red zone,
> > + * save the three registers that syscall clobbers, then invoke
> > + * the uprobe syscall. The handler returns directly to the probe
> > + * caller via iret. Execution never returns to the trampoline.
> > + */
> >  asm (
> >  	".pushsection .rodata\n"
> > -	".balign " __stringify(PAGE_SIZE) "\n"
> > -	"uprobe_trampoline_entry:\n"
> > +	".balign " __stringify(UPROBE_TRAMP_SLOT_SIZE) "\n"
> > +	"uprobe_trampoline_slot:\n"
> > +	"lea -128(%rsp), %rsp\n"
> >  	"push %rcx\n"
> >  	"push %r11\n"
> >  	"push %rax\n"
> > -	"mov $" __stringify(__NR_uprobe) ", %rax\n"
> > +	"mov $" __stringify(__NR_uprobe) ", %eax\n"
> >  	"syscall\n"
> > -	"pop %rax\n"
> > -	"pop %r11\n"
> > -	"pop %rcx\n"
> > -	"ret\n"
> > -	"int3\n"
> > -	".balign " __stringify(PAGE_SIZE) "\n"
> > +	"uprobe_trampoline_slot_end:\n"
> >  	".popsection\n"
> >  );
> >  
> > -extern u8 uprobe_trampoline_entry[];
> > +extern u8 uprobe_trampoline_slot[];
> > +extern u8 uprobe_trampoline_slot_end[];
> >  
> >  static int __init arch_uprobes_init(void)
> >  {
> > -	tramp_mapping_pages[0] = virt_to_page(uprobe_trampoline_entry);
> > +	unsigned int slot_size = uprobe_trampoline_slot_end - uprobe_trampoline_slot;
> > +	struct page *page;
> > +	u8 *page_addr;
> > +	int i;
> > +
> > +	BUILD_BUG_ON(UPROBE_TRAMP_SLOT_SIZE != 16);
> > +	WARN_ON_ONCE(slot_size != UPROBE_TRAMP_SLOT_SIZE);
> > +
> > +	page = alloc_page(GFP_KERNEL);
> > +	if (!page)
> > +		return -ENOMEM;
> > +
> > +	page_addr = page_address(page);
> > +	for (i = 0; i < UPROBE_TRAMP_MAX_SLOTS; i++)
> > +		memcpy(page_addr + i * UPROBE_TRAMP_SLOT_SIZE, uprobe_trampoline_slot, slot_size);
> > +
> > +	tramp_mapping_pages[0] = page;
> >  	return 0;
> >  }
> >  
> > @@ -909,7 +939,7 @@ late_initcall(arch_uprobes_init);
> >  
> >  enum {
> >  	EXPECT_SWBP,
> > -	EXPECT_CALL,
> > +	EXPECT_JMP,
> >  };
> >  
> >  struct write_opcode_ctx {
> > @@ -917,14 +947,14 @@ struct write_opcode_ctx {
> >  	int expect;
> >  };
> >  
> > -static int is_call_insn(uprobe_opcode_t *insn)
> > +static int is_jmp_insn(uprobe_opcode_t *insn)
> >  {
> > -	return *insn == CALL_INSN_OPCODE;
> > +	return *insn == JMP32_INSN_OPCODE;
> >  }
> >  
> >  /*
> >   * Verification callback used by int3_update uprobe_write calls to make sure
> > - * the underlying instruction is as expected - either int3 or call.
> > + * the underlying instruction is as expected - either int3 or jmp.
> >   */
> >  static int verify_insn(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode,
> >  		       int nbytes, void *data)
> > @@ -939,8 +969,8 @@ static int verify_insn(struct page *page, unsigned long vaddr, uprobe_opcode_t *
> >  		if (is_swbp_insn(&old_opcode[0]))
> >  			return 1;
> >  		break;
> > -	case EXPECT_CALL:
> > -		if (is_call_insn(&old_opcode[0]))
> > +	case EXPECT_JMP:
> > +		if (is_jmp_insn(&old_opcode[0]))
> >  			return 1;
> >  		break;
> >  	}
> > @@ -978,7 +1008,7 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> >  	 * so we can skip this step for optimize == true.
> >  	 */
> >  	if (!optimize) {
> > -		ctx.expect = EXPECT_CALL;
> > +		ctx.expect = EXPECT_JMP;
> >  		err = uprobe_write(auprobe, vma, vaddr, &int3, 1, verify_insn,
> >  				   true /* is_register */, false /* do_update_ref_ctr */,
> >  				   &ctx);
> > @@ -1015,13 +1045,13 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> >  }
> >  
> >  static int swbp_optimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> > -			 unsigned long vaddr, unsigned long tramp)
> > +			 unsigned long vaddr, unsigned long slot_vaddr)
> >  {
> > -	u8 call[5];
> > +	u8 jmp[5];
> >  
> > -	__text_gen_insn(call, CALL_INSN_OPCODE, (const void *) vaddr,
> > -			(const void *) tramp, CALL_INSN_SIZE);
> > -	return int3_update(auprobe, vma, vaddr, call, true /* optimize */);
> > +	__text_gen_insn(jmp, JMP32_INSN_OPCODE, (const void *) vaddr,
> > +			(const void *) slot_vaddr, JMP32_INSN_SIZE);
> > +	return int3_update(auprobe, vma, vaddr, jmp, true /* optimize */);
> >  }
> >  
> >  static int swbp_unoptimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> > @@ -1049,11 +1079,17 @@ static bool __is_optimized(uprobe_opcode_t *insn, unsigned long vaddr)
> >  	struct __packed __arch_relative_insn {
> >  		u8 op;
> >  		s32 raddr;
> > -	} *call = (struct __arch_relative_insn *) insn;
> > +	} *jmp = (struct __arch_relative_insn *) insn;
> >  
> > -	if (!is_call_insn(insn))
> > +	if (!is_jmp_insn(&jmp->op))
> >  		return false;
> > -	return __in_uprobe_trampoline(vaddr + 5 + call->raddr);
> > +
> > +	guard(rcu)();
> > +	/*
> > +	 * resolve_uprobe_addr() expects IP pointing after syscall instruction
> > +	 * (after the slot, basically), so adjust jump target address accordingly
> > +	 */
> > +	return resolve_uprobe_addr(vaddr + 5 + jmp->raddr + UPROBE_TRAMP_SLOT_SIZE, NULL);
> >  }
> >  
> >  static int is_optimized(struct mm_struct *mm, unsigned long vaddr)
> > @@ -1113,8 +1149,9 @@ static int __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct
> >  {
> >  	struct uprobe_trampoline *tramp;
> >  	struct vm_area_struct *vma;
> > +	unsigned long slot_vaddr;
> >  	bool new = false;
> > -	int err = 0;
> > +	int slot, err;
> >  
> >  	vma = find_vma(mm, vaddr);
> >  	if (!vma)
> > @@ -1122,8 +1159,17 @@ static int __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct
> >  	tramp = get_uprobe_trampoline(vaddr, &new);
> >  	if (!tramp)
> >  		return -EINVAL;
> > -	err = swbp_optimize(auprobe, vma, vaddr, tramp->vaddr);
> > -	if (WARN_ON_ONCE(err) && new)
> > +
> > +	slot = tramp_alloc_slot(tramp, vaddr);
> > +	if (slot < 0) {
> > +		if (new)
> > +			destroy_uprobe_trampoline(tramp);
> > +		return slot;
> > +	}
> > +
> > +	slot_vaddr = tramp->vaddr + slot * UPROBE_TRAMP_SLOT_SIZE;
> > +	err = swbp_optimize(auprobe, vma, vaddr, slot_vaddr);
> > +	if (err && new)
> >  		destroy_uprobe_trampoline(tramp);
> >  	return err;
> >  }
> > diff --git a/tools/lib/bpf/features.c b/tools/lib/bpf/features.c
> > index 4f19a0d79b0c..1b6c113357b2 100644
> > --- a/tools/lib/bpf/features.c
> > +++ b/tools/lib/bpf/features.c
> > @@ -577,10 +577,12 @@ static int probe_ldimm64_full_range_off(int token_fd)
> >  static int probe_uprobe_syscall(int token_fd)
> >  {
> >  	/*
> > -	 * If kernel supports uprobe() syscall, it will return -ENXIO when called
> > -	 * from the outside of a kernel-generated uprobe trampoline.
> > +	 * If kernel supports uprobe() syscall, it will return -EPROTO when
> > +	 * called from outside a kernel-generated uprobe trampoline.
> > +	 * Older kernels with the red-zone-clobbering bug return -ENXIO;
> > +	 * we only enable the nop5 optimization on fixed kernels.
> >  	 */
> > -	return syscall(__NR_uprobe) < 0 && errno == ENXIO;
> > +	return syscall(__NR_uprobe) < 0 && errno == EPROTO;
> >  }
> >  #else
> >  static int probe_uprobe_syscall(int token_fd)
> > diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> > index 955a37751b52..0d5eb4cd1ddf 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> > @@ -422,7 +422,8 @@ static void *check_attach(struct uprobe_syscall_executed *skel, trigger_t trigge
> >  	/* .. and check the trampoline is as expected. */
> >  	call = (struct __arch_relative_insn *) addr;
> >  	tramp = (void *) (call + 1) + call->raddr;
> > -	ASSERT_EQ(call->op, 0xe8, "call");
> > +	tramp = (void *)((unsigned long)tramp & ~(getpagesize() - 1UL));
> > +	ASSERT_EQ(call->op, 0xe9, "jmp");
> >  	ASSERT_OK(find_uprobes_trampoline(tramp), "uprobes_trampoline");
> >  
> >  	return tramp;
> > @@ -762,7 +763,7 @@ static void test_uprobe_error(void)
> >  	long err = syscall(__NR_uprobe);
> >  
> >  	ASSERT_EQ(err, -1, "error");
> > -	ASSERT_EQ(errno, ENXIO, "errno");
> > +	ASSERT_EQ(errno, EPROTO, "errno");
> >  }
> >  
> >  static void __test_uprobe_syscall(void)
> > diff --git a/tools/testing/selftests/bpf/prog_tests/usdt.c b/tools/testing/selftests/bpf/prog_tests/usdt.c
> > index 69759b27794d..9d3744d4e936 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/usdt.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/usdt.c
> > @@ -329,7 +329,7 @@ static void subtest_optimized_attach(void)
> >  	ASSERT_EQ(*addr_2, 0x90, "nop");
> >  
> >  	/* call is on addr_2 + 1 address */
> > -	ASSERT_EQ(*(addr_2 + 1), 0xe8, "call");
> > +	ASSERT_EQ(*(addr_2 + 1), 0xe9, "jmp");
> >  	ASSERT_EQ(skel->bss->executed, 4, "executed");
> >  
> >  cleanup:
> > -- 
> > 2.53.0-Meta
> > 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-05-12  5:14 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-09  0:30 [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization Andrii Nakryiko
2026-05-09  0:30 ` [PATCH bpf 2/2] selftests/bpf: Add tests for uprobe nop5 red zone clobbering Andrii Nakryiko
2026-05-09  2:12   ` sashiko-bot
2026-05-11 16:58     ` Andrii Nakryiko
2026-05-09  2:02 ` [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization sashiko-bot
2026-05-11 16:38   ` Andrii Nakryiko
2026-05-11 16:53     ` Andrii Nakryiko
2026-05-10 21:25 ` Jiri Olsa
2026-05-11 16:41   ` Andrii Nakryiko
2026-05-12  5:14   ` Masami Hiramatsu
2026-05-11 14:45 ` Oleg Nesterov
2026-05-11 16:56   ` Andrii Nakryiko
2026-05-11 17:24     ` Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox