From: Jiri Olsa <olsajiri@gmail.com>
To: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jiri Olsa <olsajiri@gmail.com>,
Andrii Nakryiko <andrii@kernel.org>,
bpf@vger.kernel.org, linux-trace-kernel@vger.kernel.org,
oleg@redhat.com, peterz@infradead.org, mingo@kernel.org
Subject: Re: [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization
Date: Tue, 12 May 2026 19:06:27 +0200 [thread overview]
Message-ID: <agNeEzjiThzmJHiP@krava> (raw)
In-Reply-To: <20260512141431.a70375744fdae263bda5b722@kernel.org>
On Tue, May 12, 2026 at 02:14:31PM +0900, Masami Hiramatsu wrote:
> On Sun, 10 May 2026 23:25:26 +0200
> Jiri Olsa <olsajiri@gmail.com> wrote:
>
> > On Fri, May 08, 2026 at 05:30:56PM -0700, Andrii Nakryiko wrote:
> > > The x86 uprobe nop5 optimization currently replaces a 5-byte NOP at the
> > > probe site with a CALL into a uprobe trampoline. CALL pushes a return
> > > address to [rsp-8]. On x86-64 this is inside the 128-byte red zone, where
> > > user code may keep temporary data without adjusting rsp.
> > >
> > > Use a 5-byte JMP instead. JMP does not write to the user stack, but it
> > > also does not provide a return address. Replace the single trampoline
> > > entry with a page of 16-byte slots. Each optimized probe jumps to its
> > > assigned slot, the slot moves rsp below the red zone, saves the registers
> > > clobbered by syscall, and invokes the uprobe syscall:
> > >
> > > Probe site: jmp slot_N (5B, replaces nop5)
> > >
> > > Slot N: lea -128(%rsp), %rsp (5B) skip red zone
> > > push %rcx (1B) save (syscall clobbers)
> > > push %r11 (2B) save (syscall clobbers)
> > > push %rax (1B) save (syscall uses for nr)
> > > mov $336, %eax (5B) uprobe syscall number
> > > syscall (2B)
> > >
> > > All slots contain identical code at different offsets, so the trampoline
> > > page is generated once at boot and mapped read-execute into each process.
> > > The syscall handler identifies the slot from regs->ip, which points just
> > > after the syscall instruction, and uses a per-mm slot table to recover the
> > > original probe address.
> > >
> > > The uprobe syscall does not return to the trampoline slot. The handler
> > > restores the probe-site register state, runs the uprobe consumers, sets
> > > pt_regs to continue at probe_addr + 5 unless a consumer redirected
> > > execution, and returns directly through the IRET path. This preserves
> > > general purpose registers, including rcx and r11, without requiring any
> > > post-syscall cleanup code in the trampoline and avoids call/ret, RSB, and
> > > shadow stack concerns.
> > >
> > > Protect the per-mm trampoline list with RCU and free trampoline metadata
> > > with kfree_rcu(). This lets the syscall path resolve trampoline slots
> > > without taking mmap_lock. The optimized-instruction detection path also
> > > walks the trampoline list under an RCU read-side lock. Since that path
> > > starts from the JMP target, it translates the slot start to the post-syscall
> > > IP expected by the shared resolver before checking the trampoline mapping.
> > >
> > > Each trampoline page provides 256 slots. Slots stay permanently assigned
> > > to their first probe address and are reused only when the same address is
> > > probed again. Reassigning detached slots is deliberately avoided because a
> > > thread can remain in a trampoline for an unbounded time due to ptrace,
> > > interrupts, or scheduling delays. If a reachable trampoline page runs out
> > > of slots, probes that cannot allocate a slot fall back to the slower INT3
> > > path.
> > >
> > > Require the entire trampoline page to be reachable by a rel32 JMP before
> > > reusing it for a probe. This keeps every slot in the page within the range
> > > that can be encoded at the probe site.
> > >
> > > Change the error code returned when the uprobe syscall is invoked outside
> > > a kernel-generated trampoline from -ENXIO to -EPROTO. This lets libbpf and
> > > similar libraries distinguish fixed kernels from kernels with the
> > > red-zone-clobbering implementation and enable nop5 optimization only on
> > > fixed kernels.
> > >
> > > Performance (usdt single-thread, M/s):
> > >
> > > usdt-nop usdt-nop5-base usdt-nop5-fix nop5-change iret%
> > > Skylake 3.149 6.422 4.865 -24.3% 39.1%
> > > Milan 2.910 3.443 3.820 +11.0% 24.3%
> > > Sapphire Rapids 1.896 4.023 3.693 -8.2% 24.9%
> > > Bergamo 3.393 3.895 3.849 -1.2% 24.5%
> > >
> > > The fixed nop5 path remains faster than the non-optimized INT3 path on all
> > > measured systems. The regression relative to the old CALL-based trampoline
> > > comes from IRET being more expensive than SYSRET, most noticeably on older
> > > Intel Skylake. Newer Intel CPUs and tested AMD CPUs have lower IRET cost,
> > > and AMD Milan improves because removing mmap_lock from the hot path more
> > > than offsets the IRET cost.
> > >
> > > Multi-threaded throughput scales nearly linearly with the number of CPUs, like
> > > it used to, thanks to lockless RCU-protected uprobe trampoline lookup.
> >
> > hi,
> > thanks a lot for the fix
> >
> > FWIW we discussed also an option to have 10-bytes nop and do:
> > [rsp+0x80, call trampoline]
> >
> > we would not need the slots re-use logic, but not sure what other
> > surprises there are with 10-bytes nop
>
> Does this mean we have to update UDST implementation?
it's the optimized uprobe code, that's used for usdt that emits nop5 instead
of single nop
>
> >
> > I tried that change [1], it seems to work, but it has other
> > difficulties, like I think the unoptimized path needs to do:
> > [rsp+0x80, call trampoline] -> [jmp end of 10-bytes nop]
> > instead of patching back the 10-byte nop, because some thread
> > could be inside the nop area already.
>
> Yeah, but at that moment, we know where the modified code is.
> Maybe memory dump shows different code, but that is also true
> if uprobe is active. So I think it is OK.
hum, I'm not what you mean.. I attached the kernel change from my changes,
if you want to comment on top of that
the whole change including user space changes is in here:
https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/log/?h=redzone_fix
jirka
---
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index ebb1baf1eb1d..a6db7b76cb49 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -636,9 +636,20 @@ struct uprobe_trampoline {
unsigned long vaddr;
};
+static const u8 lea_rsp[] = { 0x48, 0x8d, 0x64, 0x24, 0x80 };
+
+#define LEA_INSN_SIZE 5
+#define UPROBE_OPT_INSN_SIZE (LEA_INSN_SIZE + CALL_INSN_SIZE)
+#define REDZONE_SIZE 0x80
+
+static bool is_lea_insn(const uprobe_opcode_t *insn)
+{
+ return !memcmp(insn, lea_rsp, LEA_INSN_SIZE);
+}
+
static bool is_reachable_by_call(unsigned long vtramp, unsigned long vaddr)
{
- long delta = (long)(vaddr + 5 - vtramp);
+ long delta = (long)(vaddr + UPROBE_OPT_INSN_SIZE - vtramp);
return delta >= INT_MIN && delta <= INT_MAX;
}
@@ -651,7 +662,7 @@ static unsigned long find_nearest_trampoline(unsigned long vaddr)
};
unsigned long low_limit, high_limit;
unsigned long low_tramp, high_tramp;
- unsigned long call_end = vaddr + 5;
+ unsigned long call_end = vaddr + UPROBE_OPT_INSN_SIZE;
if (check_add_overflow(call_end, INT_MIN, &low_limit))
low_limit = PAGE_SIZE;
@@ -826,8 +837,8 @@ SYSCALL_DEFINE0(uprobe)
regs->ax = args.ax;
regs->r11 = args.r11;
regs->cx = args.cx;
- regs->ip = args.retaddr - 5;
- regs->sp += sizeof(args);
+ regs->ip = args.retaddr - UPROBE_OPT_INSN_SIZE;
+ regs->sp += sizeof(args) + REDZONE_SIZE;
regs->orig_ax = -1;
sp = regs->sp;
@@ -844,12 +855,12 @@ SYSCALL_DEFINE0(uprobe)
*/
if (regs->sp != sp) {
/* skip the trampoline call */
- if (args.retaddr - 5 == regs->ip)
- regs->ip += 5;
+ if (args.retaddr - UPROBE_OPT_INSN_SIZE == regs->ip)
+ regs->ip += UPROBE_OPT_INSN_SIZE;
return regs->ax;
}
- regs->sp -= sizeof(args);
+ regs->sp -= sizeof(args) + REDZONE_SIZE;
/* for the case uprobe_consumer has changed ax/r11/cx */
args.ax = regs->ax;
@@ -857,7 +868,7 @@ SYSCALL_DEFINE0(uprobe)
args.cx = regs->cx;
/* keep return address unless we are instructed otherwise */
- if (args.retaddr - 5 != regs->ip)
+ if (args.retaddr - UPROBE_OPT_INSN_SIZE != regs->ip)
args.retaddr = regs->ip;
if (shstk_push(args.retaddr) == -EFAULT)
@@ -891,7 +902,7 @@ asm (
"pop %rax\n"
"pop %r11\n"
"pop %rcx\n"
- "ret\n"
+ "ret $0x80\n"
"int3\n"
".balign " __stringify(PAGE_SIZE) "\n"
".popsection\n"
@@ -930,9 +941,9 @@ static int verify_insn(struct page *page, unsigned long vaddr, uprobe_opcode_t *
int nbytes, void *data)
{
struct write_opcode_ctx *ctx = data;
- uprobe_opcode_t old_opcode[5];
+ uprobe_opcode_t old_opcode[UPROBE_OPT_INSN_SIZE];
- uprobe_copy_from_page(page, ctx->base, (uprobe_opcode_t *) &old_opcode, 5);
+ uprobe_copy_from_page(page, ctx->base, old_opcode, UPROBE_OPT_INSN_SIZE);
switch (ctx->expect) {
case EXPECT_SWBP:
@@ -940,7 +951,7 @@ static int verify_insn(struct page *page, unsigned long vaddr, uprobe_opcode_t *
return 1;
break;
case EXPECT_CALL:
- if (is_call_insn(&old_opcode[0]))
+ if (is_lea_insn(old_opcode))
return 1;
break;
}
@@ -963,7 +974,7 @@ static int verify_insn(struct page *page, unsigned long vaddr, uprobe_opcode_t *
* - SMP sync all CPUs
*/
static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
- unsigned long vaddr, char *insn, bool optimize)
+ unsigned long vaddr, char *insn, int size, bool optimize)
{
uprobe_opcode_t int3 = UPROBE_SWBP_INSN;
struct write_opcode_ctx ctx = {
@@ -990,7 +1001,7 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
/* Write all but the first byte of the patched range. */
ctx.expect = EXPECT_SWBP;
- err = uprobe_write(auprobe, vma, vaddr + 1, insn + 1, 4, verify_insn,
+ err = uprobe_write(auprobe, vma, vaddr + 1, insn + 1, size - 1, verify_insn,
true /* is_register */, false /* do_update_ref_ctr */,
&ctx);
if (err)
@@ -1017,17 +1028,35 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
static int swbp_optimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
unsigned long vaddr, unsigned long tramp)
{
- u8 call[5];
+ u8 insn[UPROBE_OPT_INSN_SIZE];
- __text_gen_insn(call, CALL_INSN_OPCODE, (const void *) vaddr,
- (const void *) tramp, CALL_INSN_SIZE);
- return int3_update(auprobe, vma, vaddr, call, true /* optimize */);
+ /*
+ * We have nop10 (with first byte overwritten to int3),
+ * change it to:
+ * lea 0x80(%rsp), %rsp
+ * call tramp
+ *
+ * The first lea instruction skips the stack redzone so the call
+ * instruction can safely push return address on stack.
+ */
+ memcpy(insn, lea_rsp, LEA_INSN_SIZE);
+ __text_gen_insn(insn + LEA_INSN_SIZE, CALL_INSN_OPCODE,
+ (const void *)(vaddr + LEA_INSN_SIZE),
+ (const void *)tramp, CALL_INSN_SIZE);
+ return int3_update(auprobe, vma, vaddr, insn, UPROBE_OPT_INSN_SIZE, true /* optimize */);
}
static int swbp_unoptimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
unsigned long vaddr)
{
- return int3_update(auprobe, vma, vaddr, auprobe->insn, false /* optimize */);
+ /*
+ * Write JMP rel8 to end of the 10-byte slot instead of restoring the
+ * original nop10, because we could have thread already inside lea
+ * instruction.
+ */
+ u8 jmp[UPROBE_OPT_INSN_SIZE] = { 0xeb, UPROBE_OPT_INSN_SIZE - 2 };
+
+ return int3_update(auprobe, vma, vaddr, jmp, 2, false /* optimize */);
}
static int copy_from_vaddr(struct mm_struct *mm, unsigned long vaddr, void *dst, int len)
@@ -1049,19 +1078,21 @@ static bool __is_optimized(uprobe_opcode_t *insn, unsigned long vaddr)
struct __packed __arch_relative_insn {
u8 op;
s32 raddr;
- } *call = (struct __arch_relative_insn *) insn;
+ } *call = (struct __arch_relative_insn *)(insn + LEA_INSN_SIZE);
- if (!is_call_insn(insn))
+ if (!is_lea_insn(insn))
return false;
- return __in_uprobe_trampoline(vaddr + 5 + call->raddr);
+ if (!is_call_insn((uprobe_opcode_t *) call))
+ return false;
+ return __in_uprobe_trampoline(vaddr + UPROBE_OPT_INSN_SIZE + call->raddr);
}
static int is_optimized(struct mm_struct *mm, unsigned long vaddr)
{
- uprobe_opcode_t insn[5];
+ uprobe_opcode_t insn[UPROBE_OPT_INSN_SIZE];
int err;
- err = copy_from_vaddr(mm, vaddr, &insn, 5);
+ err = copy_from_vaddr(mm, vaddr, &insn, UPROBE_OPT_INSN_SIZE);
if (err)
return err;
return __is_optimized((uprobe_opcode_t *)&insn, vaddr);
@@ -1131,7 +1162,7 @@ static int __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct
void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
{
struct mm_struct *mm = current->mm;
- uprobe_opcode_t insn[5];
+ uprobe_opcode_t insn[UPROBE_OPT_INSN_SIZE];
if (!should_optimize(auprobe))
return;
@@ -1142,7 +1173,7 @@ void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
* Check if some other thread already optimized the uprobe for us,
* if it's the case just go away silently.
*/
- if (copy_from_vaddr(mm, vaddr, &insn, 5))
+ if (copy_from_vaddr(mm, vaddr, &insn, UPROBE_OPT_INSN_SIZE))
goto unlock;
if (!is_swbp_insn((uprobe_opcode_t*) &insn))
goto unlock;
@@ -1160,14 +1191,23 @@ void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
static bool can_optimize(struct insn *insn, unsigned long vaddr)
{
- if (!insn->x86_64 || insn->length != 5)
+ if (!insn->x86_64)
return false;
- if (!insn_is_nop(insn))
+ /* We can't do cross page atomic writes yet. */
+ if (PAGE_SIZE - (vaddr & ~PAGE_MASK) < UPROBE_OPT_INSN_SIZE)
return false;
- /* We can't do cross page atomic writes yet. */
- return PAGE_SIZE - (vaddr & ~PAGE_MASK) >= 5;
+ if (insn->length == UPROBE_OPT_INSN_SIZE && insn_is_nop(insn))
+ return true;
+
+ /* JMP rel8 to end of slot — written by swbp_unoptimize. */
+ if (insn->length == 2 &&
+ insn->opcode.bytes[0] == 0xEB &&
+ insn->immediate.value == UPROBE_OPT_INSN_SIZE - 2)
+ return true;
+
+ return false;
}
#else /* 32-bit: */
/*
next prev parent reply other threads:[~2026-05-12 17:06 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-09 0:30 [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization Andrii Nakryiko
2026-05-09 0:30 ` [PATCH bpf 2/2] selftests/bpf: Add tests for uprobe nop5 red zone clobbering Andrii Nakryiko
2026-05-09 2:12 ` sashiko-bot
2026-05-11 16:58 ` Andrii Nakryiko
2026-05-09 2:02 ` [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization sashiko-bot
2026-05-11 16:38 ` Andrii Nakryiko
2026-05-11 16:53 ` Andrii Nakryiko
2026-05-10 21:25 ` Jiri Olsa
2026-05-11 16:41 ` Andrii Nakryiko
2026-05-12 16:47 ` Jiri Olsa
2026-05-12 5:14 ` Masami Hiramatsu
2026-05-12 17:06 ` Jiri Olsa [this message]
2026-05-12 19:27 ` Alexei Starovoitov
2026-05-12 19:38 ` Andrii Nakryiko
2026-05-13 9:35 ` Jiri Olsa
2026-05-11 14:45 ` Oleg Nesterov
2026-05-11 16:56 ` Andrii Nakryiko
2026-05-11 17:24 ` Oleg Nesterov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=agNeEzjiThzmJHiP@krava \
--to=olsajiri@gmail.com \
--cc=andrii@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=mhiramat@kernel.org \
--cc=mingo@kernel.org \
--cc=oleg@redhat.com \
--cc=peterz@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox