Re: [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization

Linux Trace Kernel
 help / color / mirror / Atom feed

From: Jiri Olsa <olsajiri@gmail.com>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Jiri Olsa <olsajiri@gmail.com>,
	Andrii Nakryiko <andrii@kernel.org>,
	bpf@vger.kernel.org, linux-trace-kernel@vger.kernel.org,
	oleg@redhat.com, peterz@infradead.org, mingo@kernel.org,
	mhiramat@kernel.org
Subject: Re: [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization
Date: Tue, 12 May 2026 18:47:35 +0200	[thread overview]
Message-ID: <agNZp62qZLMM9hsa@krava> (raw)
In-Reply-To: <CAEf4Bza9PjbaVjFxYDmWPXXGV+Z-_Hn2Kz_KB2TOa5s-_UJ1xA@mail.gmail.com>

On Mon, May 11, 2026 at 06:41:06PM +0200, Andrii Nakryiko wrote:
> On Sun, May 10, 2026 at 2:25 PM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Fri, May 08, 2026 at 05:30:56PM -0700, Andrii Nakryiko wrote:
> > > The x86 uprobe nop5 optimization currently replaces a 5-byte NOP at the
> > > probe site with a CALL into a uprobe trampoline. CALL pushes a return
> > > address to [rsp-8]. On x86-64 this is inside the 128-byte red zone, where
> > > user code may keep temporary data without adjusting rsp.
> > >
> > > Use a 5-byte JMP instead. JMP does not write to the user stack, but it
> > > also does not provide a return address. Replace the single trampoline
> > > entry with a page of 16-byte slots. Each optimized probe jumps to its
> > > assigned slot, the slot moves rsp below the red zone, saves the registers
> > > clobbered by syscall, and invokes the uprobe syscall:
> > >
> > >   Probe site:   jmp slot_N              (5B, replaces nop5)
> > >
> > >   Slot N:       lea  -128(%rsp), %rsp   (5B)  skip red zone
> > >                 push %rcx               (1B)  save (syscall clobbers)
> > >                 push %r11               (2B)  save (syscall clobbers)
> > >                 push %rax               (1B)  save (syscall uses for nr)
> > >                 mov  $336, %eax         (5B)  uprobe syscall number
> > >                 syscall                 (2B)
> > >
> > > All slots contain identical code at different offsets, so the trampoline
> > > page is generated once at boot and mapped read-execute into each process.
> > > The syscall handler identifies the slot from regs->ip, which points just
> > > after the syscall instruction, and uses a per-mm slot table to recover the
> > > original probe address.
> > >
> > > The uprobe syscall does not return to the trampoline slot. The handler
> > > restores the probe-site register state, runs the uprobe consumers, sets
> > > pt_regs to continue at probe_addr + 5 unless a consumer redirected
> > > execution, and returns directly through the IRET path. This preserves
> > > general purpose registers, including rcx and r11, without requiring any
> > > post-syscall cleanup code in the trampoline and avoids call/ret, RSB, and
> > > shadow stack concerns.
> > >
> > > Protect the per-mm trampoline list with RCU and free trampoline metadata
> > > with kfree_rcu(). This lets the syscall path resolve trampoline slots
> > > without taking mmap_lock. The optimized-instruction detection path also
> > > walks the trampoline list under an RCU read-side lock. Since that path
> > > starts from the JMP target, it translates the slot start to the post-syscall
> > > IP expected by the shared resolver before checking the trampoline mapping.
> > >
> > > Each trampoline page provides 256 slots. Slots stay permanently assigned
> > > to their first probe address and are reused only when the same address is
> > > probed again. Reassigning detached slots is deliberately avoided because a
> > > thread can remain in a trampoline for an unbounded time due to ptrace,
> > > interrupts, or scheduling delays. If a reachable trampoline page runs out
> > > of slots, probes that cannot allocate a slot fall back to the slower INT3
> > > path.
> > >
> > > Require the entire trampoline page to be reachable by a rel32 JMP before
> > > reusing it for a probe. This keeps every slot in the page within the range
> > > that can be encoded at the probe site.
> > >
> > > Change the error code returned when the uprobe syscall is invoked outside
> > > a kernel-generated trampoline from -ENXIO to -EPROTO. This lets libbpf and
> > > similar libraries distinguish fixed kernels from kernels with the
> > > red-zone-clobbering implementation and enable nop5 optimization only on
> > > fixed kernels.
> > >
> > > Performance (usdt single-thread, M/s):
> > >
> > >                   usdt-nop  usdt-nop5-base  usdt-nop5-fix  nop5-change  iret%
> > >   Skylake          3.149        6.422          4.865         -24.3%     39.1%
> > >   Milan            2.910        3.443          3.820         +11.0%     24.3%
> > >   Sapphire Rapids  1.896        4.023          3.693          -8.2%     24.9%
> > >   Bergamo          3.393        3.895          3.849          -1.2%     24.5%
> > >
> > > The fixed nop5 path remains faster than the non-optimized INT3 path on all
> > > measured systems. The regression relative to the old CALL-based trampoline
> > > comes from IRET being more expensive than SYSRET, most noticeably on older
> > > Intel Skylake. Newer Intel CPUs and tested AMD CPUs have lower IRET cost,
> > > and AMD Milan improves because removing mmap_lock from the hot path more
> > > than offsets the IRET cost.
> > >
> > > Multi-threaded throughput scales nearly linearly with the number of CPUs, like
> > > it used to, thanks to lockless RCU-protected uprobe trampoline lookup.
> >
> > hi,
> > thanks a lot for the fix
> >
> > FWIW we discussed also an option to have 10-bytes nop and do:
> >   [rsp+0x80, call trampoline]
> >
> > we would not need the slots re-use logic, but not sure what other
> > surprises there are with 10-bytes nop
> >
> > I tried that change [1], it seems to work, but it has other
> > difficulties, like I think the unoptimized path needs to do:
> >   [rsp+0x80, call trampoline] -> [jmp end of 10-bytes nop]
> > instead of patching back the 10-byte nop, because some thread
> > could be inside the nop area already.
> >
> 
> Yeah, nop10 and this jump-over-nop10 approach is an alternative. I
> don't have strong feelings apart from the ridiculousness of a 10-byte
> nop :)
> 
> did you get a chance to benchmark your nop10 approach, curious how do
> the number look like

yes, it's the same as with the nop5

  base:
          usermode-count :  152.509 ± 0.044M/s
          syscall-count  :   15.177 ± 0.021M/s
          uprobe-nop     :    3.215 ± 0.002M/s
          uprobe-push    :    3.054 ± 0.003M/s
          uprobe-ret     :    1.100 ± 0.002M/s
          uprobe-nop5    :    7.251 ± 0.034M/s
          uretprobe-nop  :    2.149 ± 0.012M/s
          uretprobe-push :    2.088 ± 0.001M/s
          uretprobe-ret  :    0.960 ± 0.001M/s
          uretprobe-nop5 :    3.402 ± 0.001M/s
          usdt-nop       :    3.185 ± 0.024M/s
          usdt-nop5      :    7.378 ± 0.016M/s

  nop10:
          usermode-count :  152.503 ± 0.024M/s
          syscall-count  :   15.977 ± 0.047M/s
          uprobe-nop     :    3.174 ± 0.011M/s
          uprobe-push    :    3.030 ± 0.006M/s
          uprobe-ret     :    1.124 ± 0.004M/s
          uprobe-nop5    :    7.201 ± 0.012M/s
          uretprobe-nop  :    2.141 ± 0.005M/s
          uretprobe-push :    2.078 ± 0.007M/s
          uretprobe-ret  :    0.947 ± 0.003M/s
          uretprobe-nop5 :    3.384 ± 0.014M/s
          usdt-nop       :    3.247 ± 0.002M/s
          usdt-nop5      :    7.374 ± 0.027M/s

jirka

next prev parent reply	other threads:[~2026-05-12 16:47 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-09  0:30 [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization Andrii Nakryiko
2026-05-09  0:30 ` [PATCH bpf 2/2] selftests/bpf: Add tests for uprobe nop5 red zone clobbering Andrii Nakryiko
2026-05-10 21:25 ` [PATCH bpf 1/2] uprobes/x86: Fix red zone clobbering in nop5 optimization Jiri Olsa
2026-05-11 16:41   ` Andrii Nakryiko
2026-05-12 16:47     ` Jiri Olsa [this message]
2026-05-12  5:14   ` Masami Hiramatsu
2026-05-12 17:06     ` Jiri Olsa
2026-05-12 19:27       ` Alexei Starovoitov
2026-05-12 19:38         ` Andrii Nakryiko
2026-05-13  9:35           ` Jiri Olsa
2026-05-11 14:45 ` Oleg Nesterov
2026-05-11 16:56   ` Andrii Nakryiko
2026-05-11 17:24     ` Oleg Nesterov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=agNZp62qZLMM9hsa@krava \
    --to=olsajiri@gmail.com \
    --cc=andrii.nakryiko@gmail.com \
    --cc=andrii@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=mhiramat@kernel.org \
    --cc=mingo@kernel.org \
    --cc=oleg@redhat.com \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox