From: Jiri Olsa <olsajiri@gmail.com>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Jiri Olsa <olsajiri@gmail.com>, Oleg Nesterov <oleg@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
Ingo Molnar <mingo@kernel.org>,
Masami Hiramatsu <mhiramat@kernel.org>,
Andrii Nakryiko <andrii@kernel.org>,
bpf@vger.kernel.org, linux-trace-kernel@vger.kernel.org
Subject: Re: [PATCHv4 05/13] uprobes/x86: Move optimized uprobe from nop5 to nop10
Date: Wed, 10 Jun 2026 10:18:51 +0200 [thread overview]
Message-ID: <aikd6-5HYhCnc0Ze@krava> (raw)
In-Reply-To: <CAEf4BzbsB-PRZRqkz37YK5wYNR=ajZv2N7tEznpZM0rdyYC_xA@mail.gmail.com>
On Tue, Jun 09, 2026 at 09:43:15AM -0700, Andrii Nakryiko wrote:
> On Tue, Jun 9, 2026 at 4:44 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Mon, Jun 08, 2026 at 01:46:39PM -0700, Andrii Nakryiko wrote:
> > > On Tue, May 26, 2026 at 1:59 PM Jiri Olsa <jolsa@kernel.org> wrote:
> > > >
> > > > Andrii reported an issue with optimized uprobes [1] that can clobber
> > > > redzone area with call instruction storing return address on stack
> > > > where user code may keep temporary data without adjusting rsp.
> > > >
> > > > Fixing this by moving the optimized uprobes on top of 10-bytes nop
> > > > instruction, so we can squeeze another instruction to escape the
> > > > redzone area before doing the call, like:
> > > >
> > > > lea -0x80(%rsp), %rsp
> > > > call tramp
> > > >
> > > > Note the lea instruction is used to adjust the rsp register without
> > > > changing the flags.
> > > >
> > > > We use nop10 and following transformation to optimized instructions
> > > > above and back as suggested by Peterz [2].
> > > >
> > > > Optimize path (int3_update_optimize):
> > > >
> > > > 1) Initial state after set_swbp() installed the uprobe:
> > > > cc 2e 0f 1f 84 00 00 00 00 00
> > > >
> > > > From offset 0 this is INT3 followed by the tail of the original
> > > > 10-byte NOP.
> > > >
> > > > After a previous unoptimization bytes 5..9 may still contain the
> > > > old call instruction, which remains valid for threads already there.
> > > >
> > > > 2) Rewrite the LEA tail and call displacement:
> > > > cc [8d 64 24 80 e8 d0 d1 d2 d3]
> > > >
> > > > From offset 0 this traps on the uprobe INT3. Bytes 1..9 are not
> > > > executable entry points while byte 0 is trapped.
> > > >
> > > > 3) Publish the first LEA byte:
> > > > [48] 8d 64 24 80 e8 d0 d1 d2 d3
> > > >
> > > > From offset 0 this is:
> > > > lea -0x80(%rsp), %rsp
> > > > call <uprobe-trampoline>
> > > >
> > > > Unoptimize path (int3_update_unoptimize):
> > > >
> > > > 1) Initial optimized state:
> > > > 48 8d 64 24 80 e8 d0 d1 d2 d3
> > > > Same as 3) above.
> > > >
> > > > 2) Trap new entries before restoring the NOP bytes:
> > > > [cc] 8d 64 24 80 e8 d0 d1 d2 d3
> > > >
> > > > From offset 0 this traps. A thread that had already executed the
> > > > LEA can still reach the intact CALL at offset 5.
> > > >
> > > > 3) Restore bytes 1..4 of the original NOP while keeping byte 0 trapped
> > > > and byte 5 as CALL.
> > > > cc [2e 0f 1f 84] e8 d0 d1 d2 d3
> > > >
> > > > From offset 0 this still traps. Offset 5 is still the CALL for any
> > > > thread that was already past the first LEA byte.
> > > >
> > > > 4) Publish the first byte of the original NOP:
> > > > [66] 2e 0f 1f 84 e8 d0 d1 d2 d3
> > > >
> > > > From offset 0 this is the restored 10-byte NOP; the CALL opcode and
> > > > displacement are now only NOP operands. Offset 5 still decodes as
> > > > CALL for a thread that was already there.
> > > >
> > > > Tthere is only a single target uprobe-trampoline for the given nop10
> > > > instruction address, so the CALL instruction will not be changed across
> > > > unoptimization/optimization cycles.
> > > > Therefore, any task that is preempted at the CALL instruction is guaranteed
> > > > to observe that CALL and not anything else.
> > > >
> > > > Note as explained in [2] we need to use following nop10:
> > > > PF1 PF2 ESC NOPL MOD SIB DISP32
> > > > NOP10: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 -- cs nopw 0x00000000(%rax,%rax,1)
> > > >
> > > > which means we need to allow 0x2e prefix which maps to INAT_PFX_CS
> > > > attribute in is_prefix_bad function.
> > > >
> > > > Also changing the uprobe syscall error when called out of uprobe
> > > > trampoline to -EPROTO, so we are able to detect the fixed kernel.
> > > >
> > > > The optimized uprobe performance stays the same:
> > > >
> > > > uprobe-nop : 3.129 ± 0.013M/s
> > > > uprobe-push : 3.045 ± 0.006M/s
> > > > uprobe-ret : 1.095 ± 0.004M/s
> > > > --> uprobe-nop10 : 7.170 ± 0.020M/s
> > > > uretprobe-nop : 2.143 ± 0.021M/s
> > > > uretprobe-push : 2.090 ± 0.000M/s
> > > > uretprobe-ret : 0.942 ± 0.000M/s
> > > > --> uretprobe-nop10: 3.381 ± 0.003M/s
> > > > usdt-nop : 3.245 ± 0.004M/s
> > > > --> usdt-nop10 : 7.256 ± 0.023M/s
> > > >
> > > > [1] https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
> > > > [2] https://lore.kernel.org/bpf/20260518104306.GU3102624@noisy.programming.kicks-ass.net/#t
> > > > Reported-by: Andrii Nakryiko <andrii@kernel.org>
> > > > Closes: https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
> > > > Fixes: ba2bfc97b462 ("uprobes/x86: Add support to optimize uprobes")
> > > > Assisted-by: Codex:GPT-5.5
> > > > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > > > ---
> > > > arch/x86/kernel/uprobes.c | 255 ++++++++++++++++++++++++++++----------
> > > > 1 file changed, 190 insertions(+), 65 deletions(-)
> > > >
> > >
> > > [...]
> > >
> > > > @@ -943,13 +1026,31 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> > > > smp_text_poke_sync_each_cpu();
> > > >
> > > > /*
> > > > - * Write first byte.
> > > > + * 3) Restore bytes 1..4 of the original NOP while keeping byte 0 trapped
> > > > + * and byte 5 as CALL:
> > > > + * cc [2e 0f 1f 84] e8 d0 d1 d2 d3
> > > > + */
> > > > + ctx.expect = EXPECT_SWBP_OPTIMIZED;
> > > > + err = uprobe_write(auprobe, vma, vaddr + 1, insn + 1,
> > > > + LEA_INSN_SIZE - 1, verify_insn,
> > > > + true /* is_register */, false /* do_update_ref_ctr */,
> > >
> > > tbh, it's quite subtle and non-obvious why is_register should be set
> > > to true first two times (and especially that is_register and
> > > do_update_ref_ctr are implicitly connected), not sure how to make it
> > > cleaner, but maybe leave a short comment explaining this twice
> > > register, once unregister sequence?
> >
> > ok, I came up with comment below
> >
> > thanks,
> > jirka
> >
> >
> > ---
> > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> > index de544516ea70..92449f34c005 100644
> > --- a/arch/x86/kernel/uprobes.c
> > +++ b/arch/x86/kernel/uprobes.c
> > @@ -1011,6 +1011,12 @@ static int int3_update_unoptimize(struct arch_uprobe *auprobe, struct vm_area_st
> > int err;
> >
> > /*
> > + * Note the first two uprobe_write calls use is_register=true, because they
> > + * are intermediate patching states while the probe is still active.
>
> this doesn't really explain why is_register=true is the right one. It
> actually doesn't matter as long as do_update_ref_ctr=true, isn't that
> right? So maybe just to avoid a bit of confusion let's pass
> is_register=false and do_update_ref_ctr=false, and in the comment
> explain as you said that it's intermediate update and we don't want to
> update refctr just yet until the very last step?
apart from refctr update there's also different way the concerned
page is managed, IIUC:
with is_register=true we force to get exclusive anonymous page for
the update (or pin the existing one)
with is_register=false we try to zap the private anonymous page and
return the mapping to the original page
there are several comments on this in uprobe_write/__uprobe_write
how about the update below
jirka
---
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index de544516ea70..09f5ff71227c 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -1011,6 +1011,16 @@ static int int3_update_unoptimize(struct arch_uprobe *auprobe, struct vm_area_st
int err;
/*
+ * Note the first two uprobe_write calls use is_register=true, because they
+ * are intermediate patching states while the probe is still active, so
+ * we force the exclusive anonymous page for the update.
+ * Also we use do_update_ref_ctr=false because refctr was already updated by
+ * the initial int3 install.
+ *
+ * The last uprobe_write to nop10 instruction is called with is_register=false
+ * and do_update_ref_ctr=true to trigger the refctr update and to instruct
+ * uprobe_write to zap the anonymous page if it now matches the file page.
+ *
* 1) Initial optimized state:
* 48 8d 64 24 80 e8 d0 d1 d2 d3
*
next prev parent reply other threads:[~2026-06-10 8:18 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-26 20:58 [PATCHv4 00/13] uprobes/x86: Fix red zone issue for optimized uprobes Jiri Olsa
2026-05-26 20:58 ` [PATCHv4 01/13] uprobes/x86: Use proper mm_struct in __in_uprobe_trampoline Jiri Olsa
2026-05-26 20:58 ` [PATCHv4 02/13] uprobes/x86: Remove struct uprobe_trampoline object Jiri Olsa
2026-05-26 21:46 ` bot+bpf-ci
2026-05-27 9:58 ` Jiri Olsa
2026-06-01 8:31 ` Jiri Olsa
2026-05-26 20:58 ` [PATCHv4 03/13] uprobes/x86: Allow to copy uprobe trampolines on fork Jiri Olsa
2026-05-26 21:46 ` bot+bpf-ci
2026-05-27 9:58 ` Jiri Olsa
2026-05-26 20:58 ` [PATCHv4 04/13] uprobes/x86: Unmap trampoline vma object in case it's unused Jiri Olsa
2026-05-26 21:46 ` bot+bpf-ci
2026-05-27 9:57 ` Jiri Olsa
2026-05-26 20:58 ` [PATCHv4 05/13] uprobes/x86: Move optimized uprobe from nop5 to nop10 Jiri Olsa
2026-06-08 20:46 ` Andrii Nakryiko
2026-06-09 11:44 ` Jiri Olsa
2026-06-09 16:43 ` Andrii Nakryiko
2026-06-10 8:18 ` Jiri Olsa [this message]
2026-05-26 20:58 ` [PATCHv4 06/13] libbpf: Change has_nop_combo to work on top of nop10 Jiri Olsa
2026-05-26 21:46 ` bot+bpf-ci
2026-05-27 9:57 ` Jiri Olsa
2026-05-26 20:58 ` [PATCHv4 07/13] libbpf: Detect uprobe syscall with new error Jiri Olsa
2026-05-26 21:46 ` bot+bpf-ci
2026-05-26 20:58 ` [PATCHv4 08/13] selftests/bpf: Emit nop,nop10 instructions combo for x86_64 arch Jiri Olsa
2026-05-26 21:46 ` bot+bpf-ci
2026-05-26 20:58 ` [PATCHv4 09/13] selftests/bpf: Change uprobe syscall tests to use nop10 Jiri Olsa
2026-05-26 21:46 ` bot+bpf-ci
2026-05-27 9:58 ` Jiri Olsa
2026-05-27 10:30 ` Jakub Sitnicki
2026-05-26 20:58 ` [PATCHv4 10/13] selftests/bpf: Change uprobe/usdt trigger bench code " Jiri Olsa
2026-05-27 10:46 ` Jakub Sitnicki
2026-05-26 20:58 ` [PATCHv4 11/13] selftests/bpf: Add reattach tests for uprobe syscall Jiri Olsa
2026-05-27 11:32 ` Jakub Sitnicki
2026-05-28 11:10 ` Jiri Olsa
2026-05-26 20:58 ` [PATCHv4 12/13] selftests/bpf: Add tests for uprobe nop10 red zone clobbering Jiri Olsa
2026-05-26 21:46 ` bot+bpf-ci
2026-05-27 10:26 ` Jiri Olsa
2026-05-28 12:46 ` Jakub Sitnicki
2026-05-26 20:58 ` [PATCHv4 13/13] selftests/bpf: Add tests for forked/cloned optimized uprobes Jiri Olsa
2026-05-28 13:00 ` Jakub Sitnicki
2026-06-01 8:31 ` Jiri Olsa
2026-06-04 6:59 ` [PATCHv4 00/13] uprobes/x86: Fix red zone issue for " Jiri Olsa
2026-06-08 20:48 ` Andrii Nakryiko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aikd6-5HYhCnc0Ze@krava \
--to=olsajiri@gmail.com \
--cc=andrii.nakryiko@gmail.com \
--cc=andrii@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=mhiramat@kernel.org \
--cc=mingo@kernel.org \
--cc=oleg@redhat.com \
--cc=peterz@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox