Re: [PATCH bpf-next] bpf,x86: do RSB balance for trampoline

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Menglong Dong <menglong.dong@linux.dev>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>, peterz@infradead.org
Cc: Menglong Dong <menglong8.dong@gmail.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	Martin KaFai Lau <martin.lau@linux.dev>,
	Eduard <eddyz87@gmail.com>, Song Liu <song@kernel.org>,
	Yonghong Song <yonghong.song@linux.dev>,
	John Fastabend <john.fastabend@gmail.com>,
	KP Singh <kpsingh@kernel.org>,
	Stanislav Fomichev <sdf@fomichev.me>, Hao Luo <haoluo@google.com>,
	Jiri Olsa <jolsa@kernel.org>,
	"David S. Miller" <davem@davemloft.net>,
	David Ahern <dsahern@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	X86 ML <x86@kernel.org>, "H. Peter Anvin" <hpa@zytor.com>,
	jiang.biao@linux.dev, bpf <bpf@vger.kernel.org>,
	Network Development <netdev@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH bpf-next] bpf,x86: do RSB balance for trampoline
Date: Wed, 05 Nov 2025 15:13:42 +0800	[thread overview]
Message-ID: <4465519.ejJDZkT8p0@7950hx> (raw)
In-Reply-To: <CAADnVQKQXcUxjJ2uYNu1nvhFYt=KhN8QYAiGXrt_YwUsjMFOuA@mail.gmail.com>

On 2025/11/5 10:12, Alexei Starovoitov wrote:
> On Tue, Nov 4, 2025 at 5:30 PM Menglong Dong <menglong.dong@linux.dev> wrote:
> >
> > On 2025/11/5 02:56, Alexei Starovoitov wrote:
> > > On Tue, Nov 4, 2025 at 2:49 AM Menglong Dong <menglong8.dong@gmail.com> wrote:
> > > >
> > > > In origin call case, we skip the "rip" directly before we return, which
> > > > break the RSB, as we have twice "call", but only once "ret".
> > >
> > > RSB meaning return stack buffer?
> > >
> > > and by "breaks RSB" you mean it makes the cpu less efficient?
> >
> > Yeah, I mean it makes the cpu less efficient. The RSB is used
> > for the branch predicting, and it will push the "rip" to its hardware
> > stack on "call", and pop it from the stack on "ret". In the origin
> > call case, there are twice "call" but once "ret", will break its
> > balance.
> 
> Yes. I'm aware, but your "mov [rbp + 8], rax" screws it up as well,
> since RSB has to be updated/invalidated by this store.
> The behavior depends on the microarchitecture, of course.
> I think:
> add rsp, 8
> ret
> will only screw up the return prediction, but won't invalidate RSB.
> 
> > Similar things happen in "return_to_handler" in ftrace_64.S,
> > which has once "call", but twice "ret". And it pretend a "call"
> > to make it balance.
> 
> This makes more sense to me. Let's try that approach instead
> of messing with the return address on stack?

The way here is similar to the "return_to_handler". For the ftrace,
the origin stack before the "ret" of the traced function is:

    POS:
    rip   ---> return_to_handler

And the exit of the traced function will jump to return_to_handler.
In return_to_handler, it will query the real "rip" of the traced function
and the it call a internal function:

    call .Ldo_rop

And the stack now is:

    POS:
    rip   ----> the address after "call .Ldo_rop", which is a "int3"

in the .Ldo_rop, it will modify the rip to the real rip to make
it like this:

    POS:
    rip   ---> real rip

And it return. Take the target function "foo" for example, the logic
of it is:

    call foo -> call ftrace_caller -> return ftrace_caller ->
    return return_to_handler -> call Ldo_rop -> return foo

As you can see, the call and return address for ".Ldo_rop" is
also messed up. So I think it works here too. Compared with
a messed "return address", a missed return maybe have
better influence?

And the whole logic for us is:

    call foo -> call trampoline -> call origin ->
    return origin -> return POS -> return foo

Following is the partial code of return_to_handler:

	/*
	 * Jump back to the old return address. This cannot be JMP_NOSPEC rdi
	 * since IBT would demand that contain ENDBR, which simply isn't so for
	 * return addresses. Use a retpoline here to keep the RSB balanced.
	 */
	ANNOTATE_INTRA_FUNCTION_CALL
	call .Ldo_rop
	int3
.Ldo_rop:
	mov %rdi, (%rsp)
	ALTERNATIVE __stringify(RET), \
		    __stringify(ANNOTATE_UNRET_SAFE; ret; int3), \
		    X86_FEATURE_CALL_DEPTH

> 
> > I were wandering why the overhead of fexit is much higher
> > than fentry. I added the percup-ref-get-and-put stuff to the
> > fentry, and the performance of it still can be 130M/s. However,
> > the fexit only has 76M/s. And the only difference is the origin
> > call.
> >
> > The RSB balancing mitigate it, but there are still gap. I
> > suspect it's still the branch predicting things.
> 
> If you have access to intel vtune profiler, check what is

Let me have a study on the "intel vtune profiler" stuff :)

I did a perf stat, and the branch miss increase seriously,
and the IPC(insn per cycle) decrease seriously.

> actually happening. It can show micro arch details.
> I don't think there is an open source alternative.
> 
> > > Or you mean call depth accounting that is done in sw ?
> > >
> > > > Do the RSB balance by pseudo a "ret". Instead of skipping the "rip", we
> > > > modify it to the address of a "ret" insn that we generate.
> > > >
> > > > The performance of "fexit" increases from 76M/s to 84M/s. Before this
> > > > optimize, the bench resulting of fexit is:
> > > >
> > > > fexit          :   76.494 ± 0.216M/s
> > > > fexit          :   76.319 ± 0.097M/s
> > > > fexit          :   70.680 ± 0.060M/s
> > > > fexit          :   75.509 ± 0.039M/s
> > > > fexit          :   76.392 ± 0.049M/s
> > > >
> > > > After this optimize:
> > > >
> > > > fexit          :   86.023 ± 0.518M/s
> > > > fexit          :   83.388 ± 0.021M/s
> > > > fexit          :   85.146 ± 0.058M/s
> > > > fexit          :   85.646 ± 0.136M/s
> > > > fexit          :   84.040 ± 0.045M/s
> > >
> > > This is with or without calldepth accounting?
> >
> > The CONFIG_MITIGATION_CALL_DEPTH_TRACKING is enabled, but
> > I did the testing with "mitigations=off" in the cmdline, so I guess
> > "without"?
> 
> Pls benchmark both. It sounds like call_depth_tracking
> miscounting stuff ?

Sadly, the performance decrease from 28M/s to 26M/s with
mitigation enabled. I think the addition "ret" increase the
overhead with "return-thunks" enabled.

Things is not as simple as I thought, let me do more testing,
and see if ftrace has similar performance decreasing.

Hi, @Peter. Do you have any advice on this ting?

Thanks!
Menglong Dong

> 
> > >
[......]
> >                            const u32 imm32_hi, const u32 imm32_lo)
> > {
> >         u64 imm64 = ((u64)imm32_hi << 32) | (u32)imm32_lo;
> >         u8 *prog = *pprog;
> >
> >         if (is_uimm32(imm64)) {
> >                 /*
> >                  * For emitting plain u32, where sign bit must not be
> >                  * propagated LLVM tends to load imm64 over mov32
> >                  * directly, so save couple of bytes by just doing
> >                  * 'mov %eax, imm32' instead.
> >                  */
> >                 emit_mov_imm32(&prog, false, dst_reg, imm32_lo);
> >         } else if (is_simm32(imm64)) {
> >                 emit_mov_imm32(&prog, true, dst_reg, imm32_lo);
> >         } else {
> >                 /* movabsq rax, imm64 */
> >                 EMIT2(add_1mod(0x48, dst_reg), add_1reg(0xB8, dst_reg));
> >                 EMIT(imm32_lo, 4);
> >                 EMIT(imm32_hi, 4);
> 
> This part could be factored out as a separate helper.
> Then sizeof(movabsq) will be constant.
> Note, in the verifier we do:
> #if defined(MODULES_VADDR)
>                         u64 addr = MODULES_VADDR;
> #else
>                         u64 addr = VMALLOC_START;
> #endif
>                         /* jit (e.g. x86_64) may emit fewer instructions
>                          * if it learns a u32 imm is the same as a u64 imm.
>                          * Set close enough to possible prog address.
>                          */
>                         insn[0].imm = (u32)addr;
>                         insn[1].imm = addr >> 32;
> 
> do mitigate this issue.
> So you could have done:
> emit_mov_imm64(&prog, BPF_REG_0, VMALLOC_START >> 32, 0);
> 
> since 'ret_addr' math is incorrect at that point anyway.
> But prog += sizeof is imo cleaner.

Great! This make things much more clear. After I figure out all
the stuff, (which I'm not sure if I'm able), I'll do this way.

Thanks!
Menglong Dong

> The whole thing might not be necessary with extra call approach.
> I suspect it should be faster than this approach.
>

next prev parent reply	other threads:[~2025-11-05  7:14 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-04 10:49 [PATCH bpf-next] bpf,x86: do RSB balance for trampoline Menglong Dong
2025-11-04 18:56 ` Alexei Starovoitov
2025-11-05  1:30   ` Menglong Dong
2025-11-05  2:12     ` Alexei Starovoitov
2025-11-05  7:13       ` Menglong Dong [this message]
2025-11-05  7:46         ` Menglong Dong
2025-11-05 23:31           ` Alexei Starovoitov
2025-11-06  1:40             ` Menglong Dong
2025-11-06  2:49               ` Menglong Dong
2025-11-06  2:56                 ` Alexei Starovoitov
2025-11-06  3:00                   ` Menglong Dong
2025-11-10 11:43                   ` Menglong Dong
2025-11-10 16:32                     ` Alexei Starovoitov
2025-11-11  1:28                       ` Menglong Dong
2025-11-11  2:41                         ` Alexei Starovoitov
2025-11-06 12:03           ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4465519.ejJDZkT8p0@7950hx \
    --to=menglong.dong@linux.dev \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bp@alien8.de \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=eddyz87@gmail.com \
    --cc=haoluo@google.com \
    --cc=hpa@zytor.com \
    --cc=jiang.biao@linux.dev \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=menglong8.dong@gmail.com \
    --cc=mingo@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=sdf@fomichev.me \
    --cc=song@kernel.org \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    --cc=yonghong.song@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.