From: Menglong Dong <menglong.dong@linux.dev>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>, peterz@infradead.org
Cc: Menglong Dong <menglong8.dong@gmail.com>,
Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Andrii Nakryiko <andrii@kernel.org>,
Martin KaFai Lau <martin.lau@linux.dev>,
Eduard <eddyz87@gmail.com>, Song Liu <song@kernel.org>,
Yonghong Song <yonghong.song@linux.dev>,
John Fastabend <john.fastabend@gmail.com>,
KP Singh <kpsingh@kernel.org>,
Stanislav Fomichev <sdf@fomichev.me>, Hao Luo <haoluo@google.com>,
Jiri Olsa <jolsa@kernel.org>,
"David S. Miller" <davem@davemloft.net>,
David Ahern <dsahern@kernel.org>,
Thomas Gleixner <tglx@linutronix.de>,
Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
Dave Hansen <dave.hansen@linux.intel.com>,
X86 ML <x86@kernel.org>, "H. Peter Anvin" <hpa@zytor.com>,
jiang.biao@linux.dev, bpf <bpf@vger.kernel.org>,
Network Development <netdev@vger.kernel.org>,
LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH bpf-next] bpf,x86: do RSB balance for trampoline
Date: Wed, 05 Nov 2025 15:13:42 +0800 [thread overview]
Message-ID: <4465519.ejJDZkT8p0@7950hx> (raw)
In-Reply-To: <CAADnVQKQXcUxjJ2uYNu1nvhFYt=KhN8QYAiGXrt_YwUsjMFOuA@mail.gmail.com>
On 2025/11/5 10:12, Alexei Starovoitov wrote:
> On Tue, Nov 4, 2025 at 5:30 PM Menglong Dong <menglong.dong@linux.dev> wrote:
> >
> > On 2025/11/5 02:56, Alexei Starovoitov wrote:
> > > On Tue, Nov 4, 2025 at 2:49 AM Menglong Dong <menglong8.dong@gmail.com> wrote:
> > > >
> > > > In origin call case, we skip the "rip" directly before we return, which
> > > > break the RSB, as we have twice "call", but only once "ret".
> > >
> > > RSB meaning return stack buffer?
> > >
> > > and by "breaks RSB" you mean it makes the cpu less efficient?
> >
> > Yeah, I mean it makes the cpu less efficient. The RSB is used
> > for the branch predicting, and it will push the "rip" to its hardware
> > stack on "call", and pop it from the stack on "ret". In the origin
> > call case, there are twice "call" but once "ret", will break its
> > balance.
>
> Yes. I'm aware, but your "mov [rbp + 8], rax" screws it up as well,
> since RSB has to be updated/invalidated by this store.
> The behavior depends on the microarchitecture, of course.
> I think:
> add rsp, 8
> ret
> will only screw up the return prediction, but won't invalidate RSB.
>
> > Similar things happen in "return_to_handler" in ftrace_64.S,
> > which has once "call", but twice "ret". And it pretend a "call"
> > to make it balance.
>
> This makes more sense to me. Let's try that approach instead
> of messing with the return address on stack?
The way here is similar to the "return_to_handler". For the ftrace,
the origin stack before the "ret" of the traced function is:
POS:
rip ---> return_to_handler
And the exit of the traced function will jump to return_to_handler.
In return_to_handler, it will query the real "rip" of the traced function
and the it call a internal function:
call .Ldo_rop
And the stack now is:
POS:
rip ----> the address after "call .Ldo_rop", which is a "int3"
in the .Ldo_rop, it will modify the rip to the real rip to make
it like this:
POS:
rip ---> real rip
And it return. Take the target function "foo" for example, the logic
of it is:
call foo -> call ftrace_caller -> return ftrace_caller ->
return return_to_handler -> call Ldo_rop -> return foo
As you can see, the call and return address for ".Ldo_rop" is
also messed up. So I think it works here too. Compared with
a messed "return address", a missed return maybe have
better influence?
And the whole logic for us is:
call foo -> call trampoline -> call origin ->
return origin -> return POS -> return foo
Following is the partial code of return_to_handler:
/*
* Jump back to the old return address. This cannot be JMP_NOSPEC rdi
* since IBT would demand that contain ENDBR, which simply isn't so for
* return addresses. Use a retpoline here to keep the RSB balanced.
*/
ANNOTATE_INTRA_FUNCTION_CALL
call .Ldo_rop
int3
.Ldo_rop:
mov %rdi, (%rsp)
ALTERNATIVE __stringify(RET), \
__stringify(ANNOTATE_UNRET_SAFE; ret; int3), \
X86_FEATURE_CALL_DEPTH
>
> > I were wandering why the overhead of fexit is much higher
> > than fentry. I added the percup-ref-get-and-put stuff to the
> > fentry, and the performance of it still can be 130M/s. However,
> > the fexit only has 76M/s. And the only difference is the origin
> > call.
> >
> > The RSB balancing mitigate it, but there are still gap. I
> > suspect it's still the branch predicting things.
>
> If you have access to intel vtune profiler, check what is
Let me have a study on the "intel vtune profiler" stuff :)
I did a perf stat, and the branch miss increase seriously,
and the IPC(insn per cycle) decrease seriously.
> actually happening. It can show micro arch details.
> I don't think there is an open source alternative.
>
> > > Or you mean call depth accounting that is done in sw ?
> > >
> > > > Do the RSB balance by pseudo a "ret". Instead of skipping the "rip", we
> > > > modify it to the address of a "ret" insn that we generate.
> > > >
> > > > The performance of "fexit" increases from 76M/s to 84M/s. Before this
> > > > optimize, the bench resulting of fexit is:
> > > >
> > > > fexit : 76.494 ± 0.216M/s
> > > > fexit : 76.319 ± 0.097M/s
> > > > fexit : 70.680 ± 0.060M/s
> > > > fexit : 75.509 ± 0.039M/s
> > > > fexit : 76.392 ± 0.049M/s
> > > >
> > > > After this optimize:
> > > >
> > > > fexit : 86.023 ± 0.518M/s
> > > > fexit : 83.388 ± 0.021M/s
> > > > fexit : 85.146 ± 0.058M/s
> > > > fexit : 85.646 ± 0.136M/s
> > > > fexit : 84.040 ± 0.045M/s
> > >
> > > This is with or without calldepth accounting?
> >
> > The CONFIG_MITIGATION_CALL_DEPTH_TRACKING is enabled, but
> > I did the testing with "mitigations=off" in the cmdline, so I guess
> > "without"?
>
> Pls benchmark both. It sounds like call_depth_tracking
> miscounting stuff ?
Sadly, the performance decrease from 28M/s to 26M/s with
mitigation enabled. I think the addition "ret" increase the
overhead with "return-thunks" enabled.
Things is not as simple as I thought, let me do more testing,
and see if ftrace has similar performance decreasing.
Hi, @Peter. Do you have any advice on this ting?
Thanks!
Menglong Dong
>
> > >
[......]
> > const u32 imm32_hi, const u32 imm32_lo)
> > {
> > u64 imm64 = ((u64)imm32_hi << 32) | (u32)imm32_lo;
> > u8 *prog = *pprog;
> >
> > if (is_uimm32(imm64)) {
> > /*
> > * For emitting plain u32, where sign bit must not be
> > * propagated LLVM tends to load imm64 over mov32
> > * directly, so save couple of bytes by just doing
> > * 'mov %eax, imm32' instead.
> > */
> > emit_mov_imm32(&prog, false, dst_reg, imm32_lo);
> > } else if (is_simm32(imm64)) {
> > emit_mov_imm32(&prog, true, dst_reg, imm32_lo);
> > } else {
> > /* movabsq rax, imm64 */
> > EMIT2(add_1mod(0x48, dst_reg), add_1reg(0xB8, dst_reg));
> > EMIT(imm32_lo, 4);
> > EMIT(imm32_hi, 4);
>
> This part could be factored out as a separate helper.
> Then sizeof(movabsq) will be constant.
> Note, in the verifier we do:
> #if defined(MODULES_VADDR)
> u64 addr = MODULES_VADDR;
> #else
> u64 addr = VMALLOC_START;
> #endif
> /* jit (e.g. x86_64) may emit fewer instructions
> * if it learns a u32 imm is the same as a u64 imm.
> * Set close enough to possible prog address.
> */
> insn[0].imm = (u32)addr;
> insn[1].imm = addr >> 32;
>
> do mitigate this issue.
> So you could have done:
> emit_mov_imm64(&prog, BPF_REG_0, VMALLOC_START >> 32, 0);
>
> since 'ret_addr' math is incorrect at that point anyway.
> But prog += sizeof is imo cleaner.
Great! This make things much more clear. After I figure out all
the stuff, (which I'm not sure if I'm able), I'll do this way.
Thanks!
Menglong Dong
> The whole thing might not be necessary with extra call approach.
> I suspect it should be faster than this approach.
>
next prev parent reply other threads:[~2025-11-05 7:14 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-04 10:49 [PATCH bpf-next] bpf,x86: do RSB balance for trampoline Menglong Dong
2025-11-04 18:56 ` Alexei Starovoitov
2025-11-05 1:30 ` Menglong Dong
2025-11-05 2:12 ` Alexei Starovoitov
2025-11-05 7:13 ` Menglong Dong [this message]
2025-11-05 7:46 ` Menglong Dong
2025-11-05 23:31 ` Alexei Starovoitov
2025-11-06 1:40 ` Menglong Dong
2025-11-06 2:49 ` Menglong Dong
2025-11-06 2:56 ` Alexei Starovoitov
2025-11-06 3:00 ` Menglong Dong
2025-11-10 11:43 ` Menglong Dong
2025-11-10 16:32 ` Alexei Starovoitov
2025-11-11 1:28 ` Menglong Dong
2025-11-11 2:41 ` Alexei Starovoitov
2025-11-06 12:03 ` Peter Zijlstra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4465519.ejJDZkT8p0@7950hx \
--to=menglong.dong@linux.dev \
--cc=alexei.starovoitov@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bp@alien8.de \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=dave.hansen@linux.intel.com \
--cc=davem@davemloft.net \
--cc=dsahern@kernel.org \
--cc=eddyz87@gmail.com \
--cc=haoluo@google.com \
--cc=hpa@zytor.com \
--cc=jiang.biao@linux.dev \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=kpsingh@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=martin.lau@linux.dev \
--cc=menglong8.dong@gmail.com \
--cc=mingo@redhat.com \
--cc=netdev@vger.kernel.org \
--cc=peterz@infradead.org \
--cc=sdf@fomichev.me \
--cc=song@kernel.org \
--cc=tglx@linutronix.de \
--cc=x86@kernel.org \
--cc=yonghong.song@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).