[PATCH bpf-next] bpf,x86: do RSB balance for trampoline

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH bpf-next] bpf,x86: do RSB balance for trampoline
@ 2025-11-04 10:49 Menglong Dong
  2025-11-04 18:56 ` Alexei Starovoitov
  0 siblings, 1 reply; 16+ messages in thread
From: Menglong Dong @ 2025-11-04 10:49 UTC (permalink / raw)
  To: ast
  Cc: daniel, andrii, martin.lau, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, davem, dsahern, tglx,
	mingo, bp, dave.hansen, x86, hpa, jiang.biao, menglong.dong, bpf,
	netdev, linux-kernel

In origin call case, we skip the "rip" directly before we return, which
break the RSB, as we have twice "call", but only once "ret".

Do the RSB balance by pseudo a "ret". Instead of skipping the "rip", we
modify it to the address of a "ret" insn that we generate.

The performance of "fexit" increases from 76M/s to 84M/s. Before this
optimize, the bench resulting of fexit is:

fexit          :   76.494 ± 0.216M/s
fexit          :   76.319 ± 0.097M/s
fexit          :   70.680 ± 0.060M/s
fexit          :   75.509 ± 0.039M/s
fexit          :   76.392 ± 0.049M/s

After this optimize:

fexit          :   86.023 ± 0.518M/s
fexit          :   83.388 ± 0.021M/s
fexit          :   85.146 ± 0.058M/s
fexit          :   85.646 ± 0.136M/s
fexit          :   84.040 ± 0.045M/s

Things become a little more complex, not sure if the benefits worth it :/

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
---
 arch/x86/net/bpf_jit_comp.c | 32 +++++++++++++++++++++++++++++---
 1 file changed, 29 insertions(+), 3 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index d4c93d9e73e4..a9c2142a84d0 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -3185,6 +3185,7 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
 	struct bpf_tramp_links *fmod_ret = &tlinks[BPF_TRAMP_MODIFY_RETURN];
 	void *orig_call = func_addr;
 	u8 **branches = NULL;
+	u8 *rsb_pos;
 	u8 *prog;
 	bool save_ret;
 
@@ -3431,17 +3432,42 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
 		LOAD_TRAMP_TAIL_CALL_CNT_PTR(stack_size);
 	}
 
+	if (flags & BPF_TRAMP_F_SKIP_FRAME) {
+		u64 ret_addr = (u64)(image + (prog - (u8 *)rw_image));
+
+		rsb_pos = prog;
+		/*
+		 * reserve the room to save the return address to rax:
+		 *   movabs rax, imm64
+		 *
+		 * this is used to do the RSB balance. For the SKIP_FRAME
+		 * case, we do the "call" twice, but only have one "ret",
+		 * which can break the RSB.
+		 *
+		 * Therefore, instead of skipping the "rip", we make it as
+		 * a pseudo return: modify the "rip" in the stack to the
+		 * second "ret" address that we build bellow.
+		 */
+		emit_mov_imm64(&prog, BPF_REG_0, ret_addr >> 32, (u32)ret_addr);
+		/* mov [rbp + 8], rax */
+		EMIT4(0x48, 0x89, 0x45, 0x08);
+	}
+
 	/* restore return value of orig_call or fentry prog back into RAX */
 	if (save_ret)
 		emit_ldx(&prog, BPF_DW, BPF_REG_0, BPF_REG_FP, -8);
 
 	emit_ldx(&prog, BPF_DW, BPF_REG_6, BPF_REG_FP, -rbx_off);
 	EMIT1(0xC9); /* leave */
+	emit_return(&prog, image + (prog - (u8 *)rw_image));
 	if (flags & BPF_TRAMP_F_SKIP_FRAME) {
-		/* skip our return address and return to parent */
-		EMIT4(0x48, 0x83, 0xC4, 8); /* add rsp, 8 */
+		u64 ret_addr = (u64)(image + (prog - (u8 *)rw_image));
+
+		/* fix the return address to second return address */
+		emit_mov_imm64(&rsb_pos, BPF_REG_0, ret_addr >> 32, (u32)ret_addr);
+		/* this is the second(real) return */
+		emit_return(&prog, image + (prog - (u8 *)rw_image));
 	}
-	emit_return(&prog, image + (prog - (u8 *)rw_image));
 	/* Make sure the trampoline generation logic doesn't overflow */
 	if (WARN_ON_ONCE(prog > (u8 *)rw_image_end - BPF_INSN_SAFETY)) {
 		ret = -EFAULT;
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next] bpf,x86: do RSB balance for trampoline
  2025-11-04 10:49 [PATCH bpf-next] bpf,x86: do RSB balance for trampoline Menglong Dong
@ 2025-11-04 18:56 ` Alexei Starovoitov
  2025-11-05  1:30   ` Menglong Dong
  0 siblings, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2025-11-04 18:56 UTC (permalink / raw)
  To: Menglong Dong
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, David S. Miller,
	David Ahern, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, X86 ML, H. Peter Anvin, jiang.biao, Menglong Dong,
	bpf, Network Development, LKML

On Tue, Nov 4, 2025 at 2:49 AM Menglong Dong <menglong8.dong@gmail.com> wrote:
>
> In origin call case, we skip the "rip" directly before we return, which
> break the RSB, as we have twice "call", but only once "ret".

RSB meaning return stack buffer?

and by "breaks RSB" you mean it makes the cpu less efficient?
Or you mean call depth accounting that is done in sw ?

> Do the RSB balance by pseudo a "ret". Instead of skipping the "rip", we
> modify it to the address of a "ret" insn that we generate.
>
> The performance of "fexit" increases from 76M/s to 84M/s. Before this
> optimize, the bench resulting of fexit is:
>
> fexit          :   76.494 ± 0.216M/s
> fexit          :   76.319 ± 0.097M/s
> fexit          :   70.680 ± 0.060M/s
> fexit          :   75.509 ± 0.039M/s
> fexit          :   76.392 ± 0.049M/s
>
> After this optimize:
>
> fexit          :   86.023 ± 0.518M/s
> fexit          :   83.388 ± 0.021M/s
> fexit          :   85.146 ± 0.058M/s
> fexit          :   85.646 ± 0.136M/s
> fexit          :   84.040 ± 0.045M/s

This is with or without calldepth accounting?

> Things become a little more complex, not sure if the benefits worth it :/
>
> Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
> ---
>  arch/x86/net/bpf_jit_comp.c | 32 +++++++++++++++++++++++++++++---
>  1 file changed, 29 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index d4c93d9e73e4..a9c2142a84d0 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -3185,6 +3185,7 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
>         struct bpf_tramp_links *fmod_ret = &tlinks[BPF_TRAMP_MODIFY_RETURN];
>         void *orig_call = func_addr;
>         u8 **branches = NULL;
> +       u8 *rsb_pos;
>         u8 *prog;
>         bool save_ret;
>
> @@ -3431,17 +3432,42 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
>                 LOAD_TRAMP_TAIL_CALL_CNT_PTR(stack_size);
>         }
>
> +       if (flags & BPF_TRAMP_F_SKIP_FRAME) {
> +               u64 ret_addr = (u64)(image + (prog - (u8 *)rw_image));
> +
> +               rsb_pos = prog;
> +               /*
> +                * reserve the room to save the return address to rax:
> +                *   movabs rax, imm64
> +                *
> +                * this is used to do the RSB balance. For the SKIP_FRAME
> +                * case, we do the "call" twice, but only have one "ret",
> +                * which can break the RSB.
> +                *
> +                * Therefore, instead of skipping the "rip", we make it as
> +                * a pseudo return: modify the "rip" in the stack to the
> +                * second "ret" address that we build bellow.
> +                */
> +               emit_mov_imm64(&prog, BPF_REG_0, ret_addr >> 32, (u32)ret_addr);
> +               /* mov [rbp + 8], rax */
> +               EMIT4(0x48, 0x89, 0x45, 0x08);
> +       }
> +
>         /* restore return value of orig_call or fentry prog back into RAX */
>         if (save_ret)
>                 emit_ldx(&prog, BPF_DW, BPF_REG_0, BPF_REG_FP, -8);
>
>         emit_ldx(&prog, BPF_DW, BPF_REG_6, BPF_REG_FP, -rbx_off);
>         EMIT1(0xC9); /* leave */
> +       emit_return(&prog, image + (prog - (u8 *)rw_image));
>         if (flags & BPF_TRAMP_F_SKIP_FRAME) {
> -               /* skip our return address and return to parent */
> -               EMIT4(0x48, 0x83, 0xC4, 8); /* add rsp, 8 */
> +               u64 ret_addr = (u64)(image + (prog - (u8 *)rw_image));
> +
> +               /* fix the return address to second return address */
> +               emit_mov_imm64(&rsb_pos, BPF_REG_0, ret_addr >> 32, (u32)ret_addr);

So the first "movabs rax, imm64" is not needed ?
Why compute ret_addr there and everything ?
I mean it could have been prog += sizeof(movabs), right?

> +               /* this is the second(real) return */
> +               emit_return(&prog, image + (prog - (u8 *)rw_image));
>         }
> -       emit_return(&prog, image + (prog - (u8 *)rw_image));

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next] bpf,x86: do RSB balance for trampoline
  2025-11-04 18:56 ` Alexei Starovoitov
@ 2025-11-05  1:30   ` Menglong Dong
  2025-11-05  2:12     ` Alexei Starovoitov
  0 siblings, 1 reply; 16+ messages in thread
From: Menglong Dong @ 2025-11-05  1:30 UTC (permalink / raw)
  To: Menglong Dong, Alexei Starovoitov
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, David S. Miller,
	David Ahern, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, X86 ML, H. Peter Anvin, jiang.biao, bpf,
	Network Development, LKML

On 2025/11/5 02:56, Alexei Starovoitov wrote:
> On Tue, Nov 4, 2025 at 2:49 AM Menglong Dong <menglong8.dong@gmail.com> wrote:
> >
> > In origin call case, we skip the "rip" directly before we return, which
> > break the RSB, as we have twice "call", but only once "ret".
> 
> RSB meaning return stack buffer?
> 
> and by "breaks RSB" you mean it makes the cpu less efficient?

Yeah, I mean it makes the cpu less efficient. The RSB is used
for the branch predicting, and it will push the "rip" to its hardware
stack on "call", and pop it from the stack on "ret". In the origin
call case, there are twice "call" but once "ret", will break its
balance.

Similar things happen in "return_to_handler" in ftrace_64.S,
which has once "call", but twice "ret". And it pretend a "call"
to make it balance.

I were wandering why the overhead of fexit is much higher
than fentry. I added the percup-ref-get-and-put stuff to the
fentry, and the performance of it still can be 130M/s. However,
the fexit only has 76M/s. And the only difference is the origin
call.

The RSB balancing mitigate it, but there are still gap. I
suspect it's still the branch predicting things.

> Or you mean call depth accounting that is done in sw ?
> 
> > Do the RSB balance by pseudo a "ret". Instead of skipping the "rip", we
> > modify it to the address of a "ret" insn that we generate.
> >
> > The performance of "fexit" increases from 76M/s to 84M/s. Before this
> > optimize, the bench resulting of fexit is:
> >
> > fexit          :   76.494 ± 0.216M/s
> > fexit          :   76.319 ± 0.097M/s
> > fexit          :   70.680 ± 0.060M/s
> > fexit          :   75.509 ± 0.039M/s
> > fexit          :   76.392 ± 0.049M/s
> >
> > After this optimize:
> >
> > fexit          :   86.023 ± 0.518M/s
> > fexit          :   83.388 ± 0.021M/s
> > fexit          :   85.146 ± 0.058M/s
> > fexit          :   85.646 ± 0.136M/s
> > fexit          :   84.040 ± 0.045M/s
> 
> This is with or without calldepth accounting?

The CONFIG_MITIGATION_CALL_DEPTH_TRACKING is enabled, but
I did the testing with "mitigations=off" in the cmdline, so I guess
"without"?

> 
> > Things become a little more complex, not sure if the benefits worth it :/
> >
> > Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
> > ---
> >  arch/x86/net/bpf_jit_comp.c | 32 +++++++++++++++++++++++++++++---
> >  1 file changed, 29 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> > index d4c93d9e73e4..a9c2142a84d0 100644
> > --- a/arch/x86/net/bpf_jit_comp.c
> > +++ b/arch/x86/net/bpf_jit_comp.c
> > @@ -3185,6 +3185,7 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
> >         struct bpf_tramp_links *fmod_ret = &tlinks[BPF_TRAMP_MODIFY_RETURN];
> >         void *orig_call = func_addr;
> >         u8 **branches = NULL;
> > +       u8 *rsb_pos;
> >         u8 *prog;
> >         bool save_ret;
> >
> > @@ -3431,17 +3432,42 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
> >                 LOAD_TRAMP_TAIL_CALL_CNT_PTR(stack_size);
> >         }
> >
> > +       if (flags & BPF_TRAMP_F_SKIP_FRAME) {
> > +               u64 ret_addr = (u64)(image + (prog - (u8 *)rw_image));
> > +
> > +               rsb_pos = prog;
> > +               /*
> > +                * reserve the room to save the return address to rax:
> > +                *   movabs rax, imm64
> > +                *
> > +                * this is used to do the RSB balance. For the SKIP_FRAME
> > +                * case, we do the "call" twice, but only have one "ret",
> > +                * which can break the RSB.
> > +                *
> > +                * Therefore, instead of skipping the "rip", we make it as
> > +                * a pseudo return: modify the "rip" in the stack to the
> > +                * second "ret" address that we build bellow.
> > +                */
> > +               emit_mov_imm64(&prog, BPF_REG_0, ret_addr >> 32, (u32)ret_addr);
> > +               /* mov [rbp + 8], rax */
> > +               EMIT4(0x48, 0x89, 0x45, 0x08);
> > +       }
> > +
> >         /* restore return value of orig_call or fentry prog back into RAX */
> >         if (save_ret)
> >                 emit_ldx(&prog, BPF_DW, BPF_REG_0, BPF_REG_FP, -8);
> >
> >         emit_ldx(&prog, BPF_DW, BPF_REG_6, BPF_REG_FP, -rbx_off);
> >         EMIT1(0xC9); /* leave */
> > +       emit_return(&prog, image + (prog - (u8 *)rw_image));
> >         if (flags & BPF_TRAMP_F_SKIP_FRAME) {
> > -               /* skip our return address and return to parent */
> > -               EMIT4(0x48, 0x83, 0xC4, 8); /* add rsp, 8 */
> > +               u64 ret_addr = (u64)(image + (prog - (u8 *)rw_image));
> > +
> > +               /* fix the return address to second return address */
> > +               emit_mov_imm64(&rsb_pos, BPF_REG_0, ret_addr >> 32, (u32)ret_addr);
> 
> So the first "movabs rax, imm64" is not needed ?
> Why compute ret_addr there and everything ?
> I mean it could have been prog += sizeof(movabs), right?

I did it before, but the thing is that the "sizeof(movabs)" in not
fixed according to the definition of emit_mov_imm64():

static void emit_mov_imm64(u8 **pprog, u32 dst_reg,
			   const u32 imm32_hi, const u32 imm32_lo)
{
	u64 imm64 = ((u64)imm32_hi << 32) | (u32)imm32_lo;
	u8 *prog = *pprog;

	if (is_uimm32(imm64)) {
		/*
		 * For emitting plain u32, where sign bit must not be
		 * propagated LLVM tends to load imm64 over mov32
		 * directly, so save couple of bytes by just doing
		 * 'mov %eax, imm32' instead.
		 */
		emit_mov_imm32(&prog, false, dst_reg, imm32_lo);
	} else if (is_simm32(imm64)) {
		emit_mov_imm32(&prog, true, dst_reg, imm32_lo);
	} else {
		/* movabsq rax, imm64 */
		EMIT2(add_1mod(0x48, dst_reg), add_1reg(0xB8, dst_reg));
		EMIT(imm32_lo, 4);
		EMIT(imm32_hi, 4);
	}

	*pprog = prog;
}

I used "emit_mov_imm64(&prog, BPF_REG_0, 0, 0)" to take the placeholder,
but I failed, as the insn length is total different with
"emit_mov_imm64(&rsb_pos, BPF_REG_0, ret_addr >> 32, (u32)ret_addr);".

It's a little confusing here, I have not figure out a better way :/

Thanks!
Menglong Dong

> 
> > +               /* this is the second(real) return */
> > +               emit_return(&prog, image + (prog - (u8 *)rw_image));
> >         }
> > -       emit_return(&prog, image + (prog - (u8 *)rw_image));
> 





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next] bpf,x86: do RSB balance for trampoline
  2025-11-05  1:30   ` Menglong Dong
@ 2025-11-05  2:12     ` Alexei Starovoitov
  2025-11-05  7:13       ` Menglong Dong
  0 siblings, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2025-11-05  2:12 UTC (permalink / raw)
  To: Menglong Dong
  Cc: Menglong Dong, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, David S. Miller, David Ahern, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, X86 ML, H. Peter Anvin,
	jiang.biao, bpf, Network Development, LKML

On Tue, Nov 4, 2025 at 5:30 PM Menglong Dong <menglong.dong@linux.dev> wrote:
>
> On 2025/11/5 02:56, Alexei Starovoitov wrote:
> > On Tue, Nov 4, 2025 at 2:49 AM Menglong Dong <menglong8.dong@gmail.com> wrote:
> > >
> > > In origin call case, we skip the "rip" directly before we return, which
> > > break the RSB, as we have twice "call", but only once "ret".
> >
> > RSB meaning return stack buffer?
> >
> > and by "breaks RSB" you mean it makes the cpu less efficient?
>
> Yeah, I mean it makes the cpu less efficient. The RSB is used
> for the branch predicting, and it will push the "rip" to its hardware
> stack on "call", and pop it from the stack on "ret". In the origin
> call case, there are twice "call" but once "ret", will break its
> balance.

Yes. I'm aware, but your "mov [rbp + 8], rax" screws it up as well,
since RSB has to be updated/invalidated by this store.
The behavior depends on the microarchitecture, of course.
I think:
add rsp, 8
ret
will only screw up the return prediction, but won't invalidate RSB.

> Similar things happen in "return_to_handler" in ftrace_64.S,
> which has once "call", but twice "ret". And it pretend a "call"
> to make it balance.

This makes more sense to me. Let's try that approach instead
of messing with the return address on stack?

> I were wandering why the overhead of fexit is much higher
> than fentry. I added the percup-ref-get-and-put stuff to the
> fentry, and the performance of it still can be 130M/s. However,
> the fexit only has 76M/s. And the only difference is the origin
> call.
>
> The RSB balancing mitigate it, but there are still gap. I
> suspect it's still the branch predicting things.

If you have access to intel vtune profiler, check what is
actually happening. It can show micro arch details.
I don't think there is an open source alternative.

> > Or you mean call depth accounting that is done in sw ?
> >
> > > Do the RSB balance by pseudo a "ret". Instead of skipping the "rip", we
> > > modify it to the address of a "ret" insn that we generate.
> > >
> > > The performance of "fexit" increases from 76M/s to 84M/s. Before this
> > > optimize, the bench resulting of fexit is:
> > >
> > > fexit          :   76.494 ± 0.216M/s
> > > fexit          :   76.319 ± 0.097M/s
> > > fexit          :   70.680 ± 0.060M/s
> > > fexit          :   75.509 ± 0.039M/s
> > > fexit          :   76.392 ± 0.049M/s
> > >
> > > After this optimize:
> > >
> > > fexit          :   86.023 ± 0.518M/s
> > > fexit          :   83.388 ± 0.021M/s
> > > fexit          :   85.146 ± 0.058M/s
> > > fexit          :   85.646 ± 0.136M/s
> > > fexit          :   84.040 ± 0.045M/s
> >
> > This is with or without calldepth accounting?
>
> The CONFIG_MITIGATION_CALL_DEPTH_TRACKING is enabled, but
> I did the testing with "mitigations=off" in the cmdline, so I guess
> "without"?

Pls benchmark both. It sounds like call_depth_tracking
miscounting stuff ?

> >
> > > Things become a little more complex, not sure if the benefits worth it :/
> > >
> > > Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
> > > ---
> > >  arch/x86/net/bpf_jit_comp.c | 32 +++++++++++++++++++++++++++++---
> > >  1 file changed, 29 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> > > index d4c93d9e73e4..a9c2142a84d0 100644
> > > --- a/arch/x86/net/bpf_jit_comp.c
> > > +++ b/arch/x86/net/bpf_jit_comp.c
> > > @@ -3185,6 +3185,7 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
> > >         struct bpf_tramp_links *fmod_ret = &tlinks[BPF_TRAMP_MODIFY_RETURN];
> > >         void *orig_call = func_addr;
> > >         u8 **branches = NULL;
> > > +       u8 *rsb_pos;
> > >         u8 *prog;
> > >         bool save_ret;
> > >
> > > @@ -3431,17 +3432,42 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
> > >                 LOAD_TRAMP_TAIL_CALL_CNT_PTR(stack_size);
> > >         }
> > >
> > > +       if (flags & BPF_TRAMP_F_SKIP_FRAME) {
> > > +               u64 ret_addr = (u64)(image + (prog - (u8 *)rw_image));
> > > +
> > > +               rsb_pos = prog;
> > > +               /*
> > > +                * reserve the room to save the return address to rax:
> > > +                *   movabs rax, imm64
> > > +                *
> > > +                * this is used to do the RSB balance. For the SKIP_FRAME
> > > +                * case, we do the "call" twice, but only have one "ret",
> > > +                * which can break the RSB.
> > > +                *
> > > +                * Therefore, instead of skipping the "rip", we make it as
> > > +                * a pseudo return: modify the "rip" in the stack to the
> > > +                * second "ret" address that we build bellow.
> > > +                */
> > > +               emit_mov_imm64(&prog, BPF_REG_0, ret_addr >> 32, (u32)ret_addr);
> > > +               /* mov [rbp + 8], rax */
> > > +               EMIT4(0x48, 0x89, 0x45, 0x08);
> > > +       }
> > > +
> > >         /* restore return value of orig_call or fentry prog back into RAX */
> > >         if (save_ret)
> > >                 emit_ldx(&prog, BPF_DW, BPF_REG_0, BPF_REG_FP, -8);
> > >
> > >         emit_ldx(&prog, BPF_DW, BPF_REG_6, BPF_REG_FP, -rbx_off);
> > >         EMIT1(0xC9); /* leave */
> > > +       emit_return(&prog, image + (prog - (u8 *)rw_image));
> > >         if (flags & BPF_TRAMP_F_SKIP_FRAME) {
> > > -               /* skip our return address and return to parent */
> > > -               EMIT4(0x48, 0x83, 0xC4, 8); /* add rsp, 8 */
> > > +               u64 ret_addr = (u64)(image + (prog - (u8 *)rw_image));
> > > +
> > > +               /* fix the return address to second return address */
> > > +               emit_mov_imm64(&rsb_pos, BPF_REG_0, ret_addr >> 32, (u32)ret_addr);
> >
> > So the first "movabs rax, imm64" is not needed ?
> > Why compute ret_addr there and everything ?
> > I mean it could have been prog += sizeof(movabs), right?
>
> I did it before, but the thing is that the "sizeof(movabs)" in not
> fixed according to the definition of emit_mov_imm64():
>
> static void emit_mov_imm64(u8 **pprog, u32 dst_reg,
>                            const u32 imm32_hi, const u32 imm32_lo)
> {
>         u64 imm64 = ((u64)imm32_hi << 32) | (u32)imm32_lo;
>         u8 *prog = *pprog;
>
>         if (is_uimm32(imm64)) {
>                 /*
>                  * For emitting plain u32, where sign bit must not be
>                  * propagated LLVM tends to load imm64 over mov32
>                  * directly, so save couple of bytes by just doing
>                  * 'mov %eax, imm32' instead.
>                  */
>                 emit_mov_imm32(&prog, false, dst_reg, imm32_lo);
>         } else if (is_simm32(imm64)) {
>                 emit_mov_imm32(&prog, true, dst_reg, imm32_lo);
>         } else {
>                 /* movabsq rax, imm64 */
>                 EMIT2(add_1mod(0x48, dst_reg), add_1reg(0xB8, dst_reg));
>                 EMIT(imm32_lo, 4);
>                 EMIT(imm32_hi, 4);

This part could be factored out as a separate helper.
Then sizeof(movabsq) will be constant.
Note, in the verifier we do:
#if defined(MODULES_VADDR)
                        u64 addr = MODULES_VADDR;
#else
                        u64 addr = VMALLOC_START;
#endif
                        /* jit (e.g. x86_64) may emit fewer instructions
                         * if it learns a u32 imm is the same as a u64 imm.
                         * Set close enough to possible prog address.
                         */
                        insn[0].imm = (u32)addr;
                        insn[1].imm = addr >> 32;

do mitigate this issue.
So you could have done:
emit_mov_imm64(&prog, BPF_REG_0, VMALLOC_START >> 32, 0);

since 'ret_addr' math is incorrect at that point anyway.
But prog += sizeof is imo cleaner.
The whole thing might not be necessary with extra call approach.
I suspect it should be faster than this approach.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next] bpf,x86: do RSB balance for trampoline
  2025-11-05  2:12     ` Alexei Starovoitov
@ 2025-11-05  7:13       ` Menglong Dong
  2025-11-05  7:46         ` Menglong Dong
  0 siblings, 1 reply; 16+ messages in thread
From: Menglong Dong @ 2025-11-05  7:13 UTC (permalink / raw)
  To: Alexei Starovoitov, peterz
  Cc: Menglong Dong, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, David S. Miller, David Ahern, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, X86 ML, H. Peter Anvin,
	jiang.biao, bpf, Network Development, LKML

On 2025/11/5 10:12, Alexei Starovoitov wrote:
> On Tue, Nov 4, 2025 at 5:30 PM Menglong Dong <menglong.dong@linux.dev> wrote:
> >
> > On 2025/11/5 02:56, Alexei Starovoitov wrote:
> > > On Tue, Nov 4, 2025 at 2:49 AM Menglong Dong <menglong8.dong@gmail.com> wrote:
> > > >
> > > > In origin call case, we skip the "rip" directly before we return, which
> > > > break the RSB, as we have twice "call", but only once "ret".
> > >
> > > RSB meaning return stack buffer?
> > >
> > > and by "breaks RSB" you mean it makes the cpu less efficient?
> >
> > Yeah, I mean it makes the cpu less efficient. The RSB is used
> > for the branch predicting, and it will push the "rip" to its hardware
> > stack on "call", and pop it from the stack on "ret". In the origin
> > call case, there are twice "call" but once "ret", will break its
> > balance.
> 
> Yes. I'm aware, but your "mov [rbp + 8], rax" screws it up as well,
> since RSB has to be updated/invalidated by this store.
> The behavior depends on the microarchitecture, of course.
> I think:
> add rsp, 8
> ret
> will only screw up the return prediction, but won't invalidate RSB.
> 
> > Similar things happen in "return_to_handler" in ftrace_64.S,
> > which has once "call", but twice "ret". And it pretend a "call"
> > to make it balance.
> 
> This makes more sense to me. Let's try that approach instead
> of messing with the return address on stack?

The way here is similar to the "return_to_handler". For the ftrace,
the origin stack before the "ret" of the traced function is:

    POS:
    rip   ---> return_to_handler

And the exit of the traced function will jump to return_to_handler.
In return_to_handler, it will query the real "rip" of the traced function
and the it call a internal function:

    call .Ldo_rop

And the stack now is:

    POS:
    rip   ----> the address after "call .Ldo_rop", which is a "int3"

in the .Ldo_rop, it will modify the rip to the real rip to make
it like this:

    POS:
    rip   ---> real rip

And it return. Take the target function "foo" for example, the logic
of it is:

    call foo -> call ftrace_caller -> return ftrace_caller ->
    return return_to_handler -> call Ldo_rop -> return foo

As you can see, the call and return address for ".Ldo_rop" is
also messed up. So I think it works here too. Compared with
a messed "return address", a missed return maybe have
better influence?

And the whole logic for us is:

    call foo -> call trampoline -> call origin ->
    return origin -> return POS -> return foo

Following is the partial code of return_to_handler:

	/*
	 * Jump back to the old return address. This cannot be JMP_NOSPEC rdi
	 * since IBT would demand that contain ENDBR, which simply isn't so for
	 * return addresses. Use a retpoline here to keep the RSB balanced.
	 */
	ANNOTATE_INTRA_FUNCTION_CALL
	call .Ldo_rop
	int3
.Ldo_rop:
	mov %rdi, (%rsp)
	ALTERNATIVE __stringify(RET), \
		    __stringify(ANNOTATE_UNRET_SAFE; ret; int3), \
		    X86_FEATURE_CALL_DEPTH

> 
> > I were wandering why the overhead of fexit is much higher
> > than fentry. I added the percup-ref-get-and-put stuff to the
> > fentry, and the performance of it still can be 130M/s. However,
> > the fexit only has 76M/s. And the only difference is the origin
> > call.
> >
> > The RSB balancing mitigate it, but there are still gap. I
> > suspect it's still the branch predicting things.
> 
> If you have access to intel vtune profiler, check what is

Let me have a study on the "intel vtune profiler" stuff :)

I did a perf stat, and the branch miss increase seriously,
and the IPC(insn per cycle) decrease seriously.

> actually happening. It can show micro arch details.
> I don't think there is an open source alternative.
> 
> > > Or you mean call depth accounting that is done in sw ?
> > >
> > > > Do the RSB balance by pseudo a "ret". Instead of skipping the "rip", we
> > > > modify it to the address of a "ret" insn that we generate.
> > > >
> > > > The performance of "fexit" increases from 76M/s to 84M/s. Before this
> > > > optimize, the bench resulting of fexit is:
> > > >
> > > > fexit          :   76.494 ± 0.216M/s
> > > > fexit          :   76.319 ± 0.097M/s
> > > > fexit          :   70.680 ± 0.060M/s
> > > > fexit          :   75.509 ± 0.039M/s
> > > > fexit          :   76.392 ± 0.049M/s
> > > >
> > > > After this optimize:
> > > >
> > > > fexit          :   86.023 ± 0.518M/s
> > > > fexit          :   83.388 ± 0.021M/s
> > > > fexit          :   85.146 ± 0.058M/s
> > > > fexit          :   85.646 ± 0.136M/s
> > > > fexit          :   84.040 ± 0.045M/s
> > >
> > > This is with or without calldepth accounting?
> >
> > The CONFIG_MITIGATION_CALL_DEPTH_TRACKING is enabled, but
> > I did the testing with "mitigations=off" in the cmdline, so I guess
> > "without"?
> 
> Pls benchmark both. It sounds like call_depth_tracking
> miscounting stuff ?

Sadly, the performance decrease from 28M/s to 26M/s with
mitigation enabled. I think the addition "ret" increase the
overhead with "return-thunks" enabled.

Things is not as simple as I thought, let me do more testing,
and see if ftrace has similar performance decreasing.

Hi, @Peter. Do you have any advice on this ting?

Thanks!
Menglong Dong

> 
> > >
[......]
> >                            const u32 imm32_hi, const u32 imm32_lo)
> > {
> >         u64 imm64 = ((u64)imm32_hi << 32) | (u32)imm32_lo;
> >         u8 *prog = *pprog;
> >
> >         if (is_uimm32(imm64)) {
> >                 /*
> >                  * For emitting plain u32, where sign bit must not be
> >                  * propagated LLVM tends to load imm64 over mov32
> >                  * directly, so save couple of bytes by just doing
> >                  * 'mov %eax, imm32' instead.
> >                  */
> >                 emit_mov_imm32(&prog, false, dst_reg, imm32_lo);
> >         } else if (is_simm32(imm64)) {
> >                 emit_mov_imm32(&prog, true, dst_reg, imm32_lo);
> >         } else {
> >                 /* movabsq rax, imm64 */
> >                 EMIT2(add_1mod(0x48, dst_reg), add_1reg(0xB8, dst_reg));
> >                 EMIT(imm32_lo, 4);
> >                 EMIT(imm32_hi, 4);
> 
> This part could be factored out as a separate helper.
> Then sizeof(movabsq) will be constant.
> Note, in the verifier we do:
> #if defined(MODULES_VADDR)
>                         u64 addr = MODULES_VADDR;
> #else
>                         u64 addr = VMALLOC_START;
> #endif
>                         /* jit (e.g. x86_64) may emit fewer instructions
>                          * if it learns a u32 imm is the same as a u64 imm.
>                          * Set close enough to possible prog address.
>                          */
>                         insn[0].imm = (u32)addr;
>                         insn[1].imm = addr >> 32;
> 
> do mitigate this issue.
> So you could have done:
> emit_mov_imm64(&prog, BPF_REG_0, VMALLOC_START >> 32, 0);
> 
> since 'ret_addr' math is incorrect at that point anyway.
> But prog += sizeof is imo cleaner.

Great! This make things much more clear. After I figure out all
the stuff, (which I'm not sure if I'm able), I'll do this way.

Thanks!
Menglong Dong

> The whole thing might not be necessary with extra call approach.
> I suspect it should be faster than this approach.
> 





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next] bpf,x86: do RSB balance for trampoline
  2025-11-05  7:13       ` Menglong Dong
@ 2025-11-05  7:46         ` Menglong Dong
  2025-11-05 23:31           ` Alexei Starovoitov
  2025-11-06 12:03           ` Peter Zijlstra
  0 siblings, 2 replies; 16+ messages in thread
From: Menglong Dong @ 2025-11-05  7:46 UTC (permalink / raw)
  To: Alexei Starovoitov, peterz
  Cc: Menglong Dong, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, David S. Miller, David Ahern, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, X86 ML, H. Peter Anvin,
	jiang.biao, bpf, Network Development, LKML

On 2025/11/5 15:13, Menglong Dong wrote:
> On 2025/11/5 10:12, Alexei Starovoitov wrote:
> > On Tue, Nov 4, 2025 at 5:30 PM Menglong Dong <menglong.dong@linux.dev> wrote:
> > >
> > > On 2025/11/5 02:56, Alexei Starovoitov wrote:
> > > > On Tue, Nov 4, 2025 at 2:49 AM Menglong Dong <menglong8.dong@gmail.com> wrote:
> > > > >
> > > > > In origin call case, we skip the "rip" directly before we return, which
> > > > > break the RSB, as we have twice "call", but only once "ret".
> > > >
> > > > RSB meaning return stack buffer?
> > > >
> > > > and by "breaks RSB" you mean it makes the cpu less efficient?
> > >
> > > Yeah, I mean it makes the cpu less efficient. The RSB is used
> > > for the branch predicting, and it will push the "rip" to its hardware
> > > stack on "call", and pop it from the stack on "ret". In the origin
> > > call case, there are twice "call" but once "ret", will break its
> > > balance.
> > 
> > Yes. I'm aware, but your "mov [rbp + 8], rax" screws it up as well,
> > since RSB has to be updated/invalidated by this store.
> > The behavior depends on the microarchitecture, of course.
> > I think:
> > add rsp, 8
> > ret
> > will only screw up the return prediction, but won't invalidate RSB.
> > 
> > > Similar things happen in "return_to_handler" in ftrace_64.S,
> > > which has once "call", but twice "ret". And it pretend a "call"
> > > to make it balance.
> > 
> > This makes more sense to me. Let's try that approach instead
> > of messing with the return address on stack?
> 
> The way here is similar to the "return_to_handler". For the ftrace,
> the origin stack before the "ret" of the traced function is:
> 
>     POS:
>     rip   ---> return_to_handler
> 
> And the exit of the traced function will jump to return_to_handler.
> In return_to_handler, it will query the real "rip" of the traced function
> and the it call a internal function:
> 
>     call .Ldo_rop
> 
> And the stack now is:
> 
>     POS:
>     rip   ----> the address after "call .Ldo_rop", which is a "int3"
> 
> in the .Ldo_rop, it will modify the rip to the real rip to make
> it like this:
> 
>     POS:
>     rip   ---> real rip
> 
> And it return. Take the target function "foo" for example, the logic
> of it is:
> 
>     call foo -> call ftrace_caller -> return ftrace_caller ->
>     return return_to_handler -> call Ldo_rop -> return foo
> 
> As you can see, the call and return address for ".Ldo_rop" is
> also messed up. So I think it works here too. Compared with
> a messed "return address", a missed return maybe have
> better influence?
> 
> And the whole logic for us is:
> 
>     call foo -> call trampoline -> call origin ->
>     return origin -> return POS -> return foo

The "return POS" will miss the RSB, but the later return
will hit it.

The origin logic is:

     call foo -> call trampoline -> call origin ->
     return origin -> return foo

The "return foo" and all the later return will miss the RBS.

Hmm......Not sure if I understand it correctly.

> 





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next] bpf,x86: do RSB balance for trampoline
  2025-11-05  7:46         ` Menglong Dong
@ 2025-11-05 23:31           ` Alexei Starovoitov
  2025-11-06  1:40             ` Menglong Dong
  2025-11-06 12:03           ` Peter Zijlstra
  1 sibling, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2025-11-05 23:31 UTC (permalink / raw)
  To: Menglong Dong
  Cc: Peter Zijlstra, Menglong Dong, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard,
	Song Liu, Yonghong Song, John Fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, David S. Miller,
	David Ahern, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, X86 ML, H. Peter Anvin, jiang.biao, bpf,
	Network Development, LKML

On Tue, Nov 4, 2025 at 11:47 PM Menglong Dong <menglong.dong@linux.dev> wrote:
>
> On 2025/11/5 15:13, Menglong Dong wrote:
> > On 2025/11/5 10:12, Alexei Starovoitov wrote:
> > > On Tue, Nov 4, 2025 at 5:30 PM Menglong Dong <menglong.dong@linux.dev> wrote:
> > > >
> > > > On 2025/11/5 02:56, Alexei Starovoitov wrote:
> > > > > On Tue, Nov 4, 2025 at 2:49 AM Menglong Dong <menglong8.dong@gmail.com> wrote:
> > > > > >
> > > > > > In origin call case, we skip the "rip" directly before we return, which
> > > > > > break the RSB, as we have twice "call", but only once "ret".
> > > > >
> > > > > RSB meaning return stack buffer?
> > > > >
> > > > > and by "breaks RSB" you mean it makes the cpu less efficient?
> > > >
> > > > Yeah, I mean it makes the cpu less efficient. The RSB is used
> > > > for the branch predicting, and it will push the "rip" to its hardware
> > > > stack on "call", and pop it from the stack on "ret". In the origin
> > > > call case, there are twice "call" but once "ret", will break its
> > > > balance.
> > >
> > > Yes. I'm aware, but your "mov [rbp + 8], rax" screws it up as well,
> > > since RSB has to be updated/invalidated by this store.
> > > The behavior depends on the microarchitecture, of course.
> > > I think:
> > > add rsp, 8
> > > ret
> > > will only screw up the return prediction, but won't invalidate RSB.
> > >
> > > > Similar things happen in "return_to_handler" in ftrace_64.S,
> > > > which has once "call", but twice "ret". And it pretend a "call"
> > > > to make it balance.
> > >
> > > This makes more sense to me. Let's try that approach instead
> > > of messing with the return address on stack?
> >
> > The way here is similar to the "return_to_handler". For the ftrace,
> > the origin stack before the "ret" of the traced function is:
> >
> >     POS:
> >     rip   ---> return_to_handler
> >
> > And the exit of the traced function will jump to return_to_handler.
> > In return_to_handler, it will query the real "rip" of the traced function
> > and the it call a internal function:
> >
> >     call .Ldo_rop
> >
> > And the stack now is:
> >
> >     POS:
> >     rip   ----> the address after "call .Ldo_rop", which is a "int3"
> >
> > in the .Ldo_rop, it will modify the rip to the real rip to make
> > it like this:
> >
> >     POS:
> >     rip   ---> real rip
> >
> > And it return. Take the target function "foo" for example, the logic
> > of it is:
> >
> >     call foo -> call ftrace_caller -> return ftrace_caller ->
> >     return return_to_handler -> call Ldo_rop -> return foo
> >
> > As you can see, the call and return address for ".Ldo_rop" is
> > also messed up. So I think it works here too. Compared with
> > a messed "return address", a missed return maybe have
> > better influence?
> >
> > And the whole logic for us is:
> >
> >     call foo -> call trampoline -> call origin ->
> >     return origin -> return POS -> return foo
>
> The "return POS" will miss the RSB, but the later return
> will hit it.
>
> The origin logic is:
>
>      call foo -> call trampoline -> call origin ->
>      return origin -> return foo
>
> The "return foo" and all the later return will miss the RBS.
>
> Hmm......Not sure if I understand it correctly.

Here another idea...
hack tr->func.ftrace_managed = false temporarily
and use BPF_MOD_JUMP in bpf_arch_text_poke()
when installing trampoline with fexit progs.
and also do:
@@ -3437,10 +3437,6 @@ static int __arch_prepare_bpf_trampoline(struct
bpf_tramp_image *im, void *rw_im

        emit_ldx(&prog, BPF_DW, BPF_REG_6, BPF_REG_FP, -rbx_off);
        EMIT1(0xC9); /* leave */
-       if (flags & BPF_TRAMP_F_SKIP_FRAME) {
-               /* skip our return address and return to parent */
-               EMIT4(0x48, 0x83, 0xC4, 8); /* add rsp, 8 */
-       }
        emit_return(&prog, image + (prog - (u8 *)rw_image));

Then RSB is perfectly matched without messing up the stack
and/or extra calls.
If it works and performance is good the next step is to
teach ftrace to emit jmp or call in *_ftrace_direct()

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next] bpf,x86: do RSB balance for trampoline
  2025-11-05 23:31           ` Alexei Starovoitov
@ 2025-11-06  1:40             ` Menglong Dong
  2025-11-06  2:49               ` Menglong Dong
  0 siblings, 1 reply; 16+ messages in thread
From: Menglong Dong @ 2025-11-06  1:40 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Peter Zijlstra, Menglong Dong, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard,
	Song Liu, Yonghong Song, John Fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, David S. Miller,
	David Ahern, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, X86 ML, H. Peter Anvin, jiang.biao, bpf,
	Network Development, LKML

On 2025/11/6 07:31, Alexei Starovoitov wrote:
> On Tue, Nov 4, 2025 at 11:47 PM Menglong Dong <menglong.dong@linux.dev> wrote:
> >
> > On 2025/11/5 15:13, Menglong Dong wrote:
> > > On 2025/11/5 10:12, Alexei Starovoitov wrote:
> > > > On Tue, Nov 4, 2025 at 5:30 PM Menglong Dong <menglong.dong@linux.dev> wrote:
> > > > >
> > > > > On 2025/11/5 02:56, Alexei Starovoitov wrote:
> > > > > > On Tue, Nov 4, 2025 at 2:49 AM Menglong Dong <menglong8.dong@gmail.com> wrote:
> > > > > > >
> > > > > > > In origin call case, we skip the "rip" directly before we return, which
> > > > > > > break the RSB, as we have twice "call", but only once "ret".
> > > > > >
> > > > > > RSB meaning return stack buffer?
> > > > > >
> > > > > > and by "breaks RSB" you mean it makes the cpu less efficient?
> > > > >
> > > > > Yeah, I mean it makes the cpu less efficient. The RSB is used
> > > > > for the branch predicting, and it will push the "rip" to its hardware
> > > > > stack on "call", and pop it from the stack on "ret". In the origin
> > > > > call case, there are twice "call" but once "ret", will break its
> > > > > balance.
> > > >
> > > > Yes. I'm aware, but your "mov [rbp + 8], rax" screws it up as well,
> > > > since RSB has to be updated/invalidated by this store.
> > > > The behavior depends on the microarchitecture, of course.
> > > > I think:
> > > > add rsp, 8
> > > > ret
> > > > will only screw up the return prediction, but won't invalidate RSB.
> > > >
> > > > > Similar things happen in "return_to_handler" in ftrace_64.S,
> > > > > which has once "call", but twice "ret". And it pretend a "call"
> > > > > to make it balance.
> > > >
> > > > This makes more sense to me. Let's try that approach instead
> > > > of messing with the return address on stack?
> > >
> > > The way here is similar to the "return_to_handler". For the ftrace,
> > > the origin stack before the "ret" of the traced function is:
> > >
> > >     POS:
> > >     rip   ---> return_to_handler
> > >
> > > And the exit of the traced function will jump to return_to_handler.
> > > In return_to_handler, it will query the real "rip" of the traced function
> > > and the it call a internal function:
> > >
> > >     call .Ldo_rop
> > >
> > > And the stack now is:
> > >
> > >     POS:
> > >     rip   ----> the address after "call .Ldo_rop", which is a "int3"
> > >
> > > in the .Ldo_rop, it will modify the rip to the real rip to make
> > > it like this:
> > >
> > >     POS:
> > >     rip   ---> real rip
> > >
> > > And it return. Take the target function "foo" for example, the logic
> > > of it is:
> > >
> > >     call foo -> call ftrace_caller -> return ftrace_caller ->
> > >     return return_to_handler -> call Ldo_rop -> return foo
> > >
> > > As you can see, the call and return address for ".Ldo_rop" is
> > > also messed up. So I think it works here too. Compared with
> > > a messed "return address", a missed return maybe have
> > > better influence?
> > >
> > > And the whole logic for us is:
> > >
> > >     call foo -> call trampoline -> call origin ->
> > >     return origin -> return POS -> return foo
> >
> > The "return POS" will miss the RSB, but the later return
> > will hit it.
> >
> > The origin logic is:
> >
> >      call foo -> call trampoline -> call origin ->
> >      return origin -> return foo
> >
> > The "return foo" and all the later return will miss the RBS.
> >
> > Hmm......Not sure if I understand it correctly.
> 
> Here another idea...
> hack tr->func.ftrace_managed = false temporarily
> and use BPF_MOD_JUMP in bpf_arch_text_poke()
> when installing trampoline with fexit progs.
> and also do:
> @@ -3437,10 +3437,6 @@ static int __arch_prepare_bpf_trampoline(struct
> bpf_tramp_image *im, void *rw_im
> 
>         emit_ldx(&prog, BPF_DW, BPF_REG_6, BPF_REG_FP, -rbx_off);
>         EMIT1(0xC9); /* leave */
> -       if (flags & BPF_TRAMP_F_SKIP_FRAME) {
> -               /* skip our return address and return to parent */
> -               EMIT4(0x48, 0x83, 0xC4, 8); /* add rsp, 8 */
> -       }
>         emit_return(&prog, image + (prog - (u8 *)rw_image));
> 
> Then RSB is perfectly matched without messing up the stack
> and/or extra calls.
> If it works and performance is good the next step is to
> teach ftrace to emit jmp or call in *_ftrace_direct()

Good idea. I saw the "return_to_handler" used "JMP_NOSPEC", and
the jmp is converted to the "fake call" to be nice to IBT in this commit:

e52fc2cf3f66 ("x86/ibt,ftrace: Make function-graph play nice")

It's not indirect branch in our case, but let me do more testing to
see if there are any unexpected effect if we use "jmp" here.

Thanks!
Menglong Dong

> 





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next] bpf,x86: do RSB balance for trampoline
  2025-11-06  1:40             ` Menglong Dong
@ 2025-11-06  2:49               ` Menglong Dong
  2025-11-06  2:56                 ` Alexei Starovoitov
  0 siblings, 1 reply; 16+ messages in thread
From: Menglong Dong @ 2025-11-06  2:49 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Peter Zijlstra, Menglong Dong, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard,
	Song Liu, Yonghong Song, John Fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, David S. Miller,
	David Ahern, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, X86 ML, H. Peter Anvin, jiang.biao, bpf,
	Network Development, LKML

On 2025/11/6 09:40, Menglong Dong wrote:
> On 2025/11/6 07:31, Alexei Starovoitov wrote:
> > On Tue, Nov 4, 2025 at 11:47 PM Menglong Dong <menglong.dong@linux.dev> wrote:
> > >
> > > On 2025/11/5 15:13, Menglong Dong wrote:
> > > > On 2025/11/5 10:12, Alexei Starovoitov wrote:
> > > > > On Tue, Nov 4, 2025 at 5:30 PM Menglong Dong <menglong.dong@linux.dev> wrote:
> > > > > >
> > > > > > On 2025/11/5 02:56, Alexei Starovoitov wrote:
> > > > > > > On Tue, Nov 4, 2025 at 2:49 AM Menglong Dong <menglong8.dong@gmail.com> wrote:
> > > > > > > >
> > > > > > > > In origin call case, we skip the "rip" directly before we return, which
> > > > > > > > break the RSB, as we have twice "call", but only once "ret".
> > > > > > >
> > > > > > > RSB meaning return stack buffer?
> > > > > > >
> > > > > > > and by "breaks RSB" you mean it makes the cpu less efficient?
> > > > > >
> > > > > > Yeah, I mean it makes the cpu less efficient. The RSB is used
> > > > > > for the branch predicting, and it will push the "rip" to its hardware
> > > > > > stack on "call", and pop it from the stack on "ret". In the origin
> > > > > > call case, there are twice "call" but once "ret", will break its
> > > > > > balance.
> > > > >
> > > > > Yes. I'm aware, but your "mov [rbp + 8], rax" screws it up as well,
> > > > > since RSB has to be updated/invalidated by this store.
> > > > > The behavior depends on the microarchitecture, of course.
> > > > > I think:
> > > > > add rsp, 8
> > > > > ret
> > > > > will only screw up the return prediction, but won't invalidate RSB.
> > > > >
> > > > > > Similar things happen in "return_to_handler" in ftrace_64.S,
> > > > > > which has once "call", but twice "ret". And it pretend a "call"
> > > > > > to make it balance.
> > > > >
> > > > > This makes more sense to me. Let's try that approach instead
> > > > > of messing with the return address on stack?
> > > >
> > > > The way here is similar to the "return_to_handler". For the ftrace,
> > > > the origin stack before the "ret" of the traced function is:
> > > >
> > > >     POS:
> > > >     rip   ---> return_to_handler
> > > >
> > > > And the exit of the traced function will jump to return_to_handler.
> > > > In return_to_handler, it will query the real "rip" of the traced function
> > > > and the it call a internal function:
> > > >
> > > >     call .Ldo_rop
> > > >
> > > > And the stack now is:
> > > >
> > > >     POS:
> > > >     rip   ----> the address after "call .Ldo_rop", which is a "int3"
> > > >
> > > > in the .Ldo_rop, it will modify the rip to the real rip to make
> > > > it like this:
> > > >
> > > >     POS:
> > > >     rip   ---> real rip
> > > >
> > > > And it return. Take the target function "foo" for example, the logic
> > > > of it is:
> > > >
> > > >     call foo -> call ftrace_caller -> return ftrace_caller ->
> > > >     return return_to_handler -> call Ldo_rop -> return foo
> > > >
> > > > As you can see, the call and return address for ".Ldo_rop" is
> > > > also messed up. So I think it works here too. Compared with
> > > > a messed "return address", a missed return maybe have
> > > > better influence?
> > > >
> > > > And the whole logic for us is:
> > > >
> > > >     call foo -> call trampoline -> call origin ->
> > > >     return origin -> return POS -> return foo
> > >
> > > The "return POS" will miss the RSB, but the later return
> > > will hit it.
> > >
> > > The origin logic is:
> > >
> > >      call foo -> call trampoline -> call origin ->
> > >      return origin -> return foo
> > >
> > > The "return foo" and all the later return will miss the RBS.
> > >
> > > Hmm......Not sure if I understand it correctly.
> > 
> > Here another idea...
> > hack tr->func.ftrace_managed = false temporarily
> > and use BPF_MOD_JUMP in bpf_arch_text_poke()
> > when installing trampoline with fexit progs.
> > and also do:
> > @@ -3437,10 +3437,6 @@ static int __arch_prepare_bpf_trampoline(struct
> > bpf_tramp_image *im, void *rw_im
> > 
> >         emit_ldx(&prog, BPF_DW, BPF_REG_6, BPF_REG_FP, -rbx_off);
> >         EMIT1(0xC9); /* leave */
> > -       if (flags & BPF_TRAMP_F_SKIP_FRAME) {
> > -               /* skip our return address and return to parent */
> > -               EMIT4(0x48, 0x83, 0xC4, 8); /* add rsp, 8 */
> > -       }
> >         emit_return(&prog, image + (prog - (u8 *)rw_image));
> > 
> > Then RSB is perfectly matched without messing up the stack
> > and/or extra calls.
> > If it works and performance is good the next step is to
> > teach ftrace to emit jmp or call in *_ftrace_direct()

After the modification, the performance of fexit increase from
76M/s to 137M/s, awesome!

> 
> Good idea. I saw the "return_to_handler" used "JMP_NOSPEC", and
> the jmp is converted to the "fake call" to be nice to IBT in this commit:
> 
> e52fc2cf3f66 ("x86/ibt,ftrace: Make function-graph play nice")
> 
> It's not indirect branch in our case, but let me do more testing to
> see if there are any unexpected effect if we use "jmp" here.
> 
> Thanks!
> Menglong Dong
> 
> > 
> 
> 
> 
> 
> 
> 





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next] bpf,x86: do RSB balance for trampoline
  2025-11-06  2:49               ` Menglong Dong
@ 2025-11-06  2:56                 ` Alexei Starovoitov
  2025-11-06  3:00                   ` Menglong Dong
  2025-11-10 11:43                   ` Menglong Dong
  0 siblings, 2 replies; 16+ messages in thread
From: Alexei Starovoitov @ 2025-11-06  2:56 UTC (permalink / raw)
  To: Menglong Dong
  Cc: Peter Zijlstra, Menglong Dong, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard,
	Song Liu, Yonghong Song, John Fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, David S. Miller,
	David Ahern, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, X86 ML, H. Peter Anvin, jiang.biao, bpf,
	Network Development, LKML

On Wed, Nov 5, 2025 at 6:49 PM Menglong Dong <menglong.dong@linux.dev> wrote:
>
> On 2025/11/6 09:40, Menglong Dong wrote:
> > On 2025/11/6 07:31, Alexei Starovoitov wrote:
> > > On Tue, Nov 4, 2025 at 11:47 PM Menglong Dong <menglong.dong@linux.dev> wrote:
> > > >
> > > > On 2025/11/5 15:13, Menglong Dong wrote:
> > > > > On 2025/11/5 10:12, Alexei Starovoitov wrote:
> > > > > > On Tue, Nov 4, 2025 at 5:30 PM Menglong Dong <menglong.dong@linux.dev> wrote:
> > > > > > >
> > > > > > > On 2025/11/5 02:56, Alexei Starovoitov wrote:
> > > > > > > > On Tue, Nov 4, 2025 at 2:49 AM Menglong Dong <menglong8.dong@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > In origin call case, we skip the "rip" directly before we return, which
> > > > > > > > > break the RSB, as we have twice "call", but only once "ret".
> > > > > > > >
> > > > > > > > RSB meaning return stack buffer?
> > > > > > > >
> > > > > > > > and by "breaks RSB" you mean it makes the cpu less efficient?
> > > > > > >
> > > > > > > Yeah, I mean it makes the cpu less efficient. The RSB is used
> > > > > > > for the branch predicting, and it will push the "rip" to its hardware
> > > > > > > stack on "call", and pop it from the stack on "ret". In the origin
> > > > > > > call case, there are twice "call" but once "ret", will break its
> > > > > > > balance.
> > > > > >
> > > > > > Yes. I'm aware, but your "mov [rbp + 8], rax" screws it up as well,
> > > > > > since RSB has to be updated/invalidated by this store.
> > > > > > The behavior depends on the microarchitecture, of course.
> > > > > > I think:
> > > > > > add rsp, 8
> > > > > > ret
> > > > > > will only screw up the return prediction, but won't invalidate RSB.
> > > > > >
> > > > > > > Similar things happen in "return_to_handler" in ftrace_64.S,
> > > > > > > which has once "call", but twice "ret". And it pretend a "call"
> > > > > > > to make it balance.
> > > > > >
> > > > > > This makes more sense to me. Let's try that approach instead
> > > > > > of messing with the return address on stack?
> > > > >
> > > > > The way here is similar to the "return_to_handler". For the ftrace,
> > > > > the origin stack before the "ret" of the traced function is:
> > > > >
> > > > >     POS:
> > > > >     rip   ---> return_to_handler
> > > > >
> > > > > And the exit of the traced function will jump to return_to_handler.
> > > > > In return_to_handler, it will query the real "rip" of the traced function
> > > > > and the it call a internal function:
> > > > >
> > > > >     call .Ldo_rop
> > > > >
> > > > > And the stack now is:
> > > > >
> > > > >     POS:
> > > > >     rip   ----> the address after "call .Ldo_rop", which is a "int3"
> > > > >
> > > > > in the .Ldo_rop, it will modify the rip to the real rip to make
> > > > > it like this:
> > > > >
> > > > >     POS:
> > > > >     rip   ---> real rip
> > > > >
> > > > > And it return. Take the target function "foo" for example, the logic
> > > > > of it is:
> > > > >
> > > > >     call foo -> call ftrace_caller -> return ftrace_caller ->
> > > > >     return return_to_handler -> call Ldo_rop -> return foo
> > > > >
> > > > > As you can see, the call and return address for ".Ldo_rop" is
> > > > > also messed up. So I think it works here too. Compared with
> > > > > a messed "return address", a missed return maybe have
> > > > > better influence?
> > > > >
> > > > > And the whole logic for us is:
> > > > >
> > > > >     call foo -> call trampoline -> call origin ->
> > > > >     return origin -> return POS -> return foo
> > > >
> > > > The "return POS" will miss the RSB, but the later return
> > > > will hit it.
> > > >
> > > > The origin logic is:
> > > >
> > > >      call foo -> call trampoline -> call origin ->
> > > >      return origin -> return foo
> > > >
> > > > The "return foo" and all the later return will miss the RBS.
> > > >
> > > > Hmm......Not sure if I understand it correctly.
> > >
> > > Here another idea...
> > > hack tr->func.ftrace_managed = false temporarily
> > > and use BPF_MOD_JUMP in bpf_arch_text_poke()
> > > when installing trampoline with fexit progs.
> > > and also do:
> > > @@ -3437,10 +3437,6 @@ static int __arch_prepare_bpf_trampoline(struct
> > > bpf_tramp_image *im, void *rw_im
> > >
> > >         emit_ldx(&prog, BPF_DW, BPF_REG_6, BPF_REG_FP, -rbx_off);
> > >         EMIT1(0xC9); /* leave */
> > > -       if (flags & BPF_TRAMP_F_SKIP_FRAME) {
> > > -               /* skip our return address and return to parent */
> > > -               EMIT4(0x48, 0x83, 0xC4, 8); /* add rsp, 8 */
> > > -       }
> > >         emit_return(&prog, image + (prog - (u8 *)rw_image));
> > >
> > > Then RSB is perfectly matched without messing up the stack
> > > and/or extra calls.
> > > If it works and performance is good the next step is to
> > > teach ftrace to emit jmp or call in *_ftrace_direct()
>
> After the modification, the performance of fexit increase from
> 76M/s to 137M/s, awesome!

Nice! much better than double 'ret' :)
_ftrace_direct() next?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next] bpf,x86: do RSB balance for trampoline
  2025-11-06  2:56                 ` Alexei Starovoitov
@ 2025-11-06  3:00                   ` Menglong Dong
  2025-11-10 11:43                   ` Menglong Dong
  1 sibling, 0 replies; 16+ messages in thread
From: Menglong Dong @ 2025-11-06  3:00 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Peter Zijlstra, Menglong Dong, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard,
	Song Liu, Yonghong Song, John Fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, David S. Miller,
	David Ahern, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, X86 ML, H. Peter Anvin, jiang.biao, bpf,
	Network Development, LKML

On 2025/11/6 10:56, Alexei Starovoitov wrote:
> On Wed, Nov 5, 2025 at 6:49 PM Menglong Dong <menglong.dong@linux.dev> wrote:
> >
> > On 2025/11/6 09:40, Menglong Dong wrote:
> > > On 2025/11/6 07:31, Alexei Starovoitov wrote:
> > > > On Tue, Nov 4, 2025 at 11:47 PM Menglong Dong <menglong.dong@linux.dev> wrote:
> > > > >
> > > > > On 2025/11/5 15:13, Menglong Dong wrote:
> > > > > > On 2025/11/5 10:12, Alexei Starovoitov wrote:
> > > > > > > On Tue, Nov 4, 2025 at 5:30 PM Menglong Dong <menglong.dong@linux.dev> wrote:
> > > > > > > >
> > > > > > > > On 2025/11/5 02:56, Alexei Starovoitov wrote:
> > > > > > > > > On Tue, Nov 4, 2025 at 2:49 AM Menglong Dong <menglong8.dong@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > In origin call case, we skip the "rip" directly before we return, which
> > > > > > > > > > break the RSB, as we have twice "call", but only once "ret".
> > > > > > > > >
> > > > > > > > > RSB meaning return stack buffer?
> > > > > > > > >
> > > > > > > > > and by "breaks RSB" you mean it makes the cpu less efficient?
> > > > > > > >
> > > > > > > > Yeah, I mean it makes the cpu less efficient. The RSB is used
> > > > > > > > for the branch predicting, and it will push the "rip" to its hardware
> > > > > > > > stack on "call", and pop it from the stack on "ret". In the origin
> > > > > > > > call case, there are twice "call" but once "ret", will break its
> > > > > > > > balance.
> > > > > > >
> > > > > > > Yes. I'm aware, but your "mov [rbp + 8], rax" screws it up as well,
> > > > > > > since RSB has to be updated/invalidated by this store.
> > > > > > > The behavior depends on the microarchitecture, of course.
> > > > > > > I think:
> > > > > > > add rsp, 8
> > > > > > > ret
> > > > > > > will only screw up the return prediction, but won't invalidate RSB.
> > > > > > >
> > > > > > > > Similar things happen in "return_to_handler" in ftrace_64.S,
> > > > > > > > which has once "call", but twice "ret". And it pretend a "call"
> > > > > > > > to make it balance.
> > > > > > >
> > > > > > > This makes more sense to me. Let's try that approach instead
> > > > > > > of messing with the return address on stack?
> > > > > >
> > > > > > The way here is similar to the "return_to_handler". For the ftrace,
> > > > > > the origin stack before the "ret" of the traced function is:
> > > > > >
> > > > > >     POS:
> > > > > >     rip   ---> return_to_handler
> > > > > >
> > > > > > And the exit of the traced function will jump to return_to_handler.
> > > > > > In return_to_handler, it will query the real "rip" of the traced function
> > > > > > and the it call a internal function:
> > > > > >
> > > > > >     call .Ldo_rop
> > > > > >
> > > > > > And the stack now is:
> > > > > >
> > > > > >     POS:
> > > > > >     rip   ----> the address after "call .Ldo_rop", which is a "int3"
> > > > > >
> > > > > > in the .Ldo_rop, it will modify the rip to the real rip to make
> > > > > > it like this:
> > > > > >
> > > > > >     POS:
> > > > > >     rip   ---> real rip
> > > > > >
> > > > > > And it return. Take the target function "foo" for example, the logic
> > > > > > of it is:
> > > > > >
> > > > > >     call foo -> call ftrace_caller -> return ftrace_caller ->
> > > > > >     return return_to_handler -> call Ldo_rop -> return foo
> > > > > >
> > > > > > As you can see, the call and return address for ".Ldo_rop" is
> > > > > > also messed up. So I think it works here too. Compared with
> > > > > > a messed "return address", a missed return maybe have
> > > > > > better influence?
> > > > > >
> > > > > > And the whole logic for us is:
> > > > > >
> > > > > >     call foo -> call trampoline -> call origin ->
> > > > > >     return origin -> return POS -> return foo
> > > > >
> > > > > The "return POS" will miss the RSB, but the later return
> > > > > will hit it.
> > > > >
> > > > > The origin logic is:
> > > > >
> > > > >      call foo -> call trampoline -> call origin ->
> > > > >      return origin -> return foo
> > > > >
> > > > > The "return foo" and all the later return will miss the RBS.
> > > > >
> > > > > Hmm......Not sure if I understand it correctly.
> > > >
> > > > Here another idea...
> > > > hack tr->func.ftrace_managed = false temporarily
> > > > and use BPF_MOD_JUMP in bpf_arch_text_poke()
> > > > when installing trampoline with fexit progs.
> > > > and also do:
> > > > @@ -3437,10 +3437,6 @@ static int __arch_prepare_bpf_trampoline(struct
> > > > bpf_tramp_image *im, void *rw_im
> > > >
> > > >         emit_ldx(&prog, BPF_DW, BPF_REG_6, BPF_REG_FP, -rbx_off);
> > > >         EMIT1(0xC9); /* leave */
> > > > -       if (flags & BPF_TRAMP_F_SKIP_FRAME) {
> > > > -               /* skip our return address and return to parent */
> > > > -               EMIT4(0x48, 0x83, 0xC4, 8); /* add rsp, 8 */
> > > > -       }
> > > >         emit_return(&prog, image + (prog - (u8 *)rw_image));
> > > >
> > > > Then RSB is perfectly matched without messing up the stack
> > > > and/or extra calls.
> > > > If it works and performance is good the next step is to
> > > > teach ftrace to emit jmp or call in *_ftrace_direct()
> >
> > After the modification, the performance of fexit increase from
> > 76M/s to 137M/s, awesome!
> 
> Nice! much better than double 'ret' :)
> _ftrace_direct() next?

Yeah, I'll do these stuff with _ftrace_direct().

> 





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next] bpf,x86: do RSB balance for trampoline
  2025-11-05  7:46         ` Menglong Dong
  2025-11-05 23:31           ` Alexei Starovoitov
@ 2025-11-06 12:03           ` Peter Zijlstra
  1 sibling, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2025-11-06 12:03 UTC (permalink / raw)
  To: Menglong Dong
  Cc: Alexei Starovoitov, Menglong Dong, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard,
	Song Liu, Yonghong Song, John Fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, David S. Miller,
	David Ahern, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, X86 ML, H. Peter Anvin, jiang.biao, bpf,
	Network Development, LKML

On Wed, Nov 05, 2025 at 03:46:47PM +0800, Menglong Dong wrote:

> The "return POS" will miss the RSB, but the later return
> will hit it.

Right this. The moment you have an asymmetry in the RSB all future RETs
will miss and that sucks. Missing one but then hitting all the others is
much better.

Anyway, I see this thread came to a good conclusion. Just one more note,
people are working on supervisor shadow stack support, which is going to
make all these things even more 'fun' :-)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next] bpf,x86: do RSB balance for trampoline
  2025-11-06  2:56                 ` Alexei Starovoitov
  2025-11-06  3:00                   ` Menglong Dong
@ 2025-11-10 11:43                   ` Menglong Dong
  2025-11-10 16:32                     ` Alexei Starovoitov
  1 sibling, 1 reply; 16+ messages in thread
From: Menglong Dong @ 2025-11-10 11:43 UTC (permalink / raw)
  To: Alexei Starovoitov, sjenning
  Cc: Peter Zijlstra, Menglong Dong, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard,
	Song Liu, Yonghong Song, John Fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, David S. Miller,
	David Ahern, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, X86 ML, H. Peter Anvin, jiang.biao, bpf,
	Network Development, LKML

On 2025/11/6 10:56, Alexei Starovoitov wrote:
> On Wed, Nov 5, 2025 at 6:49 PM Menglong Dong <menglong.dong@linux.dev> wrote:
> >
> > On 2025/11/6 09:40, Menglong Dong wrote:
> > > On 2025/11/6 07:31, Alexei Starovoitov wrote:
> > > > On Tue, Nov 4, 2025 at 11:47 PM Menglong Dong <menglong.dong@linux.dev> wrote:
[......]
> > > >
> > > > Here another idea...
> > > > hack tr->func.ftrace_managed = false temporarily
> > > > and use BPF_MOD_JUMP in bpf_arch_text_poke()
> > > > when installing trampoline with fexit progs.
> > > > and also do:
> > > > @@ -3437,10 +3437,6 @@ static int __arch_prepare_bpf_trampoline(struct
> > > > bpf_tramp_image *im, void *rw_im
> > > >
> > > >         emit_ldx(&prog, BPF_DW, BPF_REG_6, BPF_REG_FP, -rbx_off);
> > > >         EMIT1(0xC9); /* leave */
> > > > -       if (flags & BPF_TRAMP_F_SKIP_FRAME) {
> > > > -               /* skip our return address and return to parent */
> > > > -               EMIT4(0x48, 0x83, 0xC4, 8); /* add rsp, 8 */
> > > > -       }
> > > >         emit_return(&prog, image + (prog - (u8 *)rw_image));
> > > >
> > > > Then RSB is perfectly matched without messing up the stack
> > > > and/or extra calls.
> > > > If it works and performance is good the next step is to
> > > > teach ftrace to emit jmp or call in *_ftrace_direct()
> >
> > After the modification, the performance of fexit increase from
> > 76M/s to 137M/s, awesome!
> 
> Nice! much better than double 'ret' :)
> _ftrace_direct() next?

Hi, all

Do you think if it is worth to implement the livepatch with
bpf trampoline by introduce the CONFIG_LIVEPATCH_BPF?
It's easy to achieve it, I have a POC for it, and the performance
of the livepatch increase from 99M/s to 200M/s according to
my bench testing.

The results above is tested with return-trunk disabled. With the
return-trunk enabled, the performance decrease from 58M/s to
52M/s. The main performance improvement comes from the RSB,
and the return-trunk will always break the RSB, which makes it has
no improvement. The calling to per-cpu-ref get and put make
the bpf trampoline based livepatch has a worse performance
than ftrace based.

Thanks!
Menglong Dong

> 





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next] bpf,x86: do RSB balance for trampoline
  2025-11-10 11:43                   ` Menglong Dong
@ 2025-11-10 16:32                     ` Alexei Starovoitov
  2025-11-11  1:28                       ` Menglong Dong
  0 siblings, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2025-11-10 16:32 UTC (permalink / raw)
  To: Menglong Dong
  Cc: sjenning, Peter Zijlstra, Menglong Dong, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard,
	Song Liu, Yonghong Song, John Fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, David S. Miller,
	David Ahern, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, X86 ML, H. Peter Anvin, jiang.biao, bpf,
	Network Development, LKML

On Mon, Nov 10, 2025 at 3:43 AM Menglong Dong <menglong.dong@linux.dev> wrote:
>
>
> Do you think if it is worth to implement the livepatch with
> bpf trampoline by introduce the CONFIG_LIVEPATCH_BPF?
> It's easy to achieve it, I have a POC for it, and the performance
> of the livepatch increase from 99M/s to 200M/s according to
> my bench testing.

what do you mean exactly?
I don't want to add more complexity to bpf trampoline.
Improve current livepatching logic ? jmp vs call isn't special.

> The results above is tested with return-trunk disabled. With the
> return-trunk enabled, the performance decrease from 58M/s to
> 52M/s. The main performance improvement comes from the RSB,
> and the return-trunk will always break the RSB, which makes it has
> no improvement. The calling to per-cpu-ref get and put make
> the bpf trampoline based livepatch has a worse performance
> than ftrace based.
>
> Thanks!
> Menglong Dong
>
> >
>
>
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next] bpf,x86: do RSB balance for trampoline
  2025-11-10 16:32                     ` Alexei Starovoitov
@ 2025-11-11  1:28                       ` Menglong Dong
  2025-11-11  2:41                         ` Alexei Starovoitov
  0 siblings, 1 reply; 16+ messages in thread
From: Menglong Dong @ 2025-11-11  1:28 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: sjenning, Peter Zijlstra, Menglong Dong, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard,
	Song Liu, Yonghong Song, John Fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, David S. Miller,
	David Ahern, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, X86 ML, H. Peter Anvin, jiang.biao, bpf,
	Network Development, LKML

On 2025/11/11 00:32, Alexei Starovoitov wrote:
> On Mon, Nov 10, 2025 at 3:43 AM Menglong Dong <menglong.dong@linux.dev> wrote:
> >
> >
> > Do you think if it is worth to implement the livepatch with
> > bpf trampoline by introduce the CONFIG_LIVEPATCH_BPF?
> > It's easy to achieve it, I have a POC for it, and the performance
> > of the livepatch increase from 99M/s to 200M/s according to
> > my bench testing.
> 
> what do you mean exactly?

This is totally another thing, and we can talk about it later. Let
me have a simple describe here.

I mean to implement the livepatch by bpf trampoline. For now,
the livepatch is implemented with ftrace, which will break the
RSB and has more overhead in x86_64.

It can be easily implemented by replace the "origin_call" with the
address that livepatch offered.

> I don't want to add more complexity to bpf trampoline.

If you mean the arch-specification, it won't add the complexity.
Otherwise, it can make it a little more simple in x86_64 with following
patch:

--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -3176,7 +3176,7 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
 					 void *rw_image_end, void *image,
 					 const struct btf_func_model *m, u32 flags,
 					 struct bpf_tramp_links *tlinks,
-					 void *func_addr)
+					 void *func_addr, void *origin_call_param)
 {
 	int i, ret, nr_regs = m->nr_args, stack_size = 0;
 	int regs_off, nregs_off, ip_off, run_ctx_off, arg_stack_off, rbx_off;
@@ -3280,6 +3280,7 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
 			orig_call += ENDBR_INSN_SIZE;
 		orig_call += X86_PATCH_SIZE;
 	}
+	orig_call = origin_call_param ?: orig_call;
 
 	prog = rw_image;
 
@@ -3369,15 +3370,10 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
 			LOAD_TRAMP_TAIL_CALL_CNT_PTR(stack_size);
 		}
 
-		if (flags & BPF_TRAMP_F_ORIG_STACK) {
-			emit_ldx(&prog, BPF_DW, BPF_REG_6, BPF_REG_FP, 8);
-			EMIT2(0xff, 0xd3); /* call *rbx */
-		} else {
-			/* call original function */
-			if (emit_rsb_call(&prog, orig_call, image + (prog - (u8 *)rw_image))) {
-				ret = -EINVAL;
-				goto cleanup;
-			}
+		/* call original function */
+		if (emit_rsb_call(&prog, orig_call, image + (prog - (u8 *)rw_image))) {
+			ret = -EINVAL;
+			goto cleanup;
 		}
 		/* remember return value in a stack for bpf prog to access */
 		emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_0, -8);

> Improve current livepatching logic ? jmp vs call isn't special.

Some kind. According to my testing, the performance of bpf
trampoline is much better than ftrace trampoline, so if we
can implement it with bpf trampoline, the performance can be
improved. Of course, the bpf trampoline need to offer a API
to the livepatch for this propose.

Any way, let me finish the work in this patch first. After that,
I can send a RFC of the proposal.

Thanks!
Menglong Dong

> 
> > The results above is tested with return-trunk disabled. With the
> > return-trunk enabled, the performance decrease from 58M/s to
> > 52M/s. The main performance improvement comes from the RSB,
> > and the return-trunk will always break the RSB, which makes it has
> > no improvement. The calling to per-cpu-ref get and put make
> > the bpf trampoline based livepatch has a worse performance
> > than ftrace based.
> >
> > Thanks!
> > Menglong Dong
> >
> > >
> >
> >
> >
> >
> 
> 





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next] bpf,x86: do RSB balance for trampoline
  2025-11-11  1:28                       ` Menglong Dong
@ 2025-11-11  2:41                         ` Alexei Starovoitov
  0 siblings, 0 replies; 16+ messages in thread
From: Alexei Starovoitov @ 2025-11-11  2:41 UTC (permalink / raw)
  To: Menglong Dong
  Cc: sjenning, Peter Zijlstra, Menglong Dong, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard,
	Song Liu, Yonghong Song, John Fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, David S. Miller,
	David Ahern, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, X86 ML, H. Peter Anvin, jiang.biao, bpf,
	Network Development, LKML

On Mon, Nov 10, 2025 at 5:28 PM Menglong Dong <menglong.dong@linux.dev> wrote:
>
>
> Some kind. According to my testing, the performance of bpf
> trampoline is much better than ftrace trampoline, so if we
> can implement it with bpf trampoline, the performance can be
> improved. Of course, the bpf trampoline need to offer a API
> to the livepatch for this propose.

Sure, then improve ftrace trampoline by doing the same tricks
as bpf trampoline.

> Any way, let me finish the work in this patch first. After that,
> I can send a RFC of the proposal.

Don't. livepathcing is not a job of bpf trampoline.
Song recently fixed interaction between livepatch and
fexit. We will not be adding another dimension
of complexity here where bpf trampoline is used for
livepatching and for bpf progs.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-11-11  2:41 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-04 10:49 [PATCH bpf-next] bpf,x86: do RSB balance for trampoline Menglong Dong
2025-11-04 18:56 ` Alexei Starovoitov
2025-11-05  1:30   ` Menglong Dong
2025-11-05  2:12     ` Alexei Starovoitov
2025-11-05  7:13       ` Menglong Dong
2025-11-05  7:46         ` Menglong Dong
2025-11-05 23:31           ` Alexei Starovoitov
2025-11-06  1:40             ` Menglong Dong
2025-11-06  2:49               ` Menglong Dong
2025-11-06  2:56                 ` Alexei Starovoitov
2025-11-06  3:00                   ` Menglong Dong
2025-11-10 11:43                   ` Menglong Dong
2025-11-10 16:32                     ` Alexei Starovoitov
2025-11-11  1:28                       ` Menglong Dong
2025-11-11  2:41                         ` Alexei Starovoitov
2025-11-06 12:03           ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).