BPF List
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/4] bpf: Fix tailcall infinite loop caused by freplace
@ 2024-08-25 13:09 Leon Hwang
  2024-08-25 13:09 ` [PATCH bpf-next 1/4] bpf, x64: " Leon Hwang
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Leon Hwang @ 2024-08-25 13:09 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, toke, martin.lau, yonghong.song, puranjay,
	xukuohai, eddyz87, iii, leon.hwang, kernel-patches-bot

Previously, I fixed a tailcall infinite loop issue caused by trampoline[0].

At this time, I fix a tailcall infinite loop issue caused by freplace.

Since commit 1c123c567fb1 ("bpf: Resolve fext program type when checking map compatibility"),
freplace prog is able to tail call its target prog.

What happens when freplace prog attaches to its target prog's subprog and
tail calls its target prog?

The kernel will panic because TASK stack guard page was hit.

The panic is fixed on both x64 and arm64[1]. Please check the corresponding
patch to see the details.

Links:
[0] https://lore.kernel.org/bpf/20230912150442.2009-1-hffilwlqm@gmail.com/
[1] https://github.com/kernel-patches/bpf/pull/7591

Leon Hwang (4):
  bpf, x64: Fix tailcall infinite loop caused by freplace
  bpf, arm64: Fix tailcall infinite loop caused by freplace
  selftests/bpf: Add testcases for another tailcall infinite loop fixing
  selftests/bpf: Fix verifier tailcall jit selftest

 arch/arm64/net/bpf_jit_comp.c                 |  39 +++-
 arch/x86/net/bpf_jit_comp.c                   |  44 ++--
 include/linux/bpf.h                           |   6 +-
 kernel/bpf/trampoline.c                       |   4 +-
 kernel/bpf/verifier.c                         |   4 +-
 .../selftests/bpf/prog_tests/tailcalls.c      | 209 +++++++++++++++++-
 .../tailcall_bpf2bpf_hierarchy_freplace.c     |  30 +++
 .../testing/selftests/bpf/progs/tc_bpf2bpf.c  |  18 +-
 .../bpf/progs/verifier_tailcall_jit.c         |   4 +-
 9 files changed, 322 insertions(+), 36 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy_freplace.c

-- 
2.44.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH bpf-next 1/4] bpf, x64: Fix tailcall infinite loop caused by freplace
  2024-08-25 13:09 [PATCH bpf-next 0/4] bpf: Fix tailcall infinite loop caused by freplace Leon Hwang
@ 2024-08-25 13:09 ` Leon Hwang
  2024-08-27 10:37   ` Eduard Zingerman
  2024-08-25 13:09 ` [PATCH bpf-next 2/4] bpf, arm64: " Leon Hwang
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 24+ messages in thread
From: Leon Hwang @ 2024-08-25 13:09 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, toke, martin.lau, yonghong.song, puranjay,
	xukuohai, eddyz87, iii, leon.hwang, kernel-patches-bot

This patch fixes a tailcall infinite loop issue caused by freplace.

Since commit 1c123c567fb1 ("bpf: Resolve fext program type when checking map compatibility"),
freplace prog is allowed to tail call its target prog. Then, when a
freplace prog attaches to its target prog's subprog and tail calls its
target prog, kernel will panic.

For example:

tc_bpf2bpf.c:

// SPDX-License-Identifier: GPL-2.0
\#include <linux/bpf.h>
\#include <bpf/bpf_helpers.h>

__noinline
int subprog_tc(struct __sk_buff *skb)
{
	return skb->len * 2;
}

SEC("tc")
int entry_tc(struct __sk_buff *skb)
{
	return subprog_tc(skb);
}

char __license[] SEC("license") = "GPL";

tailcall_bpf2bpf_hierarchy_freplace.c:

// SPDX-License-Identifier: GPL-2.0
\#include <linux/bpf.h>
\#include <bpf/bpf_helpers.h>

struct {
	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
	__uint(max_entries, 1);
	__uint(key_size, sizeof(__u32));
	__uint(value_size, sizeof(__u32));
} jmp_table SEC(".maps");

int count = 0;

static __noinline
int subprog_tail(struct __sk_buff *skb)
{
	bpf_tail_call_static(skb, &jmp_table, 0);
	return 0;
}

SEC("freplace")
int entry_freplace(struct __sk_buff *skb)
{
	count++;
	subprog_tail(skb);
	subprog_tail(skb);
	return count;
}

char __license[] SEC("license") = "GPL";

The attach target of entry_freplace is subprog_tc, and the tail callee
in subprog_tail is entry_tc.

Then, the infinite loop will be entry_tc -> subprog_tc -> entry_freplace
-> subprog_tail --tailcall-> entry_tc, because tail_call_cnt in
entry_freplace will count from zero for every time of entry_freplace
execution. Kernel will panic:

[   15.310490] BUG: TASK stack guard page was hit at (____ptrval____)
(stack is (____ptrval____)..(____ptrval____))
[   15.310490] Oops: stack guard page: 0000 [#1] PREEMPT SMP NOPTI
[   15.310490] CPU: 1 PID: 89 Comm: test_progs Tainted: G           OE
   6.10.0-rc6-g026dcdae8d3e-dirty #72
[   15.310490] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX,
1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   15.310490] RIP: 0010:bpf_prog_3a140cef239a4b4f_subprog_tail+0x14/0x53
[   15.310490] Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
cc cc cc cc cc f3 0f 1e fa 0f 1f 44 00 00 0f 1f 00 55 48 89 e5 f3 0f 1e
fa <50> 50 53 41 55 48 89 fb 49 bd 00 2a 46 82 98 9c ff ff 48 89 df 4c
[   15.310490] RSP: 0018:ffffb500c0aa0000 EFLAGS: 00000202
[   15.310490] RAX: ffffb500c0aa0028 RBX: ffff9c98808b7e00 RCX:
0000000000008cb5
[   15.310490] RDX: 0000000000000000 RSI: ffff9c9882462a00 RDI:
ffff9c98808b7e00
[   15.310490] RBP: ffffb500c0aa0000 R08: 0000000000000000 R09:
0000000000000000
[   15.310490] R10: 0000000000000001 R11: 0000000000000000 R12:
ffffb500c01af000
[   15.310490] R13: ffffb500c01cd000 R14: 0000000000000000 R15:
0000000000000000
[   15.310490] FS:  00007f133b665140(0000) GS:ffff9c98bbd00000(0000)
knlGS:0000000000000000
[   15.310490] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   15.310490] CR2: ffffb500c0a9fff8 CR3: 0000000102478000 CR4:
00000000000006f0
[   15.310490] Call Trace:
[   15.310490]  <#DF>
[   15.310490]  ? die+0x36/0x90
[   15.310490]  ? handle_stack_overflow+0x4d/0x60
[   15.310490]  ? exc_double_fault+0x117/0x1a0
[   15.310490]  ? asm_exc_double_fault+0x23/0x30
[   15.310490]  ? bpf_prog_3a140cef239a4b4f_subprog_tail+0x14/0x53
[   15.310490]  </#DF>
[   15.310490]  <TASK>
[   15.310490]  bpf_prog_85781a698094722f_entry+0x4c/0x64
[   15.310490]  bpf_prog_1c515f389a9059b4_entry2+0x19/0x1b
[   15.310490]  ...
[   15.310490]  bpf_prog_85781a698094722f_entry+0x4c/0x64
[   15.310490]  bpf_prog_1c515f389a9059b4_entry2+0x19/0x1b
[   15.310490]  bpf_test_run+0x210/0x370
[   15.310490]  ? bpf_test_run+0x128/0x370
[   15.310490]  bpf_prog_test_run_skb+0x388/0x7a0
[   15.310490]  __sys_bpf+0xdbf/0x2c40
[   15.310490]  ? clockevents_program_event+0x52/0xf0
[   15.310490]  ? lock_release+0xbf/0x290
[   15.310490]  __x64_sys_bpf+0x1e/0x30
[   15.310490]  do_syscall_64+0x68/0x140
[   15.310490]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   15.310490] RIP: 0033:0x7f133b52725d
[   15.310490] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa
48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f
05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8b bb 0d 00 f7 d8 64 89 01 48
[   15.310490] RSP: 002b:00007ffddbc10258 EFLAGS: 00000206 ORIG_RAX:
0000000000000141
[   15.310490] RAX: ffffffffffffffda RBX: 00007ffddbc10828 RCX:
00007f133b52725d
[   15.310490] RDX: 0000000000000050 RSI: 00007ffddbc102a0 RDI:
000000000000000a
[   15.310490] RBP: 00007ffddbc10270 R08: 0000000000000000 R09:
00007ffddbc102a0
[   15.310490] R10: 0000000000000064 R11: 0000000000000206 R12:
0000000000000004
[   15.310490] R13: 0000000000000000 R14: 0000558ec4c24890 R15:
00007f133b6ed000
[   15.310490]  </TASK>
[   15.310490] Modules linked in: bpf_testmod(OE)
[   15.310490] ---[ end trace 0000000000000000 ]---
[   15.310490] RIP: 0010:bpf_prog_3a140cef239a4b4f_subprog_tail+0x14/0x53
[   15.310490] Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
cc cc cc cc cc f3 0f 1e fa 0f 1f 44 00 00 0f 1f 00 55 48 89 e5 f3 0f 1e
fa <50> 50 53 41 55 48 89 fb 49 bd 00 2a 46 82 98 9c ff ff 48 89 df 4c
[   15.310490] RSP: 0018:ffffb500c0aa0000 EFLAGS: 00000202
[   15.310490] RAX: ffffb500c0aa0028 RBX: ffff9c98808b7e00 RCX:
0000000000008cb5
[   15.310490] RDX: 0000000000000000 RSI: ffff9c9882462a00 RDI:
ffff9c98808b7e00
[   15.310490] RBP: ffffb500c0aa0000 R08: 0000000000000000 R09:
0000000000000000
[   15.310490] R10: 0000000000000001 R11: 0000000000000000 R12:
ffffb500c01af000
[   15.310490] R13: ffffb500c01cd000 R14: 0000000000000000 R15:
0000000000000000
[   15.310490] FS:  00007f133b665140(0000) GS:ffff9c98bbd00000(0000)
knlGS:0000000000000000
[   15.310490] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   15.310490] CR2: ffffb500c0a9fff8 CR3: 0000000102478000 CR4:
00000000000006f0
[   15.310490] Kernel panic - not syncing: Fatal exception in interrupt
[   15.310490] Kernel Offset: 0x30000000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)

This patch fixes the issue by initializing tail_call_cnt at the prologue
of entry_tc.

Next, when call subprog_tc, the tail_call_cnt_ptr is propagated to
subprog_tc by rax.
Next, when jump to entry_freplace, the tail_call_cnt_ptr will be reused to
count tailcall in freplace prog.
Next, when call subprog_tail, the tail_call_cnt_ptr is propagated to
subprog_tail by rax.
Next, while tail calling to entry_tc, the tail_call_cnt on the stack of
entry_tc increments via the tail_call_cnt_ptr.

The whole procedure shows as the following JITed prog dumping.

bpftool p d j n entry_tc:

int entry_tc(struct __sk_buff * skb):
bpf_prog_1c515f389a9059b4_entry_tc:
; return subprog_tc(skb);
   0:	endbr64
   4:	xorq	%rax, %rax		;; rax = 0 (tail_call_cnt)
   7:	nopl	(%rax,%rax)
   c:	pushq	%rbp
   d:	movq	%rsp, %rbp
  10:	endbr64
  14:	cmpq	$33, %rax		;; if rax > 33, rax = tcc_ptr
  18:	ja	0x20			;; if rax > 33 goto 0x20 ---+
  1a:	pushq	%rax			;; [rbp - 8] = rax = 0      |
  1b:	movq	%rsp, %rax		;; rax = rbp - 8            |
  1e:	jmp	0x21			;; ---------+               |
  20:	pushq	%rax			;; <--------|---------------+
  21:	pushq	%rax			;; <--------+ [rbp - 16] = rax
  22:	movq	-16(%rbp), %rax		;; rax = tcc_ptr
  29:	callq	0x70			;; call subprog_tc()
; return subprog_tc(skb);
  2e:	leave
  2f:	retq

int subprog_tc(struct __sk_buff * skb):
bpf_prog_1e8f76e2374a0607_subprog_tc:
; return skb->len * 2;
   0:	endbr64
   4:	nopl	(%rax)			;; do not touch tail_call_cnt
   7:	jmp	0x108			;; jump to entry_freplace()
   c:	pushq	%rbp
   d:	movq	%rsp, %rbp
  10:	endbr64
  14:	pushq	%rax
  15:	pushq	%rax
  16:	movl	112(%rdi), %eax
; return skb->len * 2;
  19:	shll	%eax
; return skb->len * 2;
  1b:	leave
  1c:	retq

bpftool p d j n entry_freplace:

int entry_freplace(struct __sk_buff * skb):
bpf_prog_85781a698094722f_entry_freplace:
; int entry_freplace(struct __sk_buff *skb)
   0:	endbr64
   4:	nopl	(%rax)			;; do not touch tail_call_cnt
   7:	nopl	(%rax,%rax)
   c:	pushq	%rbp
   d:	movq	%rsp, %rbp
  10:	endbr64
  14:	cmpq	$33, %rax		;; if rax > 33, rax = tcc_ptr
  18:	ja	0x20			;; if rax > 33 goto 0x20 ---+
  1a:	pushq	%rax			;; [rbp - 8] = rax = 0      |
  1b:	movq	%rsp, %rax		;; rax = rbp - 8            |
  1e:	jmp	0x21			;; ---------+               |
  20:	pushq	%rax			;; <--------|---------------+
  21:	pushq	%rax			;; <--------+ [rbp - 16] = rax
  22:	pushq	%rbx			;; callee saved
  23:	pushq	%r13			;; callee saved
  25:	movq	%rdi, %rbx		;; rbx = skb (callee saved)
; count++;
  28:	movabsq	$-123406219759616, %r13
  32:	movl	(%r13), %edi
  36:	addl	$1, %edi
  39:	movl	%edi, (%r13)
; subprog_tail(skb);
  3d:	movq	%rbx, %rdi		;; rdi = skb
  40:	movq	-16(%rbp), %rax		;; rax = tcc_ptr
  47:	callq	0x80			;; call subprog_tail()
; subprog_tail(skb);
  4c:	movq	%rbx, %rdi		;; rdi = skb
  4f:	movq	-16(%rbp), %rax		;; rax = tcc_ptr
  56:	callq	0x80			;; call subprog_tail()
; return count;
  5b:	movl	(%r13), %eax
; return count;
  5f:	popq	%r13
  61:	popq	%rbx
  62:	leave
  63:	retq

int subprog_tail(struct __sk_buff * skb):
bpf_prog_3a140cef239a4b4f_subprog_tail:
; int subprog_tail(struct __sk_buff *skb)
   0:	endbr64
   4:	nopl	(%rax)			;; do not touch tail_call_cnt
   7:	nopl	(%rax,%rax)
   c:	pushq	%rbp
   d:	movq	%rsp, %rbp
  10:	endbr64
  14:	pushq	%rax			;; [rbp - 8]  = rax (tcc_ptr)
  15:	pushq	%rax			;; [rbp - 16] = rax (tcc_ptr)
  16:	pushq	%rbx			;; callee saved
  17:	pushq	%r13			;; callee saved
  19:	movq	%rdi, %rbx		;; rbx = skb
; asm volatile("r1 = %[ctx]\n\t"
  1c:	movabsq	$-128494642337280, %r13	;; r13 = jmp_table
  26:	movq	%rbx, %rdi		;; 1st arg, skb
  29:	movq	%r13, %rsi		;; 2nd arg, jmp_table
  2c:	xorl	%edx, %edx		;; 3rd arg, index = 0
  2e:	movq	-16(%rbp), %rax		;; rax = [rbp - 16] (tcc_ptr)
  35:	cmpq	$33, (%rax)
  39:	jae	0x4e			;; if *tcc_ptr >= 33 goto 0x4e --------+
  3b:	nopl	(%rax,%rax)		;; jmp bypass, toggled by poking       |
  40:	addq	$1, (%rax)		;; (*tcc_ptr)++                        |
  44:	popq	%r13			;; callee saved                        |
  46:	popq	%rbx			;; callee saved                        |
  47:	popq	%rax			;; undo rbp-16 push                    |
  48:	popq	%rax			;; undo rbp-8  push                    |
  49:	jmp	0xfffffffffffffe18	;; tail call target, toggled by poking |
; return 0;				;;                                     |
  4e:	popq	%r13			;; restore callee saved <--------------+
  50:	popq	%rbx			;; restore callee saved
  51:	leave
  52:	retq

As a result, the tail_call_cnt is stored on the stack of entry_tc. And
the tail_call_cnt_ptr is propagated between subprog_tc, entry_freplace,
subprog_tail and entry_tc.

Furthermore, trampoline is required to propagate
tail_call_cnt/tail_call_cnt_ptr always, no matter whether there is
tailcall at run time.
So, it reuses trampoline flag "BIT(7)" to tell trampoline to propagate
the tail_call_cnt/tail_call_cnt_ptr, as BPF_TRAMP_F_TAIL_CALL_CTX is not
used by any other arch BPF JIT.

However, the bad effect is that it requires initializing tail_call_cnt at
prologue always no matter whether the prog is tail_call_reachable, because
it is unable to confirm itself or its subprog[s] whether to be attached by
freplace prog.
And, when call subprog, tail_call_cnt_ptr is required to be propagated
to subprog always.

Fixes: 1c123c567fb1 ("bpf: Resolve fext program type when checking map compatibility")
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
---
 arch/x86/net/bpf_jit_comp.c | 44 +++++++++++++++++++++----------------
 include/linux/bpf.h         |  6 ++---
 kernel/bpf/trampoline.c     |  4 ++--
 kernel/bpf/verifier.c       |  4 ++--
 4 files changed, 32 insertions(+), 26 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 074b41fafbe3f..85d9af13f329f 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -274,6 +274,8 @@ struct jit_context {
 #define X86_PATCH_SIZE		5
 /* Number of bytes that will be skipped on tailcall */
 #define X86_TAIL_CALL_OFFSET	(12 + ENDBR_INSN_SIZE)
+/* Number of extra bytes that will be skipped on poke */
+#define X86_POKE_EXTRA		3
 
 static void push_r12(u8 **pprog)
 {
@@ -440,18 +442,14 @@ static void emit_prologue_tail_call(u8 **pprog, bool is_subprog)
  * while jumping to another program
  */
 static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
-			  bool tail_call_reachable, bool is_subprog,
+			  bool is_extension, bool is_subprog,
 			  bool is_exception_cb)
 {
 	u8 *prog = *pprog;
 
 	emit_cfi(&prog, is_subprog ? cfi_bpf_subprog_hash : cfi_bpf_hash);
-	/* BPF trampoline can be made to work without these nops,
-	 * but let's waste 5 bytes for now and optimize later
-	 */
-	emit_nops(&prog, X86_PATCH_SIZE);
 	if (!ebpf_from_cbpf) {
-		if (tail_call_reachable && !is_subprog)
+		if (!is_extension && !is_subprog)
 			/* When it's the entry of the whole tailcall context,
 			 * zeroing rax means initialising tail_call_cnt.
 			 */
@@ -460,6 +458,10 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
 			/* Keep the same instruction layout. */
 			emit_nops(&prog, 3);     /* nop3 */
 	}
+	/* BPF trampoline can be made to work without these nops,
+	 * but let's waste 5 bytes for now and optimize later
+	 */
+	emit_nops(&prog, X86_PATCH_SIZE);
 	/* Exception callback receives FP as third parameter */
 	if (is_exception_cb) {
 		EMIT3(0x48, 0x89, 0xF4); /* mov rsp, rsi */
@@ -483,7 +485,7 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
 	/* sub rsp, rounded_stack_depth */
 	if (stack_depth)
 		EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
-	if (tail_call_reachable)
+	if (!ebpf_from_cbpf)
 		emit_prologue_tail_call(&prog, is_subprog);
 	*pprog = prog;
 }
@@ -573,10 +575,13 @@ int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t,
 
 	/*
 	 * See emit_prologue(), for IBT builds the trampoline hook is preceded
-	 * with an ENDBR instruction.
+	 * with an ENDBR instruction and 3 bytes tail_call_cnt initialization
+	 * instruction.
 	 */
 	if (is_endbr(*(u32 *)ip))
 		ip += ENDBR_INSN_SIZE;
+	if (is_bpf_text_address((long)ip))
+		ip += X86_POKE_EXTRA;
 
 	return __bpf_arch_text_poke(ip, t, old_addr, new_addr);
 }
@@ -1365,7 +1370,7 @@ static void emit_shiftx(u8 **pprog, u32 dst_reg, u8 src_reg, bool is64, u8 op)
 static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image,
 		  int oldproglen, struct jit_context *ctx, bool jmp_padding)
 {
-	bool tail_call_reachable = bpf_prog->aux->tail_call_reachable;
+	bool is_extension = bpf_prog->type == BPF_PROG_TYPE_EXT;
 	struct bpf_insn *insn = bpf_prog->insnsi;
 	bool callee_regs_used[4] = {};
 	int insn_cnt = bpf_prog->len;
@@ -1383,7 +1388,7 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
 	detect_reg_usage(insn, insn_cnt, callee_regs_used);
 
 	emit_prologue(&prog, bpf_prog->aux->stack_depth,
-		      bpf_prog_was_classic(bpf_prog), tail_call_reachable,
+		      bpf_prog_was_classic(bpf_prog), is_extension,
 		      bpf_is_subprog(bpf_prog), bpf_prog->aux->exception_cb);
 	/* Exception callback will clobber callee regs for its own use, and
 	 * restore the original callee regs from main prog's stack frame.
@@ -2077,10 +2082,8 @@ st:			if (is_imm8(insn->off))
 			u8 *ip = image + addrs[i - 1];
 
 			func = (u8 *) __bpf_call_base + imm32;
-			if (tail_call_reachable) {
-				LOAD_TAIL_CALL_CNT_PTR(bpf_prog->aux->stack_depth);
-				ip += 7;
-			}
+			LOAD_TAIL_CALL_CNT_PTR(bpf_prog->aux->stack_depth);
+			ip += 7;
 			if (!imm32)
 				return -EINVAL;
 			ip += x86_call_depth_emit_accounting(&prog, func, ip);
@@ -2877,7 +2880,7 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
 	 *                     [ ...        ]
 	 *                     [ stack_arg2 ]
 	 * RBP - arg_stack_off [ stack_arg1 ]
-	 * RSP                 [ tail_call_cnt_ptr ] BPF_TRAMP_F_TAIL_CALL_CTX
+	 * RSP                 [ tail_call_cnt_ptr ] BPF_TRAMP_F_TAIL_CALL
 	 */
 
 	/* room for return value of orig_call or fentry prog */
@@ -2923,6 +2926,8 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
 		 */
 		if (is_endbr(*(u32 *)orig_call))
 			orig_call += ENDBR_INSN_SIZE;
+		if (is_bpf_text_address((long)orig_call))
+			orig_call += X86_POKE_EXTRA;
 		orig_call += X86_PATCH_SIZE;
 	}
 
@@ -2949,8 +2954,8 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
 		/* sub rsp, stack_size */
 		EMIT4(0x48, 0x83, 0xEC, stack_size);
 	}
-	if (flags & BPF_TRAMP_F_TAIL_CALL_CTX)
-		EMIT1(0x50);		/* push rax */
+	if (flags & BPF_TRAMP_F_TAIL_CALL)
+		EMIT1(0x50);	 /* push rax */
 	/* mov QWORD PTR [rbp - rbx_off], rbx */
 	emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_6, -rbx_off);
 
@@ -3005,7 +3010,7 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
 		restore_regs(m, &prog, regs_off);
 		save_args(m, &prog, arg_stack_off, true);
 
-		if (flags & BPF_TRAMP_F_TAIL_CALL_CTX) {
+		if (flags & BPF_TRAMP_F_TAIL_CALL) {
 			/* Before calling the original function, load the
 			 * tail_call_cnt_ptr from stack to rax.
 			 */
@@ -3025,6 +3030,7 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
 		/* remember return value in a stack for bpf prog to access */
 		emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_0, -8);
 		im->ip_after_call = image + (prog - (u8 *)rw_image);
+		emit_nops(&prog, X86_POKE_EXTRA);
 		emit_nops(&prog, X86_PATCH_SIZE);
 	}
 
@@ -3067,7 +3073,7 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
 			ret = -EINVAL;
 			goto cleanup;
 		}
-	} else if (flags & BPF_TRAMP_F_TAIL_CALL_CTX) {
+	} else if (flags & BPF_TRAMP_F_TAIL_CALL) {
 		/* Before running the original function, load the
 		 * tail_call_cnt_ptr from stack to rax.
 		 */
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 00dc4dd28cbd9..aa4a2799eaed1 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1074,10 +1074,10 @@ struct btf_func_model {
  */
 #define BPF_TRAMP_F_SHARE_IPMODIFY	BIT(6)
 
-/* Indicate that current trampoline is in a tail call context. Then, it has to
- * cache and restore tail_call_cnt to avoid infinite tail call loop.
+/* Indicate that current trampoline is required to cache and restore
+ * tail_call_cnt or tail_call_cnt_ptr to avoid infinite tail call loop.
  */
-#define BPF_TRAMP_F_TAIL_CALL_CTX	BIT(7)
+#define BPF_TRAMP_F_TAIL_CALL		BIT(7)
 
 /*
  * Indicate the trampoline should be suitable to receive indirect calls;
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
index f8302a5ca400d..257adbc83bae3 100644
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -409,8 +409,8 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr, bool lock_direct_mut
 		goto out;
 	}
 
-	/* clear all bits except SHARE_IPMODIFY and TAIL_CALL_CTX */
-	tr->flags &= (BPF_TRAMP_F_SHARE_IPMODIFY | BPF_TRAMP_F_TAIL_CALL_CTX);
+	/* clear all bits except SHARE_IPMODIFY and TAIL_CALL */
+	tr->flags &= (BPF_TRAMP_F_SHARE_IPMODIFY | BPF_TRAMP_F_TAIL_CALL);
 
 	if (tlinks[BPF_TRAMP_FEXIT].nr_links ||
 	    tlinks[BPF_TRAMP_MODIFY_RETURN].nr_links) {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1485364464576..aecd4d7cdf205 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -22096,8 +22096,8 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)
 	if (!tr)
 		return -ENOMEM;
 
-	if (tgt_prog && tgt_prog->aux->tail_call_reachable)
-		tr->flags = BPF_TRAMP_F_TAIL_CALL_CTX;
+	if (tgt_prog)
+		tr->flags = BPF_TRAMP_F_TAIL_CALL;
 
 	prog->aux->dst_trampoline = tr;
 	return 0;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next 2/4] bpf, arm64: Fix tailcall infinite loop caused by freplace
  2024-08-25 13:09 [PATCH bpf-next 0/4] bpf: Fix tailcall infinite loop caused by freplace Leon Hwang
  2024-08-25 13:09 ` [PATCH bpf-next 1/4] bpf, x64: " Leon Hwang
@ 2024-08-25 13:09 ` Leon Hwang
  2024-08-26 14:32   ` Xu Kuohai
  2024-08-25 13:09 ` [PATCH bpf-next 3/4] selftests/bpf: Add testcases for another tailcall infinite loop fixing Leon Hwang
  2024-08-25 13:09 ` [PATCH bpf-next 4/4] selftests/bpf: Fix verifier tailcall jit selftest Leon Hwang
  3 siblings, 1 reply; 24+ messages in thread
From: Leon Hwang @ 2024-08-25 13:09 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, toke, martin.lau, yonghong.song, puranjay,
	xukuohai, eddyz87, iii, leon.hwang, kernel-patches-bot

Like "bpf, x64: Fix tailcall infinite loop caused by freplace", the same
issue happens on arm64, too.

For example:

tc_bpf2bpf.c:

// SPDX-License-Identifier: GPL-2.0
\#include <linux/bpf.h>
\#include <bpf/bpf_helpers.h>

__noinline
int subprog_tc(struct __sk_buff *skb)
{
	return skb->len * 2;
}

SEC("tc")
int entry_tc(struct __sk_buff *skb)
{
	return subprog(skb);
}

char __license[] SEC("license") = "GPL";

tailcall_bpf2bpf_hierarchy_freplace.c:

// SPDX-License-Identifier: GPL-2.0
\#include <linux/bpf.h>
\#include <bpf/bpf_helpers.h>

struct {
	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
	__uint(max_entries, 1);
	__uint(key_size, sizeof(__u32));
	__uint(value_size, sizeof(__u32));
} jmp_table SEC(".maps");

int count = 0;

static __noinline
int subprog_tail(struct __sk_buff *skb)
{
	bpf_tail_call_static(skb, &jmp_table, 0);
	return 0;
}

SEC("freplace")
int entry_freplace(struct __sk_buff *skb)
{
	count++;
	subprog_tail(skb);
	subprog_tail(skb);
	return count;
}

char __license[] SEC("license") = "GPL";

The attach target of entry_freplace is subprog_tc, and the tail callee
in subprog_tail is entry_tc.

Then, the infinite loop will be entry_tc -> entry_tc -> entry_freplace ->
subprog_tail --tailcall-> entry_tc, because tail_call_cnt in
entry_freplace will count from zero for every time of entry_freplace
execution.

This patch fixes the issue by avoiding touching tail_call_cnt at
prologue when it's subprog or freplace prog.

Then, when freplace prog attaches to entry_tc, it has to initialize
tail_call_cnt and tail_call_cnt_ptr, because its target is main prog and
its target's prologue hasn't initialize them before the attach hook.

So, this patch uses x7 register to tell freplace prog that its target
prog is main prog or not.

Meanwhile, while tail calling to a freplace prog, it is required to
reset x7 register to prevent re-initializing tail_call_cnt at freplace
prog's prologue.

Fixes: 1c123c567fb1 ("bpf: Resolve fext program type when checking map compatibility")
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
---
 arch/arm64/net/bpf_jit_comp.c | 39 +++++++++++++++++++++++++++++++----
 1 file changed, 35 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index 59e05a7aea56a..4f8189824973f 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -276,6 +276,7 @@ static bool is_lsi_offset(int offset, int scale)
 /* generated prologue:
  *      bti c // if CONFIG_ARM64_BTI_KERNEL
  *      mov x9, lr
+ *      mov x7, 1 // if not-freplace main prog
  *      nop  // POKE_OFFSET
  *      paciasp // if CONFIG_ARM64_PTR_AUTH_KERNEL
  *      stp x29, lr, [sp, #-16]!
@@ -293,13 +294,14 @@ static bool is_lsi_offset(int offset, int scale)
 static void prepare_bpf_tail_call_cnt(struct jit_ctx *ctx)
 {
 	const struct bpf_prog *prog = ctx->prog;
+	const bool is_ext = prog->type == BPF_PROG_TYPE_EXT;
 	const bool is_main_prog = !bpf_is_subprog(prog);
 	const u8 ptr = bpf2a64[TCCNT_PTR];
 	const u8 fp = bpf2a64[BPF_REG_FP];
 	const u8 tcc = ptr;
 
 	emit(A64_PUSH(ptr, fp, A64_SP), ctx);
-	if (is_main_prog) {
+	if (is_main_prog && !is_ext) {
 		/* Initialize tail_call_cnt. */
 		emit(A64_MOVZ(1, tcc, 0, 0), ctx);
 		emit(A64_PUSH(tcc, fp, A64_SP), ctx);
@@ -315,22 +317,26 @@ static void prepare_bpf_tail_call_cnt(struct jit_ctx *ctx)
 #define PAC_INSNS (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL) ? 1 : 0)
 
 /* Offset of nop instruction in bpf prog entry to be poked */
-#define POKE_OFFSET (BTI_INSNS + 1)
+#define POKE_OFFSET (BTI_INSNS + 2)
 
 /* Tail call offset to jump into */
-#define PROLOGUE_OFFSET (BTI_INSNS + 2 + PAC_INSNS + 10)
+#define PROLOGUE_OFFSET (BTI_INSNS + 3 + PAC_INSNS + 10)
 
 static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf,
 			  bool is_exception_cb, u64 arena_vm_start)
 {
 	const struct bpf_prog *prog = ctx->prog;
+	const bool is_ext = prog->type == BPF_PROG_TYPE_EXT;
 	const bool is_main_prog = !bpf_is_subprog(prog);
+	const u8 r0 = bpf2a64[BPF_REG_0];
 	const u8 r6 = bpf2a64[BPF_REG_6];
 	const u8 r7 = bpf2a64[BPF_REG_7];
 	const u8 r8 = bpf2a64[BPF_REG_8];
 	const u8 r9 = bpf2a64[BPF_REG_9];
 	const u8 fp = bpf2a64[BPF_REG_FP];
 	const u8 fpb = bpf2a64[FP_BOTTOM];
+	const u8 ptr = bpf2a64[TCCNT_PTR];
+	const u8 tmp = bpf2a64[TMP_REG_1];
 	const u8 arena_vm_base = bpf2a64[ARENA_VM_START];
 	const int idx0 = ctx->idx;
 	int cur_offset;
@@ -367,6 +373,10 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf,
 	emit_bti(A64_BTI_JC, ctx);
 
 	emit(A64_MOV(1, A64_R(9), A64_LR), ctx);
+	if (!is_ext)
+		emit(A64_MOVZ(1, r0, is_main_prog, 0), ctx);
+	else
+		emit(A64_NOP, ctx);
 	emit(A64_NOP, ctx);
 
 	if (!is_exception_cb) {
@@ -413,6 +423,19 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf,
 		emit_bti(A64_BTI_J, ctx);
 	}
 
+	/* If freplace's target prog is main prog, it has to make x26 as
+	 * tail_call_cnt_ptr, and then initialize tail_call_cnt via the
+	 * tail_call_cnt_ptr.
+	 */
+	if (is_main_prog && is_ext) {
+		emit(A64_MOVZ(1, tmp, 1, 0), ctx);
+		emit(A64_CMP(1, r0, tmp), ctx);
+		emit(A64_B_(A64_COND_NE, 4), ctx);
+		emit(A64_ADD_I(1, ptr, A64_SP, 16), ctx);
+		emit(A64_MOVZ(1, r0, 0, 0), ctx);
+		emit(A64_STR64I(r0, ptr, 0), ctx);
+	}
+
 	/*
 	 * Program acting as exception boundary should save all ARM64
 	 * Callee-saved registers as the exception callback needs to recover
@@ -444,6 +467,7 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf,
 static int out_offset = -1; /* initialized on the first pass of build_body() */
 static int emit_bpf_tail_call(struct jit_ctx *ctx)
 {
+	const u8 r0 = bpf2a64[BPF_REG_0];
 	/* bpf_tail_call(void *prog_ctx, struct bpf_array *array, u64 index) */
 	const u8 r2 = bpf2a64[BPF_REG_2];
 	const u8 r3 = bpf2a64[BPF_REG_3];
@@ -491,6 +515,11 @@ static int emit_bpf_tail_call(struct jit_ctx *ctx)
 
 	/* Update tail_call_cnt if the slot is populated. */
 	emit(A64_STR64I(tcc, ptr, 0), ctx);
+	/* When freplace prog tail calls freplace prog, setting r0 as 0 is to
+	 * prevent re-initializing tail_call_cnt at the prologue of target
+	 * freplace prog.
+	 */
+	emit(A64_MOVZ(1, r0, 0, 0), ctx);
 
 	/* goto *(prog->bpf_func + prologue_offset); */
 	off = offsetof(struct bpf_prog, bpf_func);
@@ -2199,9 +2228,10 @@ static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im,
 		emit(A64_RET(A64_R(10)), ctx);
 		/* store return value */
 		emit(A64_STR64I(A64_R(0), A64_SP, retval_off), ctx);
-		/* reserve a nop for bpf_tramp_image_put */
+		/* reserve two nops for bpf_tramp_image_put */
 		im->ip_after_call = ctx->ro_image + ctx->idx;
 		emit(A64_NOP, ctx);
+		emit(A64_NOP, ctx);
 	}
 
 	/* update the branches saved in invoke_bpf_mod_ret with cbnz */
@@ -2484,6 +2514,7 @@ int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type poke_type,
 		/* skip to the nop instruction in bpf prog entry:
 		 * bti c // if BTI enabled
 		 * mov x9, x30
+		 * mov x7, 1 // if not-freplace main prog
 		 * nop
 		 */
 		ip = image + POKE_OFFSET * AARCH64_INSN_SIZE;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next 3/4] selftests/bpf: Add testcases for another tailcall infinite loop fixing
  2024-08-25 13:09 [PATCH bpf-next 0/4] bpf: Fix tailcall infinite loop caused by freplace Leon Hwang
  2024-08-25 13:09 ` [PATCH bpf-next 1/4] bpf, x64: " Leon Hwang
  2024-08-25 13:09 ` [PATCH bpf-next 2/4] bpf, arm64: " Leon Hwang
@ 2024-08-25 13:09 ` Leon Hwang
  2024-08-25 13:09 ` [PATCH bpf-next 4/4] selftests/bpf: Fix verifier tailcall jit selftest Leon Hwang
  3 siblings, 0 replies; 24+ messages in thread
From: Leon Hwang @ 2024-08-25 13:09 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, toke, martin.lau, yonghong.song, puranjay,
	xukuohai, eddyz87, iii, leon.hwang, kernel-patches-bot

The issue has been fixed on both x64 and arm64.

On x64:

cd tools/testing/selftests/bpf; ./test_progs -t tailcalls
333/26  tailcalls/tailcall_bpf2bpf_freplace:OK
333/27  tailcalls/tailcall_bpf2bpf_hierarchy_freplace_1:OK
333/28  tailcalls/tailcall_bpf2bpf_hierarchy_freplace_2:OK
333/29  tailcalls/tailcall_bpf2bpf_hierarchy_freplace_3:OK
333/30  tailcalls/tailcall_bpf2bpf_hierarchy_freplace_4:OK
333/31  tailcalls/tailcall_bpf2bpf_hierarchy_freplace_5:OK
333     tailcalls:OK
Summary: 1/31 PASSED, 0 SKIPPED, 0 FAILED

on arm64:

cd tools/testing/selftests/bpf; ./test_progs -t tailcalls
333/26  tailcalls/tailcall_bpf2bpf_freplace:OK
333/27  tailcalls/tailcall_bpf2bpf_hierarchy_freplace_1:OK
333/28  tailcalls/tailcall_bpf2bpf_hierarchy_freplace_2:OK
333/29  tailcalls/tailcall_bpf2bpf_hierarchy_freplace_3:OK
333/30  tailcalls/tailcall_bpf2bpf_hierarchy_freplace_4:OK
333/31  tailcalls/tailcall_bpf2bpf_hierarchy_freplace_5:OK
333     tailcalls:OK
Summary: 1/31 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
---
 .../selftests/bpf/prog_tests/tailcalls.c      | 209 +++++++++++++++++-
 .../tailcall_bpf2bpf_hierarchy_freplace.c     |  30 +++
 .../testing/selftests/bpf/progs/tc_bpf2bpf.c  |  18 +-
 3 files changed, 253 insertions(+), 4 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy_freplace.c

diff --git a/tools/testing/selftests/bpf/prog_tests/tailcalls.c b/tools/testing/selftests/bpf/prog_tests/tailcalls.c
index 21c5a37846ade..5df6721da3c51 100644
--- a/tools/testing/selftests/bpf/prog_tests/tailcalls.c
+++ b/tools/testing/selftests/bpf/prog_tests/tailcalls.c
@@ -5,6 +5,7 @@
 #include "tailcall_poke.skel.h"
 #include "tailcall_bpf2bpf_hierarchy2.skel.h"
 #include "tailcall_bpf2bpf_hierarchy3.skel.h"
+#include "tailcall_bpf2bpf_hierarchy_freplace.skel.h"
 #include "tailcall_freplace.skel.h"
 #include "tc_bpf2bpf.skel.h"
 
@@ -1525,7 +1526,8 @@ static void test_tailcall_freplace(void)
 
 	prog_fd = bpf_program__fd(tc_skel->progs.entry_tc);
 	freplace_prog = freplace_skel->progs.entry_freplace;
-	err = bpf_program__set_attach_target(freplace_prog, prog_fd, "subprog");
+	err = bpf_program__set_attach_target(freplace_prog, prog_fd,
+					     "subprog_tc");
 	if (!ASSERT_OK(err, "set_attach_target"))
 		goto out;
 
@@ -1534,7 +1536,7 @@ static void test_tailcall_freplace(void)
 		goto out;
 
 	freplace_link = bpf_program__attach_freplace(freplace_prog, prog_fd,
-						     "subprog");
+						     "subprog_tc");
 	if (!ASSERT_OK_PTR(freplace_link, "attach_freplace"))
 		goto out;
 
@@ -1556,6 +1558,197 @@ static void test_tailcall_freplace(void)
 	tailcall_freplace__destroy(freplace_skel);
 }
 
+/* test_tailcall_bpf2bpf_freplace checks that the count value of the tail
+ * call limit enforcement matches with expectation for this case:
+ *
+ * entry_tc --> subprog_tc --jump-> entry_freplace --tailcall-> entry_tc
+ */
+static void test_tailcall_bpf2bpf_freplace(void)
+{
+	struct tailcall_freplace *freplace_skel = NULL;
+	struct bpf_link *freplace_link = NULL;
+	struct tc_bpf2bpf *tc_skel = NULL;
+	char buff[128] = {};
+	int prog_fd, map_fd;
+	int err, i;
+
+	LIBBPF_OPTS(bpf_test_run_opts, topts,
+		    .data_in = buff,
+		    .data_size_in = sizeof(buff),
+		    .repeat = 1,
+	);
+
+	tc_skel = tc_bpf2bpf__open_and_load();
+	if (!ASSERT_OK_PTR(tc_skel, "tc_bpf2bpf__open_and_load"))
+		goto out;
+
+	prog_fd = bpf_program__fd(tc_skel->progs.entry_tc);
+	freplace_skel = tailcall_freplace__open();
+	if (!ASSERT_OK_PTR(freplace_skel, "tailcall_freplace__open"))
+		goto out;
+
+	err = bpf_program__set_attach_target(freplace_skel->progs.entry_freplace,
+					     prog_fd, "subprog_tc");
+	if (!ASSERT_OK(err, "set_attach_target"))
+		goto out;
+
+	err = tailcall_freplace__load(freplace_skel);
+	if (!ASSERT_OK(err, "tailcall_freplace__load"))
+		goto out;
+
+	i = 0;
+	map_fd = bpf_map__fd(freplace_skel->maps.jmp_table);
+	err = bpf_map_update_elem(map_fd, &i, &prog_fd, BPF_ANY);
+	if (!ASSERT_OK(err, "update jmp_table"))
+		goto out;
+
+	freplace_link = bpf_program__attach_freplace(freplace_skel->progs.entry_freplace,
+						     prog_fd, "subprog_tc");
+	if (!ASSERT_OK_PTR(freplace_link, "attach_freplace"))
+		goto out;
+
+	err = bpf_prog_test_run_opts(prog_fd, &topts);
+	ASSERT_OK(err, "test_run_opts");
+	ASSERT_EQ(topts.retval, 34, "test_run_opts retval");
+
+out:
+	bpf_link__destroy(freplace_link);
+	tailcall_freplace__destroy(freplace_skel);
+	tc_bpf2bpf__destroy(tc_skel);
+}
+
+static void test_tailcall_bpf2bpf_hierarchy_freplace(bool freplace_subprog,
+						     bool tailcall_tc,
+						     bool target_entry2,
+						     int tail_call_cnt1,
+						     int tail_call_cnt2)
+{
+	struct tailcall_bpf2bpf_hierarchy_freplace *freplace_skel = NULL;
+	struct bpf_link *freplace_link = NULL;
+	int freplace_prog_fd, prog_fd, map_fd;
+	struct tc_bpf2bpf *tc_skel = NULL;
+	char buff[128] = {};
+	int err, i, val;
+
+	LIBBPF_OPTS(bpf_test_run_opts, topts,
+		    .data_in = buff,
+		    .data_size_in = sizeof(buff),
+		    .repeat = 1,
+	);
+
+	tc_skel = tc_bpf2bpf__open_and_load();
+	if (!ASSERT_OK_PTR(tc_skel, "tc_bpf2bpf__open_and_load"))
+		goto out;
+
+	prog_fd = bpf_program__fd(target_entry2 ? tc_skel->progs.entry_tc_2 :
+				  tc_skel->progs.entry_tc);
+	freplace_skel = tailcall_bpf2bpf_hierarchy_freplace__open();
+	if (!ASSERT_OK_PTR(freplace_skel, "tailcall_bpf2bpf_hierarchy_freplace__open"))
+		goto out;
+
+	err = bpf_program__set_attach_target(freplace_skel->progs.entry_freplace,
+					     prog_fd, freplace_subprog ?
+					     "subprog_tc" : "entry_tc");
+	if (!ASSERT_OK(err, "set_attach_target"))
+		goto out;
+
+	err = tailcall_bpf2bpf_hierarchy_freplace__load(freplace_skel);
+	if (!ASSERT_OK(err, "tailcall_bpf2bpf_hierarchy_freplace__load"))
+		goto out;
+
+	freplace_prog_fd = bpf_program__fd(freplace_skel->progs.entry_freplace);
+	map_fd = bpf_map__fd(freplace_skel->maps.jmp_table);
+	val = tailcall_tc ? prog_fd : freplace_prog_fd;
+	i = 0;
+	err = bpf_map_update_elem(map_fd, &i, &val, BPF_ANY);
+	if (!ASSERT_OK(err, "update jmp_table"))
+		goto out;
+
+	freplace_link = bpf_program__attach_freplace(freplace_skel->progs.entry_freplace,
+						     prog_fd, freplace_subprog ?
+						     "subprog_tc" : "entry_tc");
+	if (!ASSERT_OK_PTR(freplace_link, "attach_freplace"))
+		goto out;
+
+	err = bpf_prog_test_run_opts(prog_fd, &topts);
+	ASSERT_OK(err, "test_run_opts");
+	ASSERT_EQ(topts.retval, tail_call_cnt1, "test_run_opts retval");
+
+	i = 0;
+	err = bpf_map_delete_elem(map_fd, &i);
+	if (!ASSERT_OK(err, "delete_elem from jmp_table"))
+		goto out;
+
+	err = bpf_prog_test_run_opts(prog_fd, &topts);
+	ASSERT_OK(err, "test_run_opts again");
+	ASSERT_EQ(topts.retval, tail_call_cnt2, "test_run_opts retval again");
+
+out:
+	bpf_link__destroy(freplace_link);
+	tailcall_bpf2bpf_hierarchy_freplace__destroy(freplace_skel);
+	tc_bpf2bpf__destroy(tc_skel);
+}
+
+/* test_tailcall_bpf2bpf_hierarchy_freplace_1 checks the count value of tail
+ * call limit enforcement matches with expectation for this case:
+ *
+ *                                    subprog_tail --tailcall-> entry_freplace
+ * entry_tc --jump-> entry_freplace <
+ *                                    subprog_tail --tailcall-> entry_freplace
+ */
+static void test_tailcall_bpf2bpf_hierarchy_freplace_1(void)
+{
+	test_tailcall_bpf2bpf_hierarchy_freplace(false, false, false, 34, 35);
+}
+
+/* test_tailcall_bpf2bpf_hierarchy_freplace_2 checks the count value of tail
+ * call limit enforcement matches with expectation for this case:
+ *
+ *                                                   subprog_tail --tailcall-> entry_freplace
+ * entry_tc --> subprog_tc --jump-> entry_freplace <
+ *                                                   subprog_tail --tailcall-> entry_freplace
+ */
+static void test_tailcall_bpf2bpf_hierarchy_freplace_2(void)
+{
+	test_tailcall_bpf2bpf_hierarchy_freplace(true, false, false, 34, 35);
+}
+
+/* test_tailcall_bpf2bpf_hierarchy_freplace_3 checks the count value of tail
+ * call limit enforcement matches with expectation for this case:
+ *
+ *                                                   subprog_tail --tailcall-> entry_tc
+ * entry_tc --> subprog_tc --jump-> entry_freplace <
+ *                                                   subprog_tail --tailcall-> entry_tc
+ */
+static void test_tailcall_bpf2bpf_hierarchy_freplace_3(void)
+{
+	test_tailcall_bpf2bpf_hierarchy_freplace(true, true, false, 34, 35);
+}
+
+/* test_tailcall_bpf2bpf_hierarchy_freplace_4 checks the count value of tail
+ * call limit enforcement matches with expectation for this case:
+ *
+ *                                                                     subprog_tail --tailcall-> entry_freplace
+ * entry_tc_2 --> subprog_tc (call 10 times) --jump-> entry_freplace <
+ *                                                                     subprog_tail --tailcall-> entry_freplace
+ */
+static void test_tailcall_bpf2bpf_hierarchy_freplace_4(void)
+{
+	test_tailcall_bpf2bpf_hierarchy_freplace(true, false, true, 43, 53);
+}
+
+/* test_tailcall_bpf2bpf_hierarchy_freplace_5 checks the count value of tail
+ * call limit enforcement matches with expectation for this case:
+ *
+ *                                                                     subprog_tail --tailcall-> entry_tc_2
+ * entry_tc_2 --> subprog_tc (call 10 times) --jump-> entry_freplace <
+ *                                                                     subprog_tail --tailcall-> entry_tc_2
+ */
+static void test_tailcall_bpf2bpf_hierarchy_freplace_5(void)
+{
+	test_tailcall_bpf2bpf_hierarchy_freplace(true, true, true, 340, 350);
+}
+
 void test_tailcalls(void)
 {
 	if (test__start_subtest("tailcall_1"))
@@ -1606,4 +1799,16 @@ void test_tailcalls(void)
 	test_tailcall_bpf2bpf_hierarchy_3();
 	if (test__start_subtest("tailcall_freplace"))
 		test_tailcall_freplace();
+	if (test__start_subtest("tailcall_bpf2bpf_freplace"))
+		test_tailcall_bpf2bpf_freplace();
+	if (test__start_subtest("tailcall_bpf2bpf_hierarchy_freplace_1"))
+		test_tailcall_bpf2bpf_hierarchy_freplace_1();
+	if (test__start_subtest("tailcall_bpf2bpf_hierarchy_freplace_2"))
+		test_tailcall_bpf2bpf_hierarchy_freplace_2();
+	if (test__start_subtest("tailcall_bpf2bpf_hierarchy_freplace_3"))
+		test_tailcall_bpf2bpf_hierarchy_freplace_3();
+	if (test__start_subtest("tailcall_bpf2bpf_hierarchy_freplace_4"))
+		test_tailcall_bpf2bpf_hierarchy_freplace_4();
+	if (test__start_subtest("tailcall_bpf2bpf_hierarchy_freplace_5"))
+		test_tailcall_bpf2bpf_hierarchy_freplace_5();
 }
diff --git a/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy_freplace.c b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy_freplace.c
new file mode 100644
index 0000000000000..6f7c1fac9ddb7
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf_hierarchy_freplace.c
@@ -0,0 +1,30 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
+	__uint(max_entries, 1);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u32));
+} jmp_table SEC(".maps");
+
+int count = 0;
+
+static __noinline
+int subprog_tail(struct __sk_buff *skb)
+{
+	bpf_tail_call_static(skb, &jmp_table, 0);
+	return 0;
+}
+
+SEC("freplace")
+int entry_freplace(struct __sk_buff *skb)
+{
+	count++;
+	subprog_tail(skb);
+	subprog_tail(skb);
+	return count;
+}
+
+char __license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/tc_bpf2bpf.c b/tools/testing/selftests/bpf/progs/tc_bpf2bpf.c
index 8a0632c37839a..4f8263f634c9c 100644
--- a/tools/testing/selftests/bpf/progs/tc_bpf2bpf.c
+++ b/tools/testing/selftests/bpf/progs/tc_bpf2bpf.c
@@ -5,10 +5,11 @@
 #include "bpf_misc.h"
 
 __noinline
-int subprog(struct __sk_buff *skb)
+int subprog_tc(struct __sk_buff *skb)
 {
 	int ret = 1;
 
+	__sink(skb);
 	__sink(ret);
 	return ret;
 }
@@ -16,7 +17,20 @@ int subprog(struct __sk_buff *skb)
 SEC("tc")
 int entry_tc(struct __sk_buff *skb)
 {
-	return subprog(skb);
+	return subprog_tc(skb);
+}
+
+SEC("tc")
+int entry_tc_2(struct __sk_buff *skb)
+{
+	int ret, i;
+
+	for (i = 0; i < 10; i++) {
+		ret = subprog_tc(skb);
+		__sink(ret);
+	}
+
+	return ret;
 }
 
 char __license[] SEC("license") = "GPL";
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next 4/4] selftests/bpf: Fix verifier tailcall jit selftest
  2024-08-25 13:09 [PATCH bpf-next 0/4] bpf: Fix tailcall infinite loop caused by freplace Leon Hwang
                   ` (2 preceding siblings ...)
  2024-08-25 13:09 ` [PATCH bpf-next 3/4] selftests/bpf: Add testcases for another tailcall infinite loop fixing Leon Hwang
@ 2024-08-25 13:09 ` Leon Hwang
  3 siblings, 0 replies; 24+ messages in thread
From: Leon Hwang @ 2024-08-25 13:09 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, toke, martin.lau, yonghong.song, puranjay,
	xukuohai, eddyz87, iii, leon.hwang, kernel-patches-bot

The verifier tailcall jit selftest is broken, because the prologue of x86
bpf prog has been updated to fix the tailcall infinite loop issue caused
by freplace.

Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
---
 tools/testing/selftests/bpf/progs/verifier_tailcall_jit.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/progs/verifier_tailcall_jit.c b/tools/testing/selftests/bpf/progs/verifier_tailcall_jit.c
index 8d60c634a114f..60a6045e2d9ae 100644
--- a/tools/testing/selftests/bpf/progs/verifier_tailcall_jit.c
+++ b/tools/testing/selftests/bpf/progs/verifier_tailcall_jit.c
@@ -33,8 +33,8 @@ __success
 __arch_x86_64
 /* program entry for main(), regular function prologue */
 __jited("	endbr64")
-__jited("	nopl	(%rax,%rax)")
 __jited("	xorq	%rax, %rax")
+__jited("	nopl	(%rax,%rax)")
 __jited("	pushq	%rbp")
 __jited("	movq	%rsp, %rbp")
 /* tail call prologue for program:
@@ -63,8 +63,8 @@ __jited("	{{(retq|jmp	0x)}}")		/* return or jump to rethunk */
 __jited("...")
 /* subprogram entry for sub(), regular function prologue */
 __jited("	endbr64")
-__jited("	nopl	(%rax,%rax)")
 __jited("	nopl	(%rax)")
+__jited("	nopl	(%rax,%rax)")
 __jited("	pushq	%rbp")
 __jited("	movq	%rsp, %rbp")
 /* tail call prologue for subprogram address of tail call counter
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf, arm64: Fix tailcall infinite loop caused by freplace
  2024-08-25 13:09 ` [PATCH bpf-next 2/4] bpf, arm64: " Leon Hwang
@ 2024-08-26 14:32   ` Xu Kuohai
  2024-08-27  2:23     ` Leon Hwang
  0 siblings, 1 reply; 24+ messages in thread
From: Xu Kuohai @ 2024-08-26 14:32 UTC (permalink / raw)
  To: Leon Hwang, bpf
  Cc: ast, daniel, andrii, toke, martin.lau, yonghong.song, puranjay,
	eddyz87, iii, kernel-patches-bot

On 8/25/2024 9:09 PM, Leon Hwang wrote:
> Like "bpf, x64: Fix tailcall infinite loop caused by freplace", the same
> issue happens on arm64, too.
> 
> For example:
> 
> tc_bpf2bpf.c:
> 
> // SPDX-License-Identifier: GPL-2.0
> \#include <linux/bpf.h>
> \#include <bpf/bpf_helpers.h>
> 
> __noinline
> int subprog_tc(struct __sk_buff *skb)
> {
> 	return skb->len * 2;
> }
> 
> SEC("tc")
> int entry_tc(struct __sk_buff *skb)
> {
> 	return subprog(skb);
> }
> 
> char __license[] SEC("license") = "GPL";
> 
> tailcall_bpf2bpf_hierarchy_freplace.c:
> 
> // SPDX-License-Identifier: GPL-2.0
> \#include <linux/bpf.h>
> \#include <bpf/bpf_helpers.h>
> 
> struct {
> 	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
> 	__uint(max_entries, 1);
> 	__uint(key_size, sizeof(__u32));
> 	__uint(value_size, sizeof(__u32));
> } jmp_table SEC(".maps");
> 
> int count = 0;
> 
> static __noinline
> int subprog_tail(struct __sk_buff *skb)
> {
> 	bpf_tail_call_static(skb, &jmp_table, 0);
> 	return 0;
> }
> 
> SEC("freplace")
> int entry_freplace(struct __sk_buff *skb)
> {
> 	count++;
> 	subprog_tail(skb);
> 	subprog_tail(skb);
> 	return count;
> }
> 
> char __license[] SEC("license") = "GPL";
> 
> The attach target of entry_freplace is subprog_tc, and the tail callee
> in subprog_tail is entry_tc.
> 
> Then, the infinite loop will be entry_tc -> entry_tc -> entry_freplace ->
> subprog_tail --tailcall-> entry_tc, because tail_call_cnt in
> entry_freplace will count from zero for every time of entry_freplace
> execution.
> 
> This patch fixes the issue by avoiding touching tail_call_cnt at
> prologue when it's subprog or freplace prog.
> 
> Then, when freplace prog attaches to entry_tc, it has to initialize
> tail_call_cnt and tail_call_cnt_ptr, because its target is main prog and
> its target's prologue hasn't initialize them before the attach hook.
> 
> So, this patch uses x7 register to tell freplace prog that its target
> prog is main prog or not.
> 
> Meanwhile, while tail calling to a freplace prog, it is required to
> reset x7 register to prevent re-initializing tail_call_cnt at freplace
> prog's prologue.
> 
> Fixes: 1c123c567fb1 ("bpf: Resolve fext program type when checking map compatibility")
> Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
> ---
>   arch/arm64/net/bpf_jit_comp.c | 39 +++++++++++++++++++++++++++++++----
>   1 file changed, 35 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
> index 59e05a7aea56a..4f8189824973f 100644
> --- a/arch/arm64/net/bpf_jit_comp.c
> +++ b/arch/arm64/net/bpf_jit_comp.c
> @@ -276,6 +276,7 @@ static bool is_lsi_offset(int offset, int scale)
>   /* generated prologue:
>    *      bti c // if CONFIG_ARM64_BTI_KERNEL
>    *      mov x9, lr
> + *      mov x7, 1 // if not-freplace main prog
>    *      nop  // POKE_OFFSET
>    *      paciasp // if CONFIG_ARM64_PTR_AUTH_KERNEL
>    *      stp x29, lr, [sp, #-16]!
> @@ -293,13 +294,14 @@ static bool is_lsi_offset(int offset, int scale)
>   static void prepare_bpf_tail_call_cnt(struct jit_ctx *ctx)
>   {
>   	const struct bpf_prog *prog = ctx->prog;
> +	const bool is_ext = prog->type == BPF_PROG_TYPE_EXT;
>   	const bool is_main_prog = !bpf_is_subprog(prog);
>   	const u8 ptr = bpf2a64[TCCNT_PTR];
>   	const u8 fp = bpf2a64[BPF_REG_FP];
>   	const u8 tcc = ptr;
>   
>   	emit(A64_PUSH(ptr, fp, A64_SP), ctx);
> -	if (is_main_prog) {
> +	if (is_main_prog && !is_ext) {
>   		/* Initialize tail_call_cnt. */
>   		emit(A64_MOVZ(1, tcc, 0, 0), ctx);
>   		emit(A64_PUSH(tcc, fp, A64_SP), ctx);
> @@ -315,22 +317,26 @@ static void prepare_bpf_tail_call_cnt(struct jit_ctx *ctx)
>   #define PAC_INSNS (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL) ? 1 : 0)
>   
>   /* Offset of nop instruction in bpf prog entry to be poked */
> -#define POKE_OFFSET (BTI_INSNS + 1)
> +#define POKE_OFFSET (BTI_INSNS + 2)
>   
>   /* Tail call offset to jump into */
> -#define PROLOGUE_OFFSET (BTI_INSNS + 2 + PAC_INSNS + 10)
> +#define PROLOGUE_OFFSET (BTI_INSNS + 3 + PAC_INSNS + 10)
>   
>   static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf,
>   			  bool is_exception_cb, u64 arena_vm_start)
>   {
>   	const struct bpf_prog *prog = ctx->prog;
> +	const bool is_ext = prog->type == BPF_PROG_TYPE_EXT;
>   	const bool is_main_prog = !bpf_is_subprog(prog);
> +	const u8 r0 = bpf2a64[BPF_REG_0];
>   	const u8 r6 = bpf2a64[BPF_REG_6];
>   	const u8 r7 = bpf2a64[BPF_REG_7];
>   	const u8 r8 = bpf2a64[BPF_REG_8];
>   	const u8 r9 = bpf2a64[BPF_REG_9];
>   	const u8 fp = bpf2a64[BPF_REG_FP];
>   	const u8 fpb = bpf2a64[FP_BOTTOM];
> +	const u8 ptr = bpf2a64[TCCNT_PTR];
> +	const u8 tmp = bpf2a64[TMP_REG_1];
>   	const u8 arena_vm_base = bpf2a64[ARENA_VM_START];
>   	const int idx0 = ctx->idx;
>   	int cur_offset;
> @@ -367,6 +373,10 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf,
>   	emit_bti(A64_BTI_JC, ctx);
>   
>   	emit(A64_MOV(1, A64_R(9), A64_LR), ctx);
> +	if (!is_ext)
> +		emit(A64_MOVZ(1, r0, is_main_prog, 0), ctx);
> +	else
> +		emit(A64_NOP, ctx);
>   	emit(A64_NOP, ctx);
>   
>   	if (!is_exception_cb) {
> @@ -413,6 +423,19 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf,
>   		emit_bti(A64_BTI_J, ctx);
>   	}
>   
> +	/* If freplace's target prog is main prog, it has to make x26 as
> +	 * tail_call_cnt_ptr, and then initialize tail_call_cnt via the
> +	 * tail_call_cnt_ptr.
> +	 */
> +	if (is_main_prog && is_ext) {
> +		emit(A64_MOVZ(1, tmp, 1, 0), ctx);
> +		emit(A64_CMP(1, r0, tmp), ctx);
> +		emit(A64_B_(A64_COND_NE, 4), ctx);
> +		emit(A64_ADD_I(1, ptr, A64_SP, 16), ctx);
> +		emit(A64_MOVZ(1, r0, 0, 0), ctx);
> +		emit(A64_STR64I(r0, ptr, 0), ctx);
> +	}
> +
>   	/*
>   	 * Program acting as exception boundary should save all ARM64
>   	 * Callee-saved registers as the exception callback needs to recover
> @@ -444,6 +467,7 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf,
>   static int out_offset = -1; /* initialized on the first pass of build_body() */
>   static int emit_bpf_tail_call(struct jit_ctx *ctx)
>   {
> +	const u8 r0 = bpf2a64[BPF_REG_0];
>   	/* bpf_tail_call(void *prog_ctx, struct bpf_array *array, u64 index) */
>   	const u8 r2 = bpf2a64[BPF_REG_2];
>   	const u8 r3 = bpf2a64[BPF_REG_3];
> @@ -491,6 +515,11 @@ static int emit_bpf_tail_call(struct jit_ctx *ctx)
>   
>   	/* Update tail_call_cnt if the slot is populated. */
>   	emit(A64_STR64I(tcc, ptr, 0), ctx);
> +	/* When freplace prog tail calls freplace prog, setting r0 as 0 is to
> +	 * prevent re-initializing tail_call_cnt at the prologue of target
> +	 * freplace prog.
> +	 */
> +	emit(A64_MOVZ(1, r0, 0, 0), ctx);
>   
>   	/* goto *(prog->bpf_func + prologue_offset); */
>   	off = offsetof(struct bpf_prog, bpf_func);
> @@ -2199,9 +2228,10 @@ static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im,
>   		emit(A64_RET(A64_R(10)), ctx);
>   		/* store return value */
>   		emit(A64_STR64I(A64_R(0), A64_SP, retval_off), ctx);
> -		/* reserve a nop for bpf_tramp_image_put */
> +		/* reserve two nops for bpf_tramp_image_put */
>   		im->ip_after_call = ctx->ro_image + ctx->idx;
>   		emit(A64_NOP, ctx);
> +		emit(A64_NOP, ctx);
>   	}
>   
>   	/* update the branches saved in invoke_bpf_mod_ret with cbnz */
> @@ -2484,6 +2514,7 @@ int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type poke_type,
>   		/* skip to the nop instruction in bpf prog entry:
>   		 * bti c // if BTI enabled
>   		 * mov x9, x30
> +		 * mov x7, 1 // if not-freplace main prog
>   		 * nop
>   		 */
>   		ip = image + POKE_OFFSET * AARCH64_INSN_SIZE;

This patch makes arm64 jited prologue even more complex. I've posted a series [1]
to simplify the arm64 jited prologue/epilogue. I think we can fix this issue based
on [1]. I'll give it a try.

[1] https://lore.kernel.org/bpf/20240826071624.350108-1-xukuohai@huaweicloud.com/


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf, arm64: Fix tailcall infinite loop caused by freplace
  2024-08-26 14:32   ` Xu Kuohai
@ 2024-08-27  2:23     ` Leon Hwang
  2024-08-30  7:37       ` Xu Kuohai
  0 siblings, 1 reply; 24+ messages in thread
From: Leon Hwang @ 2024-08-27  2:23 UTC (permalink / raw)
  To: Xu Kuohai, bpf
  Cc: ast, daniel, andrii, toke, martin.lau, yonghong.song, puranjay,
	eddyz87, iii, kernel-patches-bot



On 26/8/24 22:32, Xu Kuohai wrote:
> On 8/25/2024 9:09 PM, Leon Hwang wrote:
>> Like "bpf, x64: Fix tailcall infinite loop caused by freplace", the same
>> issue happens on arm64, too.
>>

[...]

> 
> This patch makes arm64 jited prologue even more complex. I've posted a
> series [1]
> to simplify the arm64 jited prologue/epilogue. I think we can fix this
> issue based
> on [1]. I'll give it a try.
> 
> [1]
> https://lore.kernel.org/bpf/20240826071624.350108-1-xukuohai@huaweicloud.com/
> 

Your patch series seems great. We can fix it based on it.

Please notify me if you have a successful try.

Thanks,
Leon

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 1/4] bpf, x64: Fix tailcall infinite loop caused by freplace
  2024-08-25 13:09 ` [PATCH bpf-next 1/4] bpf, x64: " Leon Hwang
@ 2024-08-27 10:37   ` Eduard Zingerman
  2024-08-27 12:48     ` Leon Hwang
  0 siblings, 1 reply; 24+ messages in thread
From: Eduard Zingerman @ 2024-08-27 10:37 UTC (permalink / raw)
  To: Leon Hwang, bpf
  Cc: ast, daniel, andrii, toke, martin.lau, yonghong.song, puranjay,
	xukuohai, iii, kernel-patches-bot

On Sun, 2024-08-25 at 21:09 +0800, Leon Hwang wrote:
> This patch fixes a tailcall infinite loop issue caused by freplace.
> 
> Since commit 1c123c567fb1 ("bpf: Resolve fext program type when checking map compatibility"),
> freplace prog is allowed to tail call its target prog. Then, when a
> freplace prog attaches to its target prog's subprog and tail calls its
> target prog, kernel will panic.
> 
> For example:
> 
> tc_bpf2bpf.c:
> 
> // SPDX-License-Identifier: GPL-2.0
> \#include <linux/bpf.h>
> \#include <bpf/bpf_helpers.h>
> 
> __noinline
> int subprog_tc(struct __sk_buff *skb)
> {
> 	return skb->len * 2;
> }
> 
> SEC("tc")
> int entry_tc(struct __sk_buff *skb)
> {
> 	return subprog_tc(skb);
> }
> 
> char __license[] SEC("license") = "GPL";
> 
> tailcall_bpf2bpf_hierarchy_freplace.c:
> 
> // SPDX-License-Identifier: GPL-2.0
> \#include <linux/bpf.h>
> \#include <bpf/bpf_helpers.h>
> 
> struct {
> 	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
> 	__uint(max_entries, 1);
> 	__uint(key_size, sizeof(__u32));
> 	__uint(value_size, sizeof(__u32));
> } jmp_table SEC(".maps");
> 
> int count = 0;
> 
> static __noinline
> int subprog_tail(struct __sk_buff *skb)
> {
> 	bpf_tail_call_static(skb, &jmp_table, 0);
> 	return 0;
> }
> 
> SEC("freplace")
> int entry_freplace(struct __sk_buff *skb)
> {
> 	count++;
> 	subprog_tail(skb);
> 	subprog_tail(skb);
> 	return count;
> }
> 
> char __license[] SEC("license") = "GPL";
> 
> The attach target of entry_freplace is subprog_tc, and the tail callee
> in subprog_tail is entry_tc.
> 
> Then, the infinite loop will be entry_tc -> subprog_tc -> entry_freplace
> -> subprog_tail --tailcall-> entry_tc, because tail_call_cnt in
> entry_freplace will count from zero for every time of entry_freplace
> execution. Kernel will panic:
> 
> [   15.310490] BUG: TASK stack guard page was hit at (____ptrval____)
> (stack is (____ptrval____)..(____ptrval____))
> [   15.310490] Oops: stack guard page: 0000 [#1] PREEMPT SMP NOPTI
> [   15.310490] CPU: 1 PID: 89 Comm: test_progs Tainted: G           OE
>    6.10.0-rc6-g026dcdae8d3e-dirty #72
> [   15.310490] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX,
> 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [   15.310490] RIP: 0010:bpf_prog_3a140cef239a4b4f_subprog_tail+0x14/0x53
> [   15.310490] Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
> cc cc cc cc cc f3 0f 1e fa 0f 1f 44 00 00 0f 1f 00 55 48 89 e5 f3 0f 1e
> fa <50> 50 53 41 55 48 89 fb 49 bd 00 2a 46 82 98 9c ff ff 48 89 df 4c
> [   15.310490] RSP: 0018:ffffb500c0aa0000 EFLAGS: 00000202
> [   15.310490] RAX: ffffb500c0aa0028 RBX: ffff9c98808b7e00 RCX:
> 0000000000008cb5
> [   15.310490] RDX: 0000000000000000 RSI: ffff9c9882462a00 RDI:
> ffff9c98808b7e00
> [   15.310490] RBP: ffffb500c0aa0000 R08: 0000000000000000 R09:
> 0000000000000000
> [   15.310490] R10: 0000000000000001 R11: 0000000000000000 R12:
> ffffb500c01af000
> [   15.310490] R13: ffffb500c01cd000 R14: 0000000000000000 R15:
> 0000000000000000
> [   15.310490] FS:  00007f133b665140(0000) GS:ffff9c98bbd00000(0000)
> knlGS:0000000000000000
> [   15.310490] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   15.310490] CR2: ffffb500c0a9fff8 CR3: 0000000102478000 CR4:
> 00000000000006f0
> [   15.310490] Call Trace:
> [   15.310490]  <#DF>
> [   15.310490]  ? die+0x36/0x90
> [   15.310490]  ? handle_stack_overflow+0x4d/0x60
> [   15.310490]  ? exc_double_fault+0x117/0x1a0
> [   15.310490]  ? asm_exc_double_fault+0x23/0x30
> [   15.310490]  ? bpf_prog_3a140cef239a4b4f_subprog_tail+0x14/0x53
> [   15.310490]  </#DF>
> [   15.310490]  <TASK>
> [   15.310490]  bpf_prog_85781a698094722f_entry+0x4c/0x64
> [   15.310490]  bpf_prog_1c515f389a9059b4_entry2+0x19/0x1b
> [   15.310490]  ...
> [   15.310490]  bpf_prog_85781a698094722f_entry+0x4c/0x64
> [   15.310490]  bpf_prog_1c515f389a9059b4_entry2+0x19/0x1b
> [   15.310490]  bpf_test_run+0x210/0x370
> [   15.310490]  ? bpf_test_run+0x128/0x370
> [   15.310490]  bpf_prog_test_run_skb+0x388/0x7a0
> [   15.310490]  __sys_bpf+0xdbf/0x2c40
> [   15.310490]  ? clockevents_program_event+0x52/0xf0
> [   15.310490]  ? lock_release+0xbf/0x290
> [   15.310490]  __x64_sys_bpf+0x1e/0x30
> [   15.310490]  do_syscall_64+0x68/0x140
> [   15.310490]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [   15.310490] RIP: 0033:0x7f133b52725d
> [   15.310490] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa
> 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f
> 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8b bb 0d 00 f7 d8 64 89 01 48
> [   15.310490] RSP: 002b:00007ffddbc10258 EFLAGS: 00000206 ORIG_RAX:
> 0000000000000141
> [   15.310490] RAX: ffffffffffffffda RBX: 00007ffddbc10828 RCX:
> 00007f133b52725d
> [   15.310490] RDX: 0000000000000050 RSI: 00007ffddbc102a0 RDI:
> 000000000000000a
> [   15.310490] RBP: 00007ffddbc10270 R08: 0000000000000000 R09:
> 00007ffddbc102a0
> [   15.310490] R10: 0000000000000064 R11: 0000000000000206 R12:
> 0000000000000004
> [   15.310490] R13: 0000000000000000 R14: 0000558ec4c24890 R15:
> 00007f133b6ed000
> [   15.310490]  </TASK>
> [   15.310490] Modules linked in: bpf_testmod(OE)
> [   15.310490] ---[ end trace 0000000000000000 ]---
> [   15.310490] RIP: 0010:bpf_prog_3a140cef239a4b4f_subprog_tail+0x14/0x53
> [   15.310490] Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
> cc cc cc cc cc f3 0f 1e fa 0f 1f 44 00 00 0f 1f 00 55 48 89 e5 f3 0f 1e
> fa <50> 50 53 41 55 48 89 fb 49 bd 00 2a 46 82 98 9c ff ff 48 89 df 4c
> [   15.310490] RSP: 0018:ffffb500c0aa0000 EFLAGS: 00000202
> [   15.310490] RAX: ffffb500c0aa0028 RBX: ffff9c98808b7e00 RCX:
> 0000000000008cb5
> [   15.310490] RDX: 0000000000000000 RSI: ffff9c9882462a00 RDI:
> ffff9c98808b7e00
> [   15.310490] RBP: ffffb500c0aa0000 R08: 0000000000000000 R09:
> 0000000000000000
> [   15.310490] R10: 0000000000000001 R11: 0000000000000000 R12:
> ffffb500c01af000
> [   15.310490] R13: ffffb500c01cd000 R14: 0000000000000000 R15:
> 0000000000000000
> [   15.310490] FS:  00007f133b665140(0000) GS:ffff9c98bbd00000(0000)
> knlGS:0000000000000000
> [   15.310490] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   15.310490] CR2: ffffb500c0a9fff8 CR3: 0000000102478000 CR4:
> 00000000000006f0
> [   15.310490] Kernel panic - not syncing: Fatal exception in interrupt
> [   15.310490] Kernel Offset: 0x30000000 from 0xffffffff81000000
> (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> 
> This patch fixes the issue by initializing tail_call_cnt at the prologue
> of entry_tc.
> 
> Next, when call subprog_tc, the tail_call_cnt_ptr is propagated to
> subprog_tc by rax.
> Next, when jump to entry_freplace, the tail_call_cnt_ptr will be reused to
> count tailcall in freplace prog.
> Next, when call subprog_tail, the tail_call_cnt_ptr is propagated to
> subprog_tail by rax.
> Next, while tail calling to entry_tc, the tail_call_cnt on the stack of
> entry_tc increments via the tail_call_cnt_ptr.
> 
> The whole procedure shows as the following JITed prog dumping.
> 
> bpftool p d j n entry_tc:
> 
> int entry_tc(struct __sk_buff * skb):
> bpf_prog_1c515f389a9059b4_entry_tc:
> ; return subprog_tc(skb);
>    0:	endbr64
>    4:	xorq	%rax, %rax		;; rax = 0 (tail_call_cnt)
>    7:	nopl	(%rax,%rax)
>    c:	pushq	%rbp
>    d:	movq	%rsp, %rbp
>   10:	endbr64
>   14:	cmpq	$33, %rax		;; if rax > 33, rax = tcc_ptr
>   18:	ja	0x20			;; if rax > 33 goto 0x20 ---+
>   1a:	pushq	%rax			;; [rbp - 8] = rax = 0      |
>   1b:	movq	%rsp, %rax		;; rax = rbp - 8            |
>   1e:	jmp	0x21			;; ---------+               |
>   20:	pushq	%rax			;; <--------|---------------+
>   21:	pushq	%rax			;; <--------+ [rbp - 16] = rax
>   22:	movq	-16(%rbp), %rax		;; rax = tcc_ptr
>   29:	callq	0x70			;; call subprog_tc()
> ; return subprog_tc(skb);
>   2e:	leave
>   2f:	retq
> 
> int subprog_tc(struct __sk_buff * skb):
> bpf_prog_1e8f76e2374a0607_subprog_tc:
> ; return skb->len * 2;
>    0:	endbr64
>    4:	nopl	(%rax)			;; do not touch tail_call_cnt
>    7:	jmp	0x108			;; jump to entry_freplace()
>    c:	pushq	%rbp
>    d:	movq	%rsp, %rbp
>   10:	endbr64
>   14:	pushq	%rax
>   15:	pushq	%rax
>   16:	movl	112(%rdi), %eax
> ; return skb->len * 2;
>   19:	shll	%eax
> ; return skb->len * 2;
>   1b:	leave
>   1c:	retq
> 
> bpftool p d j n entry_freplace:
> 
> int entry_freplace(struct __sk_buff * skb):
> bpf_prog_85781a698094722f_entry_freplace:
> ; int entry_freplace(struct __sk_buff *skb)
>    0:	endbr64
>    4:	nopl	(%rax)			;; do not touch tail_call_cnt
>    7:	nopl	(%rax,%rax)
>    c:	pushq	%rbp
>    d:	movq	%rsp, %rbp
>   10:	endbr64
>   14:	cmpq	$33, %rax		;; if rax > 33, rax = tcc_ptr
>   18:	ja	0x20			;; if rax > 33 goto 0x20 ---+
>   1a:	pushq	%rax			;; [rbp - 8] = rax = 0      |
>   1b:	movq	%rsp, %rax		;; rax = rbp - 8            |
>   1e:	jmp	0x21			;; ---------+               |
>   20:	pushq	%rax			;; <--------|---------------+
>   21:	pushq	%rax			;; <--------+ [rbp - 16] = rax
>   22:	pushq	%rbx			;; callee saved
>   23:	pushq	%r13			;; callee saved
>   25:	movq	%rdi, %rbx		;; rbx = skb (callee saved)
> ; count++;
>   28:	movabsq	$-123406219759616, %r13
>   32:	movl	(%r13), %edi
>   36:	addl	$1, %edi
>   39:	movl	%edi, (%r13)
> ; subprog_tail(skb);
>   3d:	movq	%rbx, %rdi		;; rdi = skb
>   40:	movq	-16(%rbp), %rax		;; rax = tcc_ptr
>   47:	callq	0x80			;; call subprog_tail()
> ; subprog_tail(skb);
>   4c:	movq	%rbx, %rdi		;; rdi = skb
>   4f:	movq	-16(%rbp), %rax		;; rax = tcc_ptr
>   56:	callq	0x80			;; call subprog_tail()
> ; return count;
>   5b:	movl	(%r13), %eax
> ; return count;
>   5f:	popq	%r13
>   61:	popq	%rbx
>   62:	leave
>   63:	retq
> 
> int subprog_tail(struct __sk_buff * skb):
> bpf_prog_3a140cef239a4b4f_subprog_tail:
> ; int subprog_tail(struct __sk_buff *skb)
>    0:	endbr64
>    4:	nopl	(%rax)			;; do not touch tail_call_cnt
>    7:	nopl	(%rax,%rax)
>    c:	pushq	%rbp
>    d:	movq	%rsp, %rbp
>   10:	endbr64
>   14:	pushq	%rax			;; [rbp - 8]  = rax (tcc_ptr)
>   15:	pushq	%rax			;; [rbp - 16] = rax (tcc_ptr)
>   16:	pushq	%rbx			;; callee saved
>   17:	pushq	%r13			;; callee saved
>   19:	movq	%rdi, %rbx		;; rbx = skb
> ; asm volatile("r1 = %[ctx]\n\t"
>   1c:	movabsq	$-128494642337280, %r13	;; r13 = jmp_table
>   26:	movq	%rbx, %rdi		;; 1st arg, skb
>   29:	movq	%r13, %rsi		;; 2nd arg, jmp_table
>   2c:	xorl	%edx, %edx		;; 3rd arg, index = 0
>   2e:	movq	-16(%rbp), %rax		;; rax = [rbp - 16] (tcc_ptr)
>   35:	cmpq	$33, (%rax)
>   39:	jae	0x4e			;; if *tcc_ptr >= 33 goto 0x4e --------+
>   3b:	nopl	(%rax,%rax)		;; jmp bypass, toggled by poking       |
>   40:	addq	$1, (%rax)		;; (*tcc_ptr)++                        |
>   44:	popq	%r13			;; callee saved                        |
>   46:	popq	%rbx			;; callee saved                        |
>   47:	popq	%rax			;; undo rbp-16 push                    |
>   48:	popq	%rax			;; undo rbp-8  push                    |
>   49:	jmp	0xfffffffffffffe18	;; tail call target, toggled by poking |
> ; return 0;				;;                                     |
>   4e:	popq	%r13			;; restore callee saved <--------------+
>   50:	popq	%rbx			;; restore callee saved
>   51:	leave
>   52:	retq
> 
> As a result, the tail_call_cnt is stored on the stack of entry_tc. And
> the tail_call_cnt_ptr is propagated between subprog_tc, entry_freplace,
> subprog_tail and entry_tc.
> 
> Furthermore, trampoline is required to propagate
> tail_call_cnt/tail_call_cnt_ptr always, no matter whether there is
> tailcall at run time.
> So, it reuses trampoline flag "BIT(7)" to tell trampoline to propagate
> the tail_call_cnt/tail_call_cnt_ptr, as BPF_TRAMP_F_TAIL_CALL_CTX is not
> used by any other arch BPF JIT.

This change seem to be correct.
Could you please add an explanation somewhere why nop3/xor and nop5
are swapped in the prologue?

As far as I understand, this is done so that freplace program
would reuse xor generated for replaced program, is that right?
E.g. for tailcall_bpf2bpf_freplace test case disasm looks as follows:

--------------- entry_tc --------------
func #0:
0:	f3 0f 1e fa                         	endbr64
4:	48 31 c0                            	xorq	%rax, %rax
7:	0f 1f 44 00 00                      	nopl	(%rax,%rax)
c:	55                                  	pushq	%rbp
d:	48 89 e5                            	movq	%rsp, %rbp
10:	f3 0f 1e fa                         	endbr64
...

------------ entry_freplace -----------
func #0:
0:	f3 0f 1e fa                         	endbr64
4:	0f 1f 00                            	nopl	(%rax)
7:	0f 1f 44 00 00                      	nopl	(%rax,%rax)
c:	55                                  	pushq	%rbp
d:	48 89 e5                            	movq	%rsp, %rbp
...

So, if entry_freplace would be used to replace entry_tc instead
of subprog_tc, the disasm would change to: 

--------------- entry_tc --------------
func #0:
0:	f3 0f 1e fa                         	endbr64
4:	48 31 c0                            	xorq	%rax, %rax
7:	0f 1f 44 00 00                      	jmp <entry_freplace>

Thus reusing %rax initialization from entry_tc.

> However, the bad effect is that it requires initializing tail_call_cnt at
> prologue always no matter whether the prog is tail_call_reachable, because
> it is unable to confirm itself or its subprog[s] whether to be attached by
> freplace prog.
> And, when call subprog, tail_call_cnt_ptr is required to be propagated
> to subprog always.

This seems unfortunate.
I wonder if disallowing to freplace programs when
replacement.tail_call_reachable != replaced.tail_call_reachable
would be a better option?

[...]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 1/4] bpf, x64: Fix tailcall infinite loop caused by freplace
  2024-08-27 10:37   ` Eduard Zingerman
@ 2024-08-27 12:48     ` Leon Hwang
  2024-08-27 20:50       ` Alexei Starovoitov
  0 siblings, 1 reply; 24+ messages in thread
From: Leon Hwang @ 2024-08-27 12:48 UTC (permalink / raw)
  To: Eduard Zingerman, bpf
  Cc: ast, daniel, andrii, toke, martin.lau, yonghong.song, puranjay,
	xukuohai, iii, kernel-patches-bot



On 2024/8/27 18:37, Eduard Zingerman wrote:
> On Sun, 2024-08-25 at 21:09 +0800, Leon Hwang wrote:
>> This patch fixes a tailcall infinite loop issue caused by freplace.
>>
>> Since commit 1c123c567fb1 ("bpf: Resolve fext program type when checking map compatibility"),
>> freplace prog is allowed to tail call its target prog. Then, when a
>> freplace prog attaches to its target prog's subprog and tail calls its
>> target prog, kernel will panic.
>>

[...]

>>
>> As a result, the tail_call_cnt is stored on the stack of entry_tc. And
>> the tail_call_cnt_ptr is propagated between subprog_tc, entry_freplace,
>> subprog_tail and entry_tc.
>>
>> Furthermore, trampoline is required to propagate
>> tail_call_cnt/tail_call_cnt_ptr always, no matter whether there is
>> tailcall at run time.
>> So, it reuses trampoline flag "BIT(7)" to tell trampoline to propagate
>> the tail_call_cnt/tail_call_cnt_ptr, as BPF_TRAMP_F_TAIL_CALL_CTX is not
>> used by any other arch BPF JIT.
> 
> This change seem to be correct.
> Could you please add an explanation somewhere why nop3/xor and nop5
> are swapped in the prologue?

OK, I'll explain it in message of patch v2.

> 
> As far as I understand, this is done so that freplace program
> would reuse xor generated for replaced program, is that right?
> E.g. for tailcall_bpf2bpf_freplace test case disasm looks as follows:
> 
> --------------- entry_tc --------------
> func #0:
> 0:	f3 0f 1e fa                         	endbr64
> 4:	48 31 c0                            	xorq	%rax, %rax
> 7:	0f 1f 44 00 00                      	nopl	(%rax,%rax)
> c:	55                                  	pushq	%rbp
> d:	48 89 e5                            	movq	%rsp, %rbp
> 10:	f3 0f 1e fa                         	endbr64
> ...
> 
> ------------ entry_freplace -----------
> func #0:
> 0:	f3 0f 1e fa                         	endbr64
> 4:	0f 1f 00                            	nopl	(%rax)
> 7:	0f 1f 44 00 00                      	nopl	(%rax,%rax)
> c:	55                                  	pushq	%rbp
> d:	48 89 e5                            	movq	%rsp, %rbp
> ...
> 
> So, if entry_freplace would be used to replace entry_tc instead
> of subprog_tc, the disasm would change to: 
> 
> --------------- entry_tc --------------
> func #0:
> 0:	f3 0f 1e fa                         	endbr64
> 4:	48 31 c0                            	xorq	%rax, %rax
> 7:	0f 1f 44 00 00                      	jmp <entry_freplace>
> 
> Thus reusing %rax initialization from entry_tc.
> 

Great. You understand it correctly.

>> However, the bad effect is that it requires initializing tail_call_cnt at
>> prologue always no matter whether the prog is tail_call_reachable, because
>> it is unable to confirm itself or its subprog[s] whether to be attached by
>> freplace prog.
>> And, when call subprog, tail_call_cnt_ptr is required to be propagated
>> to subprog always.
> 
> This seems unfortunate.
> I wonder if disallowing to freplace programs when
> replacement.tail_call_reachable != replaced.tail_call_reachable
> would be a better option?
> 

This idea is wonderful.

We can disallow attaching tail_call_reachable freplace prog to
not-tail_call_reachable bpf prog. So, the following 3 cases are allowed.

1. attach tail_call_reachable freplace prog to tail_call_reachable bpf prog.
2. attach not-tail_call_reachable freplace prog to tail_call_reachable
bpf prog.
3. attach not-tail_call_reachable freplace prog to
not-tail_call_reachable bpf prog.

I would like to add this limit in patch v2, that tail_call_reachable
freplace is disallowed to attach to not-tail_call_reachable bpf prog.
Thereafter, tail_call_cnt_ptr is required to be propagated to subprog
only when subprog is tail_call_reachable.

Thanks,
Leon

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 1/4] bpf, x64: Fix tailcall infinite loop caused by freplace
  2024-08-27 12:48     ` Leon Hwang
@ 2024-08-27 20:50       ` Alexei Starovoitov
  2024-08-28  2:36         ` Leon Hwang
  2024-09-02 10:19         ` Toke Høiland-Jørgensen
  0 siblings, 2 replies; 24+ messages in thread
From: Alexei Starovoitov @ 2024-08-27 20:50 UTC (permalink / raw)
  To: Leon Hwang
  Cc: Eduard Zingerman, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Toke Høiland-Jørgensen,
	Martin KaFai Lau, Yonghong Song, Puranjay Mohan, Xu Kuohai,
	Ilya Leoshkevich, kernel-patches-bot

On Tue, Aug 27, 2024 at 5:48 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>
> > I wonder if disallowing to freplace programs when
> > replacement.tail_call_reachable != replaced.tail_call_reachable
> > would be a better option?
> >
>
> This idea is wonderful.
>
> We can disallow attaching tail_call_reachable freplace prog to
> not-tail_call_reachable bpf prog. So, the following 3 cases are allowed.
>
> 1. attach tail_call_reachable freplace prog to tail_call_reachable bpf prog.
> 2. attach not-tail_call_reachable freplace prog to tail_call_reachable
> bpf prog.
> 3. attach not-tail_call_reachable freplace prog to
> not-tail_call_reachable bpf prog.

I think it's fine to disable freplace and tail_call combination altogether.

And speaking of the patch. The following:
-                       if (tail_call_reachable) {
-
LOAD_TAIL_CALL_CNT_PTR(bpf_prog->aux->stack_depth);
-                               ip += 7;
-                       }
+                       LOAD_TAIL_CALL_CNT_PTR(bpf_prog->aux->stack_depth);
+                       ip += 7;

Is too high of a penalty for every call for freplace+tail_call combo.

So disable it in the verifier.

pw-bot: cr

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 1/4] bpf, x64: Fix tailcall infinite loop caused by freplace
  2024-08-27 20:50       ` Alexei Starovoitov
@ 2024-08-28  2:36         ` Leon Hwang
  2024-08-28 16:01           ` Alexei Starovoitov
  2024-09-02 10:19         ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 24+ messages in thread
From: Leon Hwang @ 2024-08-28  2:36 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eduard Zingerman, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Toke Høiland-Jørgensen,
	Martin KaFai Lau, Yonghong Song, Puranjay Mohan, Xu Kuohai,
	Ilya Leoshkevich, kernel-patches-bot



On 28/8/24 04:50, Alexei Starovoitov wrote:
> On Tue, Aug 27, 2024 at 5:48 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>>
>>> I wonder if disallowing to freplace programs when
>>> replacement.tail_call_reachable != replaced.tail_call_reachable
>>> would be a better option?
>>>
>>
>> This idea is wonderful.
>>
>> We can disallow attaching tail_call_reachable freplace prog to
>> not-tail_call_reachable bpf prog. So, the following 3 cases are allowed.
>>
>> 1. attach tail_call_reachable freplace prog to tail_call_reachable bpf prog.
>> 2. attach not-tail_call_reachable freplace prog to tail_call_reachable
>> bpf prog.
>> 3. attach not-tail_call_reachable freplace prog to
>> not-tail_call_reachable bpf prog.
> 
> I think it's fine to disable freplace and tail_call combination altogether.

I don't think so.

My XDP project heavily relies on freplace and tailcall combination.

> 
> And speaking of the patch. The following:
> -                       if (tail_call_reachable) {
> -
> LOAD_TAIL_CALL_CNT_PTR(bpf_prog->aux->stack_depth);
> -                               ip += 7;
> -                       }
> +                       LOAD_TAIL_CALL_CNT_PTR(bpf_prog->aux->stack_depth);
> +                       ip += 7;
> 
> Is too high of a penalty for every call for freplace+tail_call combo.
> 
> So disable it in the verifier.
> 

I think, it's enough to disallow attaching tail_call_reachable freplace
prog to not-tail_call_reachable prog in verifier.

As for this code snippet in x64 JIT:

			func = (u8 *) __bpf_call_base + imm32;
			if (tail_call_reachable) {
				LOAD_TAIL_CALL_CNT_PTR(bpf_prog->aux->stack_depth);
				ip += 7;
			}
			if (!imm32)
				return -EINVAL;
			ip += x86_call_depth_emit_accounting(&prog, func, ip);
			if (emit_call(&prog, func, ip))
				return -EINVAL;

when a subprog is tail_call_reachable, its caller has to propagate
tail_call_cnt_ptr by rax. It's fine to attach tail_call_reachable
freplace prog to this subprog as for this case.

When the subprog is not tail_call_reachable, its caller is unnecessary
to propagate tail_call_cnt_ptr by rax. Then it's disallowed to attach
tail_call_reachable freplace prog to the subprog. However, it's fine to
attach not-tail_call_reachable freplace prog to the subprog.

In conclusion, if disallow attaching tail_call_reachable freplace prog
to not-tail_call_reachable prog in verifier, the above code snippet
won't be changed.

Thanks,
Leon


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 1/4] bpf, x64: Fix tailcall infinite loop caused by freplace
  2024-08-28  2:36         ` Leon Hwang
@ 2024-08-28 16:01           ` Alexei Starovoitov
  2024-08-29  2:14             ` Leon Hwang
  0 siblings, 1 reply; 24+ messages in thread
From: Alexei Starovoitov @ 2024-08-28 16:01 UTC (permalink / raw)
  To: Leon Hwang
  Cc: Eduard Zingerman, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Toke Høiland-Jørgensen,
	Martin KaFai Lau, Yonghong Song, Puranjay Mohan, Xu Kuohai,
	Ilya Leoshkevich, kernel-patches-bot

On Tue, Aug 27, 2024 at 7:36 PM Leon Hwang <leon.hwang@linux.dev> wrote:
>
>
>
> On 28/8/24 04:50, Alexei Starovoitov wrote:
> > On Tue, Aug 27, 2024 at 5:48 AM Leon Hwang <leon.hwang@linux.dev> wrote:
> >>
> >>> I wonder if disallowing to freplace programs when
> >>> replacement.tail_call_reachable != replaced.tail_call_reachable
> >>> would be a better option?
> >>>
> >>
> >> This idea is wonderful.
> >>
> >> We can disallow attaching tail_call_reachable freplace prog to
> >> not-tail_call_reachable bpf prog. So, the following 3 cases are allowed.
> >>
> >> 1. attach tail_call_reachable freplace prog to tail_call_reachable bpf prog.
> >> 2. attach not-tail_call_reachable freplace prog to tail_call_reachable
> >> bpf prog.
> >> 3. attach not-tail_call_reachable freplace prog to
> >> not-tail_call_reachable bpf prog.
> >
> > I think it's fine to disable freplace and tail_call combination altogether.
>
> I don't think so.
>
> My XDP project heavily relies on freplace and tailcall combination.

Pls share the link to the code.

> >
> > And speaking of the patch. The following:
> > -                       if (tail_call_reachable) {
> > -
> > LOAD_TAIL_CALL_CNT_PTR(bpf_prog->aux->stack_depth);
> > -                               ip += 7;
> > -                       }
> > +                       LOAD_TAIL_CALL_CNT_PTR(bpf_prog->aux->stack_depth);
> > +                       ip += 7;
> >
> > Is too high of a penalty for every call for freplace+tail_call combo.
> >
> > So disable it in the verifier.
> >
>
> I think, it's enough to disallow attaching tail_call_reachable freplace
> prog to not-tail_call_reachable prog in verifier.
>
> As for this code snippet in x64 JIT:
>
>                         func = (u8 *) __bpf_call_base + imm32;
>                         if (tail_call_reachable) {
>                                 LOAD_TAIL_CALL_CNT_PTR(bpf_prog->aux->stack_depth);
>                                 ip += 7;
>                         }
>                         if (!imm32)
>                                 return -EINVAL;
>                         ip += x86_call_depth_emit_accounting(&prog, func, ip);
>                         if (emit_call(&prog, func, ip))
>                                 return -EINVAL;
>
> when a subprog is tail_call_reachable, its caller has to propagate
> tail_call_cnt_ptr by rax. It's fine to attach tail_call_reachable
> freplace prog to this subprog as for this case.
>
> When the subprog is not tail_call_reachable, its caller is unnecessary
> to propagate tail_call_cnt_ptr by rax. Then it's disallowed to attach
> tail_call_reachable freplace prog to the subprog. However, it's fine to
> attach not-tail_call_reachable freplace prog to the subprog.
>
> In conclusion, if disallow attaching tail_call_reachable freplace prog
> to not-tail_call_reachable prog in verifier, the above code snippet
> won't be changed.

As long as there are no more JIT changes it's ok to go
with this partial verifier restriction,
but if more issues are found we'll have to restrict it further.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 1/4] bpf, x64: Fix tailcall infinite loop caused by freplace
  2024-08-28 16:01           ` Alexei Starovoitov
@ 2024-08-29  2:14             ` Leon Hwang
  0 siblings, 0 replies; 24+ messages in thread
From: Leon Hwang @ 2024-08-29  2:14 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eduard Zingerman, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Toke Høiland-Jørgensen,
	Martin KaFai Lau, Yonghong Song, Puranjay Mohan, Xu Kuohai,
	Ilya Leoshkevich, kernel-patches-bot



On 29/8/24 00:01, Alexei Starovoitov wrote:
> On Tue, Aug 27, 2024 at 7:36 PM Leon Hwang <leon.hwang@linux.dev> wrote:
>>
>>
>>
>> On 28/8/24 04:50, Alexei Starovoitov wrote:
>>> On Tue, Aug 27, 2024 at 5:48 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>>>>
>>>>> I wonder if disallowing to freplace programs when
>>>>> replacement.tail_call_reachable != replaced.tail_call_reachable
>>>>> would be a better option?
>>>>>
>>>>
>>>> This idea is wonderful.
>>>>
>>>> We can disallow attaching tail_call_reachable freplace prog to
>>>> not-tail_call_reachable bpf prog. So, the following 3 cases are allowed.
>>>>
>>>> 1. attach tail_call_reachable freplace prog to tail_call_reachable bpf prog.
>>>> 2. attach not-tail_call_reachable freplace prog to tail_call_reachable
>>>> bpf prog.
>>>> 3. attach not-tail_call_reachable freplace prog to
>>>> not-tail_call_reachable bpf prog.
>>>
>>> I think it's fine to disable freplace and tail_call combination altogether.
>>
>> I don't think so.
>>
>> My XDP project heavily relies on freplace and tailcall combination.
> 
> Pls share the link to the code.
> 

I'm willing to share it with you. But it's an in-house project of my
company. Sorry.

>>>
>>> And speaking of the patch. The following:
>>> -                       if (tail_call_reachable) {
>>> -
>>> LOAD_TAIL_CALL_CNT_PTR(bpf_prog->aux->stack_depth);
>>> -                               ip += 7;
>>> -                       }
>>> +                       LOAD_TAIL_CALL_CNT_PTR(bpf_prog->aux->stack_depth);
>>> +                       ip += 7;
>>>
>>> Is too high of a penalty for every call for freplace+tail_call combo.
>>>
>>> So disable it in the verifier.
>>>
>>
>> I think, it's enough to disallow attaching tail_call_reachable freplace
>> prog to not-tail_call_reachable prog in verifier.
>>
>> As for this code snippet in x64 JIT:
>>
>>                         func = (u8 *) __bpf_call_base + imm32;
>>                         if (tail_call_reachable) {
>>                                 LOAD_TAIL_CALL_CNT_PTR(bpf_prog->aux->stack_depth);
>>                                 ip += 7;
>>                         }
>>                         if (!imm32)
>>                                 return -EINVAL;
>>                         ip += x86_call_depth_emit_accounting(&prog, func, ip);
>>                         if (emit_call(&prog, func, ip))
>>                                 return -EINVAL;
>>
>> when a subprog is tail_call_reachable, its caller has to propagate
>> tail_call_cnt_ptr by rax. It's fine to attach tail_call_reachable
>> freplace prog to this subprog as for this case.
>>
>> When the subprog is not tail_call_reachable, its caller is unnecessary
>> to propagate tail_call_cnt_ptr by rax. Then it's disallowed to attach
>> tail_call_reachable freplace prog to the subprog. However, it's fine to
>> attach not-tail_call_reachable freplace prog to the subprog.
>>
>> In conclusion, if disallow attaching tail_call_reachable freplace prog
>> to not-tail_call_reachable prog in verifier, the above code snippet
>> won't be changed.
> 
> As long as there are no more JIT changes it's ok to go
> with this partial verifier restriction,
> but if more issues are found we'll have to restrict it further.

OK. I'll do the restriction in verifier.

Thanks,
Leon


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf, arm64: Fix tailcall infinite loop caused by freplace
  2024-08-27  2:23     ` Leon Hwang
@ 2024-08-30  7:37       ` Xu Kuohai
  2024-08-30  9:08         ` Leon Hwang
  2024-09-05  9:13         ` Puranjay Mohan
  0 siblings, 2 replies; 24+ messages in thread
From: Xu Kuohai @ 2024-08-30  7:37 UTC (permalink / raw)
  To: Leon Hwang, bpf
  Cc: ast, daniel, andrii, toke, martin.lau, yonghong.song, puranjay,
	eddyz87, iii, kernel-patches-bot

On 8/27/2024 10:23 AM, Leon Hwang wrote:
> 
> 
> On 26/8/24 22:32, Xu Kuohai wrote:
>> On 8/25/2024 9:09 PM, Leon Hwang wrote:
>>> Like "bpf, x64: Fix tailcall infinite loop caused by freplace", the same
>>> issue happens on arm64, too.
>>>
> 
> [...]
> 
>>
>> This patch makes arm64 jited prologue even more complex. I've posted a
>> series [1]
>> to simplify the arm64 jited prologue/epilogue. I think we can fix this
>> issue based
>> on [1]. I'll give it a try.
>>
>> [1]
>> https://lore.kernel.org/bpf/20240826071624.350108-1-xukuohai@huaweicloud.com/
>>
> 
> Your patch series seems great. We can fix it based on it.
> 
> Please notify me if you have a successful try.
> 

I think the complexity arises from having to decide whether
to initialize or keep the tail counter value in the prologue.

To get rid of this complexity, a straightforward idea is to
move the tail call counter initialization to the entry of
bpf world, and in the bpf world, we only increase and check
the tail call counter, never save/restore or set it. The
"entry of the bpf world" here refers to mechanisms like
bpf_prog_run, bpf dispatcher, or bpf trampoline that
allows bpf prog to be invoked from C function.

Below is a rough POC diff for arm64 that could pass all
of your tests. The tail call counter is held in callee-saved
register x26, and is set to 0 by arch_run_bpf.

diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index 8aa32cb140b9..2c0f7daf1655 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -26,7 +26,7 @@

  #define TMP_REG_1 (MAX_BPF_JIT_REG + 0)
  #define TMP_REG_2 (MAX_BPF_JIT_REG + 1)
-#define TCCNT_PTR (MAX_BPF_JIT_REG + 2)
+#define TCALL_CNT (MAX_BPF_JIT_REG + 2)
  #define TMP_REG_3 (MAX_BPF_JIT_REG + 3)
  #define ARENA_VM_START (MAX_BPF_JIT_REG + 5)

@@ -63,7 +63,7 @@ static const int bpf2a64[] = {
  	[TMP_REG_2] = A64_R(11),
  	[TMP_REG_3] = A64_R(12),
  	/* tail_call_cnt_ptr */
-	[TCCNT_PTR] = A64_R(26),
+	[TCALL_CNT] = A64_R(26), // x26 is used to hold tail call counter
  	/* temporary register for blinding constants */
  	[BPF_REG_AX] = A64_R(9),
  	/* callee saved register for kern_vm_start address */
@@ -286,19 +286,6 @@ static bool is_lsi_offset(int offset, int scale)
   *      // PROLOGUE_OFFSET
   *	// save callee-saved registers
   */
-static void prepare_bpf_tail_call_cnt(struct jit_ctx *ctx)
-{
-	const bool is_main_prog = !bpf_is_subprog(ctx->prog);
-	const u8 ptr = bpf2a64[TCCNT_PTR];
-
-	if (is_main_prog) {
-		/* Initialize tail_call_cnt. */
-		emit(A64_PUSH(A64_ZR, ptr, A64_SP), ctx);
-		emit(A64_MOV(1, ptr, A64_SP), ctx);
-	} else
-		emit(A64_PUSH(ptr, ptr, A64_SP), ctx);
-}
-
  static void find_used_callee_regs(struct jit_ctx *ctx)
  {
  	int i;
@@ -419,7 +406,7 @@ static void pop_callee_regs(struct jit_ctx *ctx)
  #define POKE_OFFSET (BTI_INSNS + 1)

  /* Tail call offset to jump into */
-#define PROLOGUE_OFFSET (BTI_INSNS + 2 + PAC_INSNS + 4)
+#define PROLOGUE_OFFSET (BTI_INSNS + 2 + PAC_INSNS + 2)

  static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
  {
@@ -473,8 +460,6 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
  		emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx);
  		emit(A64_MOV(1, A64_FP, A64_SP), ctx);

-		prepare_bpf_tail_call_cnt(ctx);
-
  		if (!ebpf_from_cbpf && is_main_prog) {
  			cur_offset = ctx->idx - idx0;
  			if (cur_offset != PROLOGUE_OFFSET) {
@@ -499,7 +484,7 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
  		 *
  		 * 12 registers are on the stack
  		 */
-		emit(A64_SUB_I(1, A64_SP, A64_FP, 96), ctx);
+		emit(A64_SUB_I(1, A64_SP, A64_FP, 80), ctx);
  	}

  	if (ctx->fp_used)
@@ -527,8 +512,7 @@ static int emit_bpf_tail_call(struct jit_ctx *ctx)

  	const u8 tmp = bpf2a64[TMP_REG_1];
  	const u8 prg = bpf2a64[TMP_REG_2];
-	const u8 tcc = bpf2a64[TMP_REG_3];
-	const u8 ptr = bpf2a64[TCCNT_PTR];
+	const u8 tcc = bpf2a64[TCALL_CNT];
  	size_t off;
  	__le32 *branch1 = NULL;
  	__le32 *branch2 = NULL;
@@ -546,16 +530,15 @@ static int emit_bpf_tail_call(struct jit_ctx *ctx)
  	emit(A64_NOP, ctx);

  	/*
-	 * if ((*tail_call_cnt_ptr) >= MAX_TAIL_CALL_CNT)
+	 * if (tail_call_cnt >= MAX_TAIL_CALL_CNT)
  	 *     goto out;
  	 */
  	emit_a64_mov_i64(tmp, MAX_TAIL_CALL_CNT, ctx);
-	emit(A64_LDR64I(tcc, ptr, 0), ctx);
  	emit(A64_CMP(1, tcc, tmp), ctx);
  	branch2 = ctx->image + ctx->idx;
  	emit(A64_NOP, ctx);

-	/* (*tail_call_cnt_ptr)++; */
+	/* tail_call_cnt++; */
  	emit(A64_ADD_I(1, tcc, tcc, 1), ctx);

  	/* prog = array->ptrs[index];
@@ -570,9 +553,6 @@ static int emit_bpf_tail_call(struct jit_ctx *ctx)
  	branch3 = ctx->image + ctx->idx;
  	emit(A64_NOP, ctx);

-	/* Update tail_call_cnt if the slot is populated. */
-	emit(A64_STR64I(tcc, ptr, 0), ctx);
-
  	/* restore SP */
  	if (ctx->stack_size)
  		emit(A64_ADD_I(1, A64_SP, A64_SP, ctx->stack_size), ctx);
@@ -793,6 +773,27 @@ asm (
  "	.popsection\n"
  );

+unsigned int arch_run_bpf(const void *ctx, const struct bpf_insn *insnsi, bpf_func_t bpf_func);
+asm (
+"	.pushsection .text, \"ax\", @progbits\n"
+"	.global arch_run_bpf\n"
+"	.type arch_run_bpf, %function\n"
+"arch_run_bpf:\n"
+#if IS_ENABLED(CONFIG_ARM64_BTI_KERNEL)
+"	bti j\n"
+#endif
+"	stp x29, x30, [sp, #-16]!\n"
+"	stp xzr, x26, [sp, #-16]!\n"
+"	mov x26, #0\n"
+"	blr x2\n"
+"	ldp xzr, x26, [sp], #16\n"
+"	ldp x29, x30, [sp], #16\n"
+"	ret x30\n"
+"	.size arch_run_bpf, . - arch_run_bpf\n"
+"	.popsection\n"
+);
+EXPORT_SYMBOL_GPL(arch_run_bpf);
+
  /* build a plt initialized like this:
   *
   * plt:
@@ -826,7 +827,6 @@ static void build_plt(struct jit_ctx *ctx)
  static void build_epilogue(struct jit_ctx *ctx)
  {
  	const u8 r0 = bpf2a64[BPF_REG_0];
-	const u8 ptr = bpf2a64[TCCNT_PTR];

  	/* We're done with BPF stack */
  	if (ctx->stack_size)
@@ -834,8 +834,6 @@ static void build_epilogue(struct jit_ctx *ctx)

  	pop_callee_regs(ctx);

-	emit(A64_POP(A64_ZR, ptr, A64_SP), ctx);
-
  	/* Restore FP/LR registers */
  	emit(A64_POP(A64_FP, A64_LR, A64_SP), ctx);

@@ -2066,6 +2064,8 @@ static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im,
  	bool save_ret;
  	__le32 **branches = NULL;

+	bool target_is_bpf = is_bpf_text_address((unsigned long)func_addr);
+
  	/* trampoline stack layout:
  	 *                  [ parent ip         ]
  	 *                  [ FP                ]
@@ -2133,6 +2133,11 @@ static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im,
  	 */
  	emit_bti(A64_BTI_JC, ctx);

+	if (!target_is_bpf) {
+		emit(A64_PUSH(A64_ZR, A64_R(26), A64_SP), ctx);
+		emit(A64_MOVZ(1, A64_R(26), 0, 0), ctx);
+	}
+
  	/* frame for parent function */
  	emit(A64_PUSH(A64_FP, A64_R(9), A64_SP), ctx);
  	emit(A64_MOV(1, A64_FP, A64_SP), ctx);
@@ -2226,6 +2231,8 @@ static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im,
  	/* pop frames  */
  	emit(A64_POP(A64_FP, A64_LR, A64_SP), ctx);
  	emit(A64_POP(A64_FP, A64_R(9), A64_SP), ctx);
+	if (!target_is_bpf)
+		emit(A64_POP(A64_ZR, A64_R(26), A64_SP), ctx);

  	if (flags & BPF_TRAMP_F_SKIP_FRAME) {
  		/* skip patched function, return to parent */
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index dc63083f76b7..8660d15dd50c 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1244,12 +1244,14 @@ struct bpf_dispatcher {
  #define __bpfcall __nocfi
  #endif

+unsigned int arch_run_bpf(const void *ctx, const struct bpf_insn *insnsi, bpf_func_t bpf_func);
+
  static __always_inline __bpfcall unsigned int bpf_dispatcher_nop_func(
  	const void *ctx,
  	const struct bpf_insn *insnsi,
  	bpf_func_t bpf_func)
  {
-	return bpf_func(ctx, insnsi);
+	return arch_run_bpf(ctx, insnsi, bpf_func);
  }

  /* the implementation of the opaque uapi struct bpf_dynptr */
@@ -1317,7 +1319,7 @@ int arch_prepare_bpf_dispatcher(void *image, void *buf, s64 *funcs, int num_func
  #else
  #define __BPF_DISPATCHER_SC_INIT(name)
  #define __BPF_DISPATCHER_SC(name)
-#define __BPF_DISPATCHER_CALL(name)		bpf_func(ctx, insnsi)
+#define __BPF_DISPATCHER_CALL(name)		arch_run_bpf(ctx, insnsi, bpf_func);
  #define __BPF_DISPATCHER_UPDATE(_d, _new)
  #endif

> Thanks,
> Leon


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf, arm64: Fix tailcall infinite loop caused by freplace
  2024-08-30  7:37       ` Xu Kuohai
@ 2024-08-30  9:08         ` Leon Hwang
  2024-08-30 10:00           ` Xu Kuohai
  2024-09-05  9:13         ` Puranjay Mohan
  1 sibling, 1 reply; 24+ messages in thread
From: Leon Hwang @ 2024-08-30  9:08 UTC (permalink / raw)
  To: Xu Kuohai, bpf
  Cc: ast, daniel, andrii, toke, martin.lau, yonghong.song, puranjay,
	eddyz87, iii, kernel-patches-bot



On 30/8/24 15:37, Xu Kuohai wrote:
> On 8/27/2024 10:23 AM, Leon Hwang wrote:
>>

[...]

> 
> I think the complexity arises from having to decide whether
> to initialize or keep the tail counter value in the prologue.
> 
> To get rid of this complexity, a straightforward idea is to
> move the tail call counter initialization to the entry of
> bpf world, and in the bpf world, we only increase and check
> the tail call counter, never save/restore or set it. The
> "entry of the bpf world" here refers to mechanisms like
> bpf_prog_run, bpf dispatcher, or bpf trampoline that
> allows bpf prog to be invoked from C function.
> 
> Below is a rough POC diff for arm64 that could pass all
> of your tests. The tail call counter is held in callee-saved
> register x26, and is set to 0 by arch_run_bpf.
> 
> diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
> index 8aa32cb140b9..2c0f7daf1655 100644
> --- a/arch/arm64/net/bpf_jit_comp.c
> +++ b/arch/arm64/net/bpf_jit_comp.c
> @@ -26,7 +26,7 @@
> 
>  #define TMP_REG_1 (MAX_BPF_JIT_REG + 0)
>  #define TMP_REG_2 (MAX_BPF_JIT_REG + 1)
> -#define TCCNT_PTR (MAX_BPF_JIT_REG + 2)
> +#define TCALL_CNT (MAX_BPF_JIT_REG + 2)
>  #define TMP_REG_3 (MAX_BPF_JIT_REG + 3)
>  #define ARENA_VM_START (MAX_BPF_JIT_REG + 5)
> 
> @@ -63,7 +63,7 @@ static const int bpf2a64[] = {
>      [TMP_REG_2] = A64_R(11),
>      [TMP_REG_3] = A64_R(12),
>      /* tail_call_cnt_ptr */
> -    [TCCNT_PTR] = A64_R(26),
> +    [TCALL_CNT] = A64_R(26), // x26 is used to hold tail call counter
>      /* temporary register for blinding constants */
>      [BPF_REG_AX] = A64_R(9),
>      /* callee saved register for kern_vm_start address */
> @@ -286,19 +286,6 @@ static bool is_lsi_offset(int offset, int scale)
>   *      // PROLOGUE_OFFSET
>   *    // save callee-saved registers
>   */
> -static void prepare_bpf_tail_call_cnt(struct jit_ctx *ctx)
> -{
> -    const bool is_main_prog = !bpf_is_subprog(ctx->prog);
> -    const u8 ptr = bpf2a64[TCCNT_PTR];
> -
> -    if (is_main_prog) {
> -        /* Initialize tail_call_cnt. */
> -        emit(A64_PUSH(A64_ZR, ptr, A64_SP), ctx);
> -        emit(A64_MOV(1, ptr, A64_SP), ctx);
> -    } else
> -        emit(A64_PUSH(ptr, ptr, A64_SP), ctx);
> -}
> -
>  static void find_used_callee_regs(struct jit_ctx *ctx)
>  {
>      int i;
> @@ -419,7 +406,7 @@ static void pop_callee_regs(struct jit_ctx *ctx)
>  #define POKE_OFFSET (BTI_INSNS + 1)
> 
>  /* Tail call offset to jump into */
> -#define PROLOGUE_OFFSET (BTI_INSNS + 2 + PAC_INSNS + 4)
> +#define PROLOGUE_OFFSET (BTI_INSNS + 2 + PAC_INSNS + 2)
> 
>  static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
>  {
> @@ -473,8 +460,6 @@ static int build_prologue(struct jit_ctx *ctx, bool
> ebpf_from_cbpf)
>          emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx);
>          emit(A64_MOV(1, A64_FP, A64_SP), ctx);
> 
> -        prepare_bpf_tail_call_cnt(ctx);
> -
>          if (!ebpf_from_cbpf && is_main_prog) {
>              cur_offset = ctx->idx - idx0;
>              if (cur_offset != PROLOGUE_OFFSET) {
> @@ -499,7 +484,7 @@ static int build_prologue(struct jit_ctx *ctx, bool
> ebpf_from_cbpf)
>           *
>           * 12 registers are on the stack
>           */
> -        emit(A64_SUB_I(1, A64_SP, A64_FP, 96), ctx);
> +        emit(A64_SUB_I(1, A64_SP, A64_FP, 80), ctx);
>      }
> 
>      if (ctx->fp_used)
> @@ -527,8 +512,7 @@ static int emit_bpf_tail_call(struct jit_ctx *ctx)
> 
>      const u8 tmp = bpf2a64[TMP_REG_1];
>      const u8 prg = bpf2a64[TMP_REG_2];
> -    const u8 tcc = bpf2a64[TMP_REG_3];
> -    const u8 ptr = bpf2a64[TCCNT_PTR];
> +    const u8 tcc = bpf2a64[TCALL_CNT];
>      size_t off;
>      __le32 *branch1 = NULL;
>      __le32 *branch2 = NULL;
> @@ -546,16 +530,15 @@ static int emit_bpf_tail_call(struct jit_ctx *ctx)
>      emit(A64_NOP, ctx);
> 
>      /*
> -     * if ((*tail_call_cnt_ptr) >= MAX_TAIL_CALL_CNT)
> +     * if (tail_call_cnt >= MAX_TAIL_CALL_CNT)
>       *     goto out;
>       */
>      emit_a64_mov_i64(tmp, MAX_TAIL_CALL_CNT, ctx);
> -    emit(A64_LDR64I(tcc, ptr, 0), ctx);
>      emit(A64_CMP(1, tcc, tmp), ctx);
>      branch2 = ctx->image + ctx->idx;
>      emit(A64_NOP, ctx);
> 
> -    /* (*tail_call_cnt_ptr)++; */
> +    /* tail_call_cnt++; */
>      emit(A64_ADD_I(1, tcc, tcc, 1), ctx);
> 
>      /* prog = array->ptrs[index];
> @@ -570,9 +553,6 @@ static int emit_bpf_tail_call(struct jit_ctx *ctx)
>      branch3 = ctx->image + ctx->idx;
>      emit(A64_NOP, ctx);
> 
> -    /* Update tail_call_cnt if the slot is populated. */
> -    emit(A64_STR64I(tcc, ptr, 0), ctx);
> -
>      /* restore SP */
>      if (ctx->stack_size)
>          emit(A64_ADD_I(1, A64_SP, A64_SP, ctx->stack_size), ctx);
> @@ -793,6 +773,27 @@ asm (
>  "    .popsection\n"
>  );
> 
> +unsigned int arch_run_bpf(const void *ctx, const struct bpf_insn
> *insnsi, bpf_func_t bpf_func);
> +asm (
> +"    .pushsection .text, \"ax\", @progbits\n"
> +"    .global arch_run_bpf\n"
> +"    .type arch_run_bpf, %function\n"
> +"arch_run_bpf:\n"
> +#if IS_ENABLED(CONFIG_ARM64_BTI_KERNEL)
> +"    bti j\n"
> +#endif
> +"    stp x29, x30, [sp, #-16]!\n"
> +"    stp xzr, x26, [sp, #-16]!\n"
> +"    mov x26, #0\n"
> +"    blr x2\n"
> +"    ldp xzr, x26, [sp], #16\n"
> +"    ldp x29, x30, [sp], #16\n"
> +"    ret x30\n"
> +"    .size arch_run_bpf, . - arch_run_bpf\n"
> +"    .popsection\n"
> +);
> +EXPORT_SYMBOL_GPL(arch_run_bpf);
> +
>  /* build a plt initialized like this:
>   *
>   * plt:
> @@ -826,7 +827,6 @@ static void build_plt(struct jit_ctx *ctx)
>  static void build_epilogue(struct jit_ctx *ctx)
>  {
>      const u8 r0 = bpf2a64[BPF_REG_0];
> -    const u8 ptr = bpf2a64[TCCNT_PTR];
> 
>      /* We're done with BPF stack */
>      if (ctx->stack_size)
> @@ -834,8 +834,6 @@ static void build_epilogue(struct jit_ctx *ctx)
> 
>      pop_callee_regs(ctx);
> 
> -    emit(A64_POP(A64_ZR, ptr, A64_SP), ctx);
> -
>      /* Restore FP/LR registers */
>      emit(A64_POP(A64_FP, A64_LR, A64_SP), ctx);
> 
> @@ -2066,6 +2064,8 @@ static int prepare_trampoline(struct jit_ctx *ctx,
> struct bpf_tramp_image *im,
>      bool save_ret;
>      __le32 **branches = NULL;
> 
> +    bool target_is_bpf = is_bpf_text_address((unsigned long)func_addr);
> +
>      /* trampoline stack layout:
>       *                  [ parent ip         ]
>       *                  [ FP                ]
> @@ -2133,6 +2133,11 @@ static int prepare_trampoline(struct jit_ctx
> *ctx, struct bpf_tramp_image *im,
>       */
>      emit_bti(A64_BTI_JC, ctx);
> 
> +    if (!target_is_bpf) {
> +        emit(A64_PUSH(A64_ZR, A64_R(26), A64_SP), ctx);
> +        emit(A64_MOVZ(1, A64_R(26), 0, 0), ctx);
> +    }
> +
>      /* frame for parent function */
>      emit(A64_PUSH(A64_FP, A64_R(9), A64_SP), ctx);
>      emit(A64_MOV(1, A64_FP, A64_SP), ctx);
> @@ -2226,6 +2231,8 @@ static int prepare_trampoline(struct jit_ctx *ctx,
> struct bpf_tramp_image *im,
>      /* pop frames  */
>      emit(A64_POP(A64_FP, A64_LR, A64_SP), ctx);
>      emit(A64_POP(A64_FP, A64_R(9), A64_SP), ctx);
> +    if (!target_is_bpf)
> +        emit(A64_POP(A64_ZR, A64_R(26), A64_SP), ctx);
> 
>      if (flags & BPF_TRAMP_F_SKIP_FRAME) {
>          /* skip patched function, return to parent */
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index dc63083f76b7..8660d15dd50c 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1244,12 +1244,14 @@ struct bpf_dispatcher {
>  #define __bpfcall __nocfi
>  #endif
> 
> +unsigned int arch_run_bpf(const void *ctx, const struct bpf_insn
> *insnsi, bpf_func_t bpf_func);
> +
>  static __always_inline __bpfcall unsigned int bpf_dispatcher_nop_func(
>      const void *ctx,
>      const struct bpf_insn *insnsi,
>      bpf_func_t bpf_func)
>  {
> -    return bpf_func(ctx, insnsi);
> +    return arch_run_bpf(ctx, insnsi, bpf_func);
>  }
> 
>  /* the implementation of the opaque uapi struct bpf_dynptr */
> @@ -1317,7 +1319,7 @@ int arch_prepare_bpf_dispatcher(void *image, void
> *buf, s64 *funcs, int num_func
>  #else
>  #define __BPF_DISPATCHER_SC_INIT(name)
>  #define __BPF_DISPATCHER_SC(name)
> -#define __BPF_DISPATCHER_CALL(name)        bpf_func(ctx, insnsi)
> +#define __BPF_DISPATCHER_CALL(name)        arch_run_bpf(ctx, insnsi,
> bpf_func);
>  #define __BPF_DISPATCHER_UPDATE(_d, _new)
>  #endif
> 

This approach is really cool!

I want an alike approach on x86. But I failed. Because, on x86, it's an
indirect call to "call *rdx", aka "bpf_func(ctx, insnsi)".

Let us imagine the arch_run_bpf() on x86:

unsigned int __naked arch_run_bpf(const void *ctx, const struct bpf_insn
*insnsi, bpf_func_t bpf_func)
{
	asm (
		"pushq %rbp\n\t"
		"movq %rsp, %rbp\n\t"
		"xor %rax, %rax\n\t"
		"pushq %rax\n\t"
		"movq %rsp, %rax\n\t"
		"callq *%rdx\n\t"
		"leave\n\t"
		"ret\n\t"
	);
}

If we can change "callq *%rdx" to a direct call, it'll be really
wonderful to resolve this tailcall issue on x86.

How to introduce arch_bpf_run() for all JIT backends?

Thanks,
Leon


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf, arm64: Fix tailcall infinite loop caused by freplace
  2024-08-30  9:08         ` Leon Hwang
@ 2024-08-30 10:00           ` Xu Kuohai
  2024-08-30 12:11             ` Leon Hwang
  0 siblings, 1 reply; 24+ messages in thread
From: Xu Kuohai @ 2024-08-30 10:00 UTC (permalink / raw)
  To: Leon Hwang, bpf
  Cc: ast, daniel, andrii, toke, martin.lau, yonghong.song, puranjay,
	eddyz87, iii, kernel-patches-bot

On 8/30/2024 5:08 PM, Leon Hwang wrote:
> 
> 
> On 30/8/24 15:37, Xu Kuohai wrote:
>> On 8/27/2024 10:23 AM, Leon Hwang wrote:
>>>
> 
> [...]
> 
>>
>> I think the complexity arises from having to decide whether
>> to initialize or keep the tail counter value in the prologue.
>>
>> To get rid of this complexity, a straightforward idea is to
>> move the tail call counter initialization to the entry of
>> bpf world, and in the bpf world, we only increase and check
>> the tail call counter, never save/restore or set it. The
>> "entry of the bpf world" here refers to mechanisms like
>> bpf_prog_run, bpf dispatcher, or bpf trampoline that
>> allows bpf prog to be invoked from C function.
>>
>> Below is a rough POC diff for arm64 that could pass all
>> of your tests. The tail call counter is held in callee-saved
>> register x26, and is set to 0 by arch_run_bpf.
>>
>> diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
>> index 8aa32cb140b9..2c0f7daf1655 100644
>> --- a/arch/arm64/net/bpf_jit_comp.c
>> +++ b/arch/arm64/net/bpf_jit_comp.c
>> @@ -26,7 +26,7 @@
>>
>>   #define TMP_REG_1 (MAX_BPF_JIT_REG + 0)
>>   #define TMP_REG_2 (MAX_BPF_JIT_REG + 1)
>> -#define TCCNT_PTR (MAX_BPF_JIT_REG + 2)
>> +#define TCALL_CNT (MAX_BPF_JIT_REG + 2)
>>   #define TMP_REG_3 (MAX_BPF_JIT_REG + 3)
>>   #define ARENA_VM_START (MAX_BPF_JIT_REG + 5)
>>
>> @@ -63,7 +63,7 @@ static const int bpf2a64[] = {
>>       [TMP_REG_2] = A64_R(11),
>>       [TMP_REG_3] = A64_R(12),
>>       /* tail_call_cnt_ptr */
>> -    [TCCNT_PTR] = A64_R(26),
>> +    [TCALL_CNT] = A64_R(26), // x26 is used to hold tail call counter
>>       /* temporary register for blinding constants */
>>       [BPF_REG_AX] = A64_R(9),
>>       /* callee saved register for kern_vm_start address */
>> @@ -286,19 +286,6 @@ static bool is_lsi_offset(int offset, int scale)
>>    *      // PROLOGUE_OFFSET
>>    *    // save callee-saved registers
>>    */
>> -static void prepare_bpf_tail_call_cnt(struct jit_ctx *ctx)
>> -{
>> -    const bool is_main_prog = !bpf_is_subprog(ctx->prog);
>> -    const u8 ptr = bpf2a64[TCCNT_PTR];
>> -
>> -    if (is_main_prog) {
>> -        /* Initialize tail_call_cnt. */
>> -        emit(A64_PUSH(A64_ZR, ptr, A64_SP), ctx);
>> -        emit(A64_MOV(1, ptr, A64_SP), ctx);
>> -    } else
>> -        emit(A64_PUSH(ptr, ptr, A64_SP), ctx);
>> -}
>> -
>>   static void find_used_callee_regs(struct jit_ctx *ctx)
>>   {
>>       int i;
>> @@ -419,7 +406,7 @@ static void pop_callee_regs(struct jit_ctx *ctx)
>>   #define POKE_OFFSET (BTI_INSNS + 1)
>>
>>   /* Tail call offset to jump into */
>> -#define PROLOGUE_OFFSET (BTI_INSNS + 2 + PAC_INSNS + 4)
>> +#define PROLOGUE_OFFSET (BTI_INSNS + 2 + PAC_INSNS + 2)
>>
>>   static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
>>   {
>> @@ -473,8 +460,6 @@ static int build_prologue(struct jit_ctx *ctx, bool
>> ebpf_from_cbpf)
>>           emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx);
>>           emit(A64_MOV(1, A64_FP, A64_SP), ctx);
>>
>> -        prepare_bpf_tail_call_cnt(ctx);
>> -
>>           if (!ebpf_from_cbpf && is_main_prog) {
>>               cur_offset = ctx->idx - idx0;
>>               if (cur_offset != PROLOGUE_OFFSET) {
>> @@ -499,7 +484,7 @@ static int build_prologue(struct jit_ctx *ctx, bool
>> ebpf_from_cbpf)
>>            *
>>            * 12 registers are on the stack
>>            */
>> -        emit(A64_SUB_I(1, A64_SP, A64_FP, 96), ctx);
>> +        emit(A64_SUB_I(1, A64_SP, A64_FP, 80), ctx);
>>       }
>>
>>       if (ctx->fp_used)
>> @@ -527,8 +512,7 @@ static int emit_bpf_tail_call(struct jit_ctx *ctx)
>>
>>       const u8 tmp = bpf2a64[TMP_REG_1];
>>       const u8 prg = bpf2a64[TMP_REG_2];
>> -    const u8 tcc = bpf2a64[TMP_REG_3];
>> -    const u8 ptr = bpf2a64[TCCNT_PTR];
>> +    const u8 tcc = bpf2a64[TCALL_CNT];
>>       size_t off;
>>       __le32 *branch1 = NULL;
>>       __le32 *branch2 = NULL;
>> @@ -546,16 +530,15 @@ static int emit_bpf_tail_call(struct jit_ctx *ctx)
>>       emit(A64_NOP, ctx);
>>
>>       /*
>> -     * if ((*tail_call_cnt_ptr) >= MAX_TAIL_CALL_CNT)
>> +     * if (tail_call_cnt >= MAX_TAIL_CALL_CNT)
>>        *     goto out;
>>        */
>>       emit_a64_mov_i64(tmp, MAX_TAIL_CALL_CNT, ctx);
>> -    emit(A64_LDR64I(tcc, ptr, 0), ctx);
>>       emit(A64_CMP(1, tcc, tmp), ctx);
>>       branch2 = ctx->image + ctx->idx;
>>       emit(A64_NOP, ctx);
>>
>> -    /* (*tail_call_cnt_ptr)++; */
>> +    /* tail_call_cnt++; */
>>       emit(A64_ADD_I(1, tcc, tcc, 1), ctx);
>>
>>       /* prog = array->ptrs[index];
>> @@ -570,9 +553,6 @@ static int emit_bpf_tail_call(struct jit_ctx *ctx)
>>       branch3 = ctx->image + ctx->idx;
>>       emit(A64_NOP, ctx);
>>
>> -    /* Update tail_call_cnt if the slot is populated. */
>> -    emit(A64_STR64I(tcc, ptr, 0), ctx);
>> -
>>       /* restore SP */
>>       if (ctx->stack_size)
>>           emit(A64_ADD_I(1, A64_SP, A64_SP, ctx->stack_size), ctx);
>> @@ -793,6 +773,27 @@ asm (
>>   "    .popsection\n"
>>   );
>>
>> +unsigned int arch_run_bpf(const void *ctx, const struct bpf_insn
>> *insnsi, bpf_func_t bpf_func);
>> +asm (
>> +"    .pushsection .text, \"ax\", @progbits\n"
>> +"    .global arch_run_bpf\n"
>> +"    .type arch_run_bpf, %function\n"
>> +"arch_run_bpf:\n"
>> +#if IS_ENABLED(CONFIG_ARM64_BTI_KERNEL)
>> +"    bti j\n"
>> +#endif
>> +"    stp x29, x30, [sp, #-16]!\n"
>> +"    stp xzr, x26, [sp, #-16]!\n"
>> +"    mov x26, #0\n"
>> +"    blr x2\n"
>> +"    ldp xzr, x26, [sp], #16\n"
>> +"    ldp x29, x30, [sp], #16\n"
>> +"    ret x30\n"
>> +"    .size arch_run_bpf, . - arch_run_bpf\n"
>> +"    .popsection\n"
>> +);
>> +EXPORT_SYMBOL_GPL(arch_run_bpf);
>> +
>>   /* build a plt initialized like this:
>>    *
>>    * plt:
>> @@ -826,7 +827,6 @@ static void build_plt(struct jit_ctx *ctx)
>>   static void build_epilogue(struct jit_ctx *ctx)
>>   {
>>       const u8 r0 = bpf2a64[BPF_REG_0];
>> -    const u8 ptr = bpf2a64[TCCNT_PTR];
>>
>>       /* We're done with BPF stack */
>>       if (ctx->stack_size)
>> @@ -834,8 +834,6 @@ static void build_epilogue(struct jit_ctx *ctx)
>>
>>       pop_callee_regs(ctx);
>>
>> -    emit(A64_POP(A64_ZR, ptr, A64_SP), ctx);
>> -
>>       /* Restore FP/LR registers */
>>       emit(A64_POP(A64_FP, A64_LR, A64_SP), ctx);
>>
>> @@ -2066,6 +2064,8 @@ static int prepare_trampoline(struct jit_ctx *ctx,
>> struct bpf_tramp_image *im,
>>       bool save_ret;
>>       __le32 **branches = NULL;
>>
>> +    bool target_is_bpf = is_bpf_text_address((unsigned long)func_addr);
>> +
>>       /* trampoline stack layout:
>>        *                  [ parent ip         ]
>>        *                  [ FP                ]
>> @@ -2133,6 +2133,11 @@ static int prepare_trampoline(struct jit_ctx
>> *ctx, struct bpf_tramp_image *im,
>>        */
>>       emit_bti(A64_BTI_JC, ctx);
>>
>> +    if (!target_is_bpf) {
>> +        emit(A64_PUSH(A64_ZR, A64_R(26), A64_SP), ctx);
>> +        emit(A64_MOVZ(1, A64_R(26), 0, 0), ctx);
>> +    }
>> +
>>       /* frame for parent function */
>>       emit(A64_PUSH(A64_FP, A64_R(9), A64_SP), ctx);
>>       emit(A64_MOV(1, A64_FP, A64_SP), ctx);
>> @@ -2226,6 +2231,8 @@ static int prepare_trampoline(struct jit_ctx *ctx,
>> struct bpf_tramp_image *im,
>>       /* pop frames  */
>>       emit(A64_POP(A64_FP, A64_LR, A64_SP), ctx);
>>       emit(A64_POP(A64_FP, A64_R(9), A64_SP), ctx);
>> +    if (!target_is_bpf)
>> +        emit(A64_POP(A64_ZR, A64_R(26), A64_SP), ctx);
>>
>>       if (flags & BPF_TRAMP_F_SKIP_FRAME) {
>>           /* skip patched function, return to parent */
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index dc63083f76b7..8660d15dd50c 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -1244,12 +1244,14 @@ struct bpf_dispatcher {
>>   #define __bpfcall __nocfi
>>   #endif
>>
>> +unsigned int arch_run_bpf(const void *ctx, const struct bpf_insn
>> *insnsi, bpf_func_t bpf_func);
>> +
>>   static __always_inline __bpfcall unsigned int bpf_dispatcher_nop_func(
>>       const void *ctx,
>>       const struct bpf_insn *insnsi,
>>       bpf_func_t bpf_func)
>>   {
>> -    return bpf_func(ctx, insnsi);
>> +    return arch_run_bpf(ctx, insnsi, bpf_func);
>>   }
>>
>>   /* the implementation of the opaque uapi struct bpf_dynptr */
>> @@ -1317,7 +1319,7 @@ int arch_prepare_bpf_dispatcher(void *image, void
>> *buf, s64 *funcs, int num_func
>>   #else
>>   #define __BPF_DISPATCHER_SC_INIT(name)
>>   #define __BPF_DISPATCHER_SC(name)
>> -#define __BPF_DISPATCHER_CALL(name)        bpf_func(ctx, insnsi)
>> +#define __BPF_DISPATCHER_CALL(name)        arch_run_bpf(ctx, insnsi,
>> bpf_func);
>>   #define __BPF_DISPATCHER_UPDATE(_d, _new)
>>   #endif
>>
> 
> This approach is really cool!
> 
> I want an alike approach on x86. But I failed. Because, on x86, it's an
> indirect call to "call *rdx", aka "bpf_func(ctx, insnsi)".
> 
> Let us imagine the arch_run_bpf() on x86:
> 
> unsigned int __naked arch_run_bpf(const void *ctx, const struct bpf_insn
> *insnsi, bpf_func_t bpf_func)
> {
> 	asm (
> 		"pushq %rbp\n\t"
> 		"movq %rsp, %rbp\n\t"
> 		"xor %rax, %rax\n\t"
> 		"pushq %rax\n\t"
> 		"movq %rsp, %rax\n\t"
> 		"callq *%rdx\n\t"
> 		"leave\n\t"
> 		"ret\n\t"
> 	);
> }
> 
> If we can change "callq *%rdx" to a direct call, it'll be really
> wonderful to resolve this tailcall issue on x86.
>

Right, so we need static call here, perhaps we can create a custom
static call trampoline to setup tail call counter.

> How to introduce arch_bpf_run() for all JIT backends?
>

Seems we can not avoid arch specific code. One approach could be
to define a default __weak function to call bpf_func directly,
and let each arch to provide its own overridden implementation.

> Thanks,
> Leon


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf, arm64: Fix tailcall infinite loop caused by freplace
  2024-08-30 10:00           ` Xu Kuohai
@ 2024-08-30 12:11             ` Leon Hwang
  2024-08-30 16:03               ` Alexei Starovoitov
  0 siblings, 1 reply; 24+ messages in thread
From: Leon Hwang @ 2024-08-30 12:11 UTC (permalink / raw)
  To: Xu Kuohai, bpf
  Cc: ast, daniel, andrii, toke, martin.lau, yonghong.song, puranjay,
	eddyz87, iii, kernel-patches-bot



On 2024/8/30 18:00, Xu Kuohai wrote:
> On 8/30/2024 5:08 PM, Leon Hwang wrote:
>>
>>
>> On 30/8/24 15:37, Xu Kuohai wrote:
>>> On 8/27/2024 10:23 AM, Leon Hwang wrote:
>>>>
>>

[...]

>>
>> This approach is really cool!
>>
>> I want an alike approach on x86. But I failed. Because, on x86, it's an
>> indirect call to "call *rdx", aka "bpf_func(ctx, insnsi)".
>>
>> Let us imagine the arch_run_bpf() on x86:
>>
>> unsigned int __naked arch_run_bpf(const void *ctx, const struct bpf_insn
>> *insnsi, bpf_func_t bpf_func)
>> {
>>     asm (
>>         "pushq %rbp\n\t"
>>         "movq %rsp, %rbp\n\t"
>>         "xor %rax, %rax\n\t"
>>         "pushq %rax\n\t"
>>         "movq %rsp, %rax\n\t"
>>         "callq *%rdx\n\t"
>>         "leave\n\t"
>>         "ret\n\t"
>>     );
>> }
>>
>> If we can change "callq *%rdx" to a direct call, it'll be really
>> wonderful to resolve this tailcall issue on x86.
>>
> 
> Right, so we need static call here, perhaps we can create a custom
> static call trampoline to setup tail call counter.
> 
>> How to introduce arch_bpf_run() for all JIT backends?
>>
> 
> Seems we can not avoid arch specific code. One approach could be
> to define a default __weak function to call bpf_func directly,
> and let each arch to provide its own overridden implementation.
> 

Hi Xu Kuohai,

Can you send a separate patch to fix this issue on arm64?

After you fixing it, I'll send the patch to fix it on x64.

Thanks,
Leon

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf, arm64: Fix tailcall infinite loop caused by freplace
  2024-08-30 12:11             ` Leon Hwang
@ 2024-08-30 16:03               ` Alexei Starovoitov
  0 siblings, 0 replies; 24+ messages in thread
From: Alexei Starovoitov @ 2024-08-30 16:03 UTC (permalink / raw)
  To: Leon Hwang
  Cc: Xu Kuohai, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Toke Høiland-Jørgensen,
	Martin KaFai Lau, Yonghong Song, Puranjay Mohan, Eddy Z,
	Ilya Leoshkevich, kernel-patches-bot

On Fri, Aug 30, 2024 at 5:11 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>
>
>
> On 2024/8/30 18:00, Xu Kuohai wrote:
> > On 8/30/2024 5:08 PM, Leon Hwang wrote:
> >>
> >>
> >> On 30/8/24 15:37, Xu Kuohai wrote:
> >>> On 8/27/2024 10:23 AM, Leon Hwang wrote:
> >>>>
> >>
>
> [...]
>
> >>
> >> This approach is really cool!
> >>
> >> I want an alike approach on x86. But I failed. Because, on x86, it's an
> >> indirect call to "call *rdx", aka "bpf_func(ctx, insnsi)".
> >>
> >> Let us imagine the arch_run_bpf() on x86:
> >>
> >> unsigned int __naked arch_run_bpf(const void *ctx, const struct bpf_insn
> >> *insnsi, bpf_func_t bpf_func)
> >> {
> >>     asm (
> >>         "pushq %rbp\n\t"
> >>         "movq %rsp, %rbp\n\t"
> >>         "xor %rax, %rax\n\t"
> >>         "pushq %rax\n\t"
> >>         "movq %rsp, %rax\n\t"
> >>         "callq *%rdx\n\t"
> >>         "leave\n\t"
> >>         "ret\n\t"
> >>     );
> >> }
> >>
> >> If we can change "callq *%rdx" to a direct call, it'll be really
> >> wonderful to resolve this tailcall issue on x86.
> >>
> >
> > Right, so we need static call here, perhaps we can create a custom
> > static call trampoline to setup tail call counter.
> >
> >> How to introduce arch_bpf_run() for all JIT backends?
> >>
> >
> > Seems we can not avoid arch specific code. One approach could be
> > to define a default __weak function to call bpf_func directly,
> > and let each arch to provide its own overridden implementation.
> >
>
> Hi Xu Kuohai,
>
> Can you send a separate patch to fix this issue on arm64?
>
> After you fixing it, I'll send the patch to fix it on x64.

Hold on.
We're disabling freplace+tail_call in the verifier.
No need to change any JITs.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 1/4] bpf, x64: Fix tailcall infinite loop caused by freplace
  2024-08-27 20:50       ` Alexei Starovoitov
  2024-08-28  2:36         ` Leon Hwang
@ 2024-09-02 10:19         ` Toke Høiland-Jørgensen
  2024-09-02 16:33           ` Vincent Li
  1 sibling, 1 reply; 24+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-09-02 10:19 UTC (permalink / raw)
  To: Alexei Starovoitov, Leon Hwang
  Cc: Eduard Zingerman, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Yonghong Song, Puranjay Mohan,
	Xu Kuohai, Ilya Leoshkevich, kernel-patches-bot

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Tue, Aug 27, 2024 at 5:48 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>>
>> > I wonder if disallowing to freplace programs when
>> > replacement.tail_call_reachable != replaced.tail_call_reachable
>> > would be a better option?
>> >
>>
>> This idea is wonderful.
>>
>> We can disallow attaching tail_call_reachable freplace prog to
>> not-tail_call_reachable bpf prog. So, the following 3 cases are allowed.
>>
>> 1. attach tail_call_reachable freplace prog to tail_call_reachable bpf prog.
>> 2. attach not-tail_call_reachable freplace prog to tail_call_reachable
>> bpf prog.
>> 3. attach not-tail_call_reachable freplace prog to
>> not-tail_call_reachable bpf prog.
>
> I think it's fine to disable freplace and tail_call combination
> altogether.

In the libxdp dispatcher we rely on the fact that an freplace program is
equivalent to a directly attached XDP program. And we've definitely seen
people using tail calls along with the libxdp dispatcher (e.g.,
https://github.com/xdp-project/xdp-tools/issues/377), so I don't think
it's a good idea to disable it entirely.

I think restricting the combinations should be fine, though - the libxdp
dispatcher will not end up in a tail call map unless someone is going
out of their way to do weird things :)

-Toke


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 1/4] bpf, x64: Fix tailcall infinite loop caused by freplace
  2024-09-02 10:19         ` Toke Høiland-Jørgensen
@ 2024-09-02 16:33           ` Vincent Li
  0 siblings, 0 replies; 24+ messages in thread
From: Vincent Li @ 2024-09-02 16:33 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Alexei Starovoitov, Leon Hwang, Eduard Zingerman, bpf,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Yonghong Song, Puranjay Mohan, Xu Kuohai,
	Ilya Leoshkevich, kernel-patches-bot

On Mon, Sep 2, 2024 at 3:20 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>
> > On Tue, Aug 27, 2024 at 5:48 AM Leon Hwang <leon.hwang@linux.dev> wrote:
> >>
> >> > I wonder if disallowing to freplace programs when
> >> > replacement.tail_call_reachable != replaced.tail_call_reachable
> >> > would be a better option?
> >> >
> >>
> >> This idea is wonderful.
> >>
> >> We can disallow attaching tail_call_reachable freplace prog to
> >> not-tail_call_reachable bpf prog. So, the following 3 cases are allowed.
> >>
> >> 1. attach tail_call_reachable freplace prog to tail_call_reachable bpf prog.
> >> 2. attach not-tail_call_reachable freplace prog to tail_call_reachable
> >> bpf prog.
> >> 3. attach not-tail_call_reachable freplace prog to
> >> not-tail_call_reachable bpf prog.
> >
> > I think it's fine to disable freplace and tail_call combination
> > altogether.
>
> In the libxdp dispatcher we rely on the fact that an freplace program is
> equivalent to a directly attached XDP program. And we've definitely seen
> people using tail calls along with the libxdp dispatcher (e.g.,
> https://github.com/xdp-project/xdp-tools/issues/377), so I don't think
> it's a good idea to disable it entirely.
>

Thanks Toke to mention this use case, I have xdp-loader to load DNS XDP program
with tail calls to do DNS ratelimit and DNS cookie verification
see here https://github.com/vincentmli/xdp-tools/blob/vli-xdp-synproxy/xdp-dnsrrl/xdp_dnsrrl.bpf.c#L635-L644
and I have it as part of XDP DDoS in an open source firewall project
https://github.com/vincentmli/BPFire.

I hope this is continued to be supported in future :)

> I think restricting the combinations should be fine, though - the libxdp
> dispatcher will not end up in a tail call map unless someone is going
> out of their way to do weird things :)
>
> -Toke
>
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf, arm64: Fix tailcall infinite loop caused by freplace
  2024-08-30  7:37       ` Xu Kuohai
  2024-08-30  9:08         ` Leon Hwang
@ 2024-09-05  9:13         ` Puranjay Mohan
  2024-09-06 14:32           ` Leon Hwang
  1 sibling, 1 reply; 24+ messages in thread
From: Puranjay Mohan @ 2024-09-05  9:13 UTC (permalink / raw)
  To: Xu Kuohai, Leon Hwang, bpf
  Cc: ast, daniel, andrii, toke, martin.lau, yonghong.song, eddyz87,
	iii, kernel-patches-bot

[-- Attachment #1: Type: text/plain, Size: 9306 bytes --]

Xu Kuohai <xukuohai@huaweicloud.com> writes:

> On 8/27/2024 10:23 AM, Leon Hwang wrote:
>> 
>> 
>> On 26/8/24 22:32, Xu Kuohai wrote:
>>> On 8/25/2024 9:09 PM, Leon Hwang wrote:
>>>> Like "bpf, x64: Fix tailcall infinite loop caused by freplace", the same
>>>> issue happens on arm64, too.
>>>>
>> 
>> [...]
>> 
>>>
>>> This patch makes arm64 jited prologue even more complex. I've posted a
>>> series [1]
>>> to simplify the arm64 jited prologue/epilogue. I think we can fix this
>>> issue based
>>> on [1]. I'll give it a try.
>>>
>>> [1]
>>> https://lore.kernel.org/bpf/20240826071624.350108-1-xukuohai@huaweicloud.com/
>>>
>> 
>> Your patch series seems great. We can fix it based on it.
>> 
>> Please notify me if you have a successful try.
>> 
>
> I think the complexity arises from having to decide whether
> to initialize or keep the tail counter value in the prologue.
>
> To get rid of this complexity, a straightforward idea is to
> move the tail call counter initialization to the entry of
> bpf world, and in the bpf world, we only increase and check
> the tail call counter, never save/restore or set it. The
> "entry of the bpf world" here refers to mechanisms like
> bpf_prog_run, bpf dispatcher, or bpf trampoline that
> allows bpf prog to be invoked from C function.
>
> Below is a rough POC diff for arm64 that could pass all
> of your tests. The tail call counter is held in callee-saved
> register x26, and is set to 0 by arch_run_bpf.

I like this approach as it removes all the complexity of handling tcc in
different cases. Can we go ahead with this for arm64 and make
arch_run_bpf a weak function and let other architectures override this
if they want to use a similar approach to this and if other archs want to
do something else they can skip implementing arch_run_bpf.

Thanks,
Puranjay

>
> diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
> index 8aa32cb140b9..2c0f7daf1655 100644
> --- a/arch/arm64/net/bpf_jit_comp.c
> +++ b/arch/arm64/net/bpf_jit_comp.c
> @@ -26,7 +26,7 @@
>
>   #define TMP_REG_1 (MAX_BPF_JIT_REG + 0)
>   #define TMP_REG_2 (MAX_BPF_JIT_REG + 1)
> -#define TCCNT_PTR (MAX_BPF_JIT_REG + 2)
> +#define TCALL_CNT (MAX_BPF_JIT_REG + 2)
>   #define TMP_REG_3 (MAX_BPF_JIT_REG + 3)
>   #define ARENA_VM_START (MAX_BPF_JIT_REG + 5)
>
> @@ -63,7 +63,7 @@ static const int bpf2a64[] = {
>   	[TMP_REG_2] = A64_R(11),
>   	[TMP_REG_3] = A64_R(12),
>   	/* tail_call_cnt_ptr */
> -	[TCCNT_PTR] = A64_R(26),
> +	[TCALL_CNT] = A64_R(26), // x26 is used to hold tail call counter
>   	/* temporary register for blinding constants */
>   	[BPF_REG_AX] = A64_R(9),
>   	/* callee saved register for kern_vm_start address */
> @@ -286,19 +286,6 @@ static bool is_lsi_offset(int offset, int scale)
>    *      // PROLOGUE_OFFSET
>    *	// save callee-saved registers
>    */
> -static void prepare_bpf_tail_call_cnt(struct jit_ctx *ctx)
> -{
> -	const bool is_main_prog = !bpf_is_subprog(ctx->prog);
> -	const u8 ptr = bpf2a64[TCCNT_PTR];
> -
> -	if (is_main_prog) {
> -		/* Initialize tail_call_cnt. */
> -		emit(A64_PUSH(A64_ZR, ptr, A64_SP), ctx);
> -		emit(A64_MOV(1, ptr, A64_SP), ctx);
> -	} else
> -		emit(A64_PUSH(ptr, ptr, A64_SP), ctx);
> -}
> -
>   static void find_used_callee_regs(struct jit_ctx *ctx)
>   {
>   	int i;
> @@ -419,7 +406,7 @@ static void pop_callee_regs(struct jit_ctx *ctx)
>   #define POKE_OFFSET (BTI_INSNS + 1)
>
>   /* Tail call offset to jump into */
> -#define PROLOGUE_OFFSET (BTI_INSNS + 2 + PAC_INSNS + 4)
> +#define PROLOGUE_OFFSET (BTI_INSNS + 2 + PAC_INSNS + 2)
>
>   static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
>   {
> @@ -473,8 +460,6 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
>   		emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx);
>   		emit(A64_MOV(1, A64_FP, A64_SP), ctx);
>
> -		prepare_bpf_tail_call_cnt(ctx);
> -
>   		if (!ebpf_from_cbpf && is_main_prog) {
>   			cur_offset = ctx->idx - idx0;
>   			if (cur_offset != PROLOGUE_OFFSET) {
> @@ -499,7 +484,7 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
>   		 *
>   		 * 12 registers are on the stack
>   		 */
> -		emit(A64_SUB_I(1, A64_SP, A64_FP, 96), ctx);
> +		emit(A64_SUB_I(1, A64_SP, A64_FP, 80), ctx);
>   	}
>
>   	if (ctx->fp_used)
> @@ -527,8 +512,7 @@ static int emit_bpf_tail_call(struct jit_ctx *ctx)
>
>   	const u8 tmp = bpf2a64[TMP_REG_1];
>   	const u8 prg = bpf2a64[TMP_REG_2];
> -	const u8 tcc = bpf2a64[TMP_REG_3];
> -	const u8 ptr = bpf2a64[TCCNT_PTR];
> +	const u8 tcc = bpf2a64[TCALL_CNT];
>   	size_t off;
>   	__le32 *branch1 = NULL;
>   	__le32 *branch2 = NULL;
> @@ -546,16 +530,15 @@ static int emit_bpf_tail_call(struct jit_ctx *ctx)
>   	emit(A64_NOP, ctx);
>
>   	/*
> -	 * if ((*tail_call_cnt_ptr) >= MAX_TAIL_CALL_CNT)
> +	 * if (tail_call_cnt >= MAX_TAIL_CALL_CNT)
>   	 *     goto out;
>   	 */
>   	emit_a64_mov_i64(tmp, MAX_TAIL_CALL_CNT, ctx);
> -	emit(A64_LDR64I(tcc, ptr, 0), ctx);
>   	emit(A64_CMP(1, tcc, tmp), ctx);
>   	branch2 = ctx->image + ctx->idx;
>   	emit(A64_NOP, ctx);
>
> -	/* (*tail_call_cnt_ptr)++; */
> +	/* tail_call_cnt++; */
>   	emit(A64_ADD_I(1, tcc, tcc, 1), ctx);
>
>   	/* prog = array->ptrs[index];
> @@ -570,9 +553,6 @@ static int emit_bpf_tail_call(struct jit_ctx *ctx)
>   	branch3 = ctx->image + ctx->idx;
>   	emit(A64_NOP, ctx);
>
> -	/* Update tail_call_cnt if the slot is populated. */
> -	emit(A64_STR64I(tcc, ptr, 0), ctx);
> -
>   	/* restore SP */
>   	if (ctx->stack_size)
>   		emit(A64_ADD_I(1, A64_SP, A64_SP, ctx->stack_size), ctx);
> @@ -793,6 +773,27 @@ asm (
>   "	.popsection\n"
>   );
>
> +unsigned int arch_run_bpf(const void *ctx, const struct bpf_insn *insnsi, bpf_func_t bpf_func);
> +asm (
> +"	.pushsection .text, \"ax\", @progbits\n"
> +"	.global arch_run_bpf\n"
> +"	.type arch_run_bpf, %function\n"
> +"arch_run_bpf:\n"
> +#if IS_ENABLED(CONFIG_ARM64_BTI_KERNEL)
> +"	bti j\n"
> +#endif
> +"	stp x29, x30, [sp, #-16]!\n"
> +"	stp xzr, x26, [sp, #-16]!\n"
> +"	mov x26, #0\n"
> +"	blr x2\n"
> +"	ldp xzr, x26, [sp], #16\n"
> +"	ldp x29, x30, [sp], #16\n"
> +"	ret x30\n"
> +"	.size arch_run_bpf, . - arch_run_bpf\n"
> +"	.popsection\n"
> +);
> +EXPORT_SYMBOL_GPL(arch_run_bpf);
> +
>   /* build a plt initialized like this:
>    *
>    * plt:
> @@ -826,7 +827,6 @@ static void build_plt(struct jit_ctx *ctx)
>   static void build_epilogue(struct jit_ctx *ctx)
>   {
>   	const u8 r0 = bpf2a64[BPF_REG_0];
> -	const u8 ptr = bpf2a64[TCCNT_PTR];
>
>   	/* We're done with BPF stack */
>   	if (ctx->stack_size)
> @@ -834,8 +834,6 @@ static void build_epilogue(struct jit_ctx *ctx)
>
>   	pop_callee_regs(ctx);
>
> -	emit(A64_POP(A64_ZR, ptr, A64_SP), ctx);
> -
>   	/* Restore FP/LR registers */
>   	emit(A64_POP(A64_FP, A64_LR, A64_SP), ctx);
>
> @@ -2066,6 +2064,8 @@ static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im,
>   	bool save_ret;
>   	__le32 **branches = NULL;
>
> +	bool target_is_bpf = is_bpf_text_address((unsigned long)func_addr);
> +
>   	/* trampoline stack layout:
>   	 *                  [ parent ip         ]
>   	 *                  [ FP                ]
> @@ -2133,6 +2133,11 @@ static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im,
>   	 */
>   	emit_bti(A64_BTI_JC, ctx);
>
> +	if (!target_is_bpf) {
> +		emit(A64_PUSH(A64_ZR, A64_R(26), A64_SP), ctx);
> +		emit(A64_MOVZ(1, A64_R(26), 0, 0), ctx);
> +	}
> +
>   	/* frame for parent function */
>   	emit(A64_PUSH(A64_FP, A64_R(9), A64_SP), ctx);
>   	emit(A64_MOV(1, A64_FP, A64_SP), ctx);
> @@ -2226,6 +2231,8 @@ static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im,
>   	/* pop frames  */
>   	emit(A64_POP(A64_FP, A64_LR, A64_SP), ctx);
>   	emit(A64_POP(A64_FP, A64_R(9), A64_SP), ctx);
> +	if (!target_is_bpf)
> +		emit(A64_POP(A64_ZR, A64_R(26), A64_SP), ctx);
>
>   	if (flags & BPF_TRAMP_F_SKIP_FRAME) {
>   		/* skip patched function, return to parent */
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index dc63083f76b7..8660d15dd50c 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1244,12 +1244,14 @@ struct bpf_dispatcher {
>   #define __bpfcall __nocfi
>   #endif
>
> +unsigned int arch_run_bpf(const void *ctx, const struct bpf_insn *insnsi, bpf_func_t bpf_func);
> +
>   static __always_inline __bpfcall unsigned int bpf_dispatcher_nop_func(
>   	const void *ctx,
>   	const struct bpf_insn *insnsi,
>   	bpf_func_t bpf_func)
>   {
> -	return bpf_func(ctx, insnsi);
> +	return arch_run_bpf(ctx, insnsi, bpf_func);
>   }
>
>   /* the implementation of the opaque uapi struct bpf_dynptr */
> @@ -1317,7 +1319,7 @@ int arch_prepare_bpf_dispatcher(void *image, void *buf, s64 *funcs, int num_func
>   #else
>   #define __BPF_DISPATCHER_SC_INIT(name)
>   #define __BPF_DISPATCHER_SC(name)
> -#define __BPF_DISPATCHER_CALL(name)		bpf_func(ctx, insnsi)
> +#define __BPF_DISPATCHER_CALL(name)		arch_run_bpf(ctx, insnsi, bpf_func);
>   #define __BPF_DISPATCHER_UPDATE(_d, _new)
>   #endif
>
>> Thanks,
>> Leon

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 255 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf, arm64: Fix tailcall infinite loop caused by freplace
  2024-09-05  9:13         ` Puranjay Mohan
@ 2024-09-06 14:32           ` Leon Hwang
  2024-09-06 15:24             ` Alexei Starovoitov
  0 siblings, 1 reply; 24+ messages in thread
From: Leon Hwang @ 2024-09-06 14:32 UTC (permalink / raw)
  To: Puranjay Mohan, Xu Kuohai, bpf
  Cc: ast, daniel, andrii, toke, martin.lau, yonghong.song, eddyz87,
	iii, kernel-patches-bot



On 2024/9/5 17:13, Puranjay Mohan wrote:
> Xu Kuohai <xukuohai@huaweicloud.com> writes:
> 
>> On 8/27/2024 10:23 AM, Leon Hwang wrote:
>>>
>>>
>>> On 26/8/24 22:32, Xu Kuohai wrote:
>>>> On 8/25/2024 9:09 PM, Leon Hwang wrote:
>>>>> Like "bpf, x64: Fix tailcall infinite loop caused by freplace", the same
>>>>> issue happens on arm64, too.
>>>>>
>>>
>>> [...]
>>>
>>>>
>>>> This patch makes arm64 jited prologue even more complex. I've posted a
>>>> series [1]
>>>> to simplify the arm64 jited prologue/epilogue. I think we can fix this
>>>> issue based
>>>> on [1]. I'll give it a try.
>>>>
>>>> [1]
>>>> https://lore.kernel.org/bpf/20240826071624.350108-1-xukuohai@huaweicloud.com/
>>>>
>>>
>>> Your patch series seems great. We can fix it based on it.
>>>
>>> Please notify me if you have a successful try.
>>>
>>
>> I think the complexity arises from having to decide whether
>> to initialize or keep the tail counter value in the prologue.
>>
>> To get rid of this complexity, a straightforward idea is to
>> move the tail call counter initialization to the entry of
>> bpf world, and in the bpf world, we only increase and check
>> the tail call counter, never save/restore or set it. The
>> "entry of the bpf world" here refers to mechanisms like
>> bpf_prog_run, bpf dispatcher, or bpf trampoline that
>> allows bpf prog to be invoked from C function.
>>
>> Below is a rough POC diff for arm64 that could pass all
>> of your tests. The tail call counter is held in callee-saved
>> register x26, and is set to 0 by arch_run_bpf.
> 
> I like this approach as it removes all the complexity of handling tcc in

I like this approach, too.

> different cases. Can we go ahead with this for arm64 and make
> arch_run_bpf a weak function and let other architectures override this
> if they want to use a similar approach to this and if other archs want to
> do something else they can skip implementing arch_run_bpf.
> 

Hi Alexei,

What do you think about this idea?

Thanks,
Leon

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf, arm64: Fix tailcall infinite loop caused by freplace
  2024-09-06 14:32           ` Leon Hwang
@ 2024-09-06 15:24             ` Alexei Starovoitov
  2024-09-07  7:03               ` Xu Kuohai
  0 siblings, 1 reply; 24+ messages in thread
From: Alexei Starovoitov @ 2024-09-06 15:24 UTC (permalink / raw)
  To: Leon Hwang
  Cc: Puranjay Mohan, Xu Kuohai, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko,
	Toke Høiland-Jørgensen, Martin KaFai Lau, Yonghong Song,
	Eddy Z, Ilya Leoshkevich, kernel-patches-bot

On Fri, Sep 6, 2024 at 7:32 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>
>
>
> On 2024/9/5 17:13, Puranjay Mohan wrote:
> > Xu Kuohai <xukuohai@huaweicloud.com> writes:
> >
> >> On 8/27/2024 10:23 AM, Leon Hwang wrote:
> >>>
> >>>
> >>> On 26/8/24 22:32, Xu Kuohai wrote:
> >>>> On 8/25/2024 9:09 PM, Leon Hwang wrote:
> >>>>> Like "bpf, x64: Fix tailcall infinite loop caused by freplace", the same
> >>>>> issue happens on arm64, too.
> >>>>>
> >>>
> >>> [...]
> >>>
> >>>>
> >>>> This patch makes arm64 jited prologue even more complex. I've posted a
> >>>> series [1]
> >>>> to simplify the arm64 jited prologue/epilogue. I think we can fix this
> >>>> issue based
> >>>> on [1]. I'll give it a try.
> >>>>
> >>>> [1]
> >>>> https://lore.kernel.org/bpf/20240826071624.350108-1-xukuohai@huaweicloud.com/
> >>>>
> >>>
> >>> Your patch series seems great. We can fix it based on it.
> >>>
> >>> Please notify me if you have a successful try.
> >>>
> >>
> >> I think the complexity arises from having to decide whether
> >> to initialize or keep the tail counter value in the prologue.
> >>
> >> To get rid of this complexity, a straightforward idea is to
> >> move the tail call counter initialization to the entry of
> >> bpf world, and in the bpf world, we only increase and check
> >> the tail call counter, never save/restore or set it. The
> >> "entry of the bpf world" here refers to mechanisms like
> >> bpf_prog_run, bpf dispatcher, or bpf trampoline that
> >> allows bpf prog to be invoked from C function.
> >>
> >> Below is a rough POC diff for arm64 that could pass all
> >> of your tests. The tail call counter is held in callee-saved
> >> register x26, and is set to 0 by arch_run_bpf.
> >
> > I like this approach as it removes all the complexity of handling tcc in
>
> I like this approach, too.
>
> > different cases. Can we go ahead with this for arm64 and make
> > arch_run_bpf a weak function and let other architectures override this
> > if they want to use a similar approach to this and if other archs want to
> > do something else they can skip implementing arch_run_bpf.
> >
>
> Hi Alexei,
>
> What do you think about this idea?

This was discussed before and no, we're not going to add an extra tcc init
to bpf_prog_run and penalize everybody for this niche case.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf, arm64: Fix tailcall infinite loop caused by freplace
  2024-09-06 15:24             ` Alexei Starovoitov
@ 2024-09-07  7:03               ` Xu Kuohai
  0 siblings, 0 replies; 24+ messages in thread
From: Xu Kuohai @ 2024-09-07  7:03 UTC (permalink / raw)
  To: Alexei Starovoitov, Leon Hwang
  Cc: Puranjay Mohan, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Toke Høiland-Jørgensen,
	Martin KaFai Lau, Yonghong Song, Eddy Z, Ilya Leoshkevich,
	kernel-patches-bot

On 9/6/2024 11:24 PM, Alexei Starovoitov wrote:
> On Fri, Sep 6, 2024 at 7:32 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>>
>>
>>
>> On 2024/9/5 17:13, Puranjay Mohan wrote:
>>> Xu Kuohai <xukuohai@huaweicloud.com> writes:
>>>
>>>> On 8/27/2024 10:23 AM, Leon Hwang wrote:
>>>>>
>>>>>
>>>>> On 26/8/24 22:32, Xu Kuohai wrote:
>>>>>> On 8/25/2024 9:09 PM, Leon Hwang wrote:
>>>>>>> Like "bpf, x64: Fix tailcall infinite loop caused by freplace", the same
>>>>>>> issue happens on arm64, too.
>>>>>>>
>>>>>
>>>>> [...]
>>>>>
>>>>>>
>>>>>> This patch makes arm64 jited prologue even more complex. I've posted a
>>>>>> series [1]
>>>>>> to simplify the arm64 jited prologue/epilogue. I think we can fix this
>>>>>> issue based
>>>>>> on [1]. I'll give it a try.
>>>>>>
>>>>>> [1]
>>>>>> https://lore.kernel.org/bpf/20240826071624.350108-1-xukuohai@huaweicloud.com/
>>>>>>
>>>>>
>>>>> Your patch series seems great. We can fix it based on it.
>>>>>
>>>>> Please notify me if you have a successful try.
>>>>>
>>>>
>>>> I think the complexity arises from having to decide whether
>>>> to initialize or keep the tail counter value in the prologue.
>>>>
>>>> To get rid of this complexity, a straightforward idea is to
>>>> move the tail call counter initialization to the entry of
>>>> bpf world, and in the bpf world, we only increase and check
>>>> the tail call counter, never save/restore or set it. The
>>>> "entry of the bpf world" here refers to mechanisms like
>>>> bpf_prog_run, bpf dispatcher, or bpf trampoline that
>>>> allows bpf prog to be invoked from C function.
>>>>
>>>> Below is a rough POC diff for arm64 that could pass all
>>>> of your tests. The tail call counter is held in callee-saved
>>>> register x26, and is set to 0 by arch_run_bpf.
>>>
>>> I like this approach as it removes all the complexity of handling tcc in
>>
>> I like this approach, too.
>>
>>> different cases. Can we go ahead with this for arm64 and make
>>> arch_run_bpf a weak function and let other architectures override this
>>> if they want to use a similar approach to this and if other archs want to
>>> do something else they can skip implementing arch_run_bpf.
>>>
>>
>> Hi Alexei,
>>
>> What do you think about this idea?
> 
> This was discussed before and no, we're not going to add an extra tcc init
> to bpf_prog_run and penalize everybody for this niche case.

+1, we should avoid hacking jit and adding complexity just for a niche case.


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2024-09-07  7:03 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-25 13:09 [PATCH bpf-next 0/4] bpf: Fix tailcall infinite loop caused by freplace Leon Hwang
2024-08-25 13:09 ` [PATCH bpf-next 1/4] bpf, x64: " Leon Hwang
2024-08-27 10:37   ` Eduard Zingerman
2024-08-27 12:48     ` Leon Hwang
2024-08-27 20:50       ` Alexei Starovoitov
2024-08-28  2:36         ` Leon Hwang
2024-08-28 16:01           ` Alexei Starovoitov
2024-08-29  2:14             ` Leon Hwang
2024-09-02 10:19         ` Toke Høiland-Jørgensen
2024-09-02 16:33           ` Vincent Li
2024-08-25 13:09 ` [PATCH bpf-next 2/4] bpf, arm64: " Leon Hwang
2024-08-26 14:32   ` Xu Kuohai
2024-08-27  2:23     ` Leon Hwang
2024-08-30  7:37       ` Xu Kuohai
2024-08-30  9:08         ` Leon Hwang
2024-08-30 10:00           ` Xu Kuohai
2024-08-30 12:11             ` Leon Hwang
2024-08-30 16:03               ` Alexei Starovoitov
2024-09-05  9:13         ` Puranjay Mohan
2024-09-06 14:32           ` Leon Hwang
2024-09-06 15:24             ` Alexei Starovoitov
2024-09-07  7:03               ` Xu Kuohai
2024-08-25 13:09 ` [PATCH bpf-next 3/4] selftests/bpf: Add testcases for another tailcall infinite loop fixing Leon Hwang
2024-08-25 13:09 ` [PATCH bpf-next 4/4] selftests/bpf: Fix verifier tailcall jit selftest Leon Hwang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox