All of lore.kernel.org
 help / color / mirror / Atom feed
From: Leon Hwang <hffilwlqm@gmail.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Pu Lehui <pulehui@huawei.com>, bpf <bpf@vger.kernel.org>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	"Fijalkowski, Maciej" <maciej.fijalkowski@intel.com>,
	Jakub Sitnicki <jakub@cloudflare.com>,
	Ilya Leoshkevich <iii@linux.ibm.com>,
	Hengqi Chen <hengqi.chen@gmail.com>,
	kernel-patches-bot@fb.com
Subject: Re: [PATCH bpf-next v2 1/2] bpf, x64: Fix tailcall hierarchy
Date: Wed, 28 Feb 2024 22:30:54 +0800	[thread overview]
Message-ID: <c5d7fcbf-a798-4c2d-8978-7c19ae7eafa3@gmail.com> (raw)
In-Reply-To: <CAADnVQ+Fr0k+28L=BLaXDxu=wDBwpiuCaXzEN3y6jEXj48zhUg@mail.gmail.com>



On 2024/2/27 06:12, Alexei Starovoitov wrote:
> On Mon, Feb 26, 2024 at 7:32 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
>>
>>
>>
>> On 2024/2/24 00:35, Alexei Starovoitov wrote:
>>> On Fri, Feb 23, 2024 at 7:30 AM Leon Hwang <hffilwlqm@gmail.com> wrote:

[SNIP]

>>
>> Let's take a look at tailcall3.c selftest:
>>
>> struct {
>>         __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
>>         __uint(max_entries, 1);
>>         __uint(key_size, sizeof(__u32));
>>         __uint(value_size, sizeof(__u32));
>> } jmp_table SEC(".maps");
>>
>> int count = 0;
>>
>> SEC("tc")
>> int classifier_0(struct __sk_buff *skb)
>> {
>>         count++;
>>         bpf_tail_call_static(skb, &jmp_table, 0);
>>         return 1;
>> }
>>
>> SEC("tc")
>> int entry(struct __sk_buff *skb)
>> {
>>         bpf_tail_call_static(skb, &jmp_table, 0);
>>         return 0;
>> }
>>
>> Here, classifier_0 is populated to jmp_table.
>>
>> Then, at classifier_0's prologue, when we 'move rax,
>> classifier_0->tail_call_cnt' in order to use the PERCPU tail_call_cnt in
>> 'struct bpf_prog' for current run-time, it fails to run selftests. It's
>> because the tail_call_cnt is not from the entry bpf prog. The
>> tail_call_cnt from the entry bpf prog is the expected one, even though
>> classifier_0 bpf prog runs. (It seems that it's unnecessary to provide
>> the diff of the exclusive approach with PERCPU tail_call_cnt.)
> 
> Not following.
> With percpu tail_call_cnt, does classifier_0 loop forever ? I doubt it.
> You mean expected 'count' value is different?
> The test expected 33 and instead it's ... what?

Yeah, the test result is 34 instead of expected 33.

test_tailcall_count:PASS:tailcall 0 nsec
test_tailcall_count:PASS:tailcall retval 0 nsec
test_tailcall_count:PASS:tailcall count 0 nsec
test_tailcall_count:FAIL:tailcall count unexpected tailcall count:
actual 34 != expected 33
test_tailcall_count:PASS:tailcall 0 nsec
test_tailcall_count:PASS:tailcall retval 0 nsec
#311/3   tailcalls/tailcall_3:FAIL

> 
>> +               if (tail_call_reachable && !is_subprog) {
>> +                       /* mov rax, entry->tail_call_cnt */
>> +                       EMIT2_off64(0x48, 0xB8, (u64) entry->tail_call_cnt);
>> +                       /* call bpf_tail_call_cnt_prepare */
>> +                       emit_call(&prog, bpf_tail_call_cnt_prepare,
>> +                                 ip + (prog - start));
>> +               } else {
>>                         /* Keep the same instruction layout. */
>> -                       EMIT2(0x66, 0x90); /* nop2 */
>> +                       emit_nops(&prog, 10 + X86_PATCH_SIZE);
> 
> As mentioned before... such "fix" is not acceptable.
> We will not be penalizing all progs this way.
> 
> How about we make percpu tail_call_cnt per prog_array map,

No, we can not store percpu tail_call_cnt on either bpf prog or
prog_array map.

Considering this case:

1. prog1 tailcall prog2 with prog_array map1.
2. prog2 tailcall prog3 with prog_array map2.
3. prog3 tailcall prog4 with prog_array map3.
4. ...

We can not store percpu tail_call_cnt on either prog1 or prog_array map1.

In conclusion, tail_call_cnt is a run-time variable that should be
stored on stack ideally.

Can we store tail_call_cnt on stack and then propagate tcc_ptr by some
way instead of rax?

> then remove rax as this patch does,
> but instead of zeroing tcc on entry, zero it on exit.
> While processing bpf_exit add:
> if (tail_call_reachable)
>   emit_call(&prog, bpf_tail_call_cnt_prepare,...)
> 
> if prog that tailcalls get preempted on this cpu and
> another prog starts that also tailcalls it won't zero the count.
> This way we can remove nop5 from prologue too.
> 
> The preempted prog will eventually zero ttc on exit,
> and earlier prog that uses the same prog_array can tail call more
> than 32 times, but it cannot be abused reliably,
> since preemption is non deterministic.


We can not zeroing tcc on exit. If zero it on exit, the selftests of
this patchset will run forever, e.g.

struct {
	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
	__uint(max_entries, 1);
	__uint(key_size, sizeof(__u32));
	__uint(value_size, sizeof(__u32));
} jmp_table SEC(".maps");

int count = 0;

static __noinline
int subprog_tail(struct __sk_buff *skb)
{
	bpf_tail_call_static(skb, &jmp_table, 0);
	return 0;
}

SEC("tc")
int entry(struct __sk_buff *skb)
{
	int ret = 1;

	count++;
	subprog_tail(skb); /* tailcall-pos1 */
	subprog_tail(skb); /* tailcall-pos2 */

	return ret; /* zeroing tcc */
}

The jmp_table populates with the entry bpf prog.

The entry bpf prog zeros tcc always when returns. So, after the entry
bpf prog returns from subprog_tail() at tailcall-pos1, tcc has been
reset to 0, and the entry bpf prog tailcalled from subprog_tail() at
tailcall-pos2 can run forever.



Here's another alternative approach. Like this PATCH v2, it's ok to
initialise tcc as MAX_TAIL_CALL_CNT, and then decrement it when tailcall
happens. This approach does same with this PATCH v2, and passes all
selftests.

Here's the diff:

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index e1390d1e3..72773899a 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -11,6 +11,7 @@
 #include <linux/bpf.h>
 #include <linux/memory.h>
 #include <linux/sort.h>
+#include <linux/stringify.h>
 #include <asm/extable.h>
 #include <asm/ftrace.h>
 #include <asm/set_memory.h>
@@ -18,6 +19,7 @@
 #include <asm/text-patching.h>
 #include <asm/unwind.h>
 #include <asm/cfi.h>
+#include <asm/percpu.h>

 static bool all_callee_regs_used[4] = {true, true, true, true};

@@ -259,7 +261,7 @@ struct jit_context {
 /* Number of bytes emit_patch() needs to generate instructions */
 #define X86_PATCH_SIZE		5
 /* Number of bytes that will be skipped on tailcall */
-#define X86_TAIL_CALL_OFFSET	(11 + ENDBR_INSN_SIZE)
+#define X86_TAIL_CALL_OFFSET	(14 + ENDBR_INSN_SIZE)

 static void push_r12(u8 **pprog)
 {
@@ -389,6 +391,9 @@ static void emit_cfi(u8 **pprog, u32 hash)
 	*pprog = prog;
 }

+static int emit_call(u8 **pprog, void *func, void *ip);
+static __used void bpf_tail_call_cnt_prepare(void);
+
 /*
  * Emit x86-64 prologue code for BPF program.
  * bpf_tail_call helper will skip the first X86_TAIL_CALL_OFFSET bytes
@@ -396,9 +401,9 @@ static void emit_cfi(u8 **pprog, u32 hash)
  */
 static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
 			  bool tail_call_reachable, bool is_subprog,
-			  bool is_exception_cb)
+			  bool is_exception_cb, u8 *ip)
 {
-	u8 *prog = *pprog;
+	u8 *prog = *pprog, *start = *pprog;

 	emit_cfi(&prog, is_subprog ? cfi_bpf_subprog_hash : cfi_bpf_hash);
 	/* BPF trampoline can be made to work without these nops,
@@ -407,13 +412,11 @@ static void emit_prologue(u8 **pprog, u32
stack_depth, bool ebpf_from_cbpf,
 	emit_nops(&prog, X86_PATCH_SIZE);
 	if (!ebpf_from_cbpf) {
 		if (tail_call_reachable && !is_subprog)
-			/* When it's the entry of the whole tailcall context,
-			 * zeroing rax means initialising tail_call_cnt.
-			 */
-			EMIT2(0x31, 0xC0); /* xor eax, eax */
+			emit_call(&prog, bpf_tail_call_cnt_prepare,
+				  ip + (prog - start));
 		else
 			/* Keep the same instruction layout. */
-			EMIT2(0x66, 0x90); /* nop2 */
+			emit_nops(&prog, X86_PATCH_SIZE);
 	}
 	/* Exception callback receives FP as third parameter */
 	if (is_exception_cb) {
@@ -438,8 +441,6 @@ static void emit_prologue(u8 **pprog, u32
stack_depth, bool ebpf_from_cbpf,
 	/* sub rsp, rounded_stack_depth */
 	if (stack_depth)
 		EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
-	if (tail_call_reachable)
-		EMIT1(0x50);         /* push rax */
 	*pprog = prog;
 }

@@ -575,13 +576,53 @@ static void emit_return(u8 **pprog, u8 *ip)
 	*pprog = prog;
 }

+DEFINE_PER_CPU(u32, bpf_tail_call_cnt);
+
+static __used void bpf_tail_call_cnt_prepare(void)
+{
+	/* The following asm equals to
+	 *
+	 * u32 *tcc_ptr = this_cpu_ptr(&bpf_tail_call_cnt);
+	 *
+	 * *tcc_ptr = MAX_TAIL_CALL_CNT;
+	 */
+
+	asm volatile (
+	    "addq " __percpu_arg(0) ", %1\n\t"
+	    "movl $" __stringify(MAX_TAIL_CALL_CNT) ", (%1)\n\t"
+	    :
+	    : "m" (this_cpu_off), "r" (&bpf_tail_call_cnt)
+	);
+}
+
+static __used u32 *bpf_tail_call_cnt_ptr(void)
+{
+	u32 *tcc_ptr;
+
+	/* The following asm equals to
+	 *
+	 * u32 *tcc_ptr = this_cpu_ptr(&bpf_tail_call_cnt);
+	 *
+	 * return tcc_ptr;
+	 */
+
+	asm volatile (
+	    "addq " __percpu_arg(1) ", %2\n\t"
+	    "movq %2, %0\n\t"
+	    : "=r" (tcc_ptr)
+	    : "m" (this_cpu_off), "r" (&bpf_tail_call_cnt)
+	);
+
+	return tcc_ptr;
+}
+
 /*
  * Generate the following code:
  *
  * ... bpf_tail_call(void *ctx, struct bpf_array *array, u64 index) ...
  *   if (index >= array->map.max_entries)
  *     goto out;
- *   if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
+ *   if ((*tcc_ptr)-- == 0)
  *     goto out;
  *   prog = array->ptrs[index];
  *   if (prog == NULL)
@@ -594,7 +635,6 @@ static void emit_bpf_tail_call_indirect(struct
bpf_prog *bpf_prog,
 					u32 stack_depth, u8 *ip,
 					struct jit_context *ctx)
 {
-	int tcc_off = -4 - round_up(stack_depth, 8);
 	u8 *prog = *pprog, *start = *pprog;
 	int offset;

@@ -616,16 +656,16 @@ static void emit_bpf_tail_call_indirect(struct
bpf_prog *bpf_prog,
 	EMIT2(X86_JBE, offset);                   /* jbe out */

 	/*
-	 * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
+	 * if ((*tcc_ptr)-- == 0)
 	 *	goto out;
 	 */
-	EMIT2_off32(0x8B, 0x85, tcc_off);         /* mov eax, dword ptr [rbp -
tcc_off] */
-	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);     /* cmp eax, MAX_TAIL_CALL_CNT */
+	/* call bpf_tail_call_cnt_ptr */
+	emit_call(&prog, bpf_tail_call_cnt_ptr, ip + (prog - start));
+	EMIT3(0x83, 0x38, 0);                     /* cmp dword ptr [rax], 0 */

 	offset = ctx->tail_call_indirect_label - (prog + 2 - start);
-	EMIT2(X86_JAE, offset);                   /* jae out */
-	EMIT3(0x83, 0xC0, 0x01);                  /* add eax, 1 */
-	EMIT2_off32(0x89, 0x85, tcc_off);         /* mov dword ptr [rbp -
tcc_off], eax */
+	EMIT2(X86_JE, offset);                    /* je out */
+	EMIT2(0xFF, 0x08);                        /* dec dword ptr [rax] */

 	/* prog = array->ptrs[index]; */
 	EMIT4_off32(0x48, 0x8B, 0x8C, 0xD6,       /* mov rcx, [rsi + rdx * 8 +
offsetof(...)] */
@@ -647,7 +687,6 @@ static void emit_bpf_tail_call_indirect(struct
bpf_prog *bpf_prog,
 		pop_callee_regs(&prog, callee_regs_used);
 	}

-	EMIT1(0x58);                              /* pop rax */
 	if (stack_depth)
 		EMIT3_off32(0x48, 0x81, 0xC4,     /* add rsp, sd */
 			    round_up(stack_depth, 8));
@@ -675,21 +714,20 @@ static void emit_bpf_tail_call_direct(struct
bpf_prog *bpf_prog,
 				      bool *callee_regs_used, u32 stack_depth,
 				      struct jit_context *ctx)
 {
-	int tcc_off = -4 - round_up(stack_depth, 8);
 	u8 *prog = *pprog, *start = *pprog;
 	int offset;

 	/*
-	 * if (tail_call_cnt++ >= MAX_TAIL_CALL_CNT)
+	 * if ((*tcc_ptr)-- == 0)
 	 *	goto out;
 	 */
-	EMIT2_off32(0x8B, 0x85, tcc_off);             /* mov eax, dword ptr
[rbp - tcc_off] */
-	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);         /* cmp eax,
MAX_TAIL_CALL_CNT */
+	/* call bpf_tail_call_cnt_ptr */
+	emit_call(&prog, bpf_tail_call_cnt_ptr, ip);
+	EMIT3(0x83, 0x38, 0);                         /* cmp dword ptr [rax], 0 */

 	offset = ctx->tail_call_direct_label - (prog + 2 - start);
-	EMIT2(X86_JAE, offset);                       /* jae out */
-	EMIT3(0x83, 0xC0, 0x01);                      /* add eax, 1 */
-	EMIT2_off32(0x89, 0x85, tcc_off);             /* mov dword ptr [rbp -
tcc_off], eax */
+	EMIT2(X86_JE, offset);                        /* je out */
+	EMIT2(0xFF, 0x08);                            /* dec dword ptr [rax] */

 	poke->tailcall_bypass = ip + (prog - start);
 	poke->adj_off = X86_TAIL_CALL_OFFSET;
@@ -706,7 +744,6 @@ static void emit_bpf_tail_call_direct(struct
bpf_prog *bpf_prog,
 		pop_callee_regs(&prog, callee_regs_used);
 	}

-	EMIT1(0x58);                                  /* pop rax */
 	if (stack_depth)
 		EMIT3_off32(0x48, 0x81, 0xC4, round_up(stack_depth, 8));

@@ -1133,10 +1170,6 @@ static void emit_shiftx(u8 **pprog, u32 dst_reg,
u8 src_reg, bool is64, u8 op)

 #define INSN_SZ_DIFF (((addrs[i] - addrs[i - 1]) - (prog - temp)))

-/* mov rax, qword ptr [rbp - rounded_stack_depth - 8] */
-#define RESTORE_TAIL_CALL_CNT(stack)				\
-	EMIT3_off32(0x48, 0x8B, 0x85, -round_up(stack, 8) - 8)
-
 static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8
*rw_image,
 		  int oldproglen, struct jit_context *ctx, bool jmp_padding)
 {
@@ -1160,7 +1193,8 @@ static int do_jit(struct bpf_prog *bpf_prog, int
*addrs, u8 *image, u8 *rw_image

 	emit_prologue(&prog, bpf_prog->aux->stack_depth,
 		      bpf_prog_was_classic(bpf_prog), tail_call_reachable,
-		      bpf_is_subprog(bpf_prog), bpf_prog->aux->exception_cb);
+		      bpf_is_subprog(bpf_prog), bpf_prog->aux->exception_cb,
+		      image);
 	/* Exception callback will clobber callee regs for its own use, and
 	 * restore the original callee regs from main prog's stack frame.
 	 */
@@ -1752,17 +1786,11 @@ st:			if (is_imm8(insn->off))
 		case BPF_JMP | BPF_CALL: {
 			int offs;

+			if (!imm32)
+				return -EINVAL;
+
 			func = (u8 *) __bpf_call_base + imm32;
-			if (tail_call_reachable) {
-				RESTORE_TAIL_CALL_CNT(bpf_prog->aux->stack_depth);
-				if (!imm32)
-					return -EINVAL;
-				offs = 7 + x86_call_depth_emit_accounting(&prog, func);
-			} else {
-				if (!imm32)
-					return -EINVAL;
-				offs = x86_call_depth_emit_accounting(&prog, func);
-			}
+			offs = x86_call_depth_emit_accounting(&prog, func);
 			if (emit_call(&prog, func, image + addrs[i - 1] + offs))
 				return -EINVAL;
 			break;
@@ -2550,7 +2578,6 @@ static int __arch_prepare_bpf_trampoline(struct
bpf_tramp_image *im, void *rw_im
 	 *                     [ ...        ]
 	 *                     [ stack_arg2 ]
 	 * RBP - arg_stack_off [ stack_arg1 ]
-	 * RSP                 [ tail_call_cnt ] BPF_TRAMP_F_TAIL_CALL_CTX
 	 */

 	/* room for return value of orig_call or fentry prog */
@@ -2622,8 +2649,6 @@ static int __arch_prepare_bpf_trampoline(struct
bpf_tramp_image *im, void *rw_im
 		/* sub rsp, stack_size */
 		EMIT4(0x48, 0x83, 0xEC, stack_size);
 	}
-	if (flags & BPF_TRAMP_F_TAIL_CALL_CTX)
-		EMIT1(0x50);		/* push rax */
 	/* mov QWORD PTR [rbp - rbx_off], rbx */
 	emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_6, -rbx_off);

@@ -2678,16 +2703,9 @@ static int __arch_prepare_bpf_trampoline(struct
bpf_tramp_image *im, void *rw_im
 		restore_regs(m, &prog, regs_off);
 		save_args(m, &prog, arg_stack_off, true);

-		if (flags & BPF_TRAMP_F_TAIL_CALL_CTX) {
-			/* Before calling the original function, restore the
-			 * tail_call_cnt from stack to rax.
-			 */
-			RESTORE_TAIL_CALL_CNT(stack_size);
-		}
-
 		if (flags & BPF_TRAMP_F_ORIG_STACK) {
-			emit_ldx(&prog, BPF_DW, BPF_REG_6, BPF_REG_FP, 8);
-			EMIT2(0xff, 0xd3); /* call *rbx */
+			emit_ldx(&prog, BPF_DW, BPF_REG_0, BPF_REG_FP, 8);
+			EMIT2(0xff, 0xd0); /* call *rax */
 		} else {
 			/* call original function */
 			if (emit_rsb_call(&prog, orig_call, image + (prog - (u8 *)rw_image))) {
@@ -2740,11 +2758,6 @@ static int __arch_prepare_bpf_trampoline(struct
bpf_tramp_image *im, void *rw_im
 			ret = -EINVAL;
 			goto cleanup;
 		}
-	} else if (flags & BPF_TRAMP_F_TAIL_CALL_CTX) {
-		/* Before running the original function, restore the
-		 * tail_call_cnt from stack to rax.
-		 */
-		RESTORE_TAIL_CALL_CNT(stack_size);
 	}

 	/* restore return value of orig_call or fentry prog back into RAX */

Thanks,
Leon

  reply	other threads:[~2024-02-28 14:31 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-22  8:52 [PATCH bpf-next v2 0/2] bpf, x64: Fix tailcall hierarchy Leon Hwang
2024-02-22  8:52 ` [PATCH bpf-next v2 1/2] " Leon Hwang
2024-02-22 10:59   ` Leon Hwang
2024-02-23  4:06   ` Pu Lehui
2024-02-23 15:30     ` Leon Hwang
2024-02-23 16:35       ` Alexei Starovoitov
2024-02-26 15:32         ` Leon Hwang
2024-02-26 16:04           ` Leon Hwang
2024-02-26 22:12           ` Alexei Starovoitov
2024-02-28 14:30             ` Leon Hwang [this message]
2024-03-29 18:49               ` Alexei Starovoitov
2024-02-23 17:21   ` Alexei Starovoitov
2024-02-22  8:52 ` [PATCH bpf-next v2 2/2] selftests/bpf: Add testcases for tailcall hierarchy fixing Leon Hwang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c5d7fcbf-a798-4c2d-8978-7c19ae7eafa3@gmail.com \
    --to=hffilwlqm@gmail.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=hengqi.chen@gmail.com \
    --cc=iii@linux.ibm.com \
    --cc=jakub@cloudflare.com \
    --cc=kernel-patches-bot@fb.com \
    --cc=maciej.fijalkowski@intel.com \
    --cc=pulehui@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.