From: Eduard Zingerman <eddyz87@gmail.com>
To: Leon Hwang <leon.hwang@linux.dev>, bpf@vger.kernel.org
Cc: ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
toke@redhat.com, martin.lau@kernel.org, yonghong.song@linux.dev,
puranjay@kernel.org, xukuohai@huaweicloud.com,
iii@linux.ibm.com, kernel-patches-bot@fb.com
Subject: Re: [PATCH bpf-next 1/4] bpf, x64: Fix tailcall infinite loop caused by freplace
Date: Tue, 27 Aug 2024 03:37:45 -0700 [thread overview]
Message-ID: <699f5798e7d982baa2e6d4b6383ab6cd588ef5a9.camel@gmail.com> (raw)
In-Reply-To: <20240825130943.7738-2-leon.hwang@linux.dev>
On Sun, 2024-08-25 at 21:09 +0800, Leon Hwang wrote:
> This patch fixes a tailcall infinite loop issue caused by freplace.
>
> Since commit 1c123c567fb1 ("bpf: Resolve fext program type when checking map compatibility"),
> freplace prog is allowed to tail call its target prog. Then, when a
> freplace prog attaches to its target prog's subprog and tail calls its
> target prog, kernel will panic.
>
> For example:
>
> tc_bpf2bpf.c:
>
> // SPDX-License-Identifier: GPL-2.0
> \#include <linux/bpf.h>
> \#include <bpf/bpf_helpers.h>
>
> __noinline
> int subprog_tc(struct __sk_buff *skb)
> {
> return skb->len * 2;
> }
>
> SEC("tc")
> int entry_tc(struct __sk_buff *skb)
> {
> return subprog_tc(skb);
> }
>
> char __license[] SEC("license") = "GPL";
>
> tailcall_bpf2bpf_hierarchy_freplace.c:
>
> // SPDX-License-Identifier: GPL-2.0
> \#include <linux/bpf.h>
> \#include <bpf/bpf_helpers.h>
>
> struct {
> __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
> __uint(max_entries, 1);
> __uint(key_size, sizeof(__u32));
> __uint(value_size, sizeof(__u32));
> } jmp_table SEC(".maps");
>
> int count = 0;
>
> static __noinline
> int subprog_tail(struct __sk_buff *skb)
> {
> bpf_tail_call_static(skb, &jmp_table, 0);
> return 0;
> }
>
> SEC("freplace")
> int entry_freplace(struct __sk_buff *skb)
> {
> count++;
> subprog_tail(skb);
> subprog_tail(skb);
> return count;
> }
>
> char __license[] SEC("license") = "GPL";
>
> The attach target of entry_freplace is subprog_tc, and the tail callee
> in subprog_tail is entry_tc.
>
> Then, the infinite loop will be entry_tc -> subprog_tc -> entry_freplace
> -> subprog_tail --tailcall-> entry_tc, because tail_call_cnt in
> entry_freplace will count from zero for every time of entry_freplace
> execution. Kernel will panic:
>
> [ 15.310490] BUG: TASK stack guard page was hit at (____ptrval____)
> (stack is (____ptrval____)..(____ptrval____))
> [ 15.310490] Oops: stack guard page: 0000 [#1] PREEMPT SMP NOPTI
> [ 15.310490] CPU: 1 PID: 89 Comm: test_progs Tainted: G OE
> 6.10.0-rc6-g026dcdae8d3e-dirty #72
> [ 15.310490] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX,
> 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [ 15.310490] RIP: 0010:bpf_prog_3a140cef239a4b4f_subprog_tail+0x14/0x53
> [ 15.310490] Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
> cc cc cc cc cc f3 0f 1e fa 0f 1f 44 00 00 0f 1f 00 55 48 89 e5 f3 0f 1e
> fa <50> 50 53 41 55 48 89 fb 49 bd 00 2a 46 82 98 9c ff ff 48 89 df 4c
> [ 15.310490] RSP: 0018:ffffb500c0aa0000 EFLAGS: 00000202
> [ 15.310490] RAX: ffffb500c0aa0028 RBX: ffff9c98808b7e00 RCX:
> 0000000000008cb5
> [ 15.310490] RDX: 0000000000000000 RSI: ffff9c9882462a00 RDI:
> ffff9c98808b7e00
> [ 15.310490] RBP: ffffb500c0aa0000 R08: 0000000000000000 R09:
> 0000000000000000
> [ 15.310490] R10: 0000000000000001 R11: 0000000000000000 R12:
> ffffb500c01af000
> [ 15.310490] R13: ffffb500c01cd000 R14: 0000000000000000 R15:
> 0000000000000000
> [ 15.310490] FS: 00007f133b665140(0000) GS:ffff9c98bbd00000(0000)
> knlGS:0000000000000000
> [ 15.310490] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 15.310490] CR2: ffffb500c0a9fff8 CR3: 0000000102478000 CR4:
> 00000000000006f0
> [ 15.310490] Call Trace:
> [ 15.310490] <#DF>
> [ 15.310490] ? die+0x36/0x90
> [ 15.310490] ? handle_stack_overflow+0x4d/0x60
> [ 15.310490] ? exc_double_fault+0x117/0x1a0
> [ 15.310490] ? asm_exc_double_fault+0x23/0x30
> [ 15.310490] ? bpf_prog_3a140cef239a4b4f_subprog_tail+0x14/0x53
> [ 15.310490] </#DF>
> [ 15.310490] <TASK>
> [ 15.310490] bpf_prog_85781a698094722f_entry+0x4c/0x64
> [ 15.310490] bpf_prog_1c515f389a9059b4_entry2+0x19/0x1b
> [ 15.310490] ...
> [ 15.310490] bpf_prog_85781a698094722f_entry+0x4c/0x64
> [ 15.310490] bpf_prog_1c515f389a9059b4_entry2+0x19/0x1b
> [ 15.310490] bpf_test_run+0x210/0x370
> [ 15.310490] ? bpf_test_run+0x128/0x370
> [ 15.310490] bpf_prog_test_run_skb+0x388/0x7a0
> [ 15.310490] __sys_bpf+0xdbf/0x2c40
> [ 15.310490] ? clockevents_program_event+0x52/0xf0
> [ 15.310490] ? lock_release+0xbf/0x290
> [ 15.310490] __x64_sys_bpf+0x1e/0x30
> [ 15.310490] do_syscall_64+0x68/0x140
> [ 15.310490] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 15.310490] RIP: 0033:0x7f133b52725d
> [ 15.310490] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa
> 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f
> 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8b bb 0d 00 f7 d8 64 89 01 48
> [ 15.310490] RSP: 002b:00007ffddbc10258 EFLAGS: 00000206 ORIG_RAX:
> 0000000000000141
> [ 15.310490] RAX: ffffffffffffffda RBX: 00007ffddbc10828 RCX:
> 00007f133b52725d
> [ 15.310490] RDX: 0000000000000050 RSI: 00007ffddbc102a0 RDI:
> 000000000000000a
> [ 15.310490] RBP: 00007ffddbc10270 R08: 0000000000000000 R09:
> 00007ffddbc102a0
> [ 15.310490] R10: 0000000000000064 R11: 0000000000000206 R12:
> 0000000000000004
> [ 15.310490] R13: 0000000000000000 R14: 0000558ec4c24890 R15:
> 00007f133b6ed000
> [ 15.310490] </TASK>
> [ 15.310490] Modules linked in: bpf_testmod(OE)
> [ 15.310490] ---[ end trace 0000000000000000 ]---
> [ 15.310490] RIP: 0010:bpf_prog_3a140cef239a4b4f_subprog_tail+0x14/0x53
> [ 15.310490] Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
> cc cc cc cc cc f3 0f 1e fa 0f 1f 44 00 00 0f 1f 00 55 48 89 e5 f3 0f 1e
> fa <50> 50 53 41 55 48 89 fb 49 bd 00 2a 46 82 98 9c ff ff 48 89 df 4c
> [ 15.310490] RSP: 0018:ffffb500c0aa0000 EFLAGS: 00000202
> [ 15.310490] RAX: ffffb500c0aa0028 RBX: ffff9c98808b7e00 RCX:
> 0000000000008cb5
> [ 15.310490] RDX: 0000000000000000 RSI: ffff9c9882462a00 RDI:
> ffff9c98808b7e00
> [ 15.310490] RBP: ffffb500c0aa0000 R08: 0000000000000000 R09:
> 0000000000000000
> [ 15.310490] R10: 0000000000000001 R11: 0000000000000000 R12:
> ffffb500c01af000
> [ 15.310490] R13: ffffb500c01cd000 R14: 0000000000000000 R15:
> 0000000000000000
> [ 15.310490] FS: 00007f133b665140(0000) GS:ffff9c98bbd00000(0000)
> knlGS:0000000000000000
> [ 15.310490] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 15.310490] CR2: ffffb500c0a9fff8 CR3: 0000000102478000 CR4:
> 00000000000006f0
> [ 15.310490] Kernel panic - not syncing: Fatal exception in interrupt
> [ 15.310490] Kernel Offset: 0x30000000 from 0xffffffff81000000
> (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>
> This patch fixes the issue by initializing tail_call_cnt at the prologue
> of entry_tc.
>
> Next, when call subprog_tc, the tail_call_cnt_ptr is propagated to
> subprog_tc by rax.
> Next, when jump to entry_freplace, the tail_call_cnt_ptr will be reused to
> count tailcall in freplace prog.
> Next, when call subprog_tail, the tail_call_cnt_ptr is propagated to
> subprog_tail by rax.
> Next, while tail calling to entry_tc, the tail_call_cnt on the stack of
> entry_tc increments via the tail_call_cnt_ptr.
>
> The whole procedure shows as the following JITed prog dumping.
>
> bpftool p d j n entry_tc:
>
> int entry_tc(struct __sk_buff * skb):
> bpf_prog_1c515f389a9059b4_entry_tc:
> ; return subprog_tc(skb);
> 0: endbr64
> 4: xorq %rax, %rax ;; rax = 0 (tail_call_cnt)
> 7: nopl (%rax,%rax)
> c: pushq %rbp
> d: movq %rsp, %rbp
> 10: endbr64
> 14: cmpq $33, %rax ;; if rax > 33, rax = tcc_ptr
> 18: ja 0x20 ;; if rax > 33 goto 0x20 ---+
> 1a: pushq %rax ;; [rbp - 8] = rax = 0 |
> 1b: movq %rsp, %rax ;; rax = rbp - 8 |
> 1e: jmp 0x21 ;; ---------+ |
> 20: pushq %rax ;; <--------|---------------+
> 21: pushq %rax ;; <--------+ [rbp - 16] = rax
> 22: movq -16(%rbp), %rax ;; rax = tcc_ptr
> 29: callq 0x70 ;; call subprog_tc()
> ; return subprog_tc(skb);
> 2e: leave
> 2f: retq
>
> int subprog_tc(struct __sk_buff * skb):
> bpf_prog_1e8f76e2374a0607_subprog_tc:
> ; return skb->len * 2;
> 0: endbr64
> 4: nopl (%rax) ;; do not touch tail_call_cnt
> 7: jmp 0x108 ;; jump to entry_freplace()
> c: pushq %rbp
> d: movq %rsp, %rbp
> 10: endbr64
> 14: pushq %rax
> 15: pushq %rax
> 16: movl 112(%rdi), %eax
> ; return skb->len * 2;
> 19: shll %eax
> ; return skb->len * 2;
> 1b: leave
> 1c: retq
>
> bpftool p d j n entry_freplace:
>
> int entry_freplace(struct __sk_buff * skb):
> bpf_prog_85781a698094722f_entry_freplace:
> ; int entry_freplace(struct __sk_buff *skb)
> 0: endbr64
> 4: nopl (%rax) ;; do not touch tail_call_cnt
> 7: nopl (%rax,%rax)
> c: pushq %rbp
> d: movq %rsp, %rbp
> 10: endbr64
> 14: cmpq $33, %rax ;; if rax > 33, rax = tcc_ptr
> 18: ja 0x20 ;; if rax > 33 goto 0x20 ---+
> 1a: pushq %rax ;; [rbp - 8] = rax = 0 |
> 1b: movq %rsp, %rax ;; rax = rbp - 8 |
> 1e: jmp 0x21 ;; ---------+ |
> 20: pushq %rax ;; <--------|---------------+
> 21: pushq %rax ;; <--------+ [rbp - 16] = rax
> 22: pushq %rbx ;; callee saved
> 23: pushq %r13 ;; callee saved
> 25: movq %rdi, %rbx ;; rbx = skb (callee saved)
> ; count++;
> 28: movabsq $-123406219759616, %r13
> 32: movl (%r13), %edi
> 36: addl $1, %edi
> 39: movl %edi, (%r13)
> ; subprog_tail(skb);
> 3d: movq %rbx, %rdi ;; rdi = skb
> 40: movq -16(%rbp), %rax ;; rax = tcc_ptr
> 47: callq 0x80 ;; call subprog_tail()
> ; subprog_tail(skb);
> 4c: movq %rbx, %rdi ;; rdi = skb
> 4f: movq -16(%rbp), %rax ;; rax = tcc_ptr
> 56: callq 0x80 ;; call subprog_tail()
> ; return count;
> 5b: movl (%r13), %eax
> ; return count;
> 5f: popq %r13
> 61: popq %rbx
> 62: leave
> 63: retq
>
> int subprog_tail(struct __sk_buff * skb):
> bpf_prog_3a140cef239a4b4f_subprog_tail:
> ; int subprog_tail(struct __sk_buff *skb)
> 0: endbr64
> 4: nopl (%rax) ;; do not touch tail_call_cnt
> 7: nopl (%rax,%rax)
> c: pushq %rbp
> d: movq %rsp, %rbp
> 10: endbr64
> 14: pushq %rax ;; [rbp - 8] = rax (tcc_ptr)
> 15: pushq %rax ;; [rbp - 16] = rax (tcc_ptr)
> 16: pushq %rbx ;; callee saved
> 17: pushq %r13 ;; callee saved
> 19: movq %rdi, %rbx ;; rbx = skb
> ; asm volatile("r1 = %[ctx]\n\t"
> 1c: movabsq $-128494642337280, %r13 ;; r13 = jmp_table
> 26: movq %rbx, %rdi ;; 1st arg, skb
> 29: movq %r13, %rsi ;; 2nd arg, jmp_table
> 2c: xorl %edx, %edx ;; 3rd arg, index = 0
> 2e: movq -16(%rbp), %rax ;; rax = [rbp - 16] (tcc_ptr)
> 35: cmpq $33, (%rax)
> 39: jae 0x4e ;; if *tcc_ptr >= 33 goto 0x4e --------+
> 3b: nopl (%rax,%rax) ;; jmp bypass, toggled by poking |
> 40: addq $1, (%rax) ;; (*tcc_ptr)++ |
> 44: popq %r13 ;; callee saved |
> 46: popq %rbx ;; callee saved |
> 47: popq %rax ;; undo rbp-16 push |
> 48: popq %rax ;; undo rbp-8 push |
> 49: jmp 0xfffffffffffffe18 ;; tail call target, toggled by poking |
> ; return 0; ;; |
> 4e: popq %r13 ;; restore callee saved <--------------+
> 50: popq %rbx ;; restore callee saved
> 51: leave
> 52: retq
>
> As a result, the tail_call_cnt is stored on the stack of entry_tc. And
> the tail_call_cnt_ptr is propagated between subprog_tc, entry_freplace,
> subprog_tail and entry_tc.
>
> Furthermore, trampoline is required to propagate
> tail_call_cnt/tail_call_cnt_ptr always, no matter whether there is
> tailcall at run time.
> So, it reuses trampoline flag "BIT(7)" to tell trampoline to propagate
> the tail_call_cnt/tail_call_cnt_ptr, as BPF_TRAMP_F_TAIL_CALL_CTX is not
> used by any other arch BPF JIT.
This change seem to be correct.
Could you please add an explanation somewhere why nop3/xor and nop5
are swapped in the prologue?
As far as I understand, this is done so that freplace program
would reuse xor generated for replaced program, is that right?
E.g. for tailcall_bpf2bpf_freplace test case disasm looks as follows:
--------------- entry_tc --------------
func #0:
0: f3 0f 1e fa endbr64
4: 48 31 c0 xorq %rax, %rax
7: 0f 1f 44 00 00 nopl (%rax,%rax)
c: 55 pushq %rbp
d: 48 89 e5 movq %rsp, %rbp
10: f3 0f 1e fa endbr64
...
------------ entry_freplace -----------
func #0:
0: f3 0f 1e fa endbr64
4: 0f 1f 00 nopl (%rax)
7: 0f 1f 44 00 00 nopl (%rax,%rax)
c: 55 pushq %rbp
d: 48 89 e5 movq %rsp, %rbp
...
So, if entry_freplace would be used to replace entry_tc instead
of subprog_tc, the disasm would change to:
--------------- entry_tc --------------
func #0:
0: f3 0f 1e fa endbr64
4: 48 31 c0 xorq %rax, %rax
7: 0f 1f 44 00 00 jmp <entry_freplace>
Thus reusing %rax initialization from entry_tc.
> However, the bad effect is that it requires initializing tail_call_cnt at
> prologue always no matter whether the prog is tail_call_reachable, because
> it is unable to confirm itself or its subprog[s] whether to be attached by
> freplace prog.
> And, when call subprog, tail_call_cnt_ptr is required to be propagated
> to subprog always.
This seems unfortunate.
I wonder if disallowing to freplace programs when
replacement.tail_call_reachable != replaced.tail_call_reachable
would be a better option?
[...]
next prev parent reply other threads:[~2024-08-27 10:37 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-08-25 13:09 [PATCH bpf-next 0/4] bpf: Fix tailcall infinite loop caused by freplace Leon Hwang
2024-08-25 13:09 ` [PATCH bpf-next 1/4] bpf, x64: " Leon Hwang
2024-08-27 10:37 ` Eduard Zingerman [this message]
2024-08-27 12:48 ` Leon Hwang
2024-08-27 20:50 ` Alexei Starovoitov
2024-08-28 2:36 ` Leon Hwang
2024-08-28 16:01 ` Alexei Starovoitov
2024-08-29 2:14 ` Leon Hwang
2024-09-02 10:19 ` Toke Høiland-Jørgensen
2024-09-02 16:33 ` Vincent Li
2024-08-25 13:09 ` [PATCH bpf-next 2/4] bpf, arm64: " Leon Hwang
2024-08-26 14:32 ` Xu Kuohai
2024-08-27 2:23 ` Leon Hwang
2024-08-30 7:37 ` Xu Kuohai
2024-08-30 9:08 ` Leon Hwang
2024-08-30 10:00 ` Xu Kuohai
2024-08-30 12:11 ` Leon Hwang
2024-08-30 16:03 ` Alexei Starovoitov
2024-09-05 9:13 ` Puranjay Mohan
2024-09-06 14:32 ` Leon Hwang
2024-09-06 15:24 ` Alexei Starovoitov
2024-09-07 7:03 ` Xu Kuohai
2024-08-25 13:09 ` [PATCH bpf-next 3/4] selftests/bpf: Add testcases for another tailcall infinite loop fixing Leon Hwang
2024-08-25 13:09 ` [PATCH bpf-next 4/4] selftests/bpf: Fix verifier tailcall jit selftest Leon Hwang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=699f5798e7d982baa2e6d4b6383ab6cd588ef5a9.camel@gmail.com \
--to=eddyz87@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=iii@linux.ibm.com \
--cc=kernel-patches-bot@fb.com \
--cc=leon.hwang@linux.dev \
--cc=martin.lau@kernel.org \
--cc=puranjay@kernel.org \
--cc=toke@redhat.com \
--cc=xukuohai@huaweicloud.com \
--cc=yonghong.song@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox