BPF List
 help / color / mirror / Atom feed
From: Yonghong Song <yonghong.song@linux.dev>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>,
	bpf <bpf@vger.kernel.org>, Alexei Starovoitov <ast@kernel.org>,
	Andrii Nakryiko <andrii@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Kernel Team <kernel-team@fb.com>,
	Martin KaFai Lau <martin.lau@kernel.org>
Subject: Re: yet another approach Was: [PATCH bpf-next v3 4/5] bpf, x86: Add jit support for private stack
Date: Tue, 8 Oct 2024 23:12:37 -0700	[thread overview]
Message-ID: <d1ae454a-4647-40a5-b48b-60cc140e51aa@linux.dev> (raw)
In-Reply-To: <CAADnVQ+tvGMFnEuZmKyXxJX25pL+G6X+9445Ct-RSU1sZ+57xw@mail.gmail.com>


On 10/8/24 3:10 PM, Alexei Starovoitov wrote:
> On Fri, Oct 4, 2024 at 7:03 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>
>> On 10/4/24 12:52 PM, Alexei Starovoitov wrote:
>>> On Fri, Oct 4, 2024 at 12:28 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>>> On 10/3/24 10:22 PM, Yonghong Song wrote:
>>>>> On 10/3/24 3:32 PM, Alexei Starovoitov wrote:
>>>>>> On Thu, Oct 3, 2024 at 1:44 PM Yonghong Song
>>>>>> <yonghong.song@linux.dev> wrote:
>>>>>>>> Looks like the idea needs more thought.
>>>>>>>>
>>>>>>>> in_task_stack() won't recognize the private stack,
>>>>>>>> so it will look like stack overflow and double fault.
>>>>>>>>
>>>>>>>> do you have CONFIG_VMAP_STACK ?
>>>>>>> Yes, my above test runs fine withCONFIG_VMAP_STACK. Let me guard
>>>>>>> private stack support with
>>>>>>> CONFIG_VMAP_STACK for now. Not sure whether distributions enable
>>>>>>> CONFIG_VMAP_STACK or not.
>>>>>> Good! but I'm surprised it makes a difference.
>>>>> That only for the test case I tried. Now I tried the whole bpf selftests
>>>>> with CONFIG_VMAP_STACK on. There are still some failures. Some of them
>>>>> due to stack protector. I disabled stack protector and then those stack
>>>>> protector error gone. But some other errors show up like below:
>>>>>
>>>>> [   27.186581] kernel tried to execute NX-protected page - exploit
>>>>> attempt? (uid: 0)
>>>>> [   27.187480] BUG: unable to handle page fault for address:
>>>>> ffff888109572800
>>>>> [   27.188299] #PF: supervisor instruction fetch in kernel mode
>>>>> [   27.189085] #PF: error_code(0x0011) - permissions violation
>>>>>
>>>>> or
>>>>>
>>>>> [   27.736844] BUG: unable to handle page fault for address:
>>>>> 0000000080000000
>>>>> [   27.737759] #PF: supervisor instruction fetch in kernel mode
>>>>> [   27.738631] #PF: error_code(0x0010) - not-present page
>>>>> [   27.739455] PGD 0 P4D 0
>>>>> [   27.739818] Oops: Oops: 0010 [#1] PREEMPT SMP PTI
>>>>>
>>>>> ...
>>>>>
>>>>> Some further investigations are needed.
>>>> I found one failure case (with stackprotector disabled):
>>>>
>>>> [   20.032611] traps: PANIC: double fault, error_code: 0x0
>>>> [   20.032615] Oops: double fault: 0000 [#1] PREEMPT SMP PTI
>>>> [   20.032619] CPU: 0 UID: 0 PID: 1959 Comm: test_progs Tainted: G           OE      6.11.0-10576-g17baa0096769-dirty #1006
>>>> [   20.032623] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
>>>> [   20.032624] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
>>>> [   20.032626] RIP: 0010:error_entry+0x17/0x140
>>>> [   20.032633] Code: ff 0f 01 f8 e9 56 fe ff ff 90 90 90 90 90 90 90 90 90 90 56 48 8b 74 24 08 48 89 7c 24 08 52 51 50 41 50 41 51 41 52 49
>>>> [   20.032635] RSP: 0018:ffffe8ffff400000 EFLAGS: 00010093
>>>> [   20.032637] RAX: ffffe8ffff4000a8 RBX: ffffe8ffff4000a8 RCX: ffffffff82201737
>>>> [   20.032639] RDX: 0000000000000000 RSI: ffffffff8220128d RDI: ffffe8ffff4000a8
>>>> [   20.032640] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
>>>> [   20.032641] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
>>>> [   20.032642] R13: 0000000000000000 R14: 000000000002ed80 R15: 0000000000000000
>>>> [   20.032643] FS:  00007f8a3a2006c0(0000) GS:ffff888237c00000(0000) knlGS:ffff888237c00000
>>>> [   20.032645] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [   20.032646] CR2: ffffe8ffff3ffff8 CR3: 0000000103580002 CR4: 0000000000370ef0
>>>> [   20.032649] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>> [   20.032650] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>> [   20.032651] Call Trace:
>>>> [   20.032660]  <#DF>
>>>> [   20.032664]  ? __die_body+0xaf/0xc0
>>>> [   20.032667]  ? die+0x2f/0x50
>>>> [   20.032670]  ? exc_double_fault+0xbf/0xd0
>>>> [   20.032674]  ? asm_exc_double_fault+0x23/0x30
>>>> [   20.032678]  ? restore_regs_and_return_to_kernel+0x1b/0x1b
>>>> [   20.032681]  ? asm_exc_page_fault+0xd/0x30
>>>> [   20.032684]  ? error_entry+0x17/0x140
>>>> [   20.032687]  </#DF>
>>>>
>>>> The private stack for cpu 0:
>>>>      priv_stack_ptr cpu 0 = [ffffe8ffff434000, ffffe8ffff438000] (total 16KB)
>>>> That is, the top stack is ffffe8ffff438000 and the bottom stack is ffffe8ffff434000.
>>>>
>>>> During bpf execution, a softirq may happen, at that point,
>>>> stack pointer becomes:
>>>>       RSP: 0018:ffffe8ffff400000 (see above)
>>>> and there is a read/write (mostly write) to address
>>>>       CR2: ffffe8ffff3ffff8
>>>> And this may cause a fault.
>>>> After this fault, there are some further access and probably because
>>>> of invalid stack, double fault happens.
>>>>
>>>> So the quesiton is why RSP is reset to ffffe8ffff400000?
>>> 0x38000 bytes consumed by stack or rounded down?
>>> That's unlikely.
>>>
>>>> I have not figured out which code changed this? Maybe somebody can help?
>>> As Kumar said earlier pls share the patch. Link to github? or whichever.
>>>
>>> Double check that any kind of tail-call logic is not mixed with priv stack.
>> Here is the reproducer. Two attached files:
>> priv_stack.config: the config file to build the kernel
>> 0001-bpf-implement-private-stack.patch: the patch to apply to the top of bpf-next.
>>
>> The top bpf-next commit in my test:
>> commit 9502a7de5a61bec3bda841a830560c5d6d40ecac (origin/master, origin/HEAD, master)
>> Author: Mykyta Yatsenko <yatsenko@meta.com>
>> Date:   Tue Oct 1 00:15:22 2024 +0100
>>
>>       selftests/bpf: Emit top frequent code lines in veristat
>>
>>
>> I am using clang18 to build the kernel and selftests.
>> The build command line:
>>     make LLVM=1 -j
>>     make -C tools/testing/selftests/bpf LLVM=1 -j
>>
>> In qemu vm, tools/testing/selftests/bpf directory, run the following script:
>>
>> $ cat run.sh
>> cat /proc/sys/net/core/bpf_jit_limit
>> echo 796917760 > /proc/sys/net/core/bpf_jit_limit
>> # ./test_progs -n 339/4
>> ./test_progs -t task_local_storage/nodeadlock
>>
>> With private stack on by default, in my environment, booting will failure.
>> So by default, private stack is off.
>> In the above
>>     echo 796917760 > /proc/sys/net/core/bpf_jit_limit
>> intends to enable private stack.
> well, the patch is written in a way that
>    cat /proc/sys/net/core/bpf_jit_limit
> is enough to enable priv stack.
Ok. Good to know.

>
>> +       if (yhs && !is_subprog && !tail_call_reachable && bpf_prog->aux->priv_stack_ptr) {
>> +               EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8) + 16);
> this part is unnecessary.
> Just setting rsp to the top is enough to start doing things

This is a hack. I run out of ideas so added some hack above for debugging purpose.


> in a new stack, but I'm afraid this is a dead end.
> I've played with orc and frame pointer unwders.
> Both are not happy with our hack.
> get_stack_info() returns unknown, so any logic that needs to collect
> the stack gets into the loop that exhausts our broken stack and
> it eventually dies with NULL deref or double fault.
> The simplest way to repro the brokeness is with:
>
> @@ -255,6 +255,8 @@ BPF_CALL_5(bpf_task_storage_get, struct bpf_map *,
> map, struct task_struct *,
>          if (flags & ~BPF_LOCAL_STORAGE_GET_F_CREATE || !task)
>                  return (unsigned long)NULL;
>
> +       dump_stack();

Indeed this won't work and the stack trace cannot be properly
presented due to private stack range is outside the default
kernel range.

Without dump_stack(), there are some other unknown errors.

> +       return 0;
>
> (in addition to your patch that calls it from
> task_local_storage/nodeadlock test).
> It will loop like:
> [   17.708612] bad stack ffffe8fffd620000
> [   17.714759] CPU: 0 UID: 0 PID: 2186 Comm: test_progs Not tainted
> 6.12.0-rc1-00162-g3fe52dc7912d-dirty #6365
> [   17.715692] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> [   17.715692] Call Trace:
> [   17.715692] bad stack ffffe8fffd620000
> [   17.721814] CPU: 0 UID: 0 PID: 2182 Comm: test_progs Not tainted
> 6.12.0-rc1-00162-g3fe52dc7912d-dirty #6365
> [   17.722763] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> [   17.722763] Call Trace:
> [   17.722763] bad stack ffffe8fffd620000
> [   17.728865] CPU: 0 UID: 0 PID: 2156 Comm: test_progs Not tainted
> 6.12.0-rc1-00162-g3fe52dc7912d-dirty #6365
> [   17.729826] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
>
> To make it less broken it requires a bunch of work in
> arch/x86/kernel/dumpstack.c
> I tried to hack things up, but wasn't successful.
> Even if we do succeed eventually this is too risky.
> We don't have a strong reason to introduce another stack type
> along with all the delicate logic to recognize it based on rsp
> value alone. If our percpu stack is per prog the get_stack_info()
> logic gets complicated and that's not acceptable.
> We need to scrap this idea.
> Let's go back to push/pop r11 around calls :(

  parent reply	other threads:[~2024-10-09  6:12 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-26 23:45 [PATCH bpf-next v3 0/5] bpf: Support private stack for bpf progs Yonghong Song
2024-09-26 23:45 ` [PATCH bpf-next v3 1/5] bpf: Allow each subprog having stack size of 512 bytes Yonghong Song
2024-09-26 23:45 ` [PATCH bpf-next v3 2/5] bpf: Collect stack depth information Yonghong Song
2024-09-30 14:42   ` Alexei Starovoitov
2024-09-30 16:23     ` Yonghong Song
2024-09-26 23:45 ` [PATCH bpf-next v3 3/5] bpf: Mark each subprog with proper pstack states Yonghong Song
2024-09-30 14:49   ` Alexei Starovoitov
2024-09-30 16:26     ` Yonghong Song
2024-09-26 23:45 ` [PATCH bpf-next v3 4/5] bpf, x86: Add jit support for private stack Yonghong Song
2024-09-27  4:58   ` Leon Hwang
2024-09-27 15:24     ` Yonghong Song
2024-09-29  8:31   ` kernel test robot
2024-09-30 16:29     ` Yonghong Song
2024-09-29 13:02   ` kernel test robot
2024-09-30 16:31     ` Yonghong Song
2024-09-29 13:34   ` kernel test robot
2024-09-30 15:03   ` Alexei Starovoitov
2024-09-30 16:33     ` Yonghong Song
2024-10-01  4:31     ` Kumar Kartikeya Dwivedi
2024-10-01  4:37       ` Kumar Kartikeya Dwivedi
2024-10-01 18:49         ` Alexei Starovoitov
2024-10-01 19:53           ` yet another approach Was: " Alexei Starovoitov
2024-10-01 20:50             ` Kumar Kartikeya Dwivedi
2024-10-01 21:28               ` Alexei Starovoitov
2024-10-02  0:22                 ` Kumar Kartikeya Dwivedi
2024-10-02  1:26                   ` Alexei Starovoitov
2024-10-02  2:16                     ` Kumar Kartikeya Dwivedi
2024-10-02  6:28                       ` Yonghong Song
2024-10-02  6:48                         ` Yonghong Song
2024-10-03  6:17                     ` Yonghong Song
2024-10-03 13:39                       ` Kumar Kartikeya Dwivedi
2024-10-03 17:35                         ` Alexei Starovoitov
2024-10-03 18:53                           ` Yonghong Song
2024-10-03 20:44                           ` Yonghong Song
2024-10-03 20:47                             ` Kumar Kartikeya Dwivedi
2024-10-03 20:54                               ` Yonghong Song
2024-10-03 22:32                             ` Alexei Starovoitov
2024-10-04  5:22                               ` Yonghong Song
2024-10-04 19:27                                 ` Yonghong Song
2024-10-04 19:52                                   ` Alexei Starovoitov
2024-10-05  2:03                                     ` Yonghong Song
2024-10-08 22:10                                       ` Alexei Starovoitov
2024-10-09  2:06                                         ` Alexei Starovoitov
2024-10-09  6:31                                           ` Yonghong Song
2024-10-09 14:56                                             ` Alexei Starovoitov
2024-10-09 15:56                                               ` Yonghong Song
2024-10-09 16:36                                           ` Kumar Kartikeya Dwivedi
2024-10-09 16:38                                             ` Kumar Kartikeya Dwivedi
2024-10-09 17:37                                               ` Kumar Kartikeya Dwivedi
2024-10-09  6:12                                         ` Yonghong Song [this message]
2024-09-26 23:45 ` [PATCH bpf-next v3 5/5] selftests/bpf: Add private stack tests Yonghong Song
2024-09-30 13:40   ` Jiri Olsa
2024-09-30 15:05     ` Alexei Starovoitov
2024-09-30 16:35       ` Yonghong Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d1ae454a-4647-40a5-b48b-60cc140e51aa@linux.dev \
    --to=yonghong.song@linux.dev \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=kernel-team@fb.com \
    --cc=martin.lau@kernel.org \
    --cc=memxor@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox