BPF List
 help / color / mirror / Atom feed
From: Yonghong Song <yhs@meta.com>
To: Dave Marchevsky <davemarchevsky@fb.com>, bpf@vger.kernel.org
Cc: Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	Kernel Team <kernel-team@fb.com>
Subject: Re: [PATCH bpf-next 1/2] bpf, x86: Improve PROBE_MEM runtime load check
Date: Thu, 15 Dec 2022 15:38:18 -0800	[thread overview]
Message-ID: <2a558ee1-e86a-6cc8-1d03-26dc81980ec8@meta.com> (raw)
In-Reply-To: <20221213182726.325137-1-davemarchevsky@fb.com>



On 12/13/22 10:27 AM, Dave Marchevsky wrote:
> This patch rewrites the runtime PROBE_MEM check insns emitted by the BPF
> JIT in order to ensure load safety. The changes in the patch fix two
> issues with the previous logic and more generally improve size of
> emitted code. Paragraphs between this one and "FIX 1" below explain the
> purpose of the runtime check and examine the current implementation.
> 
> When a load is marked PROBE_MEM - e.g. due to PTR_UNTRUSTED access - the
> address being loaded from is not necessarily valid. The BPF jit sets up
> exception handlers for each such load which catch page faults and 0 out
> the destination register.
> 
> Arbitrary register-relative loads can escape this exception handling
> mechanism. Specifically, a load like dst_reg = *(src_reg + off) will not
> trigger BPF exception handling if (src_reg + off) is outside of kernel
> address space, resulting in an uncaught page fault. A concrete example
> of such behavior is a program like:
> 
>    struct result {
>      char space[40];
>      long a;
>    };
> 
>    /* if err, returns ERR_PTR(-EINVAL) */
>    struct result *ptr = get_ptr_maybe_err();
>    long x = ptr->a;
> 
> If get_ptr_maybe_err returns ERR_PTR(-EINVAL) and the result isn't
> checked for err, 'result' will be (u64)-EINVAL, a number close to
> U64_MAX. The ptr->a load will be > U64_MAX and will wrap over to a small
> positive u64, which will be in userspace and thus not covered by BPF
> exception handling mechanism.
> 
> In order to prevent such loads from occurring, the BPF jit emits some
> instructions which do runtime checking of (src_reg + off) and skip the
> actual load if it's out of range. As an example, here are instructions
> emitted for a %rdi = *(%rdi + 0x10) PROBE_MEM load:
> 
>    72:   movabs $0x800000000010,%r11 --|
>    7c:   cmp    %r11,%rdi              |- 72 - 7f: Check 1
>    7f:    jb    0x000000000000008d   --|
>    81:   mov    %rdi,%r11             -----|
>    84:   add    $0x0000000000000010,%r11   |- 81-8b: Check 2
>    8b:   jnc    0x0000000000000091    -----|
>    8d:   xor    %edi,%edi             ---- 0 out dest
>    8f:   jmp    0x0000000000000095
>    91:   mov    0x10(%rdi),%rdi       ---- Actual load
>    95:
> 
> The JIT considers kernel address space to start at MAX_TASK_SIZE +
> PAGE_SIZE. Determining whether a load will be outside of kernel address
> space should be a simple check:
> 
>    (src_reg + off) >= MAX_TASK_SIZE + PAGE_SIZE
> 
> But because there is only one spare register when the checking logic is
> emitted, this logic is split into two checks:
> 
>    Check 1: src_reg >= (MAX_TASK_SIZE + PAGE_SIZE - off)
>    Check 2: src_reg + off doesn't wrap over U64_MAX and result in small pos u64
> 
> Emitted insns implementing Checks 1 and 2 are annotated in the above
> example. Check 1 can be done with a single spare register since the
> source reg by definition is the left-hand-side of the inequality.
> Since adding 'off' to both sides of Check 1's inequality results in the
> original inequality we want, it's equivalent to testing that inequality.
> Except in the case where src_reg + off wraps past U64_MAX, which is why
> Check 2 needs to actually add src_reg + off if Check 1 passes - again
> using the single spare reg.
> 
> FIX 1: The Check 1 inequality listed above is not what current code is
> doing. Current code is a bit more pessimistic, instead checking:
> 
>    src_reg >= (MAX_TASK_SIZE + PAGE_SIZE + abs(off))
> 
> The 0x800000000010 in above example is from this current check. If Check
> 1 was corrected to use the correct right-hand-side, the value would be
> 0x7ffffffffff0. This patch changes the checking logic more broadly (FIX
> 2 below will elaborate), fixing this issue as a side-effect of the
> rewrite. Regardless, it's important to understand why Check 1 should've
> been doing MAX_TASK_SIZE + PAGE_SIZE - off before proceeding.
> 
> FIX 2: Current code relies on a 'jnc' to determine whether src_reg + off
> addition wrapped over. For negative offsets this logic is incorrect.
> Consider Check 2 insns emitted when off = -0x10:
> 
>    81:   mov    %rdi,%r11
>    84:   add    0xfffffffffffffff0,%r11
>    8b:   jnc    0x0000000000000091
> 
> 2's complement representation of -0x10 is a large positive u64. Any
> value of src_reg that passes Check 1 will result in carry flag being set
> after (src_reg + off) addition. So a load with any negative offset will
> always fail Check 2 at runtime and never do the actual load. This patch
> fixes the negative offset issue by rewriting both checks in order to not
> rely on carry flag.
> 
> The rewrite takes advantage of the fact that, while we only have one
> scratch reg to hold arbitrary values, we know the offset at JIT time.
> This we can use src_reg as a temporary scratch reg to hold src_reg +
> offset since we can return it to its original value by later subtracting
> offset. As a result we can directly check the original inequality we
> care about:
> 
>    (src_reg + off) >= MAX_TASK_SIZE + PAGE_SIZE
> 
> For a load like %rdi = *(%rsi + -0x10), this results in emitted code:
> 
>    43:   movabs $0x800000000000,%r11
>    4d:   add    $0xfffffffffffffff0,%rsi --- src_reg += off
>    54:   cmp    %r11,%rsi                --- Check original inequality
>    57:   jae    0x000000000000005d
>    59:   xor    %edi,%edi
>    5b:   jmp    0x0000000000000061
>    5d:   mov    0x0(%rdi),%rsi           --- Actual Load
>    61:   sub    $0xfffffffffffffff0,%rsi --- src_reg -= off
> 
> Note that the actual load is always done with offset 0, since previous
> insns have already done src_reg += off. Regardless of whether the new
> check succeeds or fails, insn 61 is always executed, returning src_reg
> to its original value.
> 
> Because the goal of these checks is to ensure that loaded-from address
> will be protected by BPF exception handler, the new check can safely
> ignore any wrapover from insn 4d. If such wrapped-over address passes
> insn 54 + 57's cmp-and-jmp it will have such protection so the load can
> proceed.
> 
> As an aside, since offset in above calculations comes from
> bpf_insn, it's a s16 and thus won't wrap under unless src_reg is
> an anomalously low address in user address space. But again, if such a
> wrapunder results in an address in kernelspace, it's fine for the
> purpose of this check.

Not sure how useful the above paragraph is. Are we talking about
'offset' wraparound here? If src_reg is indeed a very low address in
user space, the following compare
    (src_reg + off) >= MAX_TASK_SIZE + PAGE_SIZE
should be force and everything is fine. What exactly this paragraph
will convey?

> 
> IMPROVEMENTS: The above improved logic is 8 insns vs original logic's 9,
> and has 1 fewer jmp. The number of checking insns can be further
> improved in common scenarios:
> 
> If src_reg == dst_reg, the actual load insn will clobber src_reg, so
> there's no original src_reg state for the sub insn immediately following
> the load to restore, so it can be omitted. In fact, it must be omitted
> since it would incorrectly subtract from the result of the load if it
> wasn't. So for src_reg == dst_reg, JIT emits these insns:
> 
>    3c:   movabs $0x800000000000,%r11
>    46:   add    $0xfffffffffffffff0,%rdi
>    4d:   cmp    %r11,%rdi
>    50:   jae    0x0000000000000056
>    52:   xor    %edi,%edi
>    54:   jmp    0x000000000000005a
>    56:   mov    0x0(%rdi),%rdi
>    5a:
> 
> The only difference from larger example being the omitted sub, which
> would've been insn 5a in this example.
> 
> If offset == 0, we can similarly omit the sub as in previous case, since
> there's nothing added to subtract. For the same reason we can omit the
> addition as well, resulting in JIT emitting these insns:
> 
>    46:   movabs $0x800000000000,%r11
>    4d:   cmp    %r11,%rdi
>    50:   jae    0x0000000000000056
>    52:   xor    %edi,%edi
>    54:   jmp    0x000000000000005a
>    56:   mov    0x0(%rdi),%rdi
>    5a:
> 
> Although the above example also has src_reg == dst_reg, the same
> offset == 0 optimization is valid to apply if src_reg != dst_reg.
> 
> To summarize the improvements in emitted insn count for the
> check-and-load:
> 
> BEFORE:                8 check insns, 3 jmps
> AFTER (general case):  7 check insns, 2 jmps (12.5% fewer insn, 33% jmp)
> AFTER (src == dst):    6 check insns, 2 jmps (25% fewer insn)
> AFTER (offset == 0):   5 check insns, 2 jmps (37.5% fewer insn)
> 
> (Above counts don't include the 1 load insn, just checking around it)
> 
> Based on BPF bytecode + JITted x86 insn I saw while experimenting with
> these improvements, I expect the src_reg == dst_reg case to occur most
> often, followed by offset == 0, then the general case.
> 
> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>

Acked-by: Yonghong Song <yhs@fb.com>

      parent reply	other threads:[~2022-12-15 23:38 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-13 18:27 [PATCH bpf-next 1/2] bpf, x86: Improve PROBE_MEM runtime load check Dave Marchevsky
2022-12-13 18:27 ` [PATCH bpf-next 2/2] selftests/bpf: Add verifier test exercising jit PROBE_MEM logic Dave Marchevsky
2022-12-16  3:34   ` Yonghong Song
2022-12-16  4:39     ` Dave Marchevsky
2022-12-15 23:38 ` Yonghong Song [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2a558ee1-e86a-6cc8-1d03-26dc81980ec8@meta.com \
    --to=yhs@meta.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davemarchevsky@fb.com \
    --cc=kernel-team@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox