All of lore.kernel.org
 help / color / mirror / Atom feed
From: Puranjay Mohan <puranjay12@gmail.com>
To: Ilya Leoshkevich <iii@linux.ibm.com>,
	"David S. Miller" <davem@davemloft.net>,
	David Ahern <dsahern@kernel.org>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	Martin KaFai Lau <martin.lau@linux.dev>,
	"Eduard Zingerman" <eddyz87@gmail.com>,
	Song Liu <song@kernel.org>,
	Yonghong Song <yonghong.song@linux.dev>,
	John Fastabend <john.fastabend@gmail.com>,
	"KP Singh" <kpsingh@kernel.org>,
	Stanislav Fomichev <sdf@google.com>, Hao Luo <haoluo@google.com>,
	Jiri Olsa <jolsa@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	x86@kernel.org, "H. Peter Anvin" <hpa@zytor.com>,
	Jean-Philippe Brucker <jean-philippe@linaro.org>,
	netdev@vger.kernel.org, bpf@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH bpf v3] bpf: verifier: prevent userspace memory access
Date: Thu, 21 Mar 2024 12:30:50 +0000	[thread overview]
Message-ID: <mb61p4jczix6t.fsf@gmail.com> (raw)
In-Reply-To: <2ef29d6b7ed800631f228ea41f27c0242e96f941.camel@linux.ibm.com>

Ilya Leoshkevich <iii@linux.ibm.com> writes:

> On Thu, 2024-03-21 at 12:08 +0000, Puranjay Mohan wrote:
>> With BPF_PROBE_MEM, BPF allows de-referencing an untrusted pointer.
>> To
>> thwart invalid memory accesses, the JITs add an exception table entry
>> for all such accesses. But in case the src_reg + offset overflows and
>> turns into a userspace address, the BPF program might read that
>> memory if
>> the user has mapped it.
>> 
>> There are architectural features that prevent the kernel from
>> accessing
>> userspace memory, like Privileged Access Never (PAN) on ARM64,
>> Supervisor Mode Access Prevention (SMAP) on x86-64, Supervisor User
>> Memory access (SUM) on RISC-V, etc. But BPF should not rely on the
>> existence of these features.
>> 
>> Make the verifier add guard instructions around such memory accesses
>> and
>> skip the load if the address falls into the userspace region.
>> 
>> The JITs need to implement bpf_arch_uaddress_limit() to define where
>> the userspace addresses end for that architecture or TASK_SIZE is
>> taken
>> as default.
>> 
>> The implementation is as follows:
>> 
>> REG_AX =  SRC_REG
>> if(offset)
>> 	REG_AX += offset;
>> REG_AX >>= 32;
>> if (REG_AX <= (uaddress_limit >> 32))
>> 	DST_REG = 0;
>> else
>> 	DST_REG = *(size *)(SRC_REG + offset);
>> 
>> Comparing just the upper 32 bits of the load address with the upper
>> 32 bits of uaddress_limit implies that the values are being aligned
>> down
>> to a 4GB boundary before comparison.
>> 
>> The above means that all loads with address <= uaddress_limit + 4GB
>> are
>> skipped. This is acceptable because there is a large hole (much
>> larger
>> than 4GB) between userspace and kernel space memory, therefore a
>> correctly functioning BPF program should not access this 4GB memory
>> above the userspace.
>> 
>> Let's analyze what this patch does to the following fentry program
>> dereferencing an untrusted pointer:
>> 
>>   SEC("fentry/tcp_v4_connect")
>>   int BPF_PROG(fentry_tcp_v4_connect, struct sock *sk)
>>   {
>>                 *(volatile long *)sk;
>>                 return 0;
>>   }
>> 
>>     BPF Program before              |           BPF Program after
>>     ------------------              |           -----------------
>> 
>>   0: (79) r1 = *(u64 *)(r1 +0)          0: (79) r1 = *(u64 *)(r1 +0)
>>   -------------------------------------------------------------------
>> ----
>>   1: (79) r1 = *(u64 *)(r1 +0) --\      1: (bf) r11 = r1
>>   ----------------------------\   \     2: (77) r11 >>= 32
>>   2: (b7) r0 = 0               \   \    3: (b5) if r11 <= 0x8000 goto
>> pc+2
>>   3: (95) exit                  \   \-> 4: (79) r1 = *(u64 *)(r1 +0)
>>                                  \      5: (05) goto pc+1
>>                                   \     6: (b7) r1 = 0
>>                                    \---------------------------------
>> -----
>>                                         7: (b7) r0 = 0
>>                                         8: (95) exit
>> 
>> As you can see from above, in the best case (off=0), 5 extra
>> instructions
>> are emitted.
>> 
>> Now, we analyse the same program after it has gone through the JITs
>> of
>> X86-64, ARM64, and RISC-V architectures. We follow the single load
>> instruction that has the untrusted pointer and see what
>> instrumentation
>> has been added around it.
>> 
>>                                 x86-64 JIT
>>                                 ==========
>>      JIT's Instrumentation                  Verifier's
>> Instrumentation
>>           (upstream)                               (This patch)
>>      ---------------------                  -------------------------
>> -
>> 
>>    0:   nopl   0x0(%rax,%rax,1)             0:   nopl  
>> 0x0(%rax,%rax,1)
>>    5:   xchg   %ax,%ax                      5:   xchg   %ax,%ax
>>    7:   push   %rbp                         7:   push   %rbp
>>    8:   mov    %rsp,%rbp                    8:   mov    %rsp,%rbp
>>    b:   mov    0x0(%rdi),%rdi               b:   mov   
>> 0x0(%rdi),%rdi
>>   -------------------------------------------------------------------
>> -----
>>    f:   movabs $0x800000000000,%r11         f:   mov    %rdi,%r10
>>   19:   cmp    %r11,%rdi                   12:   shr    $0x20,%r10
>>   1c:   jb     0x000000000000002a          16:   cmp    $0x8000,%r10
>>   1e:   mov    %rdi,%r11                   1d:   jbe   
>> 0x0000000000000025
>>   21:   add    $0x0,%r11              /--> 1f:   mov   
>> 0x0(%rdi),%rdi
>>   28:   jae    0x000000000000002e    /     23:   jmp   
>> 0x0000000000000027
>>   2a:   xor    %edi,%edi            /      25:   xor    %edi,%edi
>>   2c:   jmp    0x0000000000000032  / /-------------------------------
>> -----
>>   2e:   mov    0x0(%rdi),%rdi  ---/ /      27:   xor    %eax,%eax
>>   ---------------------------------/       29:   leave
>>   32:   xor    %eax,%eax                   2a:   ret
>>   34:   leave
>>   35:   ret
>> 
>> The x86-64 JIT already emits some instructions to protect against
>> user
>> memory access. The implementation in this patch leads to a smaller
>> number of instructions being emitted. In the worst case the JIT will
>> emit 9 extra instructions and this patch decreases it to 7.
>> 
>>                                   ARM64 JIT
>>                                   =========
>> 
>>         No Intrumentation                       Verifier's
>> Instrumentation
>>            (upstream)                                  (This patch)
>>         -----------------                       ---------------------
>> -----
>> 
>>    0:   add     x9, x30, #0x0                0:   add     x9, x30,
>> #0x0
>>    4:   nop                                  4:   nop
>>    8:   paciasp                              8:   paciasp
>>    c:   stp     x29, x30, [sp, #-16]!        c:   stp     x29, x30,
>> [sp, #-16]!
>>   10:   mov     x29, sp                     10:   mov     x29, sp
>>   14:   stp     x19, x20, [sp, #-16]!       14:   stp     x19, x20,
>> [sp, #-16]!
>>   18:   stp     x21, x22, [sp, #-16]!       18:   stp     x21, x22,
>> [sp, #-16]!
>>   1c:   stp     x25, x26, [sp, #-16]!       1c:   stp     x25, x26,
>> [sp, #-16]!
>>   20:   stp     x27, x28, [sp, #-16]!       20:   stp     x27, x28,
>> [sp, #-16]!
>>   24:   mov     x25, sp                     24:   mov     x25, sp
>>   28:   mov     x26, #0x0                   28:   mov     x26, #0x0
>>   2c:   sub     x27, x25, #0x0              2c:   sub     x27, x25,
>> #0x0
>>   30:   sub     sp, sp, #0x0                30:   sub     sp, sp,
>> #0x0
>>   34:   ldr     x0, [x0]                    34:   ldr     x0, [x0]
>> ---------------------------------------------------------------------
>> -----------
>>   38:   ldr     x0, [x0] ----------\        38:   add     x9, x0,
>> #0x0
>> -----------------------------------\\       3c:   lsr     x9, x9, #32
>>   3c:   mov     x7, #0x0            \\      40:   cmp     x9, #0x10,
>> lsl #12
>>   40:   mov     sp, sp               \\     44:   b.ls   
>> 0x0000000000000050
>>   44:   ldp     x27, x28, [sp], #16   \\--> 48:   ldr     x0, [x0]
>>   48:   ldp     x25, x26, [sp], #16    \    4c:   b      
>> 0x0000000000000054
>>   4c:   ldp     x21, x22, [sp], #16     \   50:   mov     x0, #0x0
>>   50:   ldp     x19, x20, [sp], #16      \---------------------------
>> ------------
>>   54:   ldp     x29, x30, [sp], #16         54:   mov     x7, #0x0
>>   58:   add     x0, x7, #0x0                58:   mov     sp, sp
>>   5c:   autiasp                             5c:   ldp     x27, x28,
>> [sp], #16
>>   60:   ret                                 60:   ldp     x25, x26,
>> [sp], #16
>>   64:   nop                                 64:   ldp     x21, x22,
>> [sp], #16
>>   68:   ldr     x10, 0x0000000000000070     68:   ldp     x19, x20,
>> [sp], #16
>>   6c:   br      x10                         6c:   ldp     x29, x30,
>> [sp], #16
>>                                             70:   add     x0, x7,
>> #0x0
>>                                             74:   autiasp
>>                                             78:   ret
>>                                             7c:   nop
>>                                             80:   ldr     x10,
>> 0x0000000000000088
>>                                             84:   br      x10
>> 
>> There are 6 extra instructions added in ARM64 in the best case. This
>> will
>> become 7 in the worst case (off != 0).
>> 
>>                            RISC-V JIT (RISCV_ISA_C Disabled)
>>                            ==========
>> 
>>         No Intrumentation           Verifier's Instrumentation
>>            (upstream)                      (This patch)
>>         -----------------           --------------------------
>> 
>>    0:   nop                            0:   nop
>>    4:   nop                            4:   nop
>>    8:   li      a6, 33                 8:   li      a6, 33
>>    c:   addi    sp, sp, -16            c:   addi    sp, sp, -16
>>   10:   sd      s0, 8(sp)             10:   sd      s0, 8(sp)
>>   14:   addi    s0, sp, 16            14:   addi    s0, sp, 16
>>   18:   ld      a0, 0(a0)             18:   ld      a0, 0(a0)
>> ---------------------------------------------------------------
>>   1c:   ld      a0, 0(a0) --\         1c:   mv      t0, a0
>> --------------------------\  \        20:   srli    t0, t0, 32
>>   20:   li      a5, 0      \  \       24:   lui     t1, 4096
>>   24:   ld      s0, 8(sp)   \  \      28:   sext.w  t1, t1
>>   28:   addi    sp, sp, 16   \  \     2c:   bgeu    t1, t0, 12
>>   2c:   sext.w  a0, a5        \  \--> 30:   ld      a0, 0(a0)
>>   30:   ret                    \      34:   j       8
>>                                 \     38:   li      a0, 0
>>                                  \------------------------------
>>                                       3c:   li      a5, 0
>>                                       40:   ld      s0, 8(sp)
>>                                       44:   addi    sp, sp, 16
>>                                       48:   sext.w  a0, a5
>>                                       4c:   ret
>> 
>> There are 7 extra instructions added in RISC-V.
>> 
>> Fixes: 800834285361 ("bpf, arm64: Add BPF exception tables")
>> Reported-by: Breno Leitao <leitao@debian.org>
>> Suggested-by: Alexei Starovoitov <ast@kernel.org>
>> Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
>> ---
>> V2:
>> https://lore.kernel.org/bpf/20240321101058.68530-1-puranjay12@gmail.com/
>> Changes in V3:
>> - Return 0 from bpf_arch_uaddress_limit() in disabled case because it
>>   returns u64.
>> - Modify the check in verifier to no do instrumentation when
>> uaddress_limit
>>   is 0.
>> 
>> V1:
>> https://lore.kernel.org/bpf/20240320105436.4781-1-puranjay12@gmail.com/
>> Changes in V2:
>> - Disable this feature on s390x.
>> ---
>>  arch/s390/net/bpf_jit_comp.c |  5 +++
>>  arch/x86/net/bpf_jit_comp.c  | 72 ++++------------------------------
>> --
>>  include/linux/filter.h       |  1 +
>>  kernel/bpf/core.c            |  9 +++++
>>  kernel/bpf/verifier.c        | 30 +++++++++++++++
>>  5 files changed, 53 insertions(+), 64 deletions(-)
>
> [...]
>  
>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>> index 5aacb1d3c4cc..c131bee33ac3 100644
>> --- a/kernel/bpf/core.c
>> +++ b/kernel/bpf/core.c
>> @@ -2958,6 +2958,15 @@ bool __weak bpf_jit_supports_arena(void)
>>  	return false;
>>  }
>>  
>> +u64 __weak bpf_arch_uaddress_limit(void)
>> +{
>> +#ifdef CONFIG_64BIT
>> +	return TASK_SIZE;
>> +#else
>> +	return 0;
>> +#endif
>> +}
>> +
>
> How about the following?
>
> #if defined(CONFIG_64BIT) && \
>     defined(CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE)
>
> Then we won't need to do anything for s390x explicitly.`

Thanks for the suggestion, I will use this in v4.

      reply	other threads:[~2024-03-21 12:30 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-21 12:08 [PATCH bpf v3] bpf: verifier: prevent userspace memory access Puranjay Mohan
2024-03-21 12:15 ` Ilya Leoshkevich
2024-03-21 12:30   ` Puranjay Mohan [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=mb61p4jczix6t.fsf@gmail.com \
    --to=puranjay12@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bp@alien8.de \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=eddyz87@gmail.com \
    --cc=haoluo@google.com \
    --cc=hpa@zytor.com \
    --cc=iii@linux.ibm.com \
    --cc=jean-philippe@linaro.org \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=mingo@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=sdf@google.com \
    --cc=song@kernel.org \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    --cc=yonghong.song@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.