From: Daniel Borkmann <dborkman@redhat.com>
To: davem@davemloft.net
Cc: ast@plumgrid.com, netdev@vger.kernel.org
Subject: [PATCH net-next v4 9/9] doc: filter: extend BPF documentation to document new internals
Date: Fri, 28 Mar 2014 18:58:26 +0100 [thread overview]
Message-ID: <1396029506-16776-10-git-send-email-dborkman@redhat.com> (raw)
In-Reply-To: <1396029506-16776-1-git-send-email-dborkman@redhat.com>
From: Alexei Starovoitov <ast@plumgrid.com>
Further extend the current BPF documentation to document new BPF
engine internals. Joint work with Daniel Borkmann.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
Documentation/networking/filter.txt | 125 ++++++++++++++++++++++++++++++++++++
1 file changed, 125 insertions(+)
diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
index a06b48d..81f940f 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -546,6 +546,130 @@ ffffffffa0069c8f + <x>:
For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful
toolchain for developing and testing the kernel's JIT compiler.
+BPF kernel internals
+--------------------
+Internally, for the kernel interpreter, a different BPF instruction set
+format with similar underlying principles from BPF described in previous
+paragraphs is being used. However, the instruction set format is modelled
+closer to the underlying architecture to mimic native instruction sets, so
+that a better performance can be achieved (more details later).
+
+It is designed to be JITed with one to one mapping, which can also open up
+the possibility for GCC/LLVM compilers to generate optimized BPF code through
+a BPF backend that performs almost as fast as natively compiled code.
+
+The new instruction set was originally designed with the possible goal in
+mind to write programs in "restricted C" and compile into BPF with a optional
+GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with
+minimal performance overhead over two steps, that is, C -> BPF -> native code.
+
+Currently, the new format is being used for running user BPF programs, which
+includes seccomp BPF, classic socket filters, cls_bpf traffic classifier,
+team driver's classifier for its load-balancing mode, netfilter's xt_bpf
+extension, PTP dissector/classifier, and much more. They are all internally
+converted by the kernel into the new instruction set representation and run
+in the extended interpreter. For in-kernel handlers, this all works
+transparently by using sk_unattached_filter_create() for setting up the
+filter, resp. sk_unattached_filter_destroy() for destroying it. The macro
+SK_RUN_FILTER(filter, ctx) transparently invokes the right BPF function to
+run the filter. 'filter' is a pointer to struct sk_filter that we got from
+sk_unattached_filter_create(), and 'ctx' the given context (e.g. skb pointer).
+All constraints and restrictions from sk_chk_filter() apply before a
+conversion to the new layout is being done behind the scenes!
+
+Currently, for JITing, the user BPF format is being used and current BPF JIT
+compilers reused whenever possible. In other words, we do not (yet!) perform
+a JIT compilation in the new layout, however, future work will successively
+migrate traditional JIT compilers into the new instruction format as well, so
+that they will profit from the very same benefits. Thus, when speaking about
+JIT in the following, a JIT compiler (TBD) for the new instruction format is
+meant in this context.
+
+Some core changes of the new internal format:
+
+- Number of registers increase from 2 to 10:
+
+ The old format had two registers A and X, and a hidden frame pointer. The
+ new layout extends this to be 10 internal registers and a read-only frame
+ pointer. Since 64-bit CPUs are passing arguments to functions via registers
+ the number of args from BPF program to in-kernel function is restricted
+ to 5 and one register is used to accept return value from an in-kernel
+ function. Natively, x86_64 passes first 6 arguments in registers, aarch64/
+ sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved
+ registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers.
+
+ Therefore, BPF calling convention is defined as:
+
+ * R0 - return value from in-kernel function
+ * R1 - R5 - arguments from BPF program to in-kernel function
+ * R6 - R9 - callee saved registers that in-kernel function will preserve
+ * R10 - read-only frame pointer to access stack
+
+ Thus, all BPF registers map one to one to HW registers on x86_64, aarch64,
+ etc, and BPF calling convention maps directly to ABIs used by the kernel on
+ 64-bit architectures.
+
+ On 32-bit architectures JIT may map programs that use only 32-bit arithmetic
+ and may let more complex programs to be interpreted.
+
+ R0 - R5 are scratch registers and BPF program needs spill/fill them if
+ necessary across calls. Note that there is only one BPF program (== one BPF
+ main routine) and it cannot call other BPF functions, it can only call
+ predefined in-kernel functions, though.
+
+- Register width increases from 32-bit to 64-bit:
+
+ Still, the semantics of the original 32-bit ALU operations are preserved
+ via 32-bit subregisters. All BPF registers are 64-bit with 32-bit lower
+ subregisters that zero-extend into 64-bit if they are being written to.
+ That behavior maps directly to x86_64 and arm64 subregister definition, but
+ makes other JITs more difficult.
+
+ 32-bit architectures run 64-bit internal BPF programs via interpreter.
+ Their JITs may convert BPF programs that only use 32-bit subregisters into
+ native instruction set and let the rest being interpreted.
+
+ Operation is 64-bit, because on 64-bit architectures, pointers are also
+ 64-bit wide, and we want to pass 64-bit values in/out of kernel functions,
+ so 32-bit BPF registers would otherwise require to define register-pair
+ ABI, thus, there won't be able to use a direct BPF register to HW register
+ mapping and JIT would need to do combine/split/move operations for every
+ register in and out of the function, which is complex, bug prone and slow.
+ Another reason is the use of atomic 64-bit counters.
+
+- Conditional jt/jf targets replaced with jt/fall-through:
+
+ While the original design has constructs such as "if (cond) jump_true;
+ else jump_false;", they are being replaced into alternative constructs like
+ "if (cond) jump_true; /* else fall-through */".
+
+- Introduces bpf_call insn and register passing convention for zero overhead
+ calls from/to other kernel functions:
+
+ After a kernel function call, R1 - R5 are reset to unreadable and R0 has a
+ return type of the function. Since R6 - R9 are callee saved, their state is
+ preserved across the call.
+
+Also in the new design, BPF is limited to 4096 insns, which means that any
+program will terminate quickly and will only call a fixed number of kernel
+functions. Original BPF and the new format are two operand instructions,
+which helps to do one-to-one mapping between BPF insn and x86 insn during JIT.
+
+The input context pointer for invoking the interpreter function is generic,
+its content is defined by a specific use case. For seccomp register R1 points
+to seccomp_data, for converted BPF filters R1 points to a skb.
+
+A program, that is translated internally consists of the following elements:
+
+ op:16, jt:8, jf:8, k:32 ==> op:8, a_reg:4, x_reg:4, off:16, imm:32
+
+Just like the original BPF, the new format runs within a controlled environment,
+is deterministic and the kernel can easily prove that. The safety of the program
+can be determined in two steps: first step does depth-first-search to disallow
+loops and other CFG validation; second step starts from the first insn and
+descends all possible paths. It simulates execution of every insn and observes
+the state change of registers and stack.
+
Misc
----
@@ -561,3 +685,4 @@ the underlying architecture.
Jay Schulist <jschlst@samba.org>
Daniel Borkmann <dborkman@redhat.com>
+Alexei Starovoitov <ast@plumgrid.com>
--
1.7.11.7
next prev parent reply other threads:[~2014-03-28 17:58 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-03-28 17:58 [PATCH net-next v4 0/9] BPF updates Daniel Borkmann
2014-03-28 17:58 ` [PATCH net-next v4 1/9] net: filter: add jited flag to indicate jit compiled filters Daniel Borkmann
2014-03-28 17:58 ` [PATCH net-next v4 2/9] net: filter: keep original BPF program around Daniel Borkmann
2014-09-12 3:27 ` Eric Dumazet
2014-09-12 3:51 ` Alexei Starovoitov
2014-09-12 6:09 ` Daniel Borkmann
2014-09-13 21:05 ` David Miller
2014-03-28 17:58 ` [PATCH net-next v4 3/9] net: filter: move filter accounting to filter core Daniel Borkmann
2014-03-28 17:58 ` [PATCH net-next v4 4/9] net: ptp: use sk_unattached_filter_create() for BPF Daniel Borkmann
2014-03-28 17:58 ` [PATCH net-next v4 5/9] net: ptp: do not reimplement PTP/BPF classifier Daniel Borkmann
2014-03-31 9:13 ` Richard Cochran
2014-03-31 20:37 ` Daniel Borkmann
2014-03-28 17:58 ` [PATCH net-next v4 6/9] net: ppp: use sk_unattached_filter api Daniel Borkmann
2014-03-28 17:58 ` [PATCH net-next v4 7/9] net: isdn: " Daniel Borkmann
2014-03-28 17:58 ` [PATCH net-next v4 8/9] net: filter: rework/optimize internal BPF interpreter's instruction set Daniel Borkmann
2014-03-28 17:58 ` Daniel Borkmann [this message]
2014-03-31 4:46 ` [PATCH net-next v4 0/9] BPF updates David Miller
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1396029506-16776-10-git-send-email-dborkman@redhat.com \
--to=dborkman@redhat.com \
--cc=ast@plumgrid.com \
--cc=davem@davemloft.net \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).