Re: [PATCH v4 net-next 1/3] Extended BPF interpreter and converter

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Daniel Borkmann <dborkman@redhat.com>
To: Alexei Starovoitov <ast@plumgrid.com>
Cc: "David S. Miller" <davem@davemloft.net>,
	Ingo Molnar <mingo@kernel.org>, Will Drewry <wad@chromium.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Hagen Paul Pfeifer <hagen@jauu.net>,
	Jesse Gross <jesse@nicira.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>,
	Tom Zanussi <tom.zanussi@linux.intel.com>,
	Jovi Zhangwei <jovi.zhangwei@gmail.com>,
	Eric Dumazet <edumazet@google.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	Arnaldo Carvalho de Melo <acme@infradead.org>,
	Pekka Enberg <penberg@iki.fi>,
	Arjan van de Ven <arjan@infradead.org>,
	Christoph Hellwig <hch@infradead.org>,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org
Subject: Re: [PATCH v4 net-next 1/3] Extended BPF interpreter and converter
Date: Tue, 04 Mar 2014 10:59:42 +0100	[thread overview]
Message-ID: <5315A40E.6010209@redhat.com> (raw)
In-Reply-To: <1393910304-4004-2-git-send-email-ast@plumgrid.com>

On 03/04/2014 06:18 AM, Alexei Starovoitov wrote:
> Extended BPF extends old BPF in the following ways:
> - from 2 to 10 registers
>    Original BPF has two registers (A and X) and hidden frame pointer.
>    Extended BPF has ten registers and read-only frame pointer.
> - from 32-bit registers to 64-bit registers
>    semantics of old 32-bit ALU operations are preserved via 32-bit
>    subregisters
> - if (cond) jump_true; else jump_false;
>    old BPF insns are replaced with:
>    if (cond) jump_true; /* else fallthrough */
> - adds signed > and >= insns
> - 16 4-byte stack slots for register spill-fill replaced with
>    up to 512 bytes of multi-use stack space
> - introduces bpf_call insn and register passing convention for zero
>    overhead calls from/to other kernel functions (not part of this patch)
> - adds arithmetic right shift insn
> - adds swab32/swab64 insns
> - adds atomic_add insn
> - old tax/txa insns are replaced with 'mov dst,src' insn
>
> Extended BPF is designed to be JITed with one to one mapping, which
> allows GCC/LLVM backends to generate optimized BPF code that performs
> almost as fast as natively compiled code
>
> sk_convert_filter() remaps old style insns into extended:
> 'sock_filter' instructions are remapped on the fly to
> 'sock_filter_ext' extended instructions when
> sysctl net.core.bpf_ext_enable=1
>
> Old filter comes through sk_attach_filter() or sk_unattached_filter_create()
>   if (bpf_ext_enable) {
>      convert to new
>      sk_chk_filter() - check old bpf
>      use sk_run_filter_ext() - new interpreter
>   } else {
>      sk_chk_filter() - check old bpf
>      if (bpf_jit_enable)
>          use old jit
>      else
>          use sk_run_filter() - old interpreter
>   }
>
> sk_run_filter_ext() interpreter is noticeably faster
> than sk_run_filter() for two reasons:
>
> 1.fall-through jumps
>    Old BPF jump instructions are forced to go either 'true' or 'false'
>    branch which causes branch-miss penalty.
>    Extended BPF jump instructions have one branch and fall-through,
>    which fit CPU branch predictor logic better.
>    'perf stat' shows drastic difference for branch-misses.
>
> 2.jump-threaded implementation of interpreter vs switch statement
>    Instead of single tablejump at the top of 'switch' statement, GCC will
>    generate multiple tablejump instructions, which helps CPU branch predictor
>
> Performance of two BPF filters generated by libpcap was measured
> on x86_64, i386 and arm32.
>
> fprog #1 is taken from Documentation/networking/filter.txt:
> tcpdump -i eth0 port 22 -dd
>
> fprog #2 is taken from 'man tcpdump':
> tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
>     ((tcp[12]&0xf0)>>2)) != 0)' -dd
>
> Other libpcap programs have similar performance differences.
>
> Raw performance data from BPF micro-benchmark:
> SK_RUN_FILTER on same SKB (cache-hit) or 10k SKBs (cache-miss)
> time in nsec per call, smaller is better
> --x86_64--
>           fprog #1  fprog #1   fprog #2  fprog #2
>           cache-hit cache-miss cache-hit cache-miss
> old BPF     90       101       192       202
> ext BPF     31        71       47         97
> old BPF jit 12        34       17         44
> ext BPF jit TBD
>
> --i386--
>           fprog #1  fprog #1   fprog #2  fprog #2
>           cache-hit cache-miss cache-hit cache-miss
> old BPF    107        136      227       252
> ext BPF     40        119       69       172
>
> --arm32--
>           fprog #1  fprog #1   fprog #2  fprog #2
>           cache-hit cache-miss cache-hit cache-miss
> old BPF    202        300      475       540
> ext BPF    139        270      296       470
> old BPF jit 26        182       37       202
> new BPF jit TBD
>
> Tested with trinify BPF fuzzer
>
> Future work:
>
> 0. seccomp
>
> 1. add extended BPF JIT for x86_64
>
> 2. add inband old/new demux and extended BPF verifier, so that new programs
>     can be loaded through old sk_attach_filter() and sk_unattached_filter_create()
>     interfaces
>
> 3. tracing filters systemtap-like with extended BPF
>
> 4. OVS with extended BPF
>
> 5. nftables with extended BPF
>
> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>

Looks great, imho, some comments/questions inline:

Nit: subject line of your patches should be, e.g.

  "filter: add Extended BPF interpreter and converter"
  "doc: filter: add Extended BPF documentation"
  ...

so first "<subsystem>: <summary phrase>".

> ---
>   include/linux/filter.h      |    8 +-
>   include/linux/netdevice.h   |    1 +
>   include/uapi/linux/filter.h |   34 +-
>   net/core/filter.c           |  802 ++++++++++++++++++++++++++++++++++++++++++-
>   net/core/sysctl_net_core.c  |    7 +
>   5 files changed, 830 insertions(+), 22 deletions(-)
>
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index e568c8ef896b..0e84ff6e991b 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -52,7 +52,13 @@ extern int sk_detach_filter(struct sock *sk);
>   extern int sk_chk_filter(struct sock_filter *filter, unsigned int flen);
>   extern int sk_get_filter(struct sock *sk, struct sock_filter __user *filter, unsigned len);
>   extern void sk_decode_filter(struct sock_filter *filt, struct sock_filter *to);
> +/* function remaps 'sock_filter' insns to 'sock_filter_ext' insns */
> +int sk_convert_filter(struct sock_filter *old_prog, int len,
> +		      struct sock_filter_ext *new_prog,	int *p_new_len);
> +/* execute extended bpf program */

I think this and the above comment can be omitted, as both have a kernel doc
in its implementation in net/core/filter.c that is more precise.

...
> +struct sock_filter_ext {
> +	__u8	code;    /* opcode */
> +	__u8    a_reg:4; /* dest register */
> +	__u8    x_reg:4; /* source register */
> +	__s16	off;     /* signed offset */
> +	__s32	imm;     /* signed immediate constant */
> +};
> +
>   struct sock_fprog {	/* Required for SO_ATTACH_FILTER. */
>   	unsigned short		len;	/* Number of filter blocks */
>   	struct sock_filter __user *filter;
> @@ -45,12 +54,15 @@ struct sock_fprog {	/* Required for SO_ATTACH_FILTER. */
>   #define         BPF_JMP         0x05
>   #define         BPF_RET         0x06
>   #define         BPF_MISC        0x07
> +#define         BPF_ALU64       0x07
> +
>

Please do not add empty newline above.

>   /* ld/ldx fields */
>   #define BPF_SIZE(code)  ((code) & 0x18)
>   #define         BPF_W           0x00
>   #define         BPF_H           0x08
>   #define         BPF_B           0x10
...
> diff --git a/net/core/filter.c b/net/core/filter.c
> index ad30d626a5bd..1494421486b7 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -1,5 +1,6 @@
>   /*
>    * Linux Socket Filter - Kernel level socket filtering
> + * Extended BPF is Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
>    *
>    * Author:
>    *     Jay Schulist <jschlst@samba.org>
> @@ -40,6 +41,8 @@
>   #include <linux/seccomp.h>
>   #include <linux/if_vlan.h>
>
> +int bpf_ext_enable __read_mostly;
> +
>   /* No hurry in this branch
>    *
>    * Exported for the bpf jit load helper.
> @@ -399,6 +402,7 @@ load_b:
>   	}
>
>   	return 0;
> +#undef K
>   }
>   EXPORT_SYMBOL(sk_run_filter);
...
> +		/* RET_K, RET_A are remaped into 2 insns */
> +		case BPF_RET | BPF_A:
> +		case BPF_RET | BPF_K:
> +			insn->code = BPF_ALU | BPF_MOV |
> +				(BPF_SRC(fp->code) == BPF_K ? BPF_K : BPF_X);

Hmm, so the case statement is about BPF_RET | BPF_A and BPF_RET | BPF_K
but BPF_RET | BPF_X is not mentioned. However, in BPF_SRC(fp->code)
selection you fall back to BPF_X if it doesn't equal BPF_K? Is that
correct? And, you probably also need to handle BPF_RET | BPF_X ?

> +			insn->a_reg = 0;
> +			insn->x_reg = 6;
> +			insn->imm = fp->k;
> +
> +			insn++;
> +			insn->code = BPF_RET | BPF_K;
> +			break;
...
> +	/* RET */
> +BPF_RET_BPF_K_0:
> +	return regs[0/* R0 */];

next prev parent reply	other threads:[~2014-03-04 10:01 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-03-04  5:18 [PATCH v4 net-next 0/3] Extended BPF, converter, seccomp, doc Alexei Starovoitov
2014-03-04  5:18 ` [PATCH v4 net-next 1/3] Extended BPF interpreter and converter Alexei Starovoitov
2014-03-04  9:59   ` Daniel Borkmann [this message]
2014-03-04 17:09     ` Alexei Starovoitov
2014-03-04 18:23       ` Daniel Borkmann
2014-03-04 14:28   ` Hagen Paul Pfeifer
2014-03-04 17:53     ` Alexei Starovoitov
2014-03-04 18:31       ` Daniel Borkmann
2014-03-04  5:18 ` [PATCH v4 net-next 2/3] RFC: convert seccomp to use extended BPF Alexei Starovoitov
2014-03-04  5:18 ` [PATCH v4 net-next 3/3] Extended BPF documentation Alexei Starovoitov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5315A40E.6010209@redhat.com \
    --to=dborkman@redhat.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=acme@infradead.org \
    --cc=akpm@linux-foundation.org \
    --cc=arjan@infradead.org \
    --cc=ast@plumgrid.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=fweisbec@gmail.com \
    --cc=hagen@jauu.net \
    --cc=hch@infradead.org \
    --cc=hpa@zytor.com \
    --cc=jesse@nicira.com \
    --cc=jovi.zhangwei@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=masami.hiramatsu.pt@hitachi.com \
    --cc=mingo@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=penberg@iki.fi \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=tom.zanussi@linux.intel.com \
    --cc=torvalds@linux-foundation.org \
    --cc=wad@chromium.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.