[PATCH v5 net-next 0/3] filter: add Extended BPF interpreter and converter

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 net-next 0/3] filter: add Extended BPF interpreter and converter
@ 2014-03-04 22:17 Alexei Starovoitov
  2014-03-04 22:17 ` [PATCH v5 net-next 1/3] " Alexei Starovoitov
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Alexei Starovoitov @ 2014-03-04 22:17 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: David S. Miller, Ingo Molnar, Will Drewry, Steven Rostedt,
	Peter Zijlstra, H. Peter Anvin, Hagen Paul Pfeifer, Jesse Gross,
	Thomas Gleixner, Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei,
	Eric Dumazet, Linus Torvalds, Andrew Morton, Frederic Weisbecker,
	Arnaldo Carvalho de Melo, Pekka Enberg, Arjan van de Ven,
	Christoph Hellwig, linux-kernel, netdev

Hi All,

V1 patches:
http://thread.gmane.org/gmane.linux.kernel/1605783
V2 patches:
http://thread.gmane.org/gmane.linux.kernel/1642325
V3 patches:
http://thread.gmane.org/gmane.linux.kernel/1656538

V4 summary:
- addressed Daniel comments
- RFC for seccomp with extended BPF
- added extended BPF design doc

V5 summary:
- fixed commit one-liner, removed empty line
- added Hagen's ack

Please review.
Thanks!

Alexei Starovoitov (3):
  filter: add Extended BPF interpreter and converter
  [RFC] seccomp: convert seccomp to use extended BPF
  doc: filter: add Extended BPF documentation

 Documentation/networking/filter.txt |  181 ++++++++
 include/linux/filter.h              |    6 +-
 include/linux/netdevice.h           |    1 +
 include/linux/seccomp.h             |    1 -
 include/uapi/linux/filter.h         |   33 +-
 kernel/seccomp.c                    |  112 +++--
 net/core/filter.c                   |  807 ++++++++++++++++++++++++++++++++++-
 net/core/sysctl_net_core.c          |    7 +
 8 files changed, 1059 insertions(+), 89 deletions(-)

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v5 net-next 1/3] filter: add Extended BPF interpreter and converter
  2014-03-04 22:17 [PATCH v5 net-next 0/3] filter: add Extended BPF interpreter and converter Alexei Starovoitov
@ 2014-03-04 22:17 ` Alexei Starovoitov
  2014-03-05  9:24   ` Daniel Borkmann
  2014-03-04 22:17 ` [PATCH v5 net-next 2/3] [RFC] seccomp: convert seccomp to use extended BPF Alexei Starovoitov
  2014-03-04 22:17 ` [PATCH v5 net-next 3/3] doc: filter: add Extended BPF documentation Alexei Starovoitov
  2 siblings, 1 reply; 10+ messages in thread
From: Alexei Starovoitov @ 2014-03-04 22:17 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: David S. Miller, Ingo Molnar, Will Drewry, Steven Rostedt,
	Peter Zijlstra, H. Peter Anvin, Hagen Paul Pfeifer, Jesse Gross,
	Thomas Gleixner, Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei,
	Eric Dumazet, Linus Torvalds, Andrew Morton, Frederic Weisbecker,
	Arnaldo Carvalho de Melo, Pekka Enberg, Arjan van de Ven,
	Christoph Hellwig, linux-kernel, netdev

Extended BPF extends old BPF in the following ways:
- from 2 to 10 registers
  Original BPF has two registers (A and X) and hidden frame pointer.
  Extended BPF has ten registers and read-only frame pointer.
- from 32-bit registers to 64-bit registers
  semantics of old 32-bit ALU operations are preserved via 32-bit
  subregisters
- if (cond) jump_true; else jump_false;
  old BPF insns are replaced with:
  if (cond) jump_true; /* else fallthrough */
- adds signed > and >= insns
- 16 4-byte stack slots for register spill-fill replaced with
  up to 512 bytes of multi-use stack space
- introduces bpf_call insn and register passing convention for zero
  overhead calls from/to other kernel functions (not part of this patch)
- adds arithmetic right shift insn
- adds swab32/swab64 insns
- adds atomic_add insn
- old tax/txa insns are replaced with 'mov dst,src' insn

Extended BPF is designed to be JITed with one to one mapping, which
allows GCC/LLVM backends to generate optimized BPF code that performs
almost as fast as natively compiled code

sk_convert_filter() remaps old style insns into extended:
'sock_filter' instructions are remapped on the fly to
'sock_filter_ext' extended instructions when
sysctl net.core.bpf_ext_enable=1

Old filter comes through sk_attach_filter() or sk_unattached_filter_create()
 if (bpf_ext_enable) {
    convert to new
    sk_chk_filter() - check old bpf
    use sk_run_filter_ext() - new interpreter
 } else {
    sk_chk_filter() - check old bpf
    if (bpf_jit_enable)
        use old jit
    else
        use sk_run_filter() - old interpreter
 }

sk_run_filter_ext() interpreter is noticeably faster
than sk_run_filter() for two reasons:

1.fall-through jumps
  Old BPF jump instructions are forced to go either 'true' or 'false'
  branch which causes branch-miss penalty.
  Extended BPF jump instructions have one branch and fall-through,
  which fit CPU branch predictor logic better.
  'perf stat' shows drastic difference for branch-misses.

2.jump-threaded implementation of interpreter vs switch statement
  Instead of single tablejump at the top of 'switch' statement, GCC will
  generate multiple tablejump instructions, which helps CPU branch predictor

Performance of two BPF filters generated by libpcap was measured
on x86_64, i386 and arm32.

fprog #1 is taken from Documentation/networking/filter.txt:
tcpdump -i eth0 port 22 -dd

fprog #2 is taken from 'man tcpdump':
tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
   ((tcp[12]&0xf0)>>2)) != 0)' -dd

Other libpcap programs have similar performance differences.

Raw performance data from BPF micro-benchmark:
SK_RUN_FILTER on same SKB (cache-hit) or 10k SKBs (cache-miss)
time in nsec per call, smaller is better
--x86_64--
         fprog #1  fprog #1   fprog #2  fprog #2
         cache-hit cache-miss cache-hit cache-miss
old BPF     90       101       192       202
ext BPF     31        71       47         97
old BPF jit 12        34       17         44
ext BPF jit TBD

--i386--
         fprog #1  fprog #1   fprog #2  fprog #2
         cache-hit cache-miss cache-hit cache-miss
old BPF    107        136      227       252
ext BPF     40        119       69       172

--arm32--
         fprog #1  fprog #1   fprog #2  fprog #2
         cache-hit cache-miss cache-hit cache-miss
old BPF    202        300      475       540
ext BPF    139        270      296       470
old BPF jit 26        182       37       202
new BPF jit TBD

Tested with trinify BPF fuzzer

Future work:

0. seccomp

1. add extended BPF JIT for x86_64

2. add inband old/new demux and extended BPF verifier, so that new programs
   can be loaded through old sk_attach_filter() and sk_unattached_filter_create()
   interfaces

3. tracing filters systemtap-like with extended BPF

4. OVS with extended BPF

5. nftables with extended BPF

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
---
 include/linux/filter.h      |    6 +-
 include/linux/netdevice.h   |    1 +
 include/uapi/linux/filter.h |   33 +-
 net/core/filter.c           |  802 ++++++++++++++++++++++++++++++++++++++++++-
 net/core/sysctl_net_core.c  |    7 +
 5 files changed, 827 insertions(+), 22 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index e568c8ef896b..beaba8580862 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -52,7 +52,11 @@ extern int sk_detach_filter(struct sock *sk);
 extern int sk_chk_filter(struct sock_filter *filter, unsigned int flen);
 extern int sk_get_filter(struct sock *sk, struct sock_filter __user *filter, unsigned len);
 extern void sk_decode_filter(struct sock_filter *filt, struct sock_filter *to);
+int sk_convert_filter(struct sock_filter *old_prog, int len,
+		      struct sock_filter_ext *new_prog,	int *p_new_len);
+unsigned int sk_run_filter_ext(void *ctx, const struct sock_filter_ext *insn);
 
+#define SK_RUN_FILTER(FILTER, SKB) (*FILTER->bpf_func)(SKB, FILTER->insns)
 #ifdef CONFIG_BPF_JIT
 #include <stdarg.h>
 #include <linux/linkage.h>
@@ -70,7 +74,6 @@ static inline void bpf_jit_dump(unsigned int flen, unsigned int proglen,
 		print_hex_dump(KERN_ERR, "JIT code: ", DUMP_PREFIX_OFFSET,
 			       16, 1, image, proglen, false);
 }
-#define SK_RUN_FILTER(FILTER, SKB) (*FILTER->bpf_func)(SKB, FILTER->insns)
 #else
 #include <linux/slab.h>
 static inline void bpf_jit_compile(struct sk_filter *fp)
@@ -80,7 +83,6 @@ static inline void bpf_jit_free(struct sk_filter *fp)
 {
 	kfree(fp);
 }
-#define SK_RUN_FILTER(FILTER, SKB) sk_run_filter(SKB, FILTER->insns)
 #endif
 
 static inline int bpf_tell_extensions(void)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 1a869488b8ae..2c13d000389c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3054,6 +3054,7 @@ extern int		netdev_max_backlog;
 extern int		netdev_tstamp_prequeue;
 extern int		weight_p;
 extern int		bpf_jit_enable;
+extern int		bpf_ext_enable;
 
 bool netdev_has_upper_dev(struct net_device *dev, struct net_device *upper_dev);
 struct net_device *netdev_all_upper_get_next_dev_rcu(struct net_device *dev,
diff --git a/include/uapi/linux/filter.h b/include/uapi/linux/filter.h
index 8eb9ccaa5b48..4e98fe16ba88 100644
--- a/include/uapi/linux/filter.h
+++ b/include/uapi/linux/filter.h
@@ -1,5 +1,6 @@
 /*
  * Linux Socket Filter Data Structures
+ * Extended BPF is Copyright (c) 2011-2014, PLUMgrid, http://plumgrid.com
  */
 
 #ifndef _UAPI__LINUX_FILTER_H__
@@ -19,7 +20,7 @@
  *	Try and keep these values and structures similar to BSD, especially
  *	the BPF code definitions which need to match so you can share filters
  */
- 
+
 struct sock_filter {	/* Filter block */
 	__u16	code;   /* Actual filter code */
 	__u8	jt;	/* Jump true */
@@ -27,6 +28,14 @@ struct sock_filter {	/* Filter block */
 	__u32	k;      /* Generic multiuse field */
 };
 
+struct sock_filter_ext {
+	__u8	code;    /* opcode */
+	__u8    a_reg:4; /* dest register */
+	__u8    x_reg:4; /* source register */
+	__s16	off;     /* signed offset */
+	__s32	imm;     /* signed immediate constant */
+};
+
 struct sock_fprog {	/* Required for SO_ATTACH_FILTER. */
 	unsigned short		len;	/* Number of filter blocks */
 	struct sock_filter __user *filter;
@@ -45,12 +54,14 @@ struct sock_fprog {	/* Required for SO_ATTACH_FILTER. */
 #define         BPF_JMP         0x05
 #define         BPF_RET         0x06
 #define         BPF_MISC        0x07
+#define         BPF_ALU64       0x07
 
 /* ld/ldx fields */
 #define BPF_SIZE(code)  ((code) & 0x18)
 #define         BPF_W           0x00
 #define         BPF_H           0x08
 #define         BPF_B           0x10
+#define         BPF_DW          0x18
 #define BPF_MODE(code)  ((code) & 0xe0)
 #define         BPF_IMM         0x00
 #define         BPF_ABS         0x20
@@ -58,6 +69,7 @@ struct sock_fprog {	/* Required for SO_ATTACH_FILTER. */
 #define         BPF_MEM         0x60
 #define         BPF_LEN         0x80
 #define         BPF_MSH         0xa0
+#define         BPF_XADD        0xc0 /* exclusive add */
 
 /* alu/jmp fields */
 #define BPF_OP(code)    ((code) & 0xf0)
@@ -68,16 +80,24 @@ struct sock_fprog {	/* Required for SO_ATTACH_FILTER. */
 #define         BPF_OR          0x40
 #define         BPF_AND         0x50
 #define         BPF_LSH         0x60
-#define         BPF_RSH         0x70
+#define         BPF_RSH         0x70 /* logical shift right */
 #define         BPF_NEG         0x80
 #define		BPF_MOD		0x90
 #define		BPF_XOR		0xa0
+#define		BPF_MOV		0xb0 /* mov reg to reg */
+#define		BPF_ARSH	0xc0 /* sign extending arithmetic shift right */
+#define		BPF_BSWAP32	0xd0 /* swap lower 4 bytes of 64-bit register */
+#define		BPF_BSWAP64	0xe0 /* swap all 8 bytes of 64-bit register */
 
 #define         BPF_JA          0x00
-#define         BPF_JEQ         0x10
-#define         BPF_JGT         0x20
-#define         BPF_JGE         0x30
-#define         BPF_JSET        0x40
+#define         BPF_JEQ         0x10 /* jump == */
+#define         BPF_JGT         0x20 /* GT is unsigned '>', JA in x86 */
+#define         BPF_JGE         0x30 /* GE is unsigned '>=', JAE in x86 */
+#define         BPF_JSET        0x40 /* if (A & X) */
+#define         BPF_JNE         0x50 /* jump != */
+#define         BPF_JSGT        0x60 /* SGT is signed '>', GT in x86 */
+#define         BPF_JSGE        0x70 /* SGE is signed '>=', GE in x86 */
+#define         BPF_CALL        0x80 /* function call */
 #define BPF_SRC(code)   ((code) & 0x08)
 #define         BPF_K           0x00
 #define         BPF_X           0x08
@@ -134,5 +154,4 @@ struct sock_fprog {	/* Required for SO_ATTACH_FILTER. */
 #define SKF_NET_OFF   (-0x100000)
 #define SKF_LL_OFF    (-0x200000)
 
-
 #endif /* _UAPI__LINUX_FILTER_H__ */
diff --git a/net/core/filter.c b/net/core/filter.c
index ad30d626a5bd..1494421486b7 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1,5 +1,6 @@
 /*
  * Linux Socket Filter - Kernel level socket filtering
+ * Extended BPF is Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
  *
  * Author:
  *     Jay Schulist <jschlst@samba.org>
@@ -40,6 +41,8 @@
 #include <linux/seccomp.h>
 #include <linux/if_vlan.h>
 
+int bpf_ext_enable __read_mostly;
+
 /* No hurry in this branch
  *
  * Exported for the bpf jit load helper.
@@ -399,6 +402,7 @@ load_b:
 	}
 
 	return 0;
+#undef K
 }
 EXPORT_SYMBOL(sk_run_filter);
 
@@ -637,6 +641,10 @@ void sk_filter_release_rcu(struct rcu_head *rcu)
 {
 	struct sk_filter *fp = container_of(rcu, struct sk_filter, rcu);
 
+	if ((void *)fp->bpf_func == (void *)sk_run_filter_ext)
+		/* arch specific jit_free are expecting this value */
+		fp->bpf_func = sk_run_filter;
+
 	bpf_jit_free(fp);
 }
 EXPORT_SYMBOL(sk_filter_release_rcu);
@@ -655,6 +663,81 @@ static int __sk_prepare_filter(struct sk_filter *fp)
 	return 0;
 }
 
+static int sk_prepare_filter_ext(struct sk_filter **pfp,
+				 struct sock_fprog *fprog, struct sock *sk)
+{
+	unsigned int fsize = sizeof(struct sock_filter) * fprog->len;
+	struct sock_filter *old_prog;
+	unsigned int sk_fsize;
+	struct sk_filter *fp;
+	int new_len;
+	int err;
+
+	/* store old program into buffer, since chk_filter will remap opcodes */
+	old_prog = kmalloc(fsize, GFP_KERNEL);
+	if (!old_prog)
+		return -ENOMEM;
+
+	if (sk) {
+		if (copy_from_user(old_prog, fprog->filter, fsize)) {
+			err = -EFAULT;
+			goto free_prog;
+		}
+	} else {
+		memcpy(old_prog, fprog->filter, fsize);
+	}
+
+	/* calculate bpf_ext program length */
+	err = sk_convert_filter(fprog->filter, fprog->len, NULL, &new_len);
+	if (err)
+		goto free_prog;
+
+	sk_fsize = sk_filter_size(new_len);
+	/* allocate sk_filter to store bpf_ext program */
+	if (sk)
+		fp = sock_kmalloc(sk, sk_fsize, GFP_KERNEL);
+	else
+		fp = kmalloc(sk_fsize, GFP_KERNEL);
+	if (!fp) {
+		err = -ENOMEM;
+		goto free_prog;
+	}
+
+	/* remap sock_filter insns into sock_filter_ext insns */
+	err = sk_convert_filter(old_prog, fprog->len,
+				(struct sock_filter_ext *)fp->insns, &new_len);
+	if (err)
+		/* 2nd sk_convert_filter() can fail only if it fails
+		 * to allocate memory, remapping must succeed
+		 */
+		goto free_fp;
+
+	/* now chk_filter can overwrite old_prog while checking */
+	err = sk_chk_filter(old_prog, fprog->len);
+	if (err)
+		goto free_fp;
+
+	/* discard old prog */
+	kfree(old_prog);
+
+	atomic_set(&fp->refcnt, 1);
+	fp->len = new_len;
+
+	/* sock_filter_ext insns must be executed by sk_run_filter_ext */
+	fp->bpf_func = (typeof(fp->bpf_func))sk_run_filter_ext;
+
+	*pfp = fp;
+	return 0;
+free_fp:
+	if (sk)
+		sock_kfree_s(sk, fp, sk_fsize);
+	else
+		kfree(fp);
+free_prog:
+	kfree(old_prog);
+	return err;
+}
+
 /**
  *	sk_unattached_filter_create - create an unattached filter
  *	@fprog: the filter program
@@ -676,6 +759,9 @@ int sk_unattached_filter_create(struct sk_filter **pfp,
 	if (fprog->filter == NULL)
 		return -EINVAL;
 
+	if (bpf_ext_enable)
+		return sk_prepare_filter_ext(pfp, fprog, NULL);
+
 	fp = kmalloc(sk_filter_size(fprog->len), GFP_KERNEL);
 	if (!fp)
 		return -ENOMEM;
@@ -726,21 +812,27 @@ int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk)
 	if (fprog->filter == NULL)
 		return -EINVAL;
 
-	fp = sock_kmalloc(sk, sk_fsize, GFP_KERNEL);
-	if (!fp)
-		return -ENOMEM;
-	if (copy_from_user(fp->insns, fprog->filter, fsize)) {
-		sock_kfree_s(sk, fp, sk_fsize);
-		return -EFAULT;
-	}
+	if (bpf_ext_enable) {
+		err = sk_prepare_filter_ext(&fp, fprog, sk);
+		if (err)
+			return err;
+	} else {
+		fp = sock_kmalloc(sk, sk_fsize, GFP_KERNEL);
+		if (!fp)
+			return -ENOMEM;
+		if (copy_from_user(fp->insns, fprog->filter, fsize)) {
+			sock_kfree_s(sk, fp, sk_fsize);
+			return -EFAULT;
+		}
 
-	atomic_set(&fp->refcnt, 1);
-	fp->len = fprog->len;
+		atomic_set(&fp->refcnt, 1);
+		fp->len = fprog->len;
 
-	err = __sk_prepare_filter(fp);
-	if (err) {
-		sk_filter_uncharge(sk, fp);
-		return err;
+		err = __sk_prepare_filter(fp);
+		if (err) {
+			sk_filter_uncharge(sk, fp);
+			return err;
+		}
 	}
 
 	old_fp = rcu_dereference_protected(sk->sk_filter,
@@ -882,3 +974,687 @@ out:
 	release_sock(sk);
 	return ret;
 }
+
+/**
+ *	sk_convert_filter - convert filter program
+ *	@old_prog: the filter program
+ *	@len: the length of filter program
+ *	@new_prog: buffer where converted program will be stored
+ *	@p_new_len: pointer to store length of converted program
+ *
+ * remap 'sock_filter' style BPF instruction set to 'sock_filter_ext' style
+ *
+ * first, call sk_convert_filter(old_prog, len, NULL, &new_len) to calculate new
+ * program length in one pass
+ *
+ * then new_prog = kmalloc(sizeof(struct sock_filter_ext) * new_len);
+ *
+ * and call it again: sk_convert_filter(old_prog, len, new_prog, &new_len);
+ * to remap in two passes: 1st pass finds new jump offsets, 2nd pass remaps
+ */
+int sk_convert_filter(struct sock_filter *old_prog, int len,
+		      struct sock_filter_ext *new_prog, int *p_new_len)
+{
+	struct sock_filter_ext *new_insn;
+	struct sock_filter *fp;
+	int *addrs = NULL;
+	int new_len = 0;
+	int pass = 0;
+	int tgt, i;
+	u8 bpf_src;
+
+	if (len <= 0 || len >= BPF_MAXINSNS)
+		return -EINVAL;
+
+	if (new_prog) {
+		addrs = kzalloc(len * sizeof(*addrs), GFP_KERNEL);
+		if (!addrs)
+			return -ENOMEM;
+	}
+
+do_pass:
+	new_insn = new_prog;
+	fp = old_prog;
+	for (i = 0; i < len; fp++, i++) {
+		struct sock_filter_ext tmp_insns[3] = {};
+		struct sock_filter_ext *insn = tmp_insns;
+
+		if (addrs)
+			addrs[i] = new_insn - new_prog;
+
+		switch (fp->code) {
+		/* all arithmetic insns and skb loads map as-is */
+		case BPF_ALU | BPF_ADD | BPF_X:
+		case BPF_ALU | BPF_ADD | BPF_K:
+		case BPF_ALU | BPF_SUB | BPF_X:
+		case BPF_ALU | BPF_SUB | BPF_K:
+		case BPF_ALU | BPF_AND | BPF_X:
+		case BPF_ALU | BPF_AND | BPF_K:
+		case BPF_ALU | BPF_OR | BPF_X:
+		case BPF_ALU | BPF_OR | BPF_K:
+		case BPF_ALU | BPF_LSH | BPF_X:
+		case BPF_ALU | BPF_LSH | BPF_K:
+		case BPF_ALU | BPF_RSH | BPF_X:
+		case BPF_ALU | BPF_RSH | BPF_K:
+		case BPF_ALU | BPF_XOR | BPF_X:
+		case BPF_ALU | BPF_XOR | BPF_K:
+		case BPF_ALU | BPF_MUL | BPF_X:
+		case BPF_ALU | BPF_MUL | BPF_K:
+		case BPF_ALU | BPF_DIV | BPF_X:
+		case BPF_ALU | BPF_DIV | BPF_K:
+		case BPF_ALU | BPF_MOD | BPF_X:
+		case BPF_ALU | BPF_MOD | BPF_K:
+		case BPF_ALU | BPF_NEG:
+		case BPF_LD | BPF_ABS | BPF_W:
+		case BPF_LD | BPF_ABS | BPF_H:
+		case BPF_LD | BPF_ABS | BPF_B:
+		case BPF_LD | BPF_IND | BPF_W:
+		case BPF_LD | BPF_IND | BPF_H:
+		case BPF_LD | BPF_IND | BPF_B:
+			insn->code = fp->code;
+			insn->a_reg = 6;
+			insn->x_reg = 7;
+			insn->imm = fp->k;
+			break;
+
+		/* jump opcodes map as-is, but offsets need adjustment */
+		case BPF_JMP | BPF_JA:
+			tgt = i + fp->k + 1;
+			insn->code = fp->code;
+#define EMIT_JMP \
+	do { \
+		if (tgt >= len || tgt < 0) \
+			goto err; \
+		insn->off = addrs ? addrs[tgt] - addrs[i] - 1 : 0; \
+		/* adjust pc relative offset for 2nd or 3rd insn */ \
+		insn->off -= insn - tmp_insns; \
+	} while (0)
+
+			EMIT_JMP;
+			break;
+
+		case BPF_JMP | BPF_JEQ | BPF_K:
+		case BPF_JMP | BPF_JEQ | BPF_X:
+		case BPF_JMP | BPF_JSET | BPF_K:
+		case BPF_JMP | BPF_JSET | BPF_X:
+		case BPF_JMP | BPF_JGT | BPF_K:
+		case BPF_JMP | BPF_JGT | BPF_X:
+		case BPF_JMP | BPF_JGE | BPF_K:
+		case BPF_JMP | BPF_JGE | BPF_X:
+			if (BPF_SRC(fp->code) == BPF_K &&
+			    (int)fp->k < 0) {
+				/* extended BPF immediates are signed,
+				 * zero extend immediate into tmp register
+				 * and use it in compare insn
+				 */
+				insn->code = BPF_ALU | BPF_MOV | BPF_K;
+				insn->a_reg = 2;
+				insn->imm = fp->k;
+				insn++;
+
+				insn->a_reg = 6;
+				insn->x_reg = 2;
+				bpf_src = BPF_X;
+			} else {
+				insn->a_reg = 6;
+				insn->x_reg = 7;
+				insn->imm = fp->k;
+				bpf_src = BPF_SRC(fp->code);
+			}
+			/* common case where 'jump_false' is next insn */
+			if (fp->jf == 0) {
+				insn->code = BPF_JMP | BPF_OP(fp->code) |
+					bpf_src;
+				tgt = i + fp->jt + 1;
+				EMIT_JMP;
+				break;
+			}
+			/* convert JEQ into JNE when 'jump_true' is next insn */
+			if (fp->jt == 0 && BPF_OP(fp->code) == BPF_JEQ) {
+				insn->code = BPF_JMP | BPF_JNE | bpf_src;
+				tgt = i + fp->jf + 1;
+				EMIT_JMP;
+				break;
+			}
+			/* other jumps are mapped into two insns: Jxx and JA */
+			tgt = i + fp->jt + 1;
+			insn->code = BPF_JMP | BPF_OP(fp->code) | bpf_src;
+			EMIT_JMP;
+
+			insn++;
+			insn->code = BPF_JMP | BPF_JA;
+			tgt = i + fp->jf + 1;
+			EMIT_JMP;
+			break;
+
+		/* ldxb 4*([14]&0xf) is remaped into 3 insns */
+		case BPF_LDX | BPF_MSH | BPF_B:
+			insn->code = BPF_LD | BPF_ABS | BPF_B;
+			insn->a_reg = 7;
+			insn->imm = fp->k;
+
+			insn++;
+			insn->code = BPF_ALU | BPF_AND | BPF_K;
+			insn->a_reg = 7;
+			insn->imm = 0xf;
+
+			insn++;
+			insn->code = BPF_ALU | BPF_LSH | BPF_K;
+			insn->a_reg = 7;
+			insn->imm = 2;
+			break;
+
+		/* RET_K, RET_A are remaped into 2 insns */
+		case BPF_RET | BPF_A:
+		case BPF_RET | BPF_K:
+			insn->code = BPF_ALU | BPF_MOV |
+				(BPF_SRC(fp->code) == BPF_K ? BPF_K : BPF_X);
+			insn->a_reg = 0;
+			insn->x_reg = 6;
+			insn->imm = fp->k;
+
+			insn++;
+			insn->code = BPF_RET | BPF_K;
+			break;
+
+		/* store to stack */
+		case BPF_ST:
+		case BPF_STX:
+			insn->code = BPF_STX | BPF_MEM | BPF_W;
+			insn->a_reg = 10;
+			insn->x_reg = fp->code == BPF_ST ? 6 : 7;
+			insn->off = -(BPF_MEMWORDS - fp->k) * 4;
+			break;
+
+		/* load from stack */
+		case BPF_LD | BPF_MEM:
+		case BPF_LDX | BPF_MEM:
+			insn->code = BPF_LDX | BPF_MEM | BPF_W;
+			insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 6 : 7;
+			insn->x_reg = 10;
+			insn->off = -(BPF_MEMWORDS - fp->k) * 4;
+			break;
+
+		/* A = K or X = K */
+		case BPF_LD | BPF_IMM:
+		case BPF_LDX | BPF_IMM:
+			insn->code = BPF_ALU | BPF_MOV | BPF_K;
+			insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 6 : 7;
+			insn->imm = fp->k;
+			break;
+
+		/* X = A */
+		case BPF_MISC | BPF_TAX:
+			insn->code = BPF_ALU64 | BPF_MOV | BPF_X;
+			insn->a_reg = 7;
+			insn->x_reg = 6;
+			break;
+
+		/* A = X */
+		case BPF_MISC | BPF_TXA:
+			insn->code = BPF_ALU64 | BPF_MOV | BPF_X;
+			insn->a_reg = 6;
+			insn->x_reg = 7;
+			break;
+
+		/* A = skb->len or X = skb->len */
+		case BPF_LD | BPF_W | BPF_LEN:
+		case BPF_LDX | BPF_W | BPF_LEN:
+			insn->code = BPF_LDX | BPF_MEM | BPF_W;
+			insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 6 : 7;
+			insn->x_reg = 1;
+			insn->off = offsetof(struct sk_buff, len);
+			break;
+
+		/* access seccomp_data fields */
+		case BPF_LDX | BPF_ABS | BPF_W:
+			insn->code = BPF_LDX | BPF_MEM | BPF_W;
+			insn->a_reg = 6;
+			insn->x_reg = 1;
+			insn->off = fp->k;
+			break;
+
+		default:
+			/* pr_err("unknown opcode %02x\n", fp->code); */
+			goto err;
+		}
+
+		insn++;
+		if (new_prog) {
+			memcpy(new_insn, tmp_insns,
+			       sizeof(*insn) * (insn - tmp_insns));
+		}
+		new_insn += insn - tmp_insns;
+	}
+
+	if (!new_prog) {
+		/* only calculating new length */
+		*p_new_len = new_insn - new_prog;
+		return 0;
+	}
+
+	pass++;
+	if (new_len != new_insn - new_prog) {
+		new_len = new_insn - new_prog;
+		if (pass > 2)
+			goto err;
+		goto do_pass;
+	}
+	kfree(addrs);
+	if (*p_new_len != new_len)
+		/* inconsistent new program length */
+		pr_err("sk_convert_filter() usage error\n");
+	return 0;
+err:
+	kfree(addrs);
+	return -EINVAL;
+}
+
+/**
+ *	sk_run_filter_ext - run an extended filter
+ *	@ctx: buffer to run the filter on
+ *	@insn: filter to apply
+ *
+ * Decode and execute extended BPF instructions.
+ * @ctx is the data we are operating on.
+ * @filter is the array of filter instructions.
+ */
+notrace u32 sk_run_filter_ext(void *ctx, const struct sock_filter_ext *insn)
+{
+	u64 stack[64];
+	u64 regs[16];
+	void *ptr;
+	u64 tmp;
+	int off;
+
+#ifdef __x86_64
+#define LOAD_IMM /**/
+#define K insn->imm
+#else
+#define LOAD_IMM (K = insn->imm)
+	s32 K = insn->imm;
+#endif
+
+#define A regs[insn->a_reg]
+#define X regs[insn->x_reg]
+
+#define CONT ({insn++; LOAD_IMM; goto select_insn; })
+#define CONT_JMP ({insn++; LOAD_IMM; goto select_insn; })
+/* some compilers may need help:
+ * #define CONT_JMP ({insn++; LOAD_IMM; goto *jumptable[insn->code]; })
+ */
+
+	static const void *jumptable[256] = {
+		[0 ... 255] = &&default_label,
+#define DL(A, B, C) [A|B|C] = &&A##_##B##_##C,
+		DL(BPF_ALU, BPF_ADD, BPF_X)
+		DL(BPF_ALU, BPF_ADD, BPF_K)
+		DL(BPF_ALU, BPF_SUB, BPF_X)
+		DL(BPF_ALU, BPF_SUB, BPF_K)
+		DL(BPF_ALU, BPF_AND, BPF_X)
+		DL(BPF_ALU, BPF_AND, BPF_K)
+		DL(BPF_ALU, BPF_OR, BPF_X)
+		DL(BPF_ALU, BPF_OR, BPF_K)
+		DL(BPF_ALU, BPF_LSH, BPF_X)
+		DL(BPF_ALU, BPF_LSH, BPF_K)
+		DL(BPF_ALU, BPF_RSH, BPF_X)
+		DL(BPF_ALU, BPF_RSH, BPF_K)
+		DL(BPF_ALU, BPF_XOR, BPF_X)
+		DL(BPF_ALU, BPF_XOR, BPF_K)
+		DL(BPF_ALU, BPF_MUL, BPF_X)
+		DL(BPF_ALU, BPF_MUL, BPF_K)
+		DL(BPF_ALU, BPF_MOV, BPF_X)
+		DL(BPF_ALU, BPF_MOV, BPF_K)
+		DL(BPF_ALU, BPF_DIV, BPF_X)
+		DL(BPF_ALU, BPF_DIV, BPF_K)
+		DL(BPF_ALU, BPF_MOD, BPF_X)
+		DL(BPF_ALU, BPF_MOD, BPF_K)
+		DL(BPF_ALU64, BPF_ADD, BPF_X)
+		DL(BPF_ALU64, BPF_ADD, BPF_K)
+		DL(BPF_ALU64, BPF_SUB, BPF_X)
+		DL(BPF_ALU64, BPF_SUB, BPF_K)
+		DL(BPF_ALU64, BPF_AND, BPF_X)
+		DL(BPF_ALU64, BPF_AND, BPF_K)
+		DL(BPF_ALU64, BPF_OR, BPF_X)
+		DL(BPF_ALU64, BPF_OR, BPF_K)
+		DL(BPF_ALU64, BPF_LSH, BPF_X)
+		DL(BPF_ALU64, BPF_LSH, BPF_K)
+		DL(BPF_ALU64, BPF_RSH, BPF_X)
+		DL(BPF_ALU64, BPF_RSH, BPF_K)
+		DL(BPF_ALU64, BPF_XOR, BPF_X)
+		DL(BPF_ALU64, BPF_XOR, BPF_K)
+		DL(BPF_ALU64, BPF_MUL, BPF_X)
+		DL(BPF_ALU64, BPF_MUL, BPF_K)
+		DL(BPF_ALU64, BPF_MOV, BPF_X)
+		DL(BPF_ALU64, BPF_MOV, BPF_K)
+		DL(BPF_ALU64, BPF_ARSH, BPF_X)
+		DL(BPF_ALU64, BPF_ARSH, BPF_K)
+		DL(BPF_ALU64, BPF_DIV, BPF_X)
+		DL(BPF_ALU64, BPF_DIV, BPF_K)
+		DL(BPF_ALU64, BPF_MOD, BPF_X)
+		DL(BPF_ALU64, BPF_MOD, BPF_K)
+		DL(BPF_ALU64, BPF_BSWAP32, BPF_X)
+		DL(BPF_ALU64, BPF_BSWAP64, BPF_X)
+		DL(BPF_ALU, BPF_NEG, 0)
+		DL(BPF_JMP, BPF_CALL, 0)
+		DL(BPF_JMP, BPF_JA, 0)
+		DL(BPF_JMP, BPF_JEQ, BPF_X)
+		DL(BPF_JMP, BPF_JEQ, BPF_K)
+		DL(BPF_JMP, BPF_JNE, BPF_X)
+		DL(BPF_JMP, BPF_JNE, BPF_K)
+		DL(BPF_JMP, BPF_JGT, BPF_X)
+		DL(BPF_JMP, BPF_JGT, BPF_K)
+		DL(BPF_JMP, BPF_JGE, BPF_X)
+		DL(BPF_JMP, BPF_JGE, BPF_K)
+		DL(BPF_JMP, BPF_JSGT, BPF_X)
+		DL(BPF_JMP, BPF_JSGT, BPF_K)
+		DL(BPF_JMP, BPF_JSGE, BPF_X)
+		DL(BPF_JMP, BPF_JSGE, BPF_K)
+		DL(BPF_JMP, BPF_JSET, BPF_X)
+		DL(BPF_JMP, BPF_JSET, BPF_K)
+		DL(BPF_STX, BPF_MEM, BPF_B)
+		DL(BPF_STX, BPF_MEM, BPF_H)
+		DL(BPF_STX, BPF_MEM, BPF_W)
+		DL(BPF_STX, BPF_MEM, BPF_DW)
+		DL(BPF_ST, BPF_MEM, BPF_B)
+		DL(BPF_ST, BPF_MEM, BPF_H)
+		DL(BPF_ST, BPF_MEM, BPF_W)
+		DL(BPF_ST, BPF_MEM, BPF_DW)
+		DL(BPF_LDX, BPF_MEM, BPF_B)
+		DL(BPF_LDX, BPF_MEM, BPF_H)
+		DL(BPF_LDX, BPF_MEM, BPF_W)
+		DL(BPF_LDX, BPF_MEM, BPF_DW)
+		DL(BPF_STX, BPF_XADD, BPF_W)
+#ifdef CONFIG_64BIT
+		DL(BPF_STX, BPF_XADD, BPF_DW)
+#endif
+		DL(BPF_LD, BPF_ABS, BPF_W)
+		DL(BPF_LD, BPF_ABS, BPF_H)
+		DL(BPF_LD, BPF_ABS, BPF_B)
+		DL(BPF_LD, BPF_IND, BPF_W)
+		DL(BPF_LD, BPF_IND, BPF_H)
+		DL(BPF_LD, BPF_IND, BPF_B)
+		DL(BPF_RET, BPF_K, 0)
+#undef DL
+	};
+
+	regs[10/* BPF R10 */] = (u64)(ulong)&stack[64];
+	regs[1/* BPF R1 */] = (u64)(ulong)ctx;
+
+	/* execute 1st insn */
+select_insn:
+	goto *jumptable[insn->code];
+
+	/* ALU */
+#define ALU(OPCODE, OP) \
+	BPF_ALU64_##OPCODE##_BPF_X: \
+		A = A OP X; \
+		CONT; \
+	BPF_ALU_##OPCODE##_BPF_X: \
+		A = (u32)A OP (u32)X; \
+		CONT; \
+	BPF_ALU64_##OPCODE##_BPF_K: \
+		A = A OP K; \
+		CONT; \
+	BPF_ALU_##OPCODE##_BPF_K: \
+		A = (u32)A OP (u32)K; \
+		CONT;
+
+	ALU(BPF_ADD, +)
+	ALU(BPF_SUB, -)
+	ALU(BPF_AND, &)
+	ALU(BPF_OR, |)
+	ALU(BPF_LSH, <<)
+	ALU(BPF_RSH, >>)
+	ALU(BPF_XOR, ^)
+	ALU(BPF_MUL, *)
+#undef ALU
+
+BPF_ALU_BPF_NEG_0:
+	A = (u32)-A;
+	CONT;
+BPF_ALU_BPF_MOV_BPF_X:
+	A = (u32)X;
+	CONT;
+BPF_ALU_BPF_MOV_BPF_K:
+	A = (u32)K;
+	CONT;
+BPF_ALU64_BPF_MOV_BPF_X:
+	A = X;
+	CONT;
+BPF_ALU64_BPF_MOV_BPF_K:
+	A = K;
+	CONT;
+BPF_ALU64_BPF_ARSH_BPF_X:
+	(*(s64 *) &A) >>= X;
+	CONT;
+BPF_ALU64_BPF_ARSH_BPF_K:
+	(*(s64 *) &A) >>= K;
+	CONT;
+BPF_ALU64_BPF_MOD_BPF_X:
+	tmp = A;
+	if (X)
+		A = do_div(tmp, X);
+	CONT;
+BPF_ALU_BPF_MOD_BPF_X:
+	tmp = (u32)A;
+	if (X)
+		A = do_div(tmp, (u32)X);
+	CONT;
+BPF_ALU64_BPF_MOD_BPF_K:
+	tmp = A;
+	if (K)
+		A = do_div(tmp, K);
+	CONT;
+BPF_ALU_BPF_MOD_BPF_K:
+	tmp = (u32)A;
+	if (K)
+		A = do_div(tmp, (u32)K);
+	CONT;
+BPF_ALU64_BPF_DIV_BPF_X:
+	if (X)
+		do_div(A, X);
+	CONT;
+BPF_ALU_BPF_DIV_BPF_X:
+	tmp = (u32)A;
+	if (X)
+		do_div(tmp, (u32)X);
+	A = (u32)tmp;
+	CONT;
+BPF_ALU64_BPF_DIV_BPF_K:
+	if (K)
+		do_div(A, K);
+	CONT;
+BPF_ALU_BPF_DIV_BPF_K:
+	tmp = (u32)A;
+	if (K)
+		do_div(tmp, (u32)K);
+	A = (u32)tmp;
+	CONT;
+BPF_ALU64_BPF_BSWAP32_BPF_X:
+	A = swab32(A);
+	CONT;
+BPF_ALU64_BPF_BSWAP64_BPF_X:
+	A = swab64(A);
+	CONT;
+
+	/* CALL */
+BPF_JMP_BPF_CALL_0:
+	return 0; /* not implemented yet */
+
+	/* JMP */
+BPF_JMP_BPF_JA_0:
+	insn += insn->off;
+	CONT;
+BPF_JMP_BPF_JEQ_BPF_X:
+	if (A == X) {
+		insn += insn->off;
+		CONT_JMP;
+	}
+	CONT;
+BPF_JMP_BPF_JEQ_BPF_K:
+	if (A == K) {
+		insn += insn->off;
+		CONT_JMP;
+	}
+	CONT;
+BPF_JMP_BPF_JNE_BPF_X:
+	if (A != X) {
+		insn += insn->off;
+		CONT_JMP;
+	}
+	CONT;
+BPF_JMP_BPF_JNE_BPF_K:
+	if (A != K) {
+		insn += insn->off;
+		CONT_JMP;
+	}
+	CONT;
+BPF_JMP_BPF_JGT_BPF_X:
+	if (A > X) {
+		insn += insn->off;
+		CONT_JMP;
+	}
+	CONT;
+BPF_JMP_BPF_JGT_BPF_K:
+	if (A > K) {
+		insn += insn->off;
+		CONT_JMP;
+	}
+	CONT;
+BPF_JMP_BPF_JGE_BPF_X:
+	if (A >= X) {
+		insn += insn->off;
+		CONT_JMP;
+	}
+	CONT;
+BPF_JMP_BPF_JGE_BPF_K:
+	if (A >= K) {
+		insn += insn->off;
+		CONT_JMP;
+	}
+	CONT;
+BPF_JMP_BPF_JSGT_BPF_X:
+	if (((s64)A) > ((s64)X)) {
+		insn += insn->off;
+		CONT_JMP;
+	}
+	CONT;
+BPF_JMP_BPF_JSGT_BPF_K:
+	if (((s64)A) > ((s64)K)) {
+		insn += insn->off;
+		CONT_JMP;
+	}
+	CONT;
+BPF_JMP_BPF_JSGE_BPF_X:
+	if (((s64)A) >= ((s64)X)) {
+		insn += insn->off;
+		CONT_JMP;
+	}
+	CONT;
+BPF_JMP_BPF_JSGE_BPF_K:
+	if (((s64)A) >= ((s64)K)) {
+		insn += insn->off;
+		CONT_JMP;
+	}
+	CONT;
+BPF_JMP_BPF_JSET_BPF_X:
+	if (A & X) {
+		insn += insn->off;
+		CONT_JMP;
+	}
+	CONT;
+BPF_JMP_BPF_JSET_BPF_K:
+	if (A & (u32)K) {
+		insn += insn->off;
+		CONT_JMP;
+	}
+	CONT;
+
+	/* STX and ST and LDX*/
+#define LDST(SIZEOP, SIZE) \
+	BPF_STX_BPF_MEM_##SIZEOP: \
+		*(SIZE *)(ulong)(A + insn->off) = X; \
+		CONT; \
+	BPF_ST_BPF_MEM_##SIZEOP: \
+		*(SIZE *)(ulong)(A + insn->off) = K; \
+		CONT; \
+	BPF_LDX_BPF_MEM_##SIZEOP: \
+		A = *(SIZE *)(ulong)(X + insn->off); \
+		CONT;
+
+	LDST(BPF_B, u8)
+	LDST(BPF_H, u16)
+	LDST(BPF_W, u32)
+	LDST(BPF_DW, u64)
+#undef LDST
+
+BPF_STX_BPF_XADD_BPF_W: /* lock xadd *(u32 *)(A + insn->off) += X */
+	atomic_add((u32)X, (atomic_t *)(ulong)(A + insn->off));
+	CONT;
+#ifdef CONFIG_64BIT
+BPF_STX_BPF_XADD_BPF_DW: /* lock xadd *(u64 *)(A + insn->off) += X */
+	atomic64_add((u64)X, (atomic64_t *)(ulong)(A + insn->off));
+	CONT;
+#endif
+
+BPF_LD_BPF_ABS_BPF_W: /* A = *(u32 *)(SKB + K) */
+	off = K;
+load_word:
+	ptr = load_pointer((struct sk_buff *)ctx, off, 4, &tmp);
+	if (likely(ptr != NULL)) {
+		A = get_unaligned_be32(ptr);
+		CONT;
+	}
+	return 0;
+
+BPF_LD_BPF_ABS_BPF_H: /* A = *(u16 *)(SKB + K) */
+	off = K;
+load_half:
+	ptr = load_pointer((struct sk_buff *)ctx, off, 2, &tmp);
+	if (likely(ptr != NULL)) {
+		A = get_unaligned_be16(ptr);
+		CONT;
+	}
+	return 0;
+
+BPF_LD_BPF_ABS_BPF_B: /* A = *(u8 *)(SKB + K) */
+	off = K;
+load_byte:
+	ptr = load_pointer((struct sk_buff *)ctx, off, 1, &tmp);
+	if (likely(ptr != NULL)) {
+		A = *(u8 *)ptr;
+		CONT;
+	}
+	return 0;
+
+BPF_LD_BPF_IND_BPF_W: /* A = *(u32 *)(SKB + X + K) */
+	off = K + X;
+	goto load_word;
+
+BPF_LD_BPF_IND_BPF_H: /* A = *(u16 *)(SKB + X + K) */
+	off = K + X;
+	goto load_half;
+
+BPF_LD_BPF_IND_BPF_B: /* A = *(u8 *)(SKB + X + K) */
+	off = K + X;
+	goto load_byte;
+
+	/* RET */
+BPF_RET_BPF_K_0:
+	return regs[0/* R0 */];
+
+default_label:
+	/* sk_chk_filter_ext() and sk_convert_filter() guarantee
+	 * that we never reach here
+	 */
+	WARN_RATELIMIT(1, "unknown opcode %02x\n", insn->code);
+	return 0;
+#undef CONT
+#undef A
+#undef X
+#undef K
+#undef LOAD_IMM
+}
+EXPORT_SYMBOL(sk_run_filter_ext);
+
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index cf9cd13509a7..e1b979312588 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -273,6 +273,13 @@ static struct ctl_table net_core_table[] = {
 	},
 #endif
 	{
+		.procname	= "bpf_ext_enable",
+		.data		= &bpf_ext_enable,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
+	{
 		.procname	= "netdev_tstamp_prequeue",
 		.data		= &netdev_tstamp_prequeue,
 		.maxlen		= sizeof(int),
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 net-next 1/3] filter: add Extended BPF interpreter and converter
  2014-03-04 22:17 ` [PATCH v5 net-next 1/3] " Alexei Starovoitov
@ 2014-03-05  9:24   ` Daniel Borkmann
  2014-03-05 18:13     ` Alexei Starovoitov
  0 siblings, 1 reply; 10+ messages in thread
From: Daniel Borkmann @ 2014-03-05  9:24 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Will Drewry, Steven Rostedt,
	Peter Zijlstra, H. Peter Anvin, Hagen Paul Pfeifer, Jesse Gross,
	Thomas Gleixner, Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei,
	Eric Dumazet, Linus Torvalds, Andrew Morton, Frederic Weisbecker,
	Arnaldo Carvalho de Melo, Pekka Enberg, Arjan van de Ven,
	Christoph Hellwig, linux-kernel, netdev

On 03/04/2014 11:17 PM, Alexei Starovoitov wrote:
> Extended BPF extends old BPF in the following ways:
> - from 2 to 10 registers
>    Original BPF has two registers (A and X) and hidden frame pointer.
>    Extended BPF has ten registers and read-only frame pointer.
> - from 32-bit registers to 64-bit registers
>    semantics of old 32-bit ALU operations are preserved via 32-bit
>    subregisters
> - if (cond) jump_true; else jump_false;
>    old BPF insns are replaced with:
>    if (cond) jump_true; /* else fallthrough */
> - adds signed > and >= insns
> - 16 4-byte stack slots for register spill-fill replaced with
>    up to 512 bytes of multi-use stack space
> - introduces bpf_call insn and register passing convention for zero
>    overhead calls from/to other kernel functions (not part of this patch)
> - adds arithmetic right shift insn
> - adds swab32/swab64 insns
> - adds atomic_add insn
> - old tax/txa insns are replaced with 'mov dst,src' insn
>
> Extended BPF is designed to be JITed with one to one mapping, which
> allows GCC/LLVM backends to generate optimized BPF code that performs
> almost as fast as natively compiled code
>
> sk_convert_filter() remaps old style insns into extended:
> 'sock_filter' instructions are remapped on the fly to
> 'sock_filter_ext' extended instructions when
> sysctl net.core.bpf_ext_enable=1
>
> Old filter comes through sk_attach_filter() or sk_unattached_filter_create()
>   if (bpf_ext_enable) {
>      convert to new
>      sk_chk_filter() - check old bpf
>      use sk_run_filter_ext() - new interpreter
>   } else {
>      sk_chk_filter() - check old bpf
>      if (bpf_jit_enable)
>          use old jit
>      else
>          use sk_run_filter() - old interpreter
>   }
>
> sk_run_filter_ext() interpreter is noticeably faster
> than sk_run_filter() for two reasons:
>
> 1.fall-through jumps
>    Old BPF jump instructions are forced to go either 'true' or 'false'
>    branch which causes branch-miss penalty.
>    Extended BPF jump instructions have one branch and fall-through,
>    which fit CPU branch predictor logic better.
>    'perf stat' shows drastic difference for branch-misses.
>
> 2.jump-threaded implementation of interpreter vs switch statement
>    Instead of single tablejump at the top of 'switch' statement, GCC will
>    generate multiple tablejump instructions, which helps CPU branch predictor
>
> Performance of two BPF filters generated by libpcap was measured
> on x86_64, i386 and arm32.
>
> fprog #1 is taken from Documentation/networking/filter.txt:
> tcpdump -i eth0 port 22 -dd
>
> fprog #2 is taken from 'man tcpdump':
> tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
>     ((tcp[12]&0xf0)>>2)) != 0)' -dd
>
> Other libpcap programs have similar performance differences.
>
> Raw performance data from BPF micro-benchmark:
> SK_RUN_FILTER on same SKB (cache-hit) or 10k SKBs (cache-miss)
> time in nsec per call, smaller is better
> --x86_64--
>           fprog #1  fprog #1   fprog #2  fprog #2
>           cache-hit cache-miss cache-hit cache-miss
> old BPF     90       101       192       202
> ext BPF     31        71       47         97
> old BPF jit 12        34       17         44
> ext BPF jit TBD
>
> --i386--
>           fprog #1  fprog #1   fprog #2  fprog #2
>           cache-hit cache-miss cache-hit cache-miss
> old BPF    107        136      227       252
> ext BPF     40        119       69       172
>
> --arm32--
>           fprog #1  fprog #1   fprog #2  fprog #2
>           cache-hit cache-miss cache-hit cache-miss
> old BPF    202        300      475       540
> ext BPF    139        270      296       470
> old BPF jit 26        182       37       202
> new BPF jit TBD
>
> Tested with trinify BPF fuzzer
>
> Future work:
>
> 0. seccomp
>
> 1. add extended BPF JIT for x86_64
>
> 2. add inband old/new demux and extended BPF verifier, so that new programs
>     can be loaded through old sk_attach_filter() and sk_unattached_filter_create()
>     interfaces
>
> 3. tracing filters systemtap-like with extended BPF
>
> 4. OVS with extended BPF
>
> 5. nftables with extended BPF
>
> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
> Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>

 From what I can tell, looks good to me:

Reviewed-by: Daniel Borkmann <dborkman@redhat.com>

So next step would be to add selftests and then after that JIT?

...
> +#undef LOAD_IMM
> +}
> +EXPORT_SYMBOL(sk_run_filter_ext);
> +

One minor thing I noticed when I git-am'ed your patch is the newline at
the end of file, but perhaps this can be fixed up in directly patchwork.

> diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
> index cf9cd13509a7..e1b979312588 100644
> --- a/net/core/sysctl_net_core.c
> +++ b/net/core/sysctl_net_core.c

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 net-next 1/3] filter: add Extended BPF interpreter and converter
  2014-03-05  9:24   ` Daniel Borkmann
@ 2014-03-05 18:13     ` Alexei Starovoitov
  0 siblings, 0 replies; 10+ messages in thread
From: Alexei Starovoitov @ 2014-03-05 18:13 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: David S. Miller, Ingo Molnar, Will Drewry, Steven Rostedt,
	Peter Zijlstra, H. Peter Anvin, Hagen Paul Pfeifer, Jesse Gross,
	Thomas Gleixner, Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei,
	Eric Dumazet, Linus Torvalds, Andrew Morton, Frederic Weisbecker,
	Arnaldo Carvalho de Melo, Pekka Enberg, Arjan van de Ven,
	Christoph Hellwig, linux-kernel, netdev

On Wed, Mar 5, 2014 at 1:24 AM, Daniel Borkmann <dborkman@redhat.com> wrote:
> On 03/04/2014 11:17 PM, Alexei Starovoitov wrote:
>>
>> Extended BPF extends old BPF in the following ways:
>> - from 2 to 10 registers
>>    Original BPF has two registers (A and X) and hidden frame pointer.
>>    Extended BPF has ten registers and read-only frame pointer.
>> - from 32-bit registers to 64-bit registers
>>    semantics of old 32-bit ALU operations are preserved via 32-bit
>>    subregisters
>> - if (cond) jump_true; else jump_false;
>>    old BPF insns are replaced with:
>>    if (cond) jump_true; /* else fallthrough */
>> - adds signed > and >= insns
>> - 16 4-byte stack slots for register spill-fill replaced with
>>    up to 512 bytes of multi-use stack space
>> - introduces bpf_call insn and register passing convention for zero
>>    overhead calls from/to other kernel functions (not part of this patch)
>> - adds arithmetic right shift insn
>> - adds swab32/swab64 insns
>> - adds atomic_add insn
>> - old tax/txa insns are replaced with 'mov dst,src' insn
>>
>> Extended BPF is designed to be JITed with one to one mapping, which
>> allows GCC/LLVM backends to generate optimized BPF code that performs
>> almost as fast as natively compiled code
>>
>> sk_convert_filter() remaps old style insns into extended:
>> 'sock_filter' instructions are remapped on the fly to
>> 'sock_filter_ext' extended instructions when
>> sysctl net.core.bpf_ext_enable=1
>>
>> Old filter comes through sk_attach_filter() or
>> sk_unattached_filter_create()
>>   if (bpf_ext_enable) {
>>      convert to new
>>      sk_chk_filter() - check old bpf
>>      use sk_run_filter_ext() - new interpreter
>>   } else {
>>      sk_chk_filter() - check old bpf
>>      if (bpf_jit_enable)
>>          use old jit
>>      else
>>          use sk_run_filter() - old interpreter
>>   }
>>
>> sk_run_filter_ext() interpreter is noticeably faster
>> than sk_run_filter() for two reasons:
>>
>> 1.fall-through jumps
>>    Old BPF jump instructions are forced to go either 'true' or 'false'
>>    branch which causes branch-miss penalty.
>>    Extended BPF jump instructions have one branch and fall-through,
>>    which fit CPU branch predictor logic better.
>>    'perf stat' shows drastic difference for branch-misses.
>>
>> 2.jump-threaded implementation of interpreter vs switch statement
>>    Instead of single tablejump at the top of 'switch' statement, GCC will
>>    generate multiple tablejump instructions, which helps CPU branch
>> predictor
>>
>> Performance of two BPF filters generated by libpcap was measured
>> on x86_64, i386 and arm32.
>>
>> fprog #1 is taken from Documentation/networking/filter.txt:
>> tcpdump -i eth0 port 22 -dd
>>
>> fprog #2 is taken from 'man tcpdump':
>> tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
>>     ((tcp[12]&0xf0)>>2)) != 0)' -dd
>>
>> Other libpcap programs have similar performance differences.
>>
>> Raw performance data from BPF micro-benchmark:
>> SK_RUN_FILTER on same SKB (cache-hit) or 10k SKBs (cache-miss)
>> time in nsec per call, smaller is better
>> --x86_64--
>>           fprog #1  fprog #1   fprog #2  fprog #2
>>           cache-hit cache-miss cache-hit cache-miss
>> old BPF     90       101       192       202
>> ext BPF     31        71       47         97
>> old BPF jit 12        34       17         44
>> ext BPF jit TBD
>>
>> --i386--
>>           fprog #1  fprog #1   fprog #2  fprog #2
>>           cache-hit cache-miss cache-hit cache-miss
>> old BPF    107        136      227       252
>> ext BPF     40        119       69       172
>>
>> --arm32--
>>           fprog #1  fprog #1   fprog #2  fprog #2
>>           cache-hit cache-miss cache-hit cache-miss
>> old BPF    202        300      475       540
>> ext BPF    139        270      296       470
>> old BPF jit 26        182       37       202
>> new BPF jit TBD
>>
>> Tested with trinify BPF fuzzer
>>
>> Future work:
>>
>> 0. seccomp
>>
>> 1. add extended BPF JIT for x86_64
>>
>> 2. add inband old/new demux and extended BPF verifier, so that new
>> programs
>>     can be loaded through old sk_attach_filter() and
>> sk_unattached_filter_create()
>>     interfaces
>>
>> 3. tracing filters systemtap-like with extended BPF
>>
>> 4. OVS with extended BPF
>>
>> 5. nftables with extended BPF
>>
>> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
>> Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
>
>
> From what I can tell, looks good to me:
>
> Reviewed-by: Daniel Borkmann <dborkman@redhat.com>

Thanks!

> So next step would be to add selftests and then after that JIT?

yes. that's the plan.
Will probably split selftest into few patches to simplify review.

JIT should go into the existing bpf_jit_comp.c and call it
bpf_ext_jit_compile(), right?

>> +#undef LOAD_IMM
>> +}
>> +EXPORT_SYMBOL(sk_run_filter_ext);
>> +
>
>
> One minor thing I noticed when I git-am'ed your patch is the newline at
> the end of file, but perhaps this can be fixed up in directly patchwork.

oops. Too bad checkpatch.pl didn't complain about that... will address
in the future.

Thanks!
Alexei

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v5 net-next 2/3] [RFC] seccomp: convert seccomp to use extended BPF
  2014-03-04 22:17 [PATCH v5 net-next 0/3] filter: add Extended BPF interpreter and converter Alexei Starovoitov
  2014-03-04 22:17 ` [PATCH v5 net-next 1/3] " Alexei Starovoitov
@ 2014-03-04 22:17 ` Alexei Starovoitov
  2014-03-05  3:11   ` Alexei Starovoitov
  2014-03-04 22:17 ` [PATCH v5 net-next 3/3] doc: filter: add Extended BPF documentation Alexei Starovoitov
  2 siblings, 1 reply; 10+ messages in thread
From: Alexei Starovoitov @ 2014-03-04 22:17 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: David S. Miller, Ingo Molnar, Will Drewry, Steven Rostedt,
	Peter Zijlstra, H. Peter Anvin, Hagen Paul Pfeifer, Jesse Gross,
	Thomas Gleixner, Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei,
	Eric Dumazet, Linus Torvalds, Andrew Morton, Frederic Weisbecker,
	Arnaldo Carvalho de Melo, Pekka Enberg, Arjan van de Ven,
	Christoph Hellwig, linux-kernel, netdev

use sk_convert_filter() to convert seccomp BPF into extended BPF

05-sim-long_jumps.c of libseccomp was used as micro-benchmark:
  seccomp_rule_add_exact(ctx,...
  seccomp_rule_add_exact(ctx,...
  rc = seccomp_load(ctx);
  for (i = 0; i < 10000000; i++)
     syscall(199, 100);

--x86_64--
old BPF: 8.6 seconds
    73.38%    bench  [kernel.kallsyms]  [k] sk_run_filter
    10.70%    bench  libc-2.15.so       [.] syscall
     5.09%    bench  [kernel.kallsyms]  [k] seccomp_bpf_load
     1.97%    bench  [kernel.kallsyms]  [k] system_call
ext BPF: 5.7 seconds
    66.20%    bench  [kernel.kallsyms]  [k] sk_run_filter_ext
    16.75%    bench  libc-2.15.so       [.] syscall
     3.31%    bench  [kernel.kallsyms]  [k] system_call
     2.88%    bench  [kernel.kallsyms]  [k] __secure_computing

--i386---
old BPF: 5.4 sec
ext BPF: 3.8 sec

BPF filters generated by seccomp are very branchy, so ext BPF
performance is better than old BPF.

Performance gains will be even higher when extended BPF JIT
is committed.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>

---
This patch is an RFC to use extended BPF in seccomp.
Change it to do it conditionally with bpf_ext_enable knob ?
---
 include/linux/seccomp.h |    1 -
 kernel/seccomp.c        |  112 +++++++++++++++++++++--------------------------
 net/core/filter.c       |    5 ---
 3 files changed, 51 insertions(+), 67 deletions(-)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 6f19cfd1840e..4054b0994071 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -76,7 +76,6 @@ static inline int seccomp_mode(struct seccomp *s)
 #ifdef CONFIG_SECCOMP_FILTER
 extern void put_seccomp_filter(struct task_struct *tsk);
 extern void get_seccomp_filter(struct task_struct *tsk);
-extern u32 seccomp_bpf_load(int off);
 #else  /* CONFIG_SECCOMP_FILTER */
 static inline void put_seccomp_filter(struct task_struct *tsk)
 {
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index b7a10048a32c..346c597b44cc 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -55,60 +55,27 @@ struct seccomp_filter {
 	atomic_t usage;
 	struct seccomp_filter *prev;
 	unsigned short len;  /* Instruction count */
-	struct sock_filter insns[];
+	struct sock_filter_ext insns[];
 };
 
 /* Limit any path through the tree to 256KB worth of instructions. */
 #define MAX_INSNS_PER_PATH ((1 << 18) / sizeof(struct sock_filter))
 
-/**
- * get_u32 - returns a u32 offset into data
- * @data: a unsigned 64 bit value
- * @index: 0 or 1 to return the first or second 32-bits
- *
- * This inline exists to hide the length of unsigned long.  If a 32-bit
- * unsigned long is passed in, it will be extended and the top 32-bits will be
- * 0. If it is a 64-bit unsigned long, then whatever data is resident will be
- * properly returned.
- *
+/*
  * Endianness is explicitly ignored and left for BPF program authors to manage
  * as per the specific architecture.
  */
-static inline u32 get_u32(u64 data, int index)
-{
-	return ((u32 *)&data)[index];
-}
-
-/* Helper for bpf_load below. */
-#define BPF_DATA(_name) offsetof(struct seccomp_data, _name)
-/**
- * bpf_load: checks and returns a pointer to the requested offset
- * @off: offset into struct seccomp_data to load from
- *
- * Returns the requested 32-bits of data.
- * seccomp_check_filter() should assure that @off is 32-bit aligned
- * and not out of bounds.  Failure to do so is a BUG.
- */
-u32 seccomp_bpf_load(int off)
+static void populate_seccomp_data(struct seccomp_data *sd)
 {
 	struct pt_regs *regs = task_pt_regs(current);
-	if (off == BPF_DATA(nr))
-		return syscall_get_nr(current, regs);
-	if (off == BPF_DATA(arch))
-		return syscall_get_arch(current, regs);
-	if (off >= BPF_DATA(args[0]) && off < BPF_DATA(args[6])) {
-		unsigned long value;
-		int arg = (off - BPF_DATA(args[0])) / sizeof(u64);
-		int index = !!(off % sizeof(u64));
-		syscall_get_arguments(current, regs, arg, 1, &value);
-		return get_u32(value, index);
-	}
-	if (off == BPF_DATA(instruction_pointer))
-		return get_u32(KSTK_EIP(current), 0);
-	if (off == BPF_DATA(instruction_pointer) + sizeof(u32))
-		return get_u32(KSTK_EIP(current), 1);
-	/* seccomp_check_filter should make this impossible. */
-	BUG();
+	int i;
+
+	sd->nr = syscall_get_nr(current, regs);
+	sd->arch = syscall_get_arch(current, regs);
+	for (i = 0; i < 6; i++)
+		syscall_get_arguments(current, regs, i, 1,
+				      (unsigned long *)&sd->args[i]);
+	sd->instruction_pointer = KSTK_EIP(current);
 }
 
 /**
@@ -133,17 +100,17 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
 
 		switch (code) {
 		case BPF_S_LD_W_ABS:
-			ftest->code = BPF_S_ANC_SECCOMP_LD_W;
+			ftest->code = BPF_LDX | BPF_W | BPF_ABS;
 			/* 32-bit aligned and not out of bounds. */
 			if (k >= sizeof(struct seccomp_data) || k & 3)
 				return -EINVAL;
 			continue;
 		case BPF_S_LD_W_LEN:
-			ftest->code = BPF_S_LD_IMM;
+			ftest->code = BPF_LD | BPF_IMM;
 			ftest->k = sizeof(struct seccomp_data);
 			continue;
 		case BPF_S_LDX_W_LEN:
-			ftest->code = BPF_S_LDX_IMM;
+			ftest->code = BPF_LDX | BPF_IMM;
 			ftest->k = sizeof(struct seccomp_data);
 			continue;
 		/* Explicitly include allowed calls. */
@@ -185,6 +152,7 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
 		case BPF_S_JMP_JGT_X:
 		case BPF_S_JMP_JSET_K:
 		case BPF_S_JMP_JSET_X:
+			sk_decode_filter(ftest, ftest);
 			continue;
 		default:
 			return -EINVAL;
@@ -202,18 +170,21 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
 static u32 seccomp_run_filters(int syscall)
 {
 	struct seccomp_filter *f;
+	struct seccomp_data sd;
 	u32 ret = SECCOMP_RET_ALLOW;
 
 	/* Ensure unexpected behavior doesn't result in failing open. */
 	if (WARN_ON(current->seccomp.filter == NULL))
 		return SECCOMP_RET_KILL;
 
+	populate_seccomp_data(&sd);
+
 	/*
 	 * All filters in the list are evaluated and the lowest BPF return
 	 * value always takes priority (ignoring the DATA).
 	 */
 	for (f = current->seccomp.filter; f; f = f->prev) {
-		u32 cur_ret = sk_run_filter(NULL, f->insns);
+		u32 cur_ret = sk_run_filter_ext(&sd, f->insns);
 		if ((cur_ret & SECCOMP_RET_ACTION) < (ret & SECCOMP_RET_ACTION))
 			ret = cur_ret;
 	}
@@ -231,6 +202,8 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)
 	struct seccomp_filter *filter;
 	unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
 	unsigned long total_insns = fprog->len;
+	struct sock_filter *fp;
+	int new_len;
 	long ret;
 
 	if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
@@ -252,28 +225,42 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)
 				     CAP_SYS_ADMIN) != 0)
 		return -EACCES;
 
-	/* Allocate a new seccomp_filter */
-	filter = kzalloc(sizeof(struct seccomp_filter) + fp_size,
-			 GFP_KERNEL|__GFP_NOWARN);
-	if (!filter)
+	fp = kzalloc(fp_size, GFP_KERNEL|__GFP_NOWARN);
+	if (!fp)
 		return -ENOMEM;
-	atomic_set(&filter->usage, 1);
-	filter->len = fprog->len;
 
 	/* Copy the instructions from fprog. */
 	ret = -EFAULT;
-	if (copy_from_user(filter->insns, fprog->filter, fp_size))
-		goto fail;
+	if (copy_from_user(fp, fprog->filter, fp_size))
+		goto free_prog;
 
 	/* Check and rewrite the fprog via the skb checker */
-	ret = sk_chk_filter(filter->insns, filter->len);
+	ret = sk_chk_filter(fp, fprog->len);
 	if (ret)
-		goto fail;
+		goto free_prog;
 
 	/* Check and rewrite the fprog for seccomp use */
-	ret = seccomp_check_filter(filter->insns, filter->len);
+	ret = seccomp_check_filter(fp, fprog->len);
 	if (ret)
-		goto fail;
+		goto free_prog;
+
+	/* convert 'sock_filter' insns to 'sock_filter_ext' insns */
+	ret = sk_convert_filter(fp, fprog->len, NULL, &new_len);
+	if (ret)
+		goto free_prog;
+
+	/* Allocate a new seccomp_filter */
+	filter = kzalloc(sizeof(struct seccomp_filter) +
+			 sizeof(struct sock_filter_ext) * new_len,
+			 GFP_KERNEL|__GFP_NOWARN);
+	if (!filter)
+		goto free_prog;
+
+	ret = sk_convert_filter(fp, fprog->len, filter->insns, &new_len);
+	if (ret)
+		goto free_filter;
+	atomic_set(&filter->usage, 1);
+	filter->len = new_len;
 
 	/*
 	 * If there is an existing filter, make it the prev and don't drop its
@@ -282,8 +269,11 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)
 	filter->prev = current->seccomp.filter;
 	current->seccomp.filter = filter;
 	return 0;
-fail:
+
+free_filter:
 	kfree(filter);
+free_prog:
+	kfree(fp);
 	return ret;
 }
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 1494421486b7..f144a6a7d939 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -388,11 +388,6 @@ load_b:
 				A = 0;
 			continue;
 		}
-#ifdef CONFIG_SECCOMP_FILTER
-		case BPF_S_ANC_SECCOMP_LD_W:
-			A = seccomp_bpf_load(fentry->k);
-			continue;
-#endif
 		default:
 			WARN_RATELIMIT(1, "Unknown code:%u jt:%u tf:%u k:%u\n",
 				       fentry->code, fentry->jt,
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 net-next 2/3] [RFC] seccomp: convert seccomp to use extended BPF
  2014-03-04 22:17 ` [PATCH v5 net-next 2/3] [RFC] seccomp: convert seccomp to use extended BPF Alexei Starovoitov
@ 2014-03-05  3:11   ` Alexei Starovoitov
  2014-03-05 21:42     ` Kees Cook
  0 siblings, 1 reply; 10+ messages in thread
From: Alexei Starovoitov @ 2014-03-05  3:11 UTC (permalink / raw)
  To: Daniel Borkmann, Kees Cook
  Cc: David S. Miller, Ingo Molnar, Will Drewry, Steven Rostedt,
	Peter Zijlstra, H. Peter Anvin, Hagen Paul Pfeifer, Jesse Gross,
	Thomas Gleixner, Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei,
	Eric Dumazet, Linus Torvalds, Andrew Morton, Frederic Weisbecker,
	Arnaldo Carvalho de Melo, Pekka Enberg, Arjan van de Ven,
	Christoph Hellwig, linux-kernel, netdev

On Tue, Mar 4, 2014 at 2:17 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> use sk_convert_filter() to convert seccomp BPF into extended BPF
>
> 05-sim-long_jumps.c of libseccomp was used as micro-benchmark:
>   seccomp_rule_add_exact(ctx,...
>   seccomp_rule_add_exact(ctx,...
>   rc = seccomp_load(ctx);
>   for (i = 0; i < 10000000; i++)
>      syscall(199, 100);
>
> --x86_64--
> old BPF: 8.6 seconds
>     73.38%    bench  [kernel.kallsyms]  [k] sk_run_filter
>     10.70%    bench  libc-2.15.so       [.] syscall
>      5.09%    bench  [kernel.kallsyms]  [k] seccomp_bpf_load
>      1.97%    bench  [kernel.kallsyms]  [k] system_call
> ext BPF: 5.7 seconds
>     66.20%    bench  [kernel.kallsyms]  [k] sk_run_filter_ext
>     16.75%    bench  libc-2.15.so       [.] syscall
>      3.31%    bench  [kernel.kallsyms]  [k] system_call
>      2.88%    bench  [kernel.kallsyms]  [k] __secure_computing
>
> --i386---
> old BPF: 5.4 sec
> ext BPF: 3.8 sec
>
> BPF filters generated by seccomp are very branchy, so ext BPF
> performance is better than old BPF.
>
> Performance gains will be even higher when extended BPF JIT
> is committed.
>
> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
>
> ---
> This patch is an RFC to use extended BPF in seccomp.
> Change it to do it conditionally with bpf_ext_enable knob ?

Kees, Will,

I've played with libseccomp to test this patch and just found
your other testsuite for bpf+seccomp:
https://github.com/redpig/seccomp
It passes as well on x86_64 and i386.

Not sure how real all 'seccomp' and libseccomp' bpf filters.
Some of them are very short, some very long.
If average number of filtered syscalls is > 10, then upcoming
'ebpf table' will give another performance boost, since single table
lookup will be faster than long sequence of 'if' conditions.

Please review.

Thanks
Alexei

> ---
>  include/linux/seccomp.h |    1 -
>  kernel/seccomp.c        |  112 +++++++++++++++++++++--------------------------
>  net/core/filter.c       |    5 ---
>  3 files changed, 51 insertions(+), 67 deletions(-)
>
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index 6f19cfd1840e..4054b0994071 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -76,7 +76,6 @@ static inline int seccomp_mode(struct seccomp *s)
>  #ifdef CONFIG_SECCOMP_FILTER
>  extern void put_seccomp_filter(struct task_struct *tsk);
>  extern void get_seccomp_filter(struct task_struct *tsk);
> -extern u32 seccomp_bpf_load(int off);
>  #else  /* CONFIG_SECCOMP_FILTER */
>  static inline void put_seccomp_filter(struct task_struct *tsk)
>  {
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index b7a10048a32c..346c597b44cc 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -55,60 +55,27 @@ struct seccomp_filter {
>         atomic_t usage;
>         struct seccomp_filter *prev;
>         unsigned short len;  /* Instruction count */
> -       struct sock_filter insns[];
> +       struct sock_filter_ext insns[];
>  };
>
>  /* Limit any path through the tree to 256KB worth of instructions. */
>  #define MAX_INSNS_PER_PATH ((1 << 18) / sizeof(struct sock_filter))
>
> -/**
> - * get_u32 - returns a u32 offset into data
> - * @data: a unsigned 64 bit value
> - * @index: 0 or 1 to return the first or second 32-bits
> - *
> - * This inline exists to hide the length of unsigned long.  If a 32-bit
> - * unsigned long is passed in, it will be extended and the top 32-bits will be
> - * 0. If it is a 64-bit unsigned long, then whatever data is resident will be
> - * properly returned.
> - *
> +/*
>   * Endianness is explicitly ignored and left for BPF program authors to manage
>   * as per the specific architecture.
>   */
> -static inline u32 get_u32(u64 data, int index)
> -{
> -       return ((u32 *)&data)[index];
> -}
> -
> -/* Helper for bpf_load below. */
> -#define BPF_DATA(_name) offsetof(struct seccomp_data, _name)
> -/**
> - * bpf_load: checks and returns a pointer to the requested offset
> - * @off: offset into struct seccomp_data to load from
> - *
> - * Returns the requested 32-bits of data.
> - * seccomp_check_filter() should assure that @off is 32-bit aligned
> - * and not out of bounds.  Failure to do so is a BUG.
> - */
> -u32 seccomp_bpf_load(int off)
> +static void populate_seccomp_data(struct seccomp_data *sd)
>  {
>         struct pt_regs *regs = task_pt_regs(current);
> -       if (off == BPF_DATA(nr))
> -               return syscall_get_nr(current, regs);
> -       if (off == BPF_DATA(arch))
> -               return syscall_get_arch(current, regs);
> -       if (off >= BPF_DATA(args[0]) && off < BPF_DATA(args[6])) {
> -               unsigned long value;
> -               int arg = (off - BPF_DATA(args[0])) / sizeof(u64);
> -               int index = !!(off % sizeof(u64));
> -               syscall_get_arguments(current, regs, arg, 1, &value);
> -               return get_u32(value, index);
> -       }
> -       if (off == BPF_DATA(instruction_pointer))
> -               return get_u32(KSTK_EIP(current), 0);
> -       if (off == BPF_DATA(instruction_pointer) + sizeof(u32))
> -               return get_u32(KSTK_EIP(current), 1);
> -       /* seccomp_check_filter should make this impossible. */
> -       BUG();
> +       int i;
> +
> +       sd->nr = syscall_get_nr(current, regs);
> +       sd->arch = syscall_get_arch(current, regs);
> +       for (i = 0; i < 6; i++)
> +               syscall_get_arguments(current, regs, i, 1,
> +                                     (unsigned long *)&sd->args[i]);
> +       sd->instruction_pointer = KSTK_EIP(current);
>  }
>
>  /**
> @@ -133,17 +100,17 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
>
>                 switch (code) {
>                 case BPF_S_LD_W_ABS:
> -                       ftest->code = BPF_S_ANC_SECCOMP_LD_W;
> +                       ftest->code = BPF_LDX | BPF_W | BPF_ABS;
>                         /* 32-bit aligned and not out of bounds. */
>                         if (k >= sizeof(struct seccomp_data) || k & 3)
>                                 return -EINVAL;
>                         continue;
>                 case BPF_S_LD_W_LEN:
> -                       ftest->code = BPF_S_LD_IMM;
> +                       ftest->code = BPF_LD | BPF_IMM;
>                         ftest->k = sizeof(struct seccomp_data);
>                         continue;
>                 case BPF_S_LDX_W_LEN:
> -                       ftest->code = BPF_S_LDX_IMM;
> +                       ftest->code = BPF_LDX | BPF_IMM;
>                         ftest->k = sizeof(struct seccomp_data);
>                         continue;
>                 /* Explicitly include allowed calls. */
> @@ -185,6 +152,7 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
>                 case BPF_S_JMP_JGT_X:
>                 case BPF_S_JMP_JSET_K:
>                 case BPF_S_JMP_JSET_X:
> +                       sk_decode_filter(ftest, ftest);
>                         continue;
>                 default:
>                         return -EINVAL;
> @@ -202,18 +170,21 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
>  static u32 seccomp_run_filters(int syscall)
>  {
>         struct seccomp_filter *f;
> +       struct seccomp_data sd;
>         u32 ret = SECCOMP_RET_ALLOW;
>
>         /* Ensure unexpected behavior doesn't result in failing open. */
>         if (WARN_ON(current->seccomp.filter == NULL))
>                 return SECCOMP_RET_KILL;
>
> +       populate_seccomp_data(&sd);
> +
>         /*
>          * All filters in the list are evaluated and the lowest BPF return
>          * value always takes priority (ignoring the DATA).
>          */
>         for (f = current->seccomp.filter; f; f = f->prev) {
> -               u32 cur_ret = sk_run_filter(NULL, f->insns);
> +               u32 cur_ret = sk_run_filter_ext(&sd, f->insns);
>                 if ((cur_ret & SECCOMP_RET_ACTION) < (ret & SECCOMP_RET_ACTION))
>                         ret = cur_ret;
>         }
> @@ -231,6 +202,8 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)
>         struct seccomp_filter *filter;
>         unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
>         unsigned long total_insns = fprog->len;
> +       struct sock_filter *fp;
> +       int new_len;
>         long ret;
>
>         if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
> @@ -252,28 +225,42 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)
>                                      CAP_SYS_ADMIN) != 0)
>                 return -EACCES;
>
> -       /* Allocate a new seccomp_filter */
> -       filter = kzalloc(sizeof(struct seccomp_filter) + fp_size,
> -                        GFP_KERNEL|__GFP_NOWARN);
> -       if (!filter)
> +       fp = kzalloc(fp_size, GFP_KERNEL|__GFP_NOWARN);
> +       if (!fp)
>                 return -ENOMEM;
> -       atomic_set(&filter->usage, 1);
> -       filter->len = fprog->len;
>
>         /* Copy the instructions from fprog. */
>         ret = -EFAULT;
> -       if (copy_from_user(filter->insns, fprog->filter, fp_size))
> -               goto fail;
> +       if (copy_from_user(fp, fprog->filter, fp_size))
> +               goto free_prog;
>
>         /* Check and rewrite the fprog via the skb checker */
> -       ret = sk_chk_filter(filter->insns, filter->len);
> +       ret = sk_chk_filter(fp, fprog->len);
>         if (ret)
> -               goto fail;
> +               goto free_prog;
>
>         /* Check and rewrite the fprog for seccomp use */
> -       ret = seccomp_check_filter(filter->insns, filter->len);
> +       ret = seccomp_check_filter(fp, fprog->len);
>         if (ret)
> -               goto fail;
> +               goto free_prog;
> +
> +       /* convert 'sock_filter' insns to 'sock_filter_ext' insns */
> +       ret = sk_convert_filter(fp, fprog->len, NULL, &new_len);
> +       if (ret)
> +               goto free_prog;
> +
> +       /* Allocate a new seccomp_filter */
> +       filter = kzalloc(sizeof(struct seccomp_filter) +
> +                        sizeof(struct sock_filter_ext) * new_len,
> +                        GFP_KERNEL|__GFP_NOWARN);
> +       if (!filter)
> +               goto free_prog;
> +
> +       ret = sk_convert_filter(fp, fprog->len, filter->insns, &new_len);
> +       if (ret)
> +               goto free_filter;
> +       atomic_set(&filter->usage, 1);
> +       filter->len = new_len;
>
>         /*
>          * If there is an existing filter, make it the prev and don't drop its
> @@ -282,8 +269,11 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)
>         filter->prev = current->seccomp.filter;
>         current->seccomp.filter = filter;
>         return 0;
> -fail:
> +
> +free_filter:
>         kfree(filter);
> +free_prog:
> +       kfree(fp);
>         return ret;
>  }
>
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 1494421486b7..f144a6a7d939 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -388,11 +388,6 @@ load_b:
>                                 A = 0;
>                         continue;
>                 }
> -#ifdef CONFIG_SECCOMP_FILTER
> -               case BPF_S_ANC_SECCOMP_LD_W:
> -                       A = seccomp_bpf_load(fentry->k);
> -                       continue;
> -#endif
>                 default:
>                         WARN_RATELIMIT(1, "Unknown code:%u jt:%u tf:%u k:%u\n",
>                                        fentry->code, fentry->jt,
> --
> 1.7.9.5
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 net-next 2/3] [RFC] seccomp: convert seccomp to use extended BPF
  2014-03-05  3:11   ` Alexei Starovoitov
@ 2014-03-05 21:42     ` Kees Cook
  2014-03-06  2:00       ` Alexei Starovoitov
  0 siblings, 1 reply; 10+ messages in thread
From: Kees Cook @ 2014-03-05 21:42 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, David S. Miller, Ingo Molnar, Will Drewry,
	Steven Rostedt, Peter Zijlstra, H. Peter Anvin,
	Hagen Paul Pfeifer, Jesse Gross, Thomas Gleixner,
	Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei, Eric Dumazet,
	Linus Torvalds, Andrew Morton, Frederic Weisbecker,
	Arnaldo Carvalho de Melo, Pekka Enberg, Arjan van de Ven,
	Christoph Hellwig, LKML, netdev

On Tue, Mar 4, 2014 at 7:11 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> On Tue, Mar 4, 2014 at 2:17 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>> use sk_convert_filter() to convert seccomp BPF into extended BPF
>>
>> 05-sim-long_jumps.c of libseccomp was used as micro-benchmark:
>>   seccomp_rule_add_exact(ctx,...
>>   seccomp_rule_add_exact(ctx,...
>>   rc = seccomp_load(ctx);
>>   for (i = 0; i < 10000000; i++)
>>      syscall(199, 100);
>>
>> --x86_64--
>> old BPF: 8.6 seconds
>>     73.38%    bench  [kernel.kallsyms]  [k] sk_run_filter
>>     10.70%    bench  libc-2.15.so       [.] syscall
>>      5.09%    bench  [kernel.kallsyms]  [k] seccomp_bpf_load
>>      1.97%    bench  [kernel.kallsyms]  [k] system_call
>> ext BPF: 5.7 seconds
>>     66.20%    bench  [kernel.kallsyms]  [k] sk_run_filter_ext
>>     16.75%    bench  libc-2.15.so       [.] syscall
>>      3.31%    bench  [kernel.kallsyms]  [k] system_call
>>      2.88%    bench  [kernel.kallsyms]  [k] __secure_computing
>>
>> --i386---
>> old BPF: 5.4 sec
>> ext BPF: 3.8 sec
>>
>> BPF filters generated by seccomp are very branchy, so ext BPF
>> performance is better than old BPF.
>>
>> Performance gains will be even higher when extended BPF JIT
>> is committed.

Very cool! Have you had a chance to compare on ARM too?

>> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
>>
>> ---
>> This patch is an RFC to use extended BPF in seccomp.
>> Change it to do it conditionally with bpf_ext_enable knob ?
>
> Kees, Will,
>
> I've played with libseccomp to test this patch and just found
> your other testsuite for bpf+seccomp:
> https://github.com/redpig/seccomp
> It passes as well on x86_64 and i386.

Great! If it's passing those tests, then things should be in good shape.

> Not sure how real all 'seccomp' and libseccomp' bpf filters.
> Some of them are very short, some very long.
> If average number of filtered syscalls is > 10, then upcoming
> 'ebpf table' will give another performance boost, since single table
> lookup will be faster than long sequence of 'if' conditions.

To take advantage of that, will seccomp need a new (prctl) interface
to pass in a seccomp ebpf?

> Please review.
>
> Thanks
> Alexei
>
>> ---
>>  include/linux/seccomp.h |    1 -
>>  kernel/seccomp.c        |  112 +++++++++++++++++++++--------------------------
>>  net/core/filter.c       |    5 ---
>>  3 files changed, 51 insertions(+), 67 deletions(-)
>>
>> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
>> index 6f19cfd1840e..4054b0994071 100644
>> --- a/include/linux/seccomp.h
>> +++ b/include/linux/seccomp.h
>> @@ -76,7 +76,6 @@ static inline int seccomp_mode(struct seccomp *s)
>>  #ifdef CONFIG_SECCOMP_FILTER
>>  extern void put_seccomp_filter(struct task_struct *tsk);
>>  extern void get_seccomp_filter(struct task_struct *tsk);
>> -extern u32 seccomp_bpf_load(int off);
>>  #else  /* CONFIG_SECCOMP_FILTER */
>>  static inline void put_seccomp_filter(struct task_struct *tsk)
>>  {
>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>> index b7a10048a32c..346c597b44cc 100644
>> --- a/kernel/seccomp.c
>> +++ b/kernel/seccomp.c
>> @@ -55,60 +55,27 @@ struct seccomp_filter {
>>         atomic_t usage;
>>         struct seccomp_filter *prev;
>>         unsigned short len;  /* Instruction count */
>> -       struct sock_filter insns[];
>> +       struct sock_filter_ext insns[];
>>  };
>>
>>  /* Limit any path through the tree to 256KB worth of instructions. */
>>  #define MAX_INSNS_PER_PATH ((1 << 18) / sizeof(struct sock_filter))
>>
>> -/**
>> - * get_u32 - returns a u32 offset into data
>> - * @data: a unsigned 64 bit value
>> - * @index: 0 or 1 to return the first or second 32-bits
>> - *
>> - * This inline exists to hide the length of unsigned long.  If a 32-bit
>> - * unsigned long is passed in, it will be extended and the top 32-bits will be
>> - * 0. If it is a 64-bit unsigned long, then whatever data is resident will be
>> - * properly returned.
>> - *
>> +/*
>>   * Endianness is explicitly ignored and left for BPF program authors to manage
>>   * as per the specific architecture.
>>   */
>> -static inline u32 get_u32(u64 data, int index)
>> -{
>> -       return ((u32 *)&data)[index];
>> -}
>> -
>> -/* Helper for bpf_load below. */
>> -#define BPF_DATA(_name) offsetof(struct seccomp_data, _name)
>> -/**
>> - * bpf_load: checks and returns a pointer to the requested offset
>> - * @off: offset into struct seccomp_data to load from
>> - *
>> - * Returns the requested 32-bits of data.
>> - * seccomp_check_filter() should assure that @off is 32-bit aligned
>> - * and not out of bounds.  Failure to do so is a BUG.
>> - */
>> -u32 seccomp_bpf_load(int off)
>> +static void populate_seccomp_data(struct seccomp_data *sd)
>>  {
>>         struct pt_regs *regs = task_pt_regs(current);
>> -       if (off == BPF_DATA(nr))
>> -               return syscall_get_nr(current, regs);
>> -       if (off == BPF_DATA(arch))
>> -               return syscall_get_arch(current, regs);
>> -       if (off >= BPF_DATA(args[0]) && off < BPF_DATA(args[6])) {
>> -               unsigned long value;
>> -               int arg = (off - BPF_DATA(args[0])) / sizeof(u64);
>> -               int index = !!(off % sizeof(u64));
>> -               syscall_get_arguments(current, regs, arg, 1, &value);
>> -               return get_u32(value, index);
>> -       }
>> -       if (off == BPF_DATA(instruction_pointer))
>> -               return get_u32(KSTK_EIP(current), 0);
>> -       if (off == BPF_DATA(instruction_pointer) + sizeof(u32))
>> -               return get_u32(KSTK_EIP(current), 1);
>> -       /* seccomp_check_filter should make this impossible. */
>> -       BUG();
>> +       int i;
>> +
>> +       sd->nr = syscall_get_nr(current, regs);
>> +       sd->arch = syscall_get_arch(current, regs);
>> +       for (i = 0; i < 6; i++)
>> +               syscall_get_arguments(current, regs, i, 1,
>> +                                     (unsigned long *)&sd->args[i]);
>> +       sd->instruction_pointer = KSTK_EIP(current);
>>  }
>>
>>  /**
>> @@ -133,17 +100,17 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
>>
>>                 switch (code) {
>>                 case BPF_S_LD_W_ABS:
>> -                       ftest->code = BPF_S_ANC_SECCOMP_LD_W;
>> +                       ftest->code = BPF_LDX | BPF_W | BPF_ABS;

So, with this change, the removal of BPF_S_ANC_SECCOMP_LD_W, and the
unconditional use of populate_seccomp_data(), I'm surprised there
isn't a larger hit on small filters. Currently, seccomp only loads
what it needs into the filter (via BPF_S_ANC_SECCOMP_LD_W + offset via
seccomp_bpf_load). Your benchmarks seem to show that this isn't a
problem, though. Is the ebpf gain just that much larger than the
additional unconditional loads happening in populate_seccomp_data()?

>>                         /* 32-bit aligned and not out of bounds. */
>>                         if (k >= sizeof(struct seccomp_data) || k & 3)
>>                                 return -EINVAL;
>>                         continue;
>>                 case BPF_S_LD_W_LEN:
>> -                       ftest->code = BPF_S_LD_IMM;
>> +                       ftest->code = BPF_LD | BPF_IMM;
>>                         ftest->k = sizeof(struct seccomp_data);
>>                         continue;
>>                 case BPF_S_LDX_W_LEN:
>> -                       ftest->code = BPF_S_LDX_IMM;
>> +                       ftest->code = BPF_LDX | BPF_IMM;
>>                         ftest->k = sizeof(struct seccomp_data);
>>                         continue;
>>                 /* Explicitly include allowed calls. */
>> @@ -185,6 +152,7 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
>>                 case BPF_S_JMP_JGT_X:
>>                 case BPF_S_JMP_JSET_K:
>>                 case BPF_S_JMP_JSET_X:
>> +                       sk_decode_filter(ftest, ftest);
>>                         continue;
>>                 default:
>>                         return -EINVAL;
>> @@ -202,18 +170,21 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
>>  static u32 seccomp_run_filters(int syscall)
>>  {
>>         struct seccomp_filter *f;
>> +       struct seccomp_data sd;
>>         u32 ret = SECCOMP_RET_ALLOW;
>>
>>         /* Ensure unexpected behavior doesn't result in failing open. */
>>         if (WARN_ON(current->seccomp.filter == NULL))
>>                 return SECCOMP_RET_KILL;
>>
>> +       populate_seccomp_data(&sd);
>> +
>>         /*
>>          * All filters in the list are evaluated and the lowest BPF return
>>          * value always takes priority (ignoring the DATA).
>>          */
>>         for (f = current->seccomp.filter; f; f = f->prev) {
>> -               u32 cur_ret = sk_run_filter(NULL, f->insns);
>> +               u32 cur_ret = sk_run_filter_ext(&sd, f->insns);
>>                 if ((cur_ret & SECCOMP_RET_ACTION) < (ret & SECCOMP_RET_ACTION))
>>                         ret = cur_ret;
>>         }
>> @@ -231,6 +202,8 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)
>>         struct seccomp_filter *filter;
>>         unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
>>         unsigned long total_insns = fprog->len;
>> +       struct sock_filter *fp;
>> +       int new_len;
>>         long ret;
>>
>>         if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
>> @@ -252,28 +225,42 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)
>>                                      CAP_SYS_ADMIN) != 0)
>>                 return -EACCES;
>>
>> -       /* Allocate a new seccomp_filter */
>> -       filter = kzalloc(sizeof(struct seccomp_filter) + fp_size,
>> -                        GFP_KERNEL|__GFP_NOWARN);
>> -       if (!filter)
>> +       fp = kzalloc(fp_size, GFP_KERNEL|__GFP_NOWARN);
>> +       if (!fp)
>>                 return -ENOMEM;
>> -       atomic_set(&filter->usage, 1);
>> -       filter->len = fprog->len;
>>
>>         /* Copy the instructions from fprog. */
>>         ret = -EFAULT;
>> -       if (copy_from_user(filter->insns, fprog->filter, fp_size))
>> -               goto fail;
>> +       if (copy_from_user(fp, fprog->filter, fp_size))
>> +               goto free_prog;
>>
>>         /* Check and rewrite the fprog via the skb checker */
>> -       ret = sk_chk_filter(filter->insns, filter->len);
>> +       ret = sk_chk_filter(fp, fprog->len);
>>         if (ret)
>> -               goto fail;
>> +               goto free_prog;
>>
>>         /* Check and rewrite the fprog for seccomp use */
>> -       ret = seccomp_check_filter(filter->insns, filter->len);
>> +       ret = seccomp_check_filter(fp, fprog->len);
>>         if (ret)
>> -               goto fail;
>> +               goto free_prog;
>> +
>> +       /* convert 'sock_filter' insns to 'sock_filter_ext' insns */
>> +       ret = sk_convert_filter(fp, fprog->len, NULL, &new_len);
>> +       if (ret)
>> +               goto free_prog;
>> +
>> +       /* Allocate a new seccomp_filter */
>> +       filter = kzalloc(sizeof(struct seccomp_filter) +
>> +                        sizeof(struct sock_filter_ext) * new_len,
>> +                        GFP_KERNEL|__GFP_NOWARN);
>> +       if (!filter)
>> +               goto free_prog;
>> +
>> +       ret = sk_convert_filter(fp, fprog->len, filter->insns, &new_len);

Seems like it'd be more readable to set filter->len = new_len first
and use &filter->len as the last argument here. I would find that more
readable, but if nothing else needs changing in the series, this is
fine to leave as-is.

>> +       if (ret)
>> +               goto free_filter;
>> +       atomic_set(&filter->usage, 1);
>> +       filter->len = new_len;
>>
>>         /*
>>          * If there is an existing filter, make it the prev and don't drop its
>> @@ -282,8 +269,11 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)
>>         filter->prev = current->seccomp.filter;
>>         current->seccomp.filter = filter;
>>         return 0;
>> -fail:
>> +
>> +free_filter:
>>         kfree(filter);
>> +free_prog:
>> +       kfree(fp);
>>         return ret;
>>  }
>>
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 1494421486b7..f144a6a7d939 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -388,11 +388,6 @@ load_b:
>>                                 A = 0;
>>                         continue;
>>                 }
>> -#ifdef CONFIG_SECCOMP_FILTER
>> -               case BPF_S_ANC_SECCOMP_LD_W:
>> -                       A = seccomp_bpf_load(fentry->k);
>> -                       continue;
>> -#endif
>>                 default:
>>                         WARN_RATELIMIT(1, "Unknown code:%u jt:%u tf:%u k:%u\n",
>>                                        fentry->code, fentry->jt,
>> --
>> 1.7.9.5
>>

Cool!

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 net-next 2/3] [RFC] seccomp: convert seccomp to use extended BPF
  2014-03-05 21:42     ` Kees Cook
@ 2014-03-06  2:00       ` Alexei Starovoitov
  0 siblings, 0 replies; 10+ messages in thread
From: Alexei Starovoitov @ 2014-03-06  2:00 UTC (permalink / raw)
  To: Kees Cook
  Cc: Daniel Borkmann, David S. Miller, Ingo Molnar, Will Drewry,
	Steven Rostedt, Peter Zijlstra, H. Peter Anvin,
	Hagen Paul Pfeifer, Jesse Gross, Thomas Gleixner,
	Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei, Eric Dumazet,
	Linus Torvalds, Andrew Morton, Frederic Weisbecker,
	Arnaldo Carvalho de Melo, Pekka Enberg, Arjan van de Ven,
	Christoph Hellwig, LKML, netdev

On Wed, Mar 5, 2014 at 1:42 PM, Kees Cook <keescook@chromium.org> wrote:
> On Tue, Mar 4, 2014 at 7:11 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>> On Tue, Mar 4, 2014 at 2:17 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>> use sk_convert_filter() to convert seccomp BPF into extended BPF
>>>
>>> 05-sim-long_jumps.c of libseccomp was used as micro-benchmark:
>>>   seccomp_rule_add_exact(ctx,...
>>>   seccomp_rule_add_exact(ctx,...
>>>   rc = seccomp_load(ctx);
>>>   for (i = 0; i < 10000000; i++)
>>>      syscall(199, 100);
>>>
>>> --x86_64--
>>> old BPF: 8.6 seconds
>>>     73.38%    bench  [kernel.kallsyms]  [k] sk_run_filter
>>>     10.70%    bench  libc-2.15.so       [.] syscall
>>>      5.09%    bench  [kernel.kallsyms]  [k] seccomp_bpf_load
>>>      1.97%    bench  [kernel.kallsyms]  [k] system_call
>>> ext BPF: 5.7 seconds
>>>     66.20%    bench  [kernel.kallsyms]  [k] sk_run_filter_ext
>>>     16.75%    bench  libc-2.15.so       [.] syscall
>>>      3.31%    bench  [kernel.kallsyms]  [k] system_call
>>>      2.88%    bench  [kernel.kallsyms]  [k] __secure_computing
>>>
>>> --i386---
>>> old BPF: 5.4 sec
>>> ext BPF: 3.8 sec
>>>
>>> BPF filters generated by seccomp are very branchy, so ext BPF
>>> performance is better than old BPF.
>>>
>>> Performance gains will be even higher when extended BPF JIT
>>> is committed.
>
> Very cool! Have you had a chance to compare on ARM too?

tried arm and was surprised to see 7% regression. grr
turned out gcc wasn't unrolling this loop:
       for (i = 0; i < 6; i++)
               syscall_get_arguments(current, regs, i, 1, ...);
which was causing memcpy() of 4 bytes to be called in the hot path
for my micro-benchmark.
So have to manually unroll it.

Performance is the following:
--arm32-- short filter
old BPF: 4.0 sec
 39.92%  bench  [kernel.kallsyms]  [k] vector_swi
 16.60%  bench  [kernel.kallsyms]  [k] sk_run_filter
 14.66%  bench  libc-2.17.so       [.] syscall
  5.42%  bench  [kernel.kallsyms]  [k] seccomp_bpf_load
  5.10%  bench  [kernel.kallsyms]  [k] __secure_computing
new BPF: 3.7 sec
 35.93%  bench  [kernel.kallsyms]  [k] vector_swi
 21.89%  bench  libc-2.17.so       [.] syscall
 13.45%  bench  [kernel.kallsyms]  [k] sk_run_filter_ext
  6.25%  bench  [kernel.kallsyms]  [k] __secure_computing
  3.96%  bench  [kernel.kallsyms]  [k] syscall_trace_exit

--arm32-- large filter
old BPF: 13.5 sec
 73.88%  bench  [kernel.kallsyms]  [k] sk_run_filter
 10.29%  bench  [kernel.kallsyms]  [k] vector_swi
  6.46%  bench  libc-2.17.so       [.] syscall
  2.94%  bench  [kernel.kallsyms]  [k] seccomp_bpf_load
  1.19%  bench  [kernel.kallsyms]  [k] __secure_computing
  0.87%  bench  [kernel.kallsyms]  [k] sys_getuid
new BPF: 13.5 sec
 76.08%  bench  [kernel.kallsyms]  [k] sk_run_filter_ext
 10.98%  bench  [kernel.kallsyms]  [k] vector_swi
  5.87%  bench  libc-2.17.so       [.] syscall
  1.77%  bench  [kernel.kallsyms]  [k] __secure_computing
  0.93%  bench  [kernel.kallsyms]  [k] sys_getuid

So will resend the V6 for this patch set.

>
>>> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
>>>
>>> ---
>>> This patch is an RFC to use extended BPF in seccomp.
>>> Change it to do it conditionally with bpf_ext_enable knob ?
>>
>> Kees, Will,
>>
>> I've played with libseccomp to test this patch and just found
>> your other testsuite for bpf+seccomp:
>> https://github.com/redpig/seccomp
>> It passes as well on x86_64 and i386.
>
> Great! If it's passing those tests, then things should be in good shape.
>
>> Not sure how real all 'seccomp' and libseccomp' bpf filters.
>> Some of them are very short, some very long.
>> If average number of filtered syscalls is > 10, then upcoming
>> 'ebpf table' will give another performance boost, since single table
>> lookup will be faster than long sequence of 'if' conditions.
>
> To take advantage of that, will seccomp need a new (prctl) interface
> to pass in a seccomp ebpf?

I hope not.
I think it's better to have inband signaling for ext vs old program.
Like 'struct sock_fprog -> len' must be < 4096 for old programs.
if len == 4096, this can mean that this is extended bpf program in a new format.
This way prctl() for seccomp and socket attach mechanisms can stay the same.

>> Please review.
>>
>> Thanks
>> Alexei
>>
>>> ---
>>>  include/linux/seccomp.h |    1 -
>>>  kernel/seccomp.c        |  112 +++++++++++++++++++++--------------------------
>>>  net/core/filter.c       |    5 ---
>>>  3 files changed, 51 insertions(+), 67 deletions(-)
>>>
>>> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
>>> index 6f19cfd1840e..4054b0994071 100644
>>> --- a/include/linux/seccomp.h
>>> +++ b/include/linux/seccomp.h
>>> @@ -76,7 +76,6 @@ static inline int seccomp_mode(struct seccomp *s)
>>>  #ifdef CONFIG_SECCOMP_FILTER
>>>  extern void put_seccomp_filter(struct task_struct *tsk);
>>>  extern void get_seccomp_filter(struct task_struct *tsk);
>>> -extern u32 seccomp_bpf_load(int off);
>>>  #else  /* CONFIG_SECCOMP_FILTER */
>>>  static inline void put_seccomp_filter(struct task_struct *tsk)
>>>  {
>>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>>> index b7a10048a32c..346c597b44cc 100644
>>> --- a/kernel/seccomp.c
>>> +++ b/kernel/seccomp.c
>>> @@ -55,60 +55,27 @@ struct seccomp_filter {
>>>         atomic_t usage;
>>>         struct seccomp_filter *prev;
>>>         unsigned short len;  /* Instruction count */
>>> -       struct sock_filter insns[];
>>> +       struct sock_filter_ext insns[];
>>>  };
>>>
>>>  /* Limit any path through the tree to 256KB worth of instructions. */
>>>  #define MAX_INSNS_PER_PATH ((1 << 18) / sizeof(struct sock_filter))
>>>
>>> -/**
>>> - * get_u32 - returns a u32 offset into data
>>> - * @data: a unsigned 64 bit value
>>> - * @index: 0 or 1 to return the first or second 32-bits
>>> - *
>>> - * This inline exists to hide the length of unsigned long.  If a 32-bit
>>> - * unsigned long is passed in, it will be extended and the top 32-bits will be
>>> - * 0. If it is a 64-bit unsigned long, then whatever data is resident will be
>>> - * properly returned.
>>> - *
>>> +/*
>>>   * Endianness is explicitly ignored and left for BPF program authors to manage
>>>   * as per the specific architecture.
>>>   */
>>> -static inline u32 get_u32(u64 data, int index)
>>> -{
>>> -       return ((u32 *)&data)[index];
>>> -}
>>> -
>>> -/* Helper for bpf_load below. */
>>> -#define BPF_DATA(_name) offsetof(struct seccomp_data, _name)
>>> -/**
>>> - * bpf_load: checks and returns a pointer to the requested offset
>>> - * @off: offset into struct seccomp_data to load from
>>> - *
>>> - * Returns the requested 32-bits of data.
>>> - * seccomp_check_filter() should assure that @off is 32-bit aligned
>>> - * and not out of bounds.  Failure to do so is a BUG.
>>> - */
>>> -u32 seccomp_bpf_load(int off)
>>> +static void populate_seccomp_data(struct seccomp_data *sd)
>>>  {
>>>         struct pt_regs *regs = task_pt_regs(current);
>>> -       if (off == BPF_DATA(nr))
>>> -               return syscall_get_nr(current, regs);
>>> -       if (off == BPF_DATA(arch))
>>> -               return syscall_get_arch(current, regs);
>>> -       if (off >= BPF_DATA(args[0]) && off < BPF_DATA(args[6])) {
>>> -               unsigned long value;
>>> -               int arg = (off - BPF_DATA(args[0])) / sizeof(u64);
>>> -               int index = !!(off % sizeof(u64));
>>> -               syscall_get_arguments(current, regs, arg, 1, &value);
>>> -               return get_u32(value, index);
>>> -       }
>>> -       if (off == BPF_DATA(instruction_pointer))
>>> -               return get_u32(KSTK_EIP(current), 0);
>>> -       if (off == BPF_DATA(instruction_pointer) + sizeof(u32))
>>> -               return get_u32(KSTK_EIP(current), 1);
>>> -       /* seccomp_check_filter should make this impossible. */
>>> -       BUG();
>>> +       int i;
>>> +
>>> +       sd->nr = syscall_get_nr(current, regs);
>>> +       sd->arch = syscall_get_arch(current, regs);
>>> +       for (i = 0; i < 6; i++)
>>> +               syscall_get_arguments(current, regs, i, 1,
>>> +                                     (unsigned long *)&sd->args[i]);
>>> +       sd->instruction_pointer = KSTK_EIP(current);
>>>  }
>>>
>>>  /**
>>> @@ -133,17 +100,17 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
>>>
>>>                 switch (code) {
>>>                 case BPF_S_LD_W_ABS:
>>> -                       ftest->code = BPF_S_ANC_SECCOMP_LD_W;
>>> +                       ftest->code = BPF_LDX | BPF_W | BPF_ABS;
>
> So, with this change, the removal of BPF_S_ANC_SECCOMP_LD_W, and the
> unconditional use of populate_seccomp_data(), I'm surprised there
> isn't a larger hit on small filters. Currently, seccomp only loads
> what it needs into the filter (via BPF_S_ANC_SECCOMP_LD_W + offset via
> seccomp_bpf_load). Your benchmarks seem to show that this isn't a
> problem, though. Is the ebpf gain just that much larger than the
> additional unconditional loads happening in populate_seccomp_data()?

on arm there is a small gain for short programs.

for x86_64 there is also small gain:

--x86_64-- short filter
old BPF: 2.7 sec
 39.12%  bench  libc-2.15.so       [.] syscall
  8.10%  bench  [kernel.kallsyms]  [k] sk_run_filter
  6.31%  bench  [kernel.kallsyms]  [k] system_call
  5.59%  bench  [kernel.kallsyms]  [k] trace_hardirqs_on_caller
  4.37%  bench  [kernel.kallsyms]  [k] trace_hardirqs_off_caller
  3.70%  bench  [kernel.kallsyms]  [k] __secure_computing
  3.67%  bench  [kernel.kallsyms]  [k] lock_is_held
  3.03%  bench  [kernel.kallsyms]  [k] seccomp_bpf_load
new BPF: 2.49 sec
 42.05%  bench  libc-2.15.so       [.] syscall
  6.91%  bench  [kernel.kallsyms]  [k] system_call
  6.25%  bench  [kernel.kallsyms]  [k] trace_hardirqs_on_caller
  6.07%  bench  [kernel.kallsyms]  [k] __secure_computing
  5.08%  bench  [kernel.kallsyms]  [k] sk_run_filter_ext

obviously most of the time is spent in 'syscal' instruction in
syscall() function
but having populate_seccomp_data() is actually beneficial,
since 'current' and 'pt_regs' are hot in cache anyway,
so copying them into seccomp_data is cheaper than extra 'if' conditions
and extra calls from within the filter.
faster interpreter helps too.

>>>                         /* 32-bit aligned and not out of bounds. */
>>>                         if (k >= sizeof(struct seccomp_data) || k & 3)
>>>                                 return -EINVAL;
>>>                         continue;
>>>                 case BPF_S_LD_W_LEN:
>>> -                       ftest->code = BPF_S_LD_IMM;
>>> +                       ftest->code = BPF_LD | BPF_IMM;
>>>                         ftest->k = sizeof(struct seccomp_data);
>>>                         continue;
>>>                 case BPF_S_LDX_W_LEN:
>>> -                       ftest->code = BPF_S_LDX_IMM;
>>> +                       ftest->code = BPF_LDX | BPF_IMM;
>>>                         ftest->k = sizeof(struct seccomp_data);
>>>                         continue;
>>>                 /* Explicitly include allowed calls. */
>>> @@ -185,6 +152,7 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
>>>                 case BPF_S_JMP_JGT_X:
>>>                 case BPF_S_JMP_JSET_K:
>>>                 case BPF_S_JMP_JSET_X:
>>> +                       sk_decode_filter(ftest, ftest);
>>>                         continue;
>>>                 default:
>>>                         return -EINVAL;
>>> @@ -202,18 +170,21 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
>>>  static u32 seccomp_run_filters(int syscall)
>>>  {
>>>         struct seccomp_filter *f;
>>> +       struct seccomp_data sd;
>>>         u32 ret = SECCOMP_RET_ALLOW;
>>>
>>>         /* Ensure unexpected behavior doesn't result in failing open. */
>>>         if (WARN_ON(current->seccomp.filter == NULL))
>>>                 return SECCOMP_RET_KILL;
>>>
>>> +       populate_seccomp_data(&sd);
>>> +
>>>         /*
>>>          * All filters in the list are evaluated and the lowest BPF return
>>>          * value always takes priority (ignoring the DATA).
>>>          */
>>>         for (f = current->seccomp.filter; f; f = f->prev) {
>>> -               u32 cur_ret = sk_run_filter(NULL, f->insns);
>>> +               u32 cur_ret = sk_run_filter_ext(&sd, f->insns);
>>>                 if ((cur_ret & SECCOMP_RET_ACTION) < (ret & SECCOMP_RET_ACTION))
>>>                         ret = cur_ret;
>>>         }
>>> @@ -231,6 +202,8 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)
>>>         struct seccomp_filter *filter;
>>>         unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
>>>         unsigned long total_insns = fprog->len;
>>> +       struct sock_filter *fp;
>>> +       int new_len;
>>>         long ret;
>>>
>>>         if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
>>> @@ -252,28 +225,42 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)
>>>                                      CAP_SYS_ADMIN) != 0)
>>>                 return -EACCES;
>>>
>>> -       /* Allocate a new seccomp_filter */
>>> -       filter = kzalloc(sizeof(struct seccomp_filter) + fp_size,
>>> -                        GFP_KERNEL|__GFP_NOWARN);
>>> -       if (!filter)
>>> +       fp = kzalloc(fp_size, GFP_KERNEL|__GFP_NOWARN);
>>> +       if (!fp)
>>>                 return -ENOMEM;
>>> -       atomic_set(&filter->usage, 1);
>>> -       filter->len = fprog->len;
>>>
>>>         /* Copy the instructions from fprog. */
>>>         ret = -EFAULT;
>>> -       if (copy_from_user(filter->insns, fprog->filter, fp_size))
>>> -               goto fail;
>>> +       if (copy_from_user(fp, fprog->filter, fp_size))
>>> +               goto free_prog;
>>>
>>>         /* Check and rewrite the fprog via the skb checker */
>>> -       ret = sk_chk_filter(filter->insns, filter->len);
>>> +       ret = sk_chk_filter(fp, fprog->len);
>>>         if (ret)
>>> -               goto fail;
>>> +               goto free_prog;
>>>
>>>         /* Check and rewrite the fprog for seccomp use */
>>> -       ret = seccomp_check_filter(filter->insns, filter->len);
>>> +       ret = seccomp_check_filter(fp, fprog->len);
>>>         if (ret)
>>> -               goto fail;
>>> +               goto free_prog;
>>> +
>>> +       /* convert 'sock_filter' insns to 'sock_filter_ext' insns */
>>> +       ret = sk_convert_filter(fp, fprog->len, NULL, &new_len);
>>> +       if (ret)
>>> +               goto free_prog;
>>> +
>>> +       /* Allocate a new seccomp_filter */
>>> +       filter = kzalloc(sizeof(struct seccomp_filter) +
>>> +                        sizeof(struct sock_filter_ext) * new_len,
>>> +                        GFP_KERNEL|__GFP_NOWARN);
>>> +       if (!filter)
>>> +               goto free_prog;
>>> +
>>> +       ret = sk_convert_filter(fp, fprog->len, filter->insns, &new_len);
>
> Seems like it'd be more readable to set filter->len = new_len first
> and use &filter->len as the last argument here. I would find that more
> readable, but if nothing else needs changing in the series, this is
> fine to leave as-is.

cannot do that, since filter->len is 'short', but new_len and
sk_convert_filter()
expects 'int'

>>> +       if (ret)
>>> +               goto free_filter;
>>> +       atomic_set(&filter->usage, 1);
>>> +       filter->len = new_len;
>>>
>>>         /*
>>>          * If there is an existing filter, make it the prev and don't drop its
>>> @@ -282,8 +269,11 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)
>>>         filter->prev = current->seccomp.filter;
>>>         current->seccomp.filter = filter;
>>>         return 0;
>>> -fail:
>>> +
>>> +free_filter:
>>>         kfree(filter);
>>> +free_prog:
>>> +       kfree(fp);
>>>         return ret;
>>>  }
>>>
>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>> index 1494421486b7..f144a6a7d939 100644
>>> --- a/net/core/filter.c
>>> +++ b/net/core/filter.c
>>> @@ -388,11 +388,6 @@ load_b:
>>>                                 A = 0;
>>>                         continue;
>>>                 }
>>> -#ifdef CONFIG_SECCOMP_FILTER
>>> -               case BPF_S_ANC_SECCOMP_LD_W:
>>> -                       A = seccomp_bpf_load(fentry->k);
>>> -                       continue;
>>> -#endif
>>>                 default:
>>>                         WARN_RATELIMIT(1, "Unknown code:%u jt:%u tf:%u k:%u\n",
>>>                                        fentry->code, fentry->jt,
>>> --
>>> 1.7.9.5
>>>
>
> Cool!
>
> Reviewed-by: Kees Cook <keescook@chromium.org>

Thank you for review!

Will send V6 with unrolled loop,
fix empty new line at the end of the file caught by Daniel
and update commit log with arm seccomp data.

Thanks!
Alexei

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v5 net-next 3/3] doc: filter: add Extended BPF documentation
  2014-03-04 22:17 [PATCH v5 net-next 0/3] filter: add Extended BPF interpreter and converter Alexei Starovoitov
  2014-03-04 22:17 ` [PATCH v5 net-next 1/3] " Alexei Starovoitov
  2014-03-04 22:17 ` [PATCH v5 net-next 2/3] [RFC] seccomp: convert seccomp to use extended BPF Alexei Starovoitov
@ 2014-03-04 22:17 ` Alexei Starovoitov
  2014-03-05  9:25   ` Daniel Borkmann
  2 siblings, 1 reply; 10+ messages in thread
From: Alexei Starovoitov @ 2014-03-04 22:17 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: David S. Miller, Ingo Molnar, Will Drewry, Steven Rostedt,
	Peter Zijlstra, H. Peter Anvin, Hagen Paul Pfeifer, Jesse Gross,
	Thomas Gleixner, Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei,
	Eric Dumazet, Linus Torvalds, Andrew Morton, Frederic Weisbecker,
	Arnaldo Carvalho de Melo, Pekka Enberg, Arjan van de Ven,
	Christoph Hellwig, linux-kernel, netdev

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 Documentation/networking/filter.txt |  181 +++++++++++++++++++++++++++++++++++
 1 file changed, 181 insertions(+)

diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
index a06b48d2f5cc..c3f687bf8e82 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -546,6 +546,186 @@ ffffffffa0069c8f + <x>:
 For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful
 toolchain for developing and testing the kernel's JIT compiler.
 
+Extended BPF
+------------
+Extended BPF extends BPF in the following ways:
+- from 2 to 10 registers
+  Original BPF has two registers (A and X) and hidden frame pointer.
+  Extended BPF has ten registers and read-only frame pointer.
+- from 32-bit registers to 64-bit registers
+  semantics of old 32-bit ALU operations are preserved via 32-bit
+  subregisters
+- if (cond) jump_true; else jump_false;
+  old BPF insns are replaced with:
+  if (cond) jump_true; /* else fallthrough */
+- adds signed > and >= insns
+- 16 4-byte stack slots for register spill-fill replaced with
+  up to 512 bytes of multi-use stack space
+- introduces bpf_call insn and register passing convention for zero
+  overhead calls from/to other kernel functions (not part of this patch)
+- adds arithmetic right shift insn
+- adds swab32/swab64 insns
+- adds atomic_add insn
+- old tax/txa insns are replaced with 'mov dst,src' insn
+
+Extended BPF is designed to be JITed with one to one mapping, which
+allows GCC/LLVM compilers to generate optimized BPF code that performs
+almost as fast as natively compiled code
+
+sysctl net.core.bpf_ext_enable=1
+controls whether filters attached to sockets will be automatically
+converted to extended BPF or not.
+
+BPF is safe dynamically loadable program that can call fixed set
+of kernel functions and takes a pointer to data as an input,
+where data is skb, seccomp_data, kprobe function arguments or else.
+
+Extended Instruction Set was designed with these goals:
+- write programs in restricted C and compile into BPF with GCC/LLVM
+- just-in-time map to modern 64-bit CPU with minimal performance overhead
+  over two steps: C -> BPF -> native code
+- guarantee termination and safety of BPF program in kernel
+  with simple algorithm
+
+GCC/LLVM-bpf backend is optional.
+Extended BPF can be coded with macroses from filter.h just like original BPF,
+though the same filter done in C is easier to understand.
+sk_convert_filter() remaps original BPF insns into extended.
+
+Minimal performance overhead is achieved by having one to one mapping
+between BPF insns and native insns, and one to one mapping between BPF
+registers and native registers on 64-bit CPUs
+
+Extended BPF may allow jump forward and backward for two reasons:
+to reduce branch mispredict penalty compiler moves cold basic blocks out of
+fall-through path and to reduce code duplication that would be hard to avoid
+if only jump forward was available.
+To guarantee termination simple non-recursive depth-first-search verifies
+that there are no back-edges (no loops in the program), program is a DAG
+with root at the first insn, all branches end at the last RET insn and
+all instructions are reachable.
+Original BPF actually allows unreachable insns. Though it's safe, it will be
+fixed when extended BPF replaces BPF completely.
+
+Original BPF has two registers (A and X) and hidden frame pointer.
+Extended BPF has ten registers and read-only frame pointer.
+Since 64-bit CPUs are passing arguments to the functions via registers
+the number of args from BPF program to in-kernel function is restricted to 5
+and one register is used to accept return value from in-kernel function.
+x86_64 passes first 6 arguments in registers.
+aarch64/sparcv9/mips64 have 7-8 registers for arguments.
+x86_64 has 6 callee saved registers.
+aarch64/sparcv9/mips64 have 11 or more callee saved registers.
+
+Therefore extended BPF calling convention is defined as:
+R0 - return value from in-kernel function
+R1-R5 - arguments from BPF program to in-kernel function
+R6-R9 - callee saved registers that in-kernel function will preserve
+R10 - read-only frame pointer to access stack
+
+so that all BPF registers map one to one to HW registers on x86_64,aarch64,etc
+and BPF calling convention maps directly to ABIs used by kernel on 64-bit
+architectures.
+On 32-bit architectures JIT may map programs that use only 32-bit arithmetic
+and let more complex programs to be interpreted.
+
+R0-R5 are scratch registers and BPF program needs spill/fill them if necessary
+across calls.
+Note that there is only one BPF program == one BPF function and it cannot call
+other BPF functions. It can only call predefined in-kernel functions.
+
+All BPF registers are 64-bit with 32-bit lower subregister that zero-extends
+into 64-bit if written to. That behavior maps directly to x86_64 and arm64
+subregister defintion, but makes other JITs more difficult.
+
+Original BPF and extended BPF are two operand instructions, which helps
+to do one-to-one mapping between BPF insn and x86 insn during JIT.
+
+Extended BPF doesn't have pre-defined endianness not to favor one
+architecture vs another. Therefore bswap insn is available.
+Original BPF doesn't have such insn and does bswap as part of sk_load_word call
+which is often unnecessary if we want to compare the value with the constant.
+Restricted C code might be written differently depending on endianness
+and GCC/LLVM-bpf will take an endianness flag.
+
+32-bit architectures run 64-bit extended BPF programs via interpreter.
+Their JITs may convert BPF programs that only use 32-bit subregs into native
+instruction set and let the rest being interpreted.
+
+Extended BPF is 64-bit, because on 64-bit architectures, pointers are 64-bit
+and we want to pass 64-bit values in/out kernel functions, so 32-bit BPF
+registers would require to define register-pair ABI, there won't be a direct
+BPF register to HW register mapping and JIT would need to do
+combine/split/move operations for every register in and out of the function,
+which is complex, bug prone and slow.
+Another reason is atomic 64-bit counters
+
+Just like original BPF, extended BPF is safe, deterministic and kernel can
+easily prove that. The safety of the program is determined in two steps.
+First step does depth-first-search to disallow loops and other CFG validation.
+Second step starts from the first insn and descends all possible paths.
+It simulates execution of every insn and observes the state change of
+registers and stack.
+At the start of the program the register R1 contains a pointer to context
+and has type PTR_TO_CTX. If checker sees an insn that does R2=R1, then R2 has
+now type PTR_TO_CTX as well and can be used on right hand side of expression.
+If R1=PTR_TO_CTX and insn is R2=R1+1, then R2=INVALID_PTR and it is readable.
+If register was never written to, it's not readable.
+After kernel function call, R1-R5 are reset to unreadable and R0 has a return
+type of the function. Since R6-R9 are callee saved, their state is preserved
+across the call.
+load/store instructions are allowed only with registers of valid types, which
+are PTR_TO_CTX, PTR_TO_TABLE, PTR_TO_STACK. They are bounds and alginment
+checked.
+
+Input context pointer is generic. Its contents are defined by specific use case.
+For seccomp R1 points to seccomp_data
+For converted BPF filters R1 points to skb
+Through get_context_access callback BPF checker is customized, so that BPF
+program can only access certain fields of input context with specified size
+and alignment.
+For example, the following insn:
+  BPF_INSN_LD(BPF_W, R0, R6, 8)
+intends to load word from address R6 + 8 and store it into R0
+If R6=PTR_TO_CTX, then get_context_access callback should let the checker know
+that offset 8 of size 4 bytes can be accessed for reading, otherwise the checker
+will reject the program.
+
+If R6=PTR_TO_STACK, then access should be aligned and be within stack bounds,
+which are hard coded to [-480, 0]. In this example offset is 8, so it will fail
+verification.
+The checker will allow BPF program to read data from stack only after it wrote
+into it.
+Pointer register spill/fill is tracked as well, since four (R6-R9) callee saved
+registers may not be enough for some programs.
+
+Allowed function calls are customized via get_func_proto callback.
+
+One of the useful functions that can be made available to BPF program
+are bpf_table_lookup/bpf_table_update.
+They can help tracing filters collect different types of statistics.
+Example: pc addresses for drop_monitor filter
+
+In seccomp and socket filter use cases extended BPF program consists
+of intructions only, but for tracing filters case BPF program may contain
+BPF tables as well.
+There are no special instructions to access BPF tables. The access is done
+via function calls.
+
+BPF program identifies the table by table_id and accesses it in C like:
+elem = bpf_table_lookup(ctx, table_id, key);
+
+BPF checker matches 'table_id' against known tables, verifies that 'key' points
+to stack and table->key_size bytes are initialized.
+bpf_table_lookup() is a normal kernel function. It needs to do a lookup and
+return either valid pointer to the element or NULL.
+BPF checker will verify that the program accesses the pointer only
+after comparing it to NULL.
+It's up to implementation to decide how lookup is done and meaning of the key.
+
+Just like original, extended BPF is limited to 4096 insns, which means that any
+program will terminate quickly and will call fixed number of kernel functions.
+
 Misc
 ----
 
@@ -561,3 +741,4 @@ the underlying architecture.
 
 Jay Schulist <jschlst@samba.org>
 Daniel Borkmann <dborkman@redhat.com>
+Alexei Starovoitov <ast@plumgrid.com>
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 net-next 3/3] doc: filter: add Extended BPF documentation
  2014-03-04 22:17 ` [PATCH v5 net-next 3/3] doc: filter: add Extended BPF documentation Alexei Starovoitov
@ 2014-03-05  9:25   ` Daniel Borkmann
  0 siblings, 0 replies; 10+ messages in thread
From: Daniel Borkmann @ 2014-03-05  9:25 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Will Drewry, Steven Rostedt,
	Peter Zijlstra, H. Peter Anvin, Hagen Paul Pfeifer, Jesse Gross,
	Thomas Gleixner, Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei,
	Eric Dumazet, Linus Torvalds, Andrew Morton, Frederic Weisbecker,
	Arnaldo Carvalho de Melo, Pekka Enberg, Arjan van de Ven,
	Christoph Hellwig, linux-kernel, netdev

On 03/04/2014 11:17 PM, Alexei Starovoitov wrote:
> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>

Reviewed-by: Daniel Borkmann <dborkman@redhat.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-03-06  2:00 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-04 22:17 [PATCH v5 net-next 0/3] filter: add Extended BPF interpreter and converter Alexei Starovoitov
2014-03-04 22:17 ` [PATCH v5 net-next 1/3] " Alexei Starovoitov
2014-03-05  9:24   ` Daniel Borkmann
2014-03-05 18:13     ` Alexei Starovoitov
2014-03-04 22:17 ` [PATCH v5 net-next 2/3] [RFC] seccomp: convert seccomp to use extended BPF Alexei Starovoitov
2014-03-05  3:11   ` Alexei Starovoitov
2014-03-05 21:42     ` Kees Cook
2014-03-06  2:00       ` Alexei Starovoitov
2014-03-04 22:17 ` [PATCH v5 net-next 3/3] doc: filter: add Extended BPF documentation Alexei Starovoitov
2014-03-05  9:25   ` Daniel Borkmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).