[PATCH v6 net-next 0/6] introduce BPF syscall

linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v6 net-next 0/6] introduce BPF syscall
@ 2014-08-26  1:00 Alexei Starovoitov
  2014-08-26  1:00 ` [PATCH v6 net-next 1/6] net: filter: add "load 64-bit immediate" eBPF instruction Alexei Starovoitov
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Alexei Starovoitov @ 2014-08-26  1:00 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton,
	Kees Cook, linux-api-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Hi All,

splitting big set of patches into smaller sets:

1st(this) set - introduces uapi/linux/bpf.h and BPF syscall for maps only
2nd set will extend BPF syscall with programs and verifier
3rd set will use eBPF in tracing, add samples and verifier tests
4th set will have llvm and C examples

Tested on x64 and i386.
BPF syscall with maps only is usable, but not very useful until the rest comes
Build/boot tested on arm and sparc.
They got new warning 'syscall bpf not implemented'

V5->V6:
- fixed few checkpatch warnings.
  Few lines are still over 80 char, but fixing them will reduce readability
- rebased

V5 thread:
https://lkml.org/lkml/2014/8/24/107

All patches:
git://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf master

Alexei Starovoitov (6):
  net: filter: add "load 64-bit immediate" eBPF instruction
  net: filter: split filter.h and expose eBPF to user space
  bpf: introduce syscall(BPF, ...) and BPF maps
  bpf: enable bpf syscall on x64 and i386
  bpf: add lookup/update/delete/iterate methods to BPF maps
  bpf: add hashtable type of BPF maps

 Documentation/networking/filter.txt |   79 +++++++-
 arch/x86/net/bpf_jit_comp.c         |   17 ++
 arch/x86/syscalls/syscall_32.tbl    |    1 +
 arch/x86/syscalls/syscall_64.tbl    |    1 +
 include/linux/bpf.h                 |   50 +++++
 include/linux/filter.h              |  294 +--------------------------
 include/linux/syscalls.h            |    3 +-
 include/uapi/asm-generic/unistd.h   |    4 +-
 include/uapi/linux/Kbuild           |    1 +
 include/uapi/linux/bpf.h            |  371 ++++++++++++++++++++++++++++++++++
 kernel/bpf/Makefile                 |    2 +-
 kernel/bpf/core.c                   |    5 +
 kernel/bpf/hashtab.c                |  373 +++++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c                |  354 +++++++++++++++++++++++++++++++++
 kernel/sys_ni.c                     |    3 +
 lib/test_bpf.c                      |   21 ++
 16 files changed, 1282 insertions(+), 297 deletions(-)
 create mode 100644 include/linux/bpf.h
 create mode 100644 include/uapi/linux/bpf.h
 create mode 100644 kernel/bpf/hashtab.c
 create mode 100644 kernel/bpf/syscall.c

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v6 net-next 1/6] net: filter: add "load 64-bit immediate" eBPF instruction
  2014-08-26  1:00 [PATCH v6 net-next 0/6] introduce BPF syscall Alexei Starovoitov
@ 2014-08-26  1:00 ` Alexei Starovoitov
  2014-08-26  1:06   ` David Miller
  2014-08-26  1:00 ` [PATCH v6 net-next 3/6] bpf: introduce syscall(BPF, ...) and BPF maps Alexei Starovoitov
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 24+ messages in thread
From: Alexei Starovoitov @ 2014-08-26  1:00 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton,
	Kees Cook, linux-api, netdev, linux-kernel

add BPF_LD_IMM64 instruction to load 64-bit immediate value into a register.
All previous instructions were 8-byte. This is first 16-byte instruction.
Two consecutive 'struct bpf_insn' blocks are interpreted as single instruction:
insn[0].code = BPF_LD | BPF_DW | BPF_IMM
insn[0].dst_reg = destination register
insn[0].imm = lower 32-bit
insn[1].code = 0
insn[1].imm = upper 32-bit
All unused fields must be zero.

Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM
which loads 32-bit immediate value into a register.

x64 JITs it as single 'movabsq %rax, imm64'
arm64 may JIT as sequence of four 'movk x0, #imm16, lsl #shift' insn

Note that old eBPF programs are binary compatible with new interpreter.

In the following patches this instruction is used to store eBPF map pointers:
 BPF_LD_IMM64(R1, const_imm_map_ptr)
 BPF_CALL(BPF_FUNC_map_lookup_elem)
and verifier is introduced to check validity of the programs.

Later LLVM compiler is using this insn as generic load of 64-bit immediate
constant and as a load of map pointer with relocation.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 Documentation/networking/filter.txt |    8 +++++++-
 arch/x86/net/bpf_jit_comp.c         |   17 +++++++++++++++++
 include/linux/filter.h              |   18 ++++++++++++++++++
 kernel/bpf/core.c                   |    5 +++++
 lib/test_bpf.c                      |   21 +++++++++++++++++++++
 5 files changed, 68 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
index c48a9704bda8..81916ab5d96f 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -951,7 +951,7 @@ Size modifier is one of ...
 
 Mode modifier is one of:
 
-  BPF_IMM  0x00  /* classic BPF only, reserved in eBPF */
+  BPF_IMM  0x00  /* used for 32-bit mov in classic BPF and 64-bit in eBPF */
   BPF_ABS  0x20
   BPF_IND  0x40
   BPF_MEM  0x60
@@ -995,6 +995,12 @@ BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg
 Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and
 2 byte atomic increments are not supported.
 
+eBPF has one 16-byte instruction: BPF_LD | BPF_DW | BPF_IMM which consists
+of two consecutive 'struct bpf_insn' 8-byte blocks and interpreted as single
+instruction that loads 64-bit immediate value into a dst_reg.
+Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM which loads
+32-bit immediate value into a register.
+
 Testing
 -------
 
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index b08a98c59530..98837147ee57 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -393,6 +393,23 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
 			EMIT1_off32(add_1reg(0xB8, dst_reg), imm32);
 			break;
 
+		case BPF_LD | BPF_IMM | BPF_DW:
+			if (insn[1].code != 0 || insn[1].src_reg != 0 ||
+			    insn[1].dst_reg != 0 || insn[1].off != 0) {
+				/* verifier must catch invalid insns */
+				pr_err("invalid BPF_LD_IMM64 insn\n");
+				return -EINVAL;
+			}
+
+			/* movabsq %rax, imm64 */
+			EMIT2(add_1mod(0x48, dst_reg), add_1reg(0xB8, dst_reg));
+			EMIT(insn[0].imm, 4);
+			EMIT(insn[1].imm, 4);
+
+			insn++;
+			i++;
+			break;
+
 			/* dst %= src, dst /= src, dst %= imm32, dst /= imm32 */
 		case BPF_ALU | BPF_MOD | BPF_X:
 		case BPF_ALU | BPF_DIV | BPF_X:
diff --git a/include/linux/filter.h b/include/linux/filter.h
index a5227ab8ccb1..f3262b598262 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -161,6 +161,24 @@ enum {
 		.off   = 0,					\
 		.imm   = IMM })
 
+/* BPF_LD_IMM64 macro encodes single 'load 64-bit immediate' insn */
+#define BPF_LD_IMM64(DST, IMM)					\
+	BPF_LD_IMM64_RAW(DST, 0, IMM)
+
+#define BPF_LD_IMM64_RAW(DST, SRC, IMM)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_LD | BPF_DW | BPF_IMM,		\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = (__u32) (IMM) }),			\
+	((struct bpf_insn) {					\
+		.code  = 0, /* zero is reserved opcode */	\
+		.dst_reg = 0,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = ((__u64) (IMM)) >> 32 })
+
 /* Short form of mov based on type, BPF_X: dst_reg = src_reg, BPF_K: dst_reg = imm32 */
 
 #define BPF_MOV64_RAW(TYPE, DST, SRC, IMM)			\
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 7f0dbcbb34af..0434c2170f2b 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -180,6 +180,7 @@ static unsigned int __bpf_prog_run(void *ctx, const struct bpf_insn *insn)
 		[BPF_LD | BPF_IND | BPF_W] = &&LD_IND_W,
 		[BPF_LD | BPF_IND | BPF_H] = &&LD_IND_H,
 		[BPF_LD | BPF_IND | BPF_B] = &&LD_IND_B,
+		[BPF_LD | BPF_IMM | BPF_DW] = &&LD_IMM_DW,
 	};
 	void *ptr;
 	int off;
@@ -239,6 +240,10 @@ select_insn:
 	ALU64_MOV_K:
 		DST = IMM;
 		CONT;
+	LD_IMM_DW:
+		DST = (u64) (u32) insn[0].imm | ((u64) (u32) insn[1].imm) << 32;
+		insn++;
+		CONT;
 	ALU64_ARSH_X:
 		(*(s64 *) &DST) >>= SRC;
 		CONT;
diff --git a/lib/test_bpf.c b/lib/test_bpf.c
index 8c66c6aace04..46ab1a7ef135 100644
--- a/lib/test_bpf.c
+++ b/lib/test_bpf.c
@@ -1735,6 +1735,27 @@ static struct bpf_test tests[] = {
 		{ },
 		{ { 1, 0 } },
 	},
+	{
+		"load 64-bit immediate",
+		.u.insns_int = {
+			BPF_LD_IMM64(R1, 0x567800001234L),
+			BPF_MOV64_REG(R2, R1),
+			BPF_MOV64_REG(R3, R2),
+			BPF_ALU64_IMM(BPF_RSH, R2, 32),
+			BPF_ALU64_IMM(BPF_LSH, R3, 32),
+			BPF_ALU64_IMM(BPF_RSH, R3, 32),
+			BPF_ALU64_IMM(BPF_MOV, R0, 0),
+			BPF_JMP_IMM(BPF_JEQ, R2, 0x5678, 1),
+			BPF_EXIT_INSN(),
+			BPF_JMP_IMM(BPF_JEQ, R3, 0x1234, 1),
+			BPF_EXIT_INSN(),
+			BPF_ALU64_IMM(BPF_MOV, R0, 1),
+			BPF_EXIT_INSN(),
+		},
+		INTERNAL,
+		{ },
+		{ { 0, 1 } }
+	},
 };
 
 static struct net_device dev;
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v6 net-next 2/6] net: filter: split filter.h and expose eBPF to user space
       [not found] ` <1409014858-1410-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
@ 2014-08-26  1:00   ` Alexei Starovoitov
  2014-08-26  1:00   ` [PATCH v6 net-next 5/6] bpf: add lookup/update/delete/iterate methods to BPF maps Alexei Starovoitov
  2014-08-26  1:00   ` [PATCH v6 net-next 6/6] bpf: add hashtable type of " Alexei Starovoitov
  2 siblings, 0 replies; 24+ messages in thread
From: Alexei Starovoitov @ 2014-08-26  1:00 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton,
	Kees Cook, linux-api-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

eBPF can be used from user space.

uapi/linux/bpf.h: eBPF instruction set definition

linux/filter.h: the rest

This patch only moves macro definitions, but practically it freezes existing
eBPF instruction set, though new instructions can still be added in the future.

These eBPF definitions cannot go into uapi/linux/filter.h, since the names
may conflict with existing applications.

Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
---
 include/linux/filter.h    |  312 +------------------------------------------
 include/uapi/linux/Kbuild |    1 +
 include/uapi/linux/bpf.h  |  321 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 323 insertions(+), 311 deletions(-)
 create mode 100644 include/uapi/linux/bpf.h

diff --git a/include/linux/filter.h b/include/linux/filter.h
index f3262b598262..f04793474d16 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -9,322 +9,12 @@
 #include <linux/skbuff.h>
 #include <linux/workqueue.h>
 #include <uapi/linux/filter.h>
-
-/* Internally used and optimized filter representation with extended
- * instruction set based on top of classic BPF.
- */
-
-/* instruction classes */
-#define BPF_ALU64	0x07	/* alu mode in double word width */
-
-/* ld/ldx fields */
-#define BPF_DW		0x18	/* double word */
-#define BPF_XADD	0xc0	/* exclusive add */
-
-/* alu/jmp fields */
-#define BPF_MOV		0xb0	/* mov reg to reg */
-#define BPF_ARSH	0xc0	/* sign extending arithmetic shift right */
-
-/* change endianness of a register */
-#define BPF_END		0xd0	/* flags for endianness conversion: */
-#define BPF_TO_LE	0x00	/* convert to little-endian */
-#define BPF_TO_BE	0x08	/* convert to big-endian */
-#define BPF_FROM_LE	BPF_TO_LE
-#define BPF_FROM_BE	BPF_TO_BE
-
-#define BPF_JNE		0x50	/* jump != */
-#define BPF_JSGT	0x60	/* SGT is signed '>', GT in x86 */
-#define BPF_JSGE	0x70	/* SGE is signed '>=', GE in x86 */
-#define BPF_CALL	0x80	/* function call */
-#define BPF_EXIT	0x90	/* function return */
-
-/* Register numbers */
-enum {
-	BPF_REG_0 = 0,
-	BPF_REG_1,
-	BPF_REG_2,
-	BPF_REG_3,
-	BPF_REG_4,
-	BPF_REG_5,
-	BPF_REG_6,
-	BPF_REG_7,
-	BPF_REG_8,
-	BPF_REG_9,
-	BPF_REG_10,
-	__MAX_BPF_REG,
-};
-
-/* BPF has 10 general purpose 64-bit registers and stack frame. */
-#define MAX_BPF_REG	__MAX_BPF_REG
-
-/* ArgX, context and stack frame pointer register positions. Note,
- * Arg1, Arg2, Arg3, etc are used as argument mappings of function
- * calls in BPF_CALL instruction.
- */
-#define BPF_REG_ARG1	BPF_REG_1
-#define BPF_REG_ARG2	BPF_REG_2
-#define BPF_REG_ARG3	BPF_REG_3
-#define BPF_REG_ARG4	BPF_REG_4
-#define BPF_REG_ARG5	BPF_REG_5
-#define BPF_REG_CTX	BPF_REG_6
-#define BPF_REG_FP	BPF_REG_10
-
-/* Additional register mappings for converted user programs. */
-#define BPF_REG_A	BPF_REG_0
-#define BPF_REG_X	BPF_REG_7
-#define BPF_REG_TMP	BPF_REG_8
-
-/* BPF program can access up to 512 bytes of stack space. */
-#define MAX_BPF_STACK	512
-
-/* Helper macros for filter block array initializers. */
-
-/* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */
-
-#define BPF_ALU64_REG(OP, DST, SRC)				\
-	((struct bpf_insn) {					\
-		.code  = BPF_ALU64 | BPF_OP(OP) | BPF_X,	\
-		.dst_reg = DST,					\
-		.src_reg = SRC,					\
-		.off   = 0,					\
-		.imm   = 0 })
-
-#define BPF_ALU32_REG(OP, DST, SRC)				\
-	((struct bpf_insn) {					\
-		.code  = BPF_ALU | BPF_OP(OP) | BPF_X,		\
-		.dst_reg = DST,					\
-		.src_reg = SRC,					\
-		.off   = 0,					\
-		.imm   = 0 })
-
-/* ALU ops on immediates, bpf_add|sub|...: dst_reg += imm32 */
-
-#define BPF_ALU64_IMM(OP, DST, IMM)				\
-	((struct bpf_insn) {					\
-		.code  = BPF_ALU64 | BPF_OP(OP) | BPF_K,	\
-		.dst_reg = DST,					\
-		.src_reg = 0,					\
-		.off   = 0,					\
-		.imm   = IMM })
-
-#define BPF_ALU32_IMM(OP, DST, IMM)				\
-	((struct bpf_insn) {					\
-		.code  = BPF_ALU | BPF_OP(OP) | BPF_K,		\
-		.dst_reg = DST,					\
-		.src_reg = 0,					\
-		.off   = 0,					\
-		.imm   = IMM })
-
-/* Endianess conversion, cpu_to_{l,b}e(), {l,b}e_to_cpu() */
-
-#define BPF_ENDIAN(TYPE, DST, LEN)				\
-	((struct bpf_insn) {					\
-		.code  = BPF_ALU | BPF_END | BPF_SRC(TYPE),	\
-		.dst_reg = DST,					\
-		.src_reg = 0,					\
-		.off   = 0,					\
-		.imm   = LEN })
-
-/* Short form of mov, dst_reg = src_reg */
-
-#define BPF_MOV64_REG(DST, SRC)					\
-	((struct bpf_insn) {					\
-		.code  = BPF_ALU64 | BPF_MOV | BPF_X,		\
-		.dst_reg = DST,					\
-		.src_reg = SRC,					\
-		.off   = 0,					\
-		.imm   = 0 })
-
-#define BPF_MOV32_REG(DST, SRC)					\
-	((struct bpf_insn) {					\
-		.code  = BPF_ALU | BPF_MOV | BPF_X,		\
-		.dst_reg = DST,					\
-		.src_reg = SRC,					\
-		.off   = 0,					\
-		.imm   = 0 })
-
-/* Short form of mov, dst_reg = imm32 */
-
-#define BPF_MOV64_IMM(DST, IMM)					\
-	((struct bpf_insn) {					\
-		.code  = BPF_ALU64 | BPF_MOV | BPF_K,		\
-		.dst_reg = DST,					\
-		.src_reg = 0,					\
-		.off   = 0,					\
-		.imm   = IMM })
-
-#define BPF_MOV32_IMM(DST, IMM)					\
-	((struct bpf_insn) {					\
-		.code  = BPF_ALU | BPF_MOV | BPF_K,		\
-		.dst_reg = DST,					\
-		.src_reg = 0,					\
-		.off   = 0,					\
-		.imm   = IMM })
-
-/* BPF_LD_IMM64 macro encodes single 'load 64-bit immediate' insn */
-#define BPF_LD_IMM64(DST, IMM)					\
-	BPF_LD_IMM64_RAW(DST, 0, IMM)
-
-#define BPF_LD_IMM64_RAW(DST, SRC, IMM)				\
-	((struct bpf_insn) {					\
-		.code  = BPF_LD | BPF_DW | BPF_IMM,		\
-		.dst_reg = DST,					\
-		.src_reg = SRC,					\
-		.off   = 0,					\
-		.imm   = (__u32) (IMM) }),			\
-	((struct bpf_insn) {					\
-		.code  = 0, /* zero is reserved opcode */	\
-		.dst_reg = 0,					\
-		.src_reg = 0,					\
-		.off   = 0,					\
-		.imm   = ((__u64) (IMM)) >> 32 })
-
-/* Short form of mov based on type, BPF_X: dst_reg = src_reg, BPF_K: dst_reg = imm32 */
-
-#define BPF_MOV64_RAW(TYPE, DST, SRC, IMM)			\
-	((struct bpf_insn) {					\
-		.code  = BPF_ALU64 | BPF_MOV | BPF_SRC(TYPE),	\
-		.dst_reg = DST,					\
-		.src_reg = SRC,					\
-		.off   = 0,					\
-		.imm   = IMM })
-
-#define BPF_MOV32_RAW(TYPE, DST, SRC, IMM)			\
-	((struct bpf_insn) {					\
-		.code  = BPF_ALU | BPF_MOV | BPF_SRC(TYPE),	\
-		.dst_reg = DST,					\
-		.src_reg = SRC,					\
-		.off   = 0,					\
-		.imm   = IMM })
-
-/* Direct packet access, R0 = *(uint *) (skb->data + imm32) */
-
-#define BPF_LD_ABS(SIZE, IMM)					\
-	((struct bpf_insn) {					\
-		.code  = BPF_LD | BPF_SIZE(SIZE) | BPF_ABS,	\
-		.dst_reg = 0,					\
-		.src_reg = 0,					\
-		.off   = 0,					\
-		.imm   = IMM })
-
-/* Indirect packet access, R0 = *(uint *) (skb->data + src_reg + imm32) */
-
-#define BPF_LD_IND(SIZE, SRC, IMM)				\
-	((struct bpf_insn) {					\
-		.code  = BPF_LD | BPF_SIZE(SIZE) | BPF_IND,	\
-		.dst_reg = 0,					\
-		.src_reg = SRC,					\
-		.off   = 0,					\
-		.imm   = IMM })
-
-/* Memory load, dst_reg = *(uint *) (src_reg + off16) */
-
-#define BPF_LDX_MEM(SIZE, DST, SRC, OFF)			\
-	((struct bpf_insn) {					\
-		.code  = BPF_LDX | BPF_SIZE(SIZE) | BPF_MEM,	\
-		.dst_reg = DST,					\
-		.src_reg = SRC,					\
-		.off   = OFF,					\
-		.imm   = 0 })
-
-/* Memory store, *(uint *) (dst_reg + off16) = src_reg */
-
-#define BPF_STX_MEM(SIZE, DST, SRC, OFF)			\
-	((struct bpf_insn) {					\
-		.code  = BPF_STX | BPF_SIZE(SIZE) | BPF_MEM,	\
-		.dst_reg = DST,					\
-		.src_reg = SRC,					\
-		.off   = OFF,					\
-		.imm   = 0 })
-
-/* Memory store, *(uint *) (dst_reg + off16) = imm32 */
-
-#define BPF_ST_MEM(SIZE, DST, OFF, IMM)				\
-	((struct bpf_insn) {					\
-		.code  = BPF_ST | BPF_SIZE(SIZE) | BPF_MEM,	\
-		.dst_reg = DST,					\
-		.src_reg = 0,					\
-		.off   = OFF,					\
-		.imm   = IMM })
-
-/* Conditional jumps against registers, if (dst_reg 'op' src_reg) goto pc + off16 */
-
-#define BPF_JMP_REG(OP, DST, SRC, OFF)				\
-	((struct bpf_insn) {					\
-		.code  = BPF_JMP | BPF_OP(OP) | BPF_X,		\
-		.dst_reg = DST,					\
-		.src_reg = SRC,					\
-		.off   = OFF,					\
-		.imm   = 0 })
-
-/* Conditional jumps against immediates, if (dst_reg 'op' imm32) goto pc + off16 */
-
-#define BPF_JMP_IMM(OP, DST, IMM, OFF)				\
-	((struct bpf_insn) {					\
-		.code  = BPF_JMP | BPF_OP(OP) | BPF_K,		\
-		.dst_reg = DST,					\
-		.src_reg = 0,					\
-		.off   = OFF,					\
-		.imm   = IMM })
-
-/* Function call */
-
-#define BPF_EMIT_CALL(FUNC)					\
-	((struct bpf_insn) {					\
-		.code  = BPF_JMP | BPF_CALL,			\
-		.dst_reg = 0,					\
-		.src_reg = 0,					\
-		.off   = 0,					\
-		.imm   = ((FUNC) - __bpf_call_base) })
-
-/* Raw code statement block */
-
-#define BPF_RAW_INSN(CODE, DST, SRC, OFF, IMM)			\
-	((struct bpf_insn) {					\
-		.code  = CODE,					\
-		.dst_reg = DST,					\
-		.src_reg = SRC,					\
-		.off   = OFF,					\
-		.imm   = IMM })
-
-/* Program exit */
-
-#define BPF_EXIT_INSN()						\
-	((struct bpf_insn) {					\
-		.code  = BPF_JMP | BPF_EXIT,			\
-		.dst_reg = 0,					\
-		.src_reg = 0,					\
-		.off   = 0,					\
-		.imm   = 0 })
-
-#define bytes_to_bpf_size(bytes)				\
-({								\
-	int bpf_size = -EINVAL;					\
-								\
-	if (bytes == sizeof(u8))				\
-		bpf_size = BPF_B;				\
-	else if (bytes == sizeof(u16))				\
-		bpf_size = BPF_H;				\
-	else if (bytes == sizeof(u32))				\
-		bpf_size = BPF_W;				\
-	else if (bytes == sizeof(u64))				\
-		bpf_size = BPF_DW;				\
-								\
-	bpf_size;						\
-})
+#include <uapi/linux/bpf.h>
 
 /* Macro to invoke filter function. */
 #define SK_RUN_FILTER(filter, ctx) \
 	(*filter->prog->bpf_func)(ctx, filter->prog->insnsi)
 
-struct bpf_insn {
-	__u8	code;		/* opcode */
-	__u8	dst_reg:4;	/* dest register */
-	__u8	src_reg:4;	/* source register */
-	__s16	off;		/* signed offset */
-	__s32	imm;		/* signed immediate constant */
-};
-
 #ifdef CONFIG_COMPAT
 /* A struct sock_filter is architecture independent. */
 struct compat_sock_fprog {
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 24e9033f8b3f..fb3f7b675229 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -67,6 +67,7 @@ header-y += bfs_fs.h
 header-y += binfmts.h
 header-y += blkpg.h
 header-y += blktrace_api.h
+header-y += bpf.h
 header-y += bpqether.h
 header-y += bsg.h
 header-y += btrfs.h
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
new file mode 100644
index 000000000000..45a09b46c578
--- /dev/null
+++ b/include/uapi/linux/bpf.h
@@ -0,0 +1,321 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#ifndef _UAPI__LINUX_BPF_H__
+#define _UAPI__LINUX_BPF_H__
+
+#include <linux/types.h>
+
+/* Extended instruction set based on top of classic BPF */
+
+/* instruction classes */
+#define BPF_ALU64	0x07	/* alu mode in double word width */
+
+/* ld/ldx fields */
+#define BPF_DW		0x18	/* double word */
+#define BPF_XADD	0xc0	/* exclusive add */
+
+/* alu/jmp fields */
+#define BPF_MOV		0xb0	/* mov reg to reg */
+#define BPF_ARSH	0xc0	/* sign extending arithmetic shift right */
+
+/* change endianness of a register */
+#define BPF_END		0xd0	/* flags for endianness conversion: */
+#define BPF_TO_LE	0x00	/* convert to little-endian */
+#define BPF_TO_BE	0x08	/* convert to big-endian */
+#define BPF_FROM_LE	BPF_TO_LE
+#define BPF_FROM_BE	BPF_TO_BE
+
+#define BPF_JNE		0x50	/* jump != */
+#define BPF_JSGT	0x60	/* SGT is signed '>', GT in x86 */
+#define BPF_JSGE	0x70	/* SGE is signed '>=', GE in x86 */
+#define BPF_CALL	0x80	/* function call */
+#define BPF_EXIT	0x90	/* function return */
+
+/* Register numbers */
+enum {
+	BPF_REG_0 = 0,
+	BPF_REG_1,
+	BPF_REG_2,
+	BPF_REG_3,
+	BPF_REG_4,
+	BPF_REG_5,
+	BPF_REG_6,
+	BPF_REG_7,
+	BPF_REG_8,
+	BPF_REG_9,
+	BPF_REG_10,
+	__MAX_BPF_REG,
+};
+
+/* BPF has 10 general purpose 64-bit registers and stack frame. */
+#define MAX_BPF_REG	__MAX_BPF_REG
+
+/* ArgX, context and stack frame pointer register positions. Note,
+ * Arg1, Arg2, Arg3, etc are used as argument mappings of function
+ * calls in BPF_CALL instruction.
+ */
+#define BPF_REG_ARG1	BPF_REG_1
+#define BPF_REG_ARG2	BPF_REG_2
+#define BPF_REG_ARG3	BPF_REG_3
+#define BPF_REG_ARG4	BPF_REG_4
+#define BPF_REG_ARG5	BPF_REG_5
+#define BPF_REG_CTX	BPF_REG_6
+#define BPF_REG_FP	BPF_REG_10
+
+/* Additional register mappings for converted user programs. */
+#define BPF_REG_A	BPF_REG_0
+#define BPF_REG_X	BPF_REG_7
+#define BPF_REG_TMP	BPF_REG_8
+
+/* BPF program can access up to 512 bytes of stack space. */
+#define MAX_BPF_STACK	512
+
+/* Helper macros for filter block array initializers. */
+
+/* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */
+
+#define BPF_ALU64_REG(OP, DST, SRC)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU64 | BPF_OP(OP) | BPF_X,	\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = 0 })
+
+#define BPF_ALU32_REG(OP, DST, SRC)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU | BPF_OP(OP) | BPF_X,		\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = 0 })
+
+/* ALU ops on immediates, bpf_add|sub|...: dst_reg += imm32 */
+
+#define BPF_ALU64_IMM(OP, DST, IMM)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU64 | BPF_OP(OP) | BPF_K,	\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+#define BPF_ALU32_IMM(OP, DST, IMM)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU | BPF_OP(OP) | BPF_K,		\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+/* Endianess conversion, cpu_to_{l,b}e(), {l,b}e_to_cpu() */
+
+#define BPF_ENDIAN(TYPE, DST, LEN)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU | BPF_END | BPF_SRC(TYPE),	\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = LEN })
+
+/* Short form of mov, dst_reg = src_reg */
+
+#define BPF_MOV64_REG(DST, SRC)					\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU64 | BPF_MOV | BPF_X,		\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = 0 })
+
+#define BPF_MOV32_REG(DST, SRC)					\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU | BPF_MOV | BPF_X,		\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = 0 })
+
+/* Short form of mov, dst_reg = imm32 */
+
+#define BPF_MOV64_IMM(DST, IMM)					\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU64 | BPF_MOV | BPF_K,		\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+#define BPF_MOV32_IMM(DST, IMM)					\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU | BPF_MOV | BPF_K,		\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+/* BPF_LD_IMM64 macro encodes single 'load 64-bit immediate' insn */
+#define BPF_LD_IMM64(DST, IMM)					\
+	BPF_LD_IMM64_RAW(DST, 0, IMM)
+
+#define BPF_LD_IMM64_RAW(DST, SRC, IMM)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_LD | BPF_DW | BPF_IMM,		\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = (__u32) (IMM) }),			\
+	((struct bpf_insn) {					\
+		.code  = 0, /* zero is reserved opcode */	\
+		.dst_reg = 0,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = ((__u64) (IMM)) >> 32 })
+
+/* Short form of mov based on type, BPF_X: dst_reg = src_reg, BPF_K: dst_reg = imm32 */
+
+#define BPF_MOV64_RAW(TYPE, DST, SRC, IMM)			\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU64 | BPF_MOV | BPF_SRC(TYPE),	\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+#define BPF_MOV32_RAW(TYPE, DST, SRC, IMM)			\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU | BPF_MOV | BPF_SRC(TYPE),	\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+/* Direct packet access, R0 = *(uint *) (skb->data + imm32) */
+
+#define BPF_LD_ABS(SIZE, IMM)					\
+	((struct bpf_insn) {					\
+		.code  = BPF_LD | BPF_SIZE(SIZE) | BPF_ABS,	\
+		.dst_reg = 0,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+/* Indirect packet access, R0 = *(uint *) (skb->data + src_reg + imm32) */
+
+#define BPF_LD_IND(SIZE, SRC, IMM)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_LD | BPF_SIZE(SIZE) | BPF_IND,	\
+		.dst_reg = 0,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+/* Memory load, dst_reg = *(uint *) (src_reg + off16) */
+
+#define BPF_LDX_MEM(SIZE, DST, SRC, OFF)			\
+	((struct bpf_insn) {					\
+		.code  = BPF_LDX | BPF_SIZE(SIZE) | BPF_MEM,	\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = OFF,					\
+		.imm   = 0 })
+
+/* Memory store, *(uint *) (dst_reg + off16) = src_reg */
+
+#define BPF_STX_MEM(SIZE, DST, SRC, OFF)			\
+	((struct bpf_insn) {					\
+		.code  = BPF_STX | BPF_SIZE(SIZE) | BPF_MEM,	\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = OFF,					\
+		.imm   = 0 })
+
+/* Memory store, *(uint *) (dst_reg + off16) = imm32 */
+
+#define BPF_ST_MEM(SIZE, DST, OFF, IMM)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_ST | BPF_SIZE(SIZE) | BPF_MEM,	\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = OFF,					\
+		.imm   = IMM })
+
+/* Conditional jumps against registers, if (dst_reg 'op' src_reg) goto pc + off16 */
+
+#define BPF_JMP_REG(OP, DST, SRC, OFF)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_JMP | BPF_OP(OP) | BPF_X,		\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = OFF,					\
+		.imm   = 0 })
+
+/* Conditional jumps against immediates, if (dst_reg 'op' imm32) goto pc + off16 */
+
+#define BPF_JMP_IMM(OP, DST, IMM, OFF)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_JMP | BPF_OP(OP) | BPF_K,		\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = OFF,					\
+		.imm   = IMM })
+
+/* Function call */
+
+#define BPF_EMIT_CALL(FUNC)					\
+	((struct bpf_insn) {					\
+		.code  = BPF_JMP | BPF_CALL,			\
+		.dst_reg = 0,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = ((FUNC) - __bpf_call_base) })
+
+/* Raw code statement block */
+
+#define BPF_RAW_INSN(CODE, DST, SRC, OFF, IMM)			\
+	((struct bpf_insn) {					\
+		.code  = CODE,					\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = OFF,					\
+		.imm   = IMM })
+
+/* Program exit */
+
+#define BPF_EXIT_INSN()						\
+	((struct bpf_insn) {					\
+		.code  = BPF_JMP | BPF_EXIT,			\
+		.dst_reg = 0,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = 0 })
+
+#define bytes_to_bpf_size(bytes)				\
+({								\
+	int bpf_size = -EINVAL;					\
+								\
+	if (bytes == sizeof(u8))				\
+		bpf_size = BPF_B;				\
+	else if (bytes == sizeof(u16))				\
+		bpf_size = BPF_H;				\
+	else if (bytes == sizeof(u32))				\
+		bpf_size = BPF_W;				\
+	else if (bytes == sizeof(u64))				\
+		bpf_size = BPF_DW;				\
+								\
+	bpf_size;						\
+})
+
+struct bpf_insn {
+	__u8	code;		/* opcode */
+	__u8	dst_reg:4;	/* dest register */
+	__u8	src_reg:4;	/* source register */
+	__s16	off;		/* signed offset */
+	__s32	imm;		/* signed immediate constant */
+};
+
+#endif /* _UAPI__LINUX_BPF_H__ */
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v6 net-next 3/6] bpf: introduce syscall(BPF, ...) and BPF maps
  2014-08-26  1:00 [PATCH v6 net-next 0/6] introduce BPF syscall Alexei Starovoitov
  2014-08-26  1:00 ` [PATCH v6 net-next 1/6] net: filter: add "load 64-bit immediate" eBPF instruction Alexei Starovoitov
@ 2014-08-26  1:00 ` Alexei Starovoitov
  2014-08-26  1:00 ` [PATCH v6 net-next 4/6] bpf: enable bpf syscall on x64 and i386 Alexei Starovoitov
       [not found] ` <1409014858-1410-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
  3 siblings, 0 replies; 24+ messages in thread
From: Alexei Starovoitov @ 2014-08-26  1:00 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton,
	Kees Cook, linux-api, netdev, linux-kernel

BPF syscall is a demux for different BPF releated commands.

'maps' is a generic storage of different types for sharing data between kernel
and userspace.

The maps can be created from user space via BPF syscall:
- create a map with given type and attributes
  fd = bpf_map_create(map_type, struct nlattr *attr, int len)
  returns fd or negative error

- close(fd) deletes the map

Next patch allows userspace programs to populate/read maps that eBPF programs
are concurrently updating.

maps can have different types: hash, bloom filter, radix-tree, etc.

The map is defined by:
  . type
  . max number of elements
  . key size in bytes
  . value size in bytes

This patch establishes core infrastructure for BPF maps.
Next patches implement lookup/update and hashtable type.
More map types can be added in the future.

syscall is using type-length-value style of passing arguments to be backwards
compatible with future extensions to map attributes. Different map types may
use different attributes as well.
The concept of type-lenght-value is borrowed from netlink, but netlink itself
is not applicable here, since BPF programs and maps can be used in NET-less
configurations.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 Documentation/networking/filter.txt |   71 ++++++++++++++++
 include/linux/bpf.h                 |   42 ++++++++++
 include/uapi/linux/bpf.h            |   24 ++++++
 kernel/bpf/Makefile                 |    2 +-
 kernel/bpf/syscall.c                |  156 +++++++++++++++++++++++++++++++++++
 5 files changed, 294 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/bpf.h
 create mode 100644 kernel/bpf/syscall.c

diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
index 81916ab5d96f..27a0a6c6acb4 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -1001,6 +1001,77 @@ instruction that loads 64-bit immediate value into a dst_reg.
 Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM which loads
 32-bit immediate value into a register.
 
+eBPF maps
+---------
+'maps' is a generic storage of different types for sharing data between kernel
+and userspace.
+
+The maps are accessed from user space via BPF syscall, which has commands:
+- create a map with given type and attributes
+  map_fd = bpf_map_create(map_type, struct nlattr *attr, int len)
+  returns process-local file descriptor or negative error
+
+- lookup key in a given map
+  err = bpf_map_lookup_elem(int fd, void *key, void *value)
+  returns zero and stores found elem into value or negative error
+
+- create or update key/value pair in a given map
+  err = bpf_map_update_elem(int fd, void *key, void *value)
+  returns zero or negative error
+
+- find and delete element by key in a given map
+  err = bpf_map_delete_elem(int fd, void *key)
+
+- to delete map: close(fd)
+  Exiting process will delete maps automatically
+
+userspace programs uses this API to create/populate/read maps that eBPF programs
+are concurrently updating.
+
+maps can have different types: hash, array, bloom filter, radix-tree, etc.
+
+The map is defined by:
+  . type
+  . max number of elements
+  . key size in bytes
+  . value size in bytes
+
+The maps are accesible from eBPF program with API:
+  void * bpf_map_lookup_elem(u32 map_fd, void *key);
+  int bpf_map_update_elem(u32 map_fd, void *key, void *value);
+  int bpf_map_delete_elem(u32 map_fd, void *key);
+
+The kernel replaces process-local map_fd with kernel internal map pointer,
+while loading eBPF program.
+
+If eBPF verifier is configured to recognize extra calls in the program
+bpf_map_lookup_elem() and bpf_map_update_elem() then access to maps looks like:
+  ...
+  ptr_to_value = bpf_map_lookup_elem(map_fd, key)
+  access memory range [ptr_to_value, ptr_to_value + value_size_in_bytes)
+  ...
+  prepare key2 and value2 on stack of key_size and value_size
+  err = bpf_map_update_elem(map_fd, key2, value2)
+  ...
+
+eBPF program cannot create or delete maps
+(such calls will be unknown to verifier)
+
+During program loading the refcnt of used maps is incremented, so they don't get
+deleted while program is running
+
+bpf_map_update_elem() can fail if maximum number of elements reached.
+if key2 already exists, bpf_map_update_elem() replaces it with value2 atomically
+
+bpf_map_lookup_elem() returns NULL or ptr_to_value, so program must do
+if (ptr_to_value != NULL) check before accessing it.
+NULL means that element with given 'key' was not found.
+
+The verifier will check that the program accesses map elements within specified
+size. It will not let programs pass junk values to bpf_map_*_elem() functions,
+so these functions (implemented in C inside kernel) can safely access
+the pointers in all cases.
+
 Testing
 -------
 
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
new file mode 100644
index 000000000000..607ca53fe2af
--- /dev/null
+++ b/include/linux/bpf.h
@@ -0,0 +1,42 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#ifndef _LINUX_BPF_H
+#define _LINUX_BPF_H 1
+
+#include <uapi/linux/bpf.h>
+#include <linux/workqueue.h>
+
+struct bpf_map;
+struct nlattr;
+
+/* map is generic key/value storage optionally accesible by eBPF programs */
+struct bpf_map_ops {
+	/* funcs callable from userspace (via syscall) */
+	struct bpf_map *(*map_alloc)(struct nlattr *attrs[BPF_MAP_ATTR_MAX + 1]);
+	void (*map_free)(struct bpf_map *);
+};
+
+struct bpf_map {
+	atomic_t refcnt;
+	enum bpf_map_type map_type;
+	u32 key_size;
+	u32 value_size;
+	u32 max_entries;
+	struct bpf_map_ops *ops;
+	struct work_struct work;
+};
+
+struct bpf_map_type_list {
+	struct list_head list_node;
+	struct bpf_map_ops *ops;
+	enum bpf_map_type type;
+};
+
+void bpf_register_map_type(struct bpf_map_type_list *tl);
+void bpf_map_put(struct bpf_map *map);
+
+#endif /* _LINUX_BPF_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 45a09b46c578..de21c8ecf0bb 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -318,4 +318,28 @@ struct bpf_insn {
 	__s32	imm;		/* signed immediate constant */
 };
 
+/* BPF syscall commands */
+enum bpf_cmd {
+	/* create a map with given type and attributes
+	 * fd = bpf_map_create(bpf_map_type, struct nlattr *attr, int len)
+	 * returns fd or negative error
+	 * map is deleted when fd is closed
+	 */
+	BPF_MAP_CREATE,
+};
+
+enum bpf_map_attributes {
+	BPF_MAP_UNSPEC,
+	BPF_MAP_KEY_SIZE,	/* size of key in bytes */
+	BPF_MAP_VALUE_SIZE,	/* size of value in bytes */
+	BPF_MAP_MAX_ENTRIES,	/* maximum number of entries in a map */
+	__BPF_MAP_ATTR_MAX,
+};
+#define BPF_MAP_ATTR_MAX (__BPF_MAP_ATTR_MAX - 1)
+#define BPF_MAP_MAX_ATTR_SIZE 65535
+
+enum bpf_map_type {
+	BPF_MAP_TYPE_UNSPEC,
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 6a71145e2769..e9f7334ed07a 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -1 +1 @@
-obj-y := core.o
+obj-y := core.o syscall.o
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
new file mode 100644
index 000000000000..b929ba7fc48e
--- /dev/null
+++ b/kernel/bpf/syscall.c
@@ -0,0 +1,156 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include <linux/bpf.h>
+#include <linux/syscalls.h>
+#include <net/netlink.h>
+#include <linux/anon_inodes.h>
+
+static LIST_HEAD(bpf_map_types);
+
+static struct bpf_map *find_and_alloc_map(enum bpf_map_type type,
+					  struct nlattr *tb[BPF_MAP_ATTR_MAX + 1])
+{
+	struct bpf_map_type_list *tl;
+	struct bpf_map *map;
+
+	list_for_each_entry(tl, &bpf_map_types, list_node) {
+		if (tl->type == type) {
+			map = tl->ops->map_alloc(tb);
+			if (IS_ERR(map))
+				return map;
+			map->ops = tl->ops;
+			map->map_type = type;
+			return map;
+		}
+	}
+	return ERR_PTR(-EINVAL);
+}
+
+/* boot time registration of different map implementations */
+void bpf_register_map_type(struct bpf_map_type_list *tl)
+{
+	list_add(&tl->list_node, &bpf_map_types);
+}
+
+/* called from workqueue */
+static void bpf_map_free_deferred(struct work_struct *work)
+{
+	struct bpf_map *map = container_of(work, struct bpf_map, work);
+
+	/* implementation dependent freeing */
+	map->ops->map_free(map);
+}
+
+/* decrement map refcnt and schedule it for freeing via workqueue
+ * (unrelying map implementation ops->map_free() might sleep)
+ */
+void bpf_map_put(struct bpf_map *map)
+{
+	if (atomic_dec_and_test(&map->refcnt)) {
+		INIT_WORK(&map->work, bpf_map_free_deferred);
+		schedule_work(&map->work);
+	}
+}
+
+static int bpf_map_release(struct inode *inode, struct file *filp)
+{
+	struct bpf_map *map = filp->private_data;
+
+	bpf_map_put(map);
+	return 0;
+}
+
+static const struct file_operations bpf_map_fops = {
+	.release = bpf_map_release,
+};
+
+static const struct nla_policy map_policy[BPF_MAP_ATTR_MAX + 1] = {
+	[BPF_MAP_KEY_SIZE]    = { .type = NLA_U32 },
+	[BPF_MAP_VALUE_SIZE]  = { .type = NLA_U32 },
+	[BPF_MAP_MAX_ENTRIES] = { .type = NLA_U32 },
+};
+
+/* called via syscall */
+static int map_create(enum bpf_map_type type, struct nlattr __user *uattr, int len)
+{
+	struct nlattr *tb[BPF_MAP_ATTR_MAX + 1];
+	struct bpf_map *map;
+	struct nlattr *attr;
+	int err;
+
+	if (len <= 0 || len > BPF_MAP_MAX_ATTR_SIZE)
+		return -EINVAL;
+
+	attr = kmalloc(len, GFP_USER);
+	if (!attr)
+		return -ENOMEM;
+
+	/* copy map attributes from user space */
+	err = -EFAULT;
+	if (copy_from_user(attr, uattr, len) != 0)
+		goto free_attr;
+
+	/* perform basic validation */
+	err = nla_parse(tb, BPF_MAP_ATTR_MAX, attr, len, map_policy);
+	if (err < 0)
+		goto free_attr;
+
+	/* find map type and init map: hashtable vs rbtree vs bloom vs ... */
+	map = find_and_alloc_map(type, tb);
+	if (IS_ERR(map)) {
+		err = PTR_ERR(map);
+		goto free_attr;
+	}
+
+	atomic_set(&map->refcnt, 1);
+
+	err = anon_inode_getfd("bpf-map", &bpf_map_fops, map, O_RDWR | O_CLOEXEC);
+
+	if (err < 0)
+		/* failed to allocate fd */
+		goto free_map;
+
+	/* user supplied array of map attributes is no longer needed */
+	kfree(attr);
+
+	return err;
+
+free_map:
+	map->ops->map_free(map);
+free_attr:
+	kfree(attr);
+	return err;
+}
+
+SYSCALL_DEFINE5(bpf, int, cmd, unsigned long, arg2, unsigned long, arg3,
+		unsigned long, arg4, unsigned long, arg5)
+{
+	/* eBPF syscall is limited to root temporarily. This restriction will
+	 * be lifted when verifier has enough mileage and security audit is
+	 * clean. Note that tracing/networking analytics use cases will be
+	 * turning off 'secure' mode of verifier, since they need to pass
+	 * kernel data back to user space
+	 */
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (arg5 != 0)
+		return -EINVAL;
+
+	switch (cmd) {
+	case BPF_MAP_CREATE:
+		return map_create((enum bpf_map_type) arg2,
+				  (struct nlattr __user *) arg3, (int) arg4);
+	default:
+		return -EINVAL;
+	}
+}
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v6 net-next 4/6] bpf: enable bpf syscall on x64 and i386
  2014-08-26  1:00 [PATCH v6 net-next 0/6] introduce BPF syscall Alexei Starovoitov
  2014-08-26  1:00 ` [PATCH v6 net-next 1/6] net: filter: add "load 64-bit immediate" eBPF instruction Alexei Starovoitov
  2014-08-26  1:00 ` [PATCH v6 net-next 3/6] bpf: introduce syscall(BPF, ...) and BPF maps Alexei Starovoitov
@ 2014-08-26  1:00 ` Alexei Starovoitov
       [not found]   ` <1409014858-1410-5-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
       [not found] ` <1409014858-1410-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
  3 siblings, 1 reply; 24+ messages in thread
From: Alexei Starovoitov @ 2014-08-26  1:00 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton,
	Kees Cook, linux-api, netdev, linux-kernel

done as separate commit to ease conflict resolution

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 arch/x86/syscalls/syscall_32.tbl  |    1 +
 arch/x86/syscalls/syscall_64.tbl  |    1 +
 include/linux/syscalls.h          |    3 ++-
 include/uapi/asm-generic/unistd.h |    4 +++-
 kernel/sys_ni.c                   |    3 +++
 5 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 028b78168d85..9fe1b5d002f0 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -363,3 +363,4 @@
 354	i386	seccomp			sys_seccomp
 355	i386	getrandom		sys_getrandom
 356	i386	memfd_create		sys_memfd_create
+357	i386	bpf			sys_bpf
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 35dd922727b9..281150b539a2 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -327,6 +327,7 @@
 318	common	getrandom		sys_getrandom
 319	common	memfd_create		sys_memfd_create
 320	common	kexec_file_load		sys_kexec_file_load
+321	common	bpf			sys_bpf
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 0f86d85a9ce4..61bc112b9fa5 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -875,5 +875,6 @@ asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
 			    const char __user *uargs);
 asmlinkage long sys_getrandom(char __user *buf, size_t count,
 			      unsigned int flags);
-
+asmlinkage long sys_bpf(int cmd, unsigned long arg2, unsigned long arg3,
+			unsigned long arg4, unsigned long arg5);
 #endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 11d11bc5c78f..22749c134117 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -705,9 +705,11 @@ __SYSCALL(__NR_seccomp, sys_seccomp)
 __SYSCALL(__NR_getrandom, sys_getrandom)
 #define __NR_memfd_create 279
 __SYSCALL(__NR_memfd_create, sys_memfd_create)
+#define __NR_bpf 280
+__SYSCALL(__NR_bpf, sys_bpf)
 
 #undef __NR_syscalls
-#define __NR_syscalls 280
+#define __NR_syscalls 281
 
 /*
  * All syscalls below here should go away really,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 391d4ddb6f4b..b4b5083f5f5e 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -218,3 +218,6 @@ cond_syscall(sys_kcmp);
 
 /* operate on Secure Computing state */
 cond_syscall(sys_seccomp);
+
+/* access BPF programs and maps */
+cond_syscall(sys_bpf);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v6 net-next 5/6] bpf: add lookup/update/delete/iterate methods to BPF maps
       [not found] ` <1409014858-1410-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
  2014-08-26  1:00   ` [PATCH v6 net-next 2/6] net: filter: split filter.h and expose eBPF to user space Alexei Starovoitov
@ 2014-08-26  1:00   ` Alexei Starovoitov
  2014-08-26  1:00   ` [PATCH v6 net-next 6/6] bpf: add hashtable type of " Alexei Starovoitov
  2 siblings, 0 replies; 24+ messages in thread
From: Alexei Starovoitov @ 2014-08-26  1:00 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton,
	Kees Cook, linux-api-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

'maps' is a generic storage of different types for sharing data between kernel
and userspace.

The maps are accessed from user space via BPF syscall, which has commands:

- create a map with given type and attributes
  fd = bpf_map_create(map_type, struct nlattr *attr, int len)
  returns fd or negative error

- lookup key in a given map referenced by fd
  err = bpf_map_lookup_elem(int fd, void *key, void *value)
  returns zero and stores found elem into value or negative error

- create or update key/value pair in a given map
  err = bpf_map_update_elem(int fd, void *key, void *value)
  returns zero or negative error

- find and delete element by key in a given map
  err = bpf_map_delete_elem(int fd, void *key)

- iterate map elements (based on input key return next_key)
  err = bpf_map_get_next_key(int fd, void *key, void *next_key)

- close(fd) deletes the map

Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
---
 include/linux/bpf.h      |    8 ++
 include/uapi/linux/bpf.h |   25 ++++++
 kernel/bpf/syscall.c     |  198 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 231 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 607ca53fe2af..fd1ac4b5ba8b 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -9,6 +9,7 @@
 
 #include <uapi/linux/bpf.h>
 #include <linux/workqueue.h>
+#include <linux/file.h>
 
 struct bpf_map;
 struct nlattr;
@@ -18,6 +19,12 @@ struct bpf_map_ops {
 	/* funcs callable from userspace (via syscall) */
 	struct bpf_map *(*map_alloc)(struct nlattr *attrs[BPF_MAP_ATTR_MAX + 1]);
 	void (*map_free)(struct bpf_map *);
+	int (*map_get_next_key)(struct bpf_map *map, void *key, void *next_key);
+
+	/* funcs callable from userspace and from eBPF programs */
+	void *(*map_lookup_elem)(struct bpf_map *map, void *key);
+	int (*map_update_elem)(struct bpf_map *map, void *key, void *value);
+	int (*map_delete_elem)(struct bpf_map *map, void *key);
 };
 
 struct bpf_map {
@@ -38,5 +45,6 @@ struct bpf_map_type_list {
 
 void bpf_register_map_type(struct bpf_map_type_list *tl);
 void bpf_map_put(struct bpf_map *map);
+struct bpf_map *bpf_map_get(struct fd f);
 
 #endif /* _LINUX_BPF_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index de21c8ecf0bb..f68edb2681f8 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -326,6 +326,31 @@ enum bpf_cmd {
 	 * map is deleted when fd is closed
 	 */
 	BPF_MAP_CREATE,
+
+	/* lookup key in a given map
+	 * err = bpf_map_lookup_elem(int fd, void *key, void *value)
+	 * returns zero and stores found elem into value
+	 * or negative error
+	 */
+	BPF_MAP_LOOKUP_ELEM,
+
+	/* create or update key/value pair in a given map
+	 * err = bpf_map_update_elem(int fd, void *key, void *value)
+	 * returns zero or negative error
+	 */
+	BPF_MAP_UPDATE_ELEM,
+
+	/* find and delete elem by key in a given map
+	 * err = bpf_map_delete_elem(int fd, void *key)
+	 * returns zero or negative error
+	 */
+	BPF_MAP_DELETE_ELEM,
+
+	/* lookup key in a given map and return next key
+	 * err = bpf_map_get_elem(int fd, void *key, void *next_key)
+	 * returns zero and stores next key or negative error
+	 */
+	BPF_MAP_GET_NEXT_KEY,
 };
 
 enum bpf_map_attributes {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index b929ba7fc48e..c16bf2052c64 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -13,6 +13,7 @@
 #include <linux/syscalls.h>
 #include <net/netlink.h>
 #include <linux/anon_inodes.h>
+#include <linux/file.h>
 
 static LIST_HEAD(bpf_map_types);
 
@@ -131,6 +132,189 @@ free_attr:
 	return err;
 }
 
+/* if error is returned, fd is released.
+ * On success caller should complete fd access with matching fdput()
+ */
+struct bpf_map *bpf_map_get(struct fd f)
+{
+	struct bpf_map *map;
+
+	if (!f.file)
+		return ERR_PTR(-EBADF);
+
+	if (f.file->f_op != &bpf_map_fops) {
+		fdput(f);
+		return ERR_PTR(-EINVAL);
+	}
+
+	map = f.file->private_data;
+
+	return map;
+}
+
+static int map_lookup_elem(int ufd, void __user *ukey, void __user *uvalue)
+{
+	struct fd f = fdget(ufd);
+	struct bpf_map *map;
+	void *key, *value;
+	int err;
+
+	map = bpf_map_get(f);
+	if (IS_ERR(map))
+		return PTR_ERR(map);
+
+	err = -ENOMEM;
+	key = kmalloc(map->key_size, GFP_USER);
+	if (!key)
+		goto err_put;
+
+	err = -EFAULT;
+	if (copy_from_user(key, ukey, map->key_size) != 0)
+		goto free_key;
+
+	err = -ESRCH;
+	rcu_read_lock();
+	value = map->ops->map_lookup_elem(map, key);
+	if (!value)
+		goto err_unlock;
+
+	err = -EFAULT;
+	if (copy_to_user(uvalue, value, map->value_size) != 0)
+		goto err_unlock;
+
+	err = 0;
+
+err_unlock:
+	rcu_read_unlock();
+free_key:
+	kfree(key);
+err_put:
+	fdput(f);
+	return err;
+}
+
+static int map_update_elem(int ufd, void __user *ukey, void __user *uvalue)
+{
+	struct fd f = fdget(ufd);
+	struct bpf_map *map;
+	void *key, *value;
+	int err;
+
+	map = bpf_map_get(f);
+	if (IS_ERR(map))
+		return PTR_ERR(map);
+
+	err = -ENOMEM;
+	key = kmalloc(map->key_size, GFP_USER);
+	if (!key)
+		goto err_put;
+
+	err = -EFAULT;
+	if (copy_from_user(key, ukey, map->key_size) != 0)
+		goto free_key;
+
+	err = -ENOMEM;
+	value = kmalloc(map->value_size, GFP_USER);
+	if (!value)
+		goto free_key;
+
+	err = -EFAULT;
+	if (copy_from_user(value, uvalue, map->value_size) != 0)
+		goto free_value;
+
+	/* eBPF program that use maps are running under rcu_read_lock(),
+	 * therefore all map accessors rely on this fact, so do the same here
+	 */
+	rcu_read_lock();
+	err = map->ops->map_update_elem(map, key, value);
+	rcu_read_unlock();
+
+free_value:
+	kfree(value);
+free_key:
+	kfree(key);
+err_put:
+	fdput(f);
+	return err;
+}
+
+static int map_delete_elem(int ufd, void __user *ukey)
+{
+	struct fd f = fdget(ufd);
+	struct bpf_map *map;
+	void *key;
+	int err;
+
+	map = bpf_map_get(f);
+	if (IS_ERR(map))
+		return PTR_ERR(map);
+
+	err = -ENOMEM;
+	key = kmalloc(map->key_size, GFP_USER);
+	if (!key)
+		goto err_put;
+
+	err = -EFAULT;
+	if (copy_from_user(key, ukey, map->key_size) != 0)
+		goto free_key;
+
+	rcu_read_lock();
+	err = map->ops->map_delete_elem(map, key);
+	rcu_read_unlock();
+
+free_key:
+	kfree(key);
+err_put:
+	fdput(f);
+	return err;
+}
+
+static int map_get_next_key(int ufd, void __user *ukey, void __user *unext_key)
+{
+	struct fd f = fdget(ufd);
+	struct bpf_map *map;
+	void *key, *next_key;
+	int err;
+
+	map = bpf_map_get(f);
+	if (IS_ERR(map))
+		return PTR_ERR(map);
+
+	err = -ENOMEM;
+	key = kmalloc(map->key_size, GFP_USER);
+	if (!key)
+		goto err_put;
+
+	err = -EFAULT;
+	if (copy_from_user(key, ukey, map->key_size) != 0)
+		goto free_key;
+
+	err = -ENOMEM;
+	next_key = kmalloc(map->key_size, GFP_USER);
+	if (!next_key)
+		goto free_key;
+
+	rcu_read_lock();
+	err = map->ops->map_get_next_key(map, key, next_key);
+	rcu_read_unlock();
+	if (err)
+		goto free_next_key;
+
+	err = -EFAULT;
+	if (copy_to_user(unext_key, next_key, map->key_size) != 0)
+		goto free_next_key;
+
+	err = 0;
+
+free_next_key:
+	kfree(next_key);
+free_key:
+	kfree(key);
+err_put:
+	fdput(f);
+	return err;
+}
+
 SYSCALL_DEFINE5(bpf, int, cmd, unsigned long, arg2, unsigned long, arg3,
 		unsigned long, arg4, unsigned long, arg5)
 {
@@ -150,6 +334,20 @@ SYSCALL_DEFINE5(bpf, int, cmd, unsigned long, arg2, unsigned long, arg3,
 	case BPF_MAP_CREATE:
 		return map_create((enum bpf_map_type) arg2,
 				  (struct nlattr __user *) arg3, (int) arg4);
+	case BPF_MAP_LOOKUP_ELEM:
+		return map_lookup_elem((int) arg2, (void __user *) arg3,
+				       (void __user *) arg4);
+	case BPF_MAP_UPDATE_ELEM:
+		return map_update_elem((int) arg2, (void __user *) arg3,
+				       (void __user *) arg4);
+	case BPF_MAP_DELETE_ELEM:
+		if (arg4 != 0)
+			return -EINVAL;
+		return map_delete_elem((int) arg2, (void __user *) arg3);
+
+	case BPF_MAP_GET_NEXT_KEY:
+		return map_get_next_key((int) arg2, (void __user *) arg3,
+					(void __user *) arg4);
 	default:
 		return -EINVAL;
 	}
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v6 net-next 6/6] bpf: add hashtable type of BPF maps
       [not found] ` <1409014858-1410-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
  2014-08-26  1:00   ` [PATCH v6 net-next 2/6] net: filter: split filter.h and expose eBPF to user space Alexei Starovoitov
  2014-08-26  1:00   ` [PATCH v6 net-next 5/6] bpf: add lookup/update/delete/iterate methods to BPF maps Alexei Starovoitov
@ 2014-08-26  1:00   ` Alexei Starovoitov
  2 siblings, 0 replies; 24+ messages in thread
From: Alexei Starovoitov @ 2014-08-26  1:00 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton,
	Kees Cook, linux-api-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

add new map type BPF_MAP_TYPE_HASH and its implementation

- key/value are opaque range of bytes

- user space provides 3 configuration attributes via BPF syscall:
  key_size, value_size, max_entries

- if value_size == 0, the map is used as a set

- map_update_elem() must fail to insert new element when max_entries
  limit is reached

- map takes care of allocating/freeing key/value pairs

- update/lookup/delete methods may be called from eBPF program attached
  to kprobes, so use spin_lock_irqsave() mechanism for concurrent updates

- optimized for speed of lookup() which can be called multiple times from
  eBPF program which itself is triggered by high volume of events

- in the future JIT compiler may recognize lookup() call and optimize it
  further, since key_size is constant for life of eBPF program

Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
---

Note, lib/rhashtable.c couldn't be reused here, since it's not managing the
objects, shrink/expand is not supported out of irq context, overhead of
generic key/head_offsets and hashfn is too high.
In the future if/when rhashtable gains shrink/expand as a separate thread,
we may add another BPF_MAP_TYPE_ for it, but security implications of
dynamically resizeable maps would need to be analyzed.

 include/uapi/linux/bpf.h |    1 +
 kernel/bpf/Makefile      |    2 +-
 kernel/bpf/hashtab.c     |  373 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 375 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/hashtab.c

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f68edb2681f8..8069ab7b64cf 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -365,6 +365,7 @@ enum bpf_map_attributes {
 
 enum bpf_map_type {
 	BPF_MAP_TYPE_UNSPEC,
+	BPF_MAP_TYPE_HASH,
 };
 
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index e9f7334ed07a..558e12712ebc 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -1 +1 @@
-obj-y := core.o syscall.o
+obj-y := core.o syscall.o hashtab.o
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
new file mode 100644
index 000000000000..3178a6746b92
--- /dev/null
+++ b/kernel/bpf/hashtab.c
@@ -0,0 +1,373 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include <linux/bpf.h>
+#include <net/netlink.h>
+#include <linux/jhash.h>
+
+struct bpf_htab {
+	struct bpf_map map;
+	struct hlist_head *buckets;
+	struct kmem_cache *elem_cache;
+	spinlock_t lock;
+	u32 count; /* number of elements in this hashtable */
+	u32 n_buckets; /* number of hash buckets */
+	u32 elem_size; /* size of each element in bytes */
+};
+
+/* each htab element is struct htab_elem + key + value */
+struct htab_elem {
+	struct hlist_node hash_node;
+	struct rcu_head rcu;
+	struct bpf_htab *htab;
+	u32 hash;
+	u32 pad;
+	char key[0];
+};
+
+#define BPF_MAP_MAX_KEY_SIZE 256
+static struct bpf_map *htab_map_alloc(struct nlattr *attr[BPF_MAP_ATTR_MAX + 1])
+{
+	struct bpf_htab *htab;
+	int err, i;
+
+	htab = kzalloc(sizeof(*htab), GFP_USER);
+	if (!htab)
+		return ERR_PTR(-ENOMEM);
+
+	/* look for mandatory map attributes */
+	err = -EINVAL;
+	if (!attr[BPF_MAP_KEY_SIZE])
+		goto free_htab;
+	htab->map.key_size = nla_get_u32(attr[BPF_MAP_KEY_SIZE]);
+
+	if (!attr[BPF_MAP_VALUE_SIZE])
+		goto free_htab;
+	htab->map.value_size = nla_get_u32(attr[BPF_MAP_VALUE_SIZE]);
+
+	if (!attr[BPF_MAP_MAX_ENTRIES])
+		goto free_htab;
+	htab->map.max_entries = nla_get_u32(attr[BPF_MAP_MAX_ENTRIES]);
+
+	/* check sanity of attributes.
+	 * value_size == 0 is allowed, in this case map is used as a set
+	 */
+	if (htab->map.max_entries == 0 || htab->map.key_size == 0)
+		goto free_htab;
+
+	/* hash table size must be power of 2 */
+	htab->n_buckets = roundup_pow_of_two(htab->map.max_entries);
+
+	err = -E2BIG;
+	if (htab->map.key_size > BPF_MAP_MAX_KEY_SIZE)
+		goto free_htab;
+
+	err = -ENOMEM;
+	htab->buckets = kmalloc_array(htab->n_buckets,
+				      sizeof(struct hlist_head), GFP_USER);
+
+	if (!htab->buckets)
+		goto free_htab;
+
+	for (i = 0; i < htab->n_buckets; i++)
+		INIT_HLIST_HEAD(&htab->buckets[i]);
+
+	spin_lock_init(&htab->lock);
+	htab->count = 0;
+
+	htab->elem_size = sizeof(struct htab_elem) +
+			  round_up(htab->map.key_size, 8) +
+			  htab->map.value_size;
+
+	htab->elem_cache = kmem_cache_create("bpf_htab", htab->elem_size, 0, 0,
+					     NULL);
+	if (!htab->elem_cache)
+		goto free_buckets;
+
+	return &htab->map;
+
+free_buckets:
+	kfree(htab->buckets);
+free_htab:
+	kfree(htab);
+	return ERR_PTR(err);
+}
+
+static inline u32 htab_map_hash(const void *key, u32 key_len)
+{
+	return jhash(key, key_len, 0);
+}
+
+static inline struct hlist_head *select_bucket(struct bpf_htab *htab, u32 hash)
+{
+	return &htab->buckets[hash & (htab->n_buckets - 1)];
+}
+
+static struct htab_elem *lookup_elem_raw(struct hlist_head *head, u32 hash,
+					 void *key, u32 key_size)
+{
+	struct htab_elem *l;
+
+	hlist_for_each_entry_rcu(l, head, hash_node) {
+		if (l->hash == hash && !memcmp(&l->key, key, key_size))
+			return l;
+	}
+	return NULL;
+}
+
+/* Must be called with rcu_read_lock. */
+static void *htab_map_lookup_elem(struct bpf_map *map, void *key)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	struct hlist_head *head;
+	struct htab_elem *l;
+	u32 hash, key_size;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	key_size = map->key_size;
+
+	hash = htab_map_hash(key, key_size);
+
+	head = select_bucket(htab, hash);
+
+	l = lookup_elem_raw(head, hash, key, key_size);
+
+	if (l)
+		return l->key + round_up(map->key_size, 8);
+
+	return NULL;
+}
+
+/* Must be called with rcu_read_lock. */
+static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	struct hlist_head *head;
+	struct htab_elem *l, *next_l;
+	u32 hash, key_size;
+	int i;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	key_size = map->key_size;
+
+	hash = htab_map_hash(key, key_size);
+
+	head = select_bucket(htab, hash);
+
+	/* lookup the key */
+	l = lookup_elem_raw(head, hash, key, key_size);
+
+	if (!l) {
+		i = 0;
+		goto find_first_elem;
+	}
+
+	/* key was found, get next key in the same bucket */
+	next_l = hlist_entry_safe(rcu_dereference_raw(hlist_next_rcu(&l->hash_node)),
+				  struct htab_elem, hash_node);
+
+	if (next_l) {
+		/* if next elem in this hash list is non-zero, just return it */
+		memcpy(next_key, next_l->key, key_size);
+		return 0;
+	}
+
+	/* no more elements in this hash list, go to the next bucket */
+	i = hash & (htab->n_buckets - 1);
+	i++;
+
+find_first_elem:
+	/* iterate over buckets */
+	for (; i < htab->n_buckets; i++) {
+		head = select_bucket(htab, i);
+
+		/* pick first element in the bucket */
+		next_l = hlist_entry_safe(rcu_dereference_raw(hlist_first_rcu(head)),
+					  struct htab_elem, hash_node);
+		if (next_l) {
+			/* if it's not empty, just return it */
+			memcpy(next_key, next_l->key, key_size);
+			return 0;
+		}
+	}
+
+	/* itereated over all buckets and all elements */
+	return -ENOENT;
+}
+
+static struct htab_elem *htab_alloc_elem(struct bpf_htab *htab)
+{
+	void *l;
+
+	l = kmem_cache_alloc(htab->elem_cache, GFP_ATOMIC);
+	if (!l)
+		return ERR_PTR(-ENOMEM);
+	return l;
+}
+
+static void free_htab_elem_rcu(struct rcu_head *rcu)
+{
+	struct htab_elem *l = container_of(rcu, struct htab_elem, rcu);
+
+	kmem_cache_free(l->htab->elem_cache, l);
+}
+
+static void release_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
+{
+	l->htab = htab;
+	call_rcu(&l->rcu, free_htab_elem_rcu);
+}
+
+/* Must be called with rcu_read_lock. */
+static int htab_map_update_elem(struct bpf_map *map, void *key, void *value)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	struct htab_elem *l_new, *l_old;
+	struct hlist_head *head;
+	unsigned long flags;
+	u32 key_size;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	l_new = htab_alloc_elem(htab);
+	if (IS_ERR(l_new))
+		return -ENOMEM;
+
+	key_size = map->key_size;
+
+	memcpy(l_new->key, key, key_size);
+	memcpy(l_new->key + round_up(key_size, 8), value, map->value_size);
+
+	l_new->hash = htab_map_hash(l_new->key, key_size);
+
+	/* bpf_map_update_elem() can be called in_irq() as well, so
+	 * spin_lock() or spin_lock_bh() cannot be used
+	 */
+	spin_lock_irqsave(&htab->lock, flags);
+
+	head = select_bucket(htab, l_new->hash);
+
+	l_old = lookup_elem_raw(head, l_new->hash, key, key_size);
+
+	if (!l_old && unlikely(htab->count >= map->max_entries)) {
+		/* if elem with this 'key' doesn't exist and we've reached
+		 * max_entries limit, fail insertion of new elem
+		 */
+		spin_unlock_irqrestore(&htab->lock, flags);
+		kmem_cache_free(htab->elem_cache, l_new);
+		return -EFBIG;
+	}
+
+	/* add new element to the head of the list, so that concurrent
+	 * search will find it before old elem
+	 */
+	hlist_add_head_rcu(&l_new->hash_node, head);
+	if (l_old) {
+		hlist_del_rcu(&l_old->hash_node);
+		release_htab_elem(htab, l_old);
+	} else {
+		htab->count++;
+	}
+	spin_unlock_irqrestore(&htab->lock, flags);
+
+	return 0;
+}
+
+/* Must be called with rcu_read_lock. */
+static int htab_map_delete_elem(struct bpf_map *map, void *key)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	struct hlist_head *head;
+	struct htab_elem *l;
+	unsigned long flags;
+	u32 hash, key_size;
+	int ret = -ESRCH;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	key_size = map->key_size;
+
+	hash = htab_map_hash(key, key_size);
+
+	spin_lock_irqsave(&htab->lock, flags);
+
+	head = select_bucket(htab, hash);
+
+	l = lookup_elem_raw(head, hash, key, key_size);
+
+	if (l) {
+		hlist_del_rcu(&l->hash_node);
+		htab->count--;
+		release_htab_elem(htab, l);
+		ret = 0;
+	}
+
+	spin_unlock_irqrestore(&htab->lock, flags);
+	return ret;
+}
+
+static void delete_all_elements(struct bpf_htab *htab)
+{
+	int i;
+
+	for (i = 0; i < htab->n_buckets; i++) {
+		struct hlist_head *head = select_bucket(htab, i);
+		struct hlist_node *n;
+		struct htab_elem *l;
+
+		hlist_for_each_entry_safe(l, n, head, hash_node) {
+			hlist_del_rcu(&l->hash_node);
+			htab->count--;
+			kmem_cache_free(htab->elem_cache, l);
+		}
+	}
+}
+
+/* called when map->refcnt goes to zero */
+static void htab_map_free(struct bpf_map *map)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+
+	/* wait for all outstanding updates to complete */
+	synchronize_rcu();
+
+	/* kmem_cache_free all htab elements */
+	delete_all_elements(htab);
+
+	/* and destroy cache, which might sleep */
+	kmem_cache_destroy(htab->elem_cache);
+
+	kfree(htab->buckets);
+	kfree(htab);
+}
+
+static struct bpf_map_ops htab_ops = {
+	.map_alloc = htab_map_alloc,
+	.map_free = htab_map_free,
+	.map_get_next_key = htab_map_get_next_key,
+	.map_lookup_elem = htab_map_lookup_elem,
+	.map_update_elem = htab_map_update_elem,
+	.map_delete_elem = htab_map_delete_elem,
+};
+
+static struct bpf_map_type_list tl = {
+	.ops = &htab_ops,
+	.type = BPF_MAP_TYPE_HASH,
+};
+
+static int __init register_htab_map(void)
+{
+	bpf_register_map_type(&tl);
+	return 0;
+}
+late_initcall(register_htab_map);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 net-next 1/6] net: filter: add "load 64-bit immediate" eBPF instruction
  2014-08-26  1:00 ` [PATCH v6 net-next 1/6] net: filter: add "load 64-bit immediate" eBPF instruction Alexei Starovoitov
@ 2014-08-26  1:06   ` David Miller
  2014-08-26  1:35     ` Alexei Starovoitov
  2014-08-26  4:12     ` Alexei Starovoitov
  0 siblings, 2 replies; 24+ messages in thread
From: David Miller @ 2014-08-26  1:06 UTC (permalink / raw)
  To: ast
  Cc: mingo, torvalds, luto, rostedt, dborkman, chema, edumazet,
	a.p.zijlstra, brendan.d.gregg, namhyung, hpa, akpm, keescook,
	linux-api, netdev, linux-kernel

From: Alexei Starovoitov <ast@plumgrid.com>
Date: Mon, 25 Aug 2014 18:00:53 -0700

> add BPF_LD_IMM64 instruction to load 64-bit immediate value into a register.

I think you need to rethink this.

I understand that you want to be able to compile arbitrary C code into
eBPF, but you have to restrict strongly what data the eBPF code can get
to.

Arbitrary pointer loads is asking for trouble.

Instead I would rather you look into a model like what the quake
engine uses for it's VM.

Namely, the program can do loads and stores from/to a data section,
but all of them are validated to be in the range of the VM program's
image.

eBPF programs should only be able to access things using either:

1) Well defined entry/exit points for control transfer

2) Load/Store within a private limited data segment range used
   only by the eBPF program

I don't want the eBPF program to be able to get "out of it's box"
in any way shape or form.

And besides, you're only making this thing as an optimization right?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 net-next 4/6] bpf: enable bpf syscall on x64 and i386
       [not found]   ` <1409014858-1410-5-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
@ 2014-08-26  1:07     ` David Miller
       [not found]       ` <20140825.180718.137768107010295086.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
  2014-08-26  3:52     ` Stephen Hemminger
  1 sibling, 1 reply; 24+ messages in thread
From: David Miller @ 2014-08-26  1:07 UTC (permalink / raw)
  To: ast-uqk4Ao+rVK5Wk0Htik3J/w
  Cc: mingo-DgEjT+Ai2ygdnm+yROfE0A,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	luto-kltTT9wpgjJwATOyAt5JVQ, rostedt-nx8X9YLhiw1AfugRpC6u6w,
	dborkman-H+wXaHxf7aLQT0dZR+AlfA, chema-hpIqsD4AKlfQT0dZR+AlfA,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw,
	brendan.d.gregg-Re5JQEeQqe8AvxtiuMwx3w,
	namhyung-DgEjT+Ai2ygdnm+yROfE0A, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	keescook-F7+t8E8rja9g9hUCZPvPmw, linux-api-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

From: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
Date: Mon, 25 Aug 2014 18:00:56 -0700

> -
> +asmlinkage long sys_bpf(int cmd, unsigned long arg2, unsigned long arg3,
> +			unsigned long arg4, unsigned long arg5);

Please do not add interfaces with opaque types as arguments.

It is impossible for the compiler to type check the args at
compile time when userspace tries to use this stuff.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 net-next 1/6] net: filter: add "load 64-bit immediate" eBPF instruction
  2014-08-26  1:06   ` David Miller
@ 2014-08-26  1:35     ` Alexei Starovoitov
       [not found]       ` <CAMEtUuxdQpkX8t1_szde=Q1ALcp5t7rRyK+zEDafj27_J2LzVg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2014-08-26  4:12     ` Alexei Starovoitov
  1 sibling, 1 reply; 24+ messages in thread
From: Alexei Starovoitov @ 2014-08-26  1:35 UTC (permalink / raw)
  To: David Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton,
	Kees Cook, Linux API, Network Development, LKML

On Mon, Aug 25, 2014 at 6:06 PM, David Miller <davem@davemloft.net> wrote:
> From: Alexei Starovoitov <ast@plumgrid.com>
> Date: Mon, 25 Aug 2014 18:00:53 -0700
>
>> add BPF_LD_IMM64 instruction to load 64-bit immediate value into a register.
>
> I think you need to rethink this.
>
> I understand that you want to be able to compile arbitrary C code into
> eBPF, but you have to restrict strongly what data the eBPF code can get
> to.

I believe verifier already does restrict it. I don't see any holes in
the architecture. I'm probably not explaining it clearly though :(

> Arbitrary pointer loads is asking for trouble.

Of course.
There is no arbitrary pointer from user space.
Verifier checks all pointers.
I guess this commit log description is confusing.
It says:
BPF_LD_IMM64(R1, const_imm_map_ptr)
that's what appears in the program _after_ it goes through verifier.
User space cannot pass a pointer into the kernel.

> Instead I would rather you look into a model like what the quake
> engine uses for it's VM.
>
> Namely, the program can do loads and stores from/to a data section,
> but all of them are validated to be in the range of the VM program's
> image.

That's exactly what eBPF does as well. load and stores can only
be from three 'sections': pointer to stack, pointer to context, pointer
to map value.
All verifier logic including these pointer checks is described in
extensive verifier documentation. Please take a look at it.
Andy said that it's good doc :)

Programs like:
BPF_LD_IMM64(R1, some_64bit_constant)
*(u8*)(R1+ off) = imm;
will be rejected by verifier.
I even have a test case for that later in the patches.

> eBPF programs should only be able to access things using either:
>
> 1) Well defined entry/exit points for control transfer
>
> 2) Load/Store within a private limited data segment range used
>    only by the eBPF program
>
> I don't want the eBPF program to be able to get "out of it's box"
> in any way shape or form.

Agree 100%.
And I believe I already achieved that with verifier.

> And besides, you're only making this thing as an optimization right?

not quite. This instruction is needed to get rid of IDR for maps,
speeded up lookup, made verifier much simpler and most important
it cleaned up program<->maps interaction.
I've described it better in V4 set:
https://lkml.org/lkml/2014/8/13/111

Should I resend this patch together with all of the verifier patches?
I included it here as #1, since patch #2 moves all eBPF macros
into uapi/linux/bpf.h and this instruction by itself is completely
harmless. It's verifier that needs to do proper checking.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 net-next 1/6] net: filter: add "load 64-bit immediate" eBPF instruction
       [not found]       ` <CAMEtUuxdQpkX8t1_szde=Q1ALcp5t7rRyK+zEDafj27_J2LzVg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-08-26  1:38         ` Andy Lutomirski
  2014-08-26  1:53           ` Alexei Starovoitov
  0 siblings, 1 reply; 24+ messages in thread
From: Andy Lutomirski @ 2014-08-26  1:38 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Miller, Ingo Molnar, Linus Torvalds, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton,
	Kees Cook, Linux API, Network Development, LKML

On Mon, Aug 25, 2014 at 6:35 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
> On Mon, Aug 25, 2014 at 6:06 PM, David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> wrote:
>> From: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
>> Date: Mon, 25 Aug 2014 18:00:53 -0700
>>
>>> add BPF_LD_IMM64 instruction to load 64-bit immediate value into a register.
>>
>> I think you need to rethink this.
>>
>> I understand that you want to be able to compile arbitrary C code into
>> eBPF, but you have to restrict strongly what data the eBPF code can get
>> to.
>
> I believe verifier already does restrict it. I don't see any holes in
> the architecture. I'm probably not explaining it clearly though :(
>
>> Arbitrary pointer loads is asking for trouble.
>
> Of course.
> There is no arbitrary pointer from user space.
> Verifier checks all pointers.
> I guess this commit log description is confusing.
> It says:
> BPF_LD_IMM64(R1, const_imm_map_ptr)
> that's what appears in the program _after_ it goes through verifier.
> User space cannot pass a pointer into the kernel.

If you don't intend for userspace to load a program that contains this
instruction, then why does it need to be an instruction that the
verifier rewrites?  Why not have an instruction "load immediate
relocated pointer" that contains a reference to a relocation table and
have the JIT do it?  That might be easier to understand than having
the verifier do it, and it'll avoid committing to ABIs before we need
them.

--Andy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 net-next 4/6] bpf: enable bpf syscall on x64 and i386
       [not found]       ` <20140825.180718.137768107010295086.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
@ 2014-08-26  1:43         ` Alexei Starovoitov
  2014-08-26  7:45           ` Ingo Molnar
  0 siblings, 1 reply; 24+ messages in thread
From: Alexei Starovoitov @ 2014-08-26  1:43 UTC (permalink / raw)
  To: David Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton,
	Kees Cook, Linux API, Network Development, LKML

On Mon, Aug 25, 2014 at 6:07 PM, David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> wrote:
> From: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
> Date: Mon, 25 Aug 2014 18:00:56 -0700
>
>> -
>> +asmlinkage long sys_bpf(int cmd, unsigned long arg2, unsigned long arg3,
>> +                     unsigned long arg4, unsigned long arg5);
>
> Please do not add interfaces with opaque types as arguments.
>
> It is impossible for the compiler to type check the args at
> compile time when userspace tries to use this stuff.

I share this concern. I went with single BPF syscall, because
alternative is 6 syscalls for every command and more
syscalls in the future when we'd need to add another command.
I think type casting is much lesser evil.
We already have similar muxing syscalls.
It feels to me that single mux/demux syscall is easier to support,
document, add new commands. Type casting, yeah, not pretty.
Most users will be using wrappers similar to those I've defined in libbpf.h

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 net-next 1/6] net: filter: add "load 64-bit immediate" eBPF instruction
  2014-08-26  1:38         ` Andy Lutomirski
@ 2014-08-26  1:53           ` Alexei Starovoitov
       [not found]             ` <CAMEtUuwz9MEei+tjWx4Fv8cK_zc9TKVbWxQEAE+yWvxRMa793g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Alexei Starovoitov @ 2014-08-26  1:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Miller, Ingo Molnar, Linus Torvalds, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton,
	Kees Cook, Linux API, Network Development, LKML

On Mon, Aug 25, 2014 at 6:38 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Aug 25, 2014 at 6:35 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>> On Mon, Aug 25, 2014 at 6:06 PM, David Miller <davem@davemloft.net> wrote:
>>> From: Alexei Starovoitov <ast@plumgrid.com>
>>> Date: Mon, 25 Aug 2014 18:00:53 -0700
>>>
>>>> add BPF_LD_IMM64 instruction to load 64-bit immediate value into a register.
>>>
>>> I think you need to rethink this.
>>>
>>> I understand that you want to be able to compile arbitrary C code into
>>> eBPF, but you have to restrict strongly what data the eBPF code can get
>>> to.
>>
>> I believe verifier already does restrict it. I don't see any holes in
>> the architecture. I'm probably not explaining it clearly though :(
>>
>>> Arbitrary pointer loads is asking for trouble.
>>
>> Of course.
>> There is no arbitrary pointer from user space.
>> Verifier checks all pointers.
>> I guess this commit log description is confusing.
>> It says:
>> BPF_LD_IMM64(R1, const_imm_map_ptr)
>> that's what appears in the program _after_ it goes through verifier.
>> User space cannot pass a pointer into the kernel.
>
> If you don't intend for userspace to load a program that contains this
> instruction, then why does it need to be an instruction that the
> verifier rewrites?  Why not have an instruction "load immediate

user space use _pseudo_ bpf_ld_imm64 instruction.
_pseudo_ stands for using 'map_fd' as imm instead of pointer.

> relocated pointer" that contains a reference to a relocation table and

Andy, I guess you missed explanation in:
https://lkml.org/lkml/2014/8/13/111
"
Obviously user space doesn't know what kernel map pointer is associated
with process-local map-FD.
So it's using pseudo BPF_LD_IMM64 instruction.
BPF_LD_IMM64 with src_reg == 0 -> generic move 64-bit immediate into dst_reg
BPF_LD_IMM64 with src_reg == BPF_PSEUDO_MAP_FD -> mov map_fd into dst_reg
Other values are reserved for now. (They will be used to implement
global variables, strings and other constants and per-cpu areas in the future)
So the programs look like:
  BPF_LD_MAP_FD(BPF_REG_1, process_local_map_fd),
  BPF_CALL(BPF_FUNC_map_lookup_elem),
eBPF verifier scans the program for such pseudo instructions, converts
process_local_map_fd -> in-kernel map pointer
and drops 'pseudo' flag of BPF_LD_IMM64 instruction.
"

To rephrase it differently.
  BPF_LD_MAP_FD(BPF_REG_1, process_local_map_fd),
is very much what you suggesting by "load immediate relocated pointer"
Right?

> have the JIT do it?  That might be easier to understand than having
> the verifier do it, and it'll avoid committing to ABIs before we need
> them.

that part I don't understand.
The patch that handles pseudo_with_map_fd ->
-> normal_with_kernel_pointer conversion is only 147 lines:
https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/commit/?id=d82d3daa20465dfdc6b2a0094ad27de9edbb328b
Cannot think of shorter version.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 net-next 1/6] net: filter: add "load 64-bit immediate" eBPF instruction
       [not found]             ` <CAMEtUuwz9MEei+tjWx4Fv8cK_zc9TKVbWxQEAE+yWvxRMa793g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-08-26  1:54               ` Andy Lutomirski
       [not found]                 ` <CALCETrWLvjt_D2B2sYoQtXeU1_9-005BfPcuYDKp75GbPk68dQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Andy Lutomirski @ 2014-08-26  1:54 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Miller, Ingo Molnar, Linus Torvalds, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton,
	Kees Cook, Linux API, Network Development, LKML

On Mon, Aug 25, 2014 at 6:53 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
> On Mon, Aug 25, 2014 at 6:38 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> On Mon, Aug 25, 2014 at 6:35 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>>> On Mon, Aug 25, 2014 at 6:06 PM, David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> wrote:
>>>> From: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
>>>> Date: Mon, 25 Aug 2014 18:00:53 -0700
>>>>
>>>>> add BPF_LD_IMM64 instruction to load 64-bit immediate value into a register.
>>>>
>>>> I think you need to rethink this.
>>>>
>>>> I understand that you want to be able to compile arbitrary C code into
>>>> eBPF, but you have to restrict strongly what data the eBPF code can get
>>>> to.
>>>
>>> I believe verifier already does restrict it. I don't see any holes in
>>> the architecture. I'm probably not explaining it clearly though :(
>>>
>>>> Arbitrary pointer loads is asking for trouble.
>>>
>>> Of course.
>>> There is no arbitrary pointer from user space.
>>> Verifier checks all pointers.
>>> I guess this commit log description is confusing.
>>> It says:
>>> BPF_LD_IMM64(R1, const_imm_map_ptr)
>>> that's what appears in the program _after_ it goes through verifier.
>>> User space cannot pass a pointer into the kernel.
>>
>> If you don't intend for userspace to load a program that contains this
>> instruction, then why does it need to be an instruction that the
>> verifier rewrites?  Why not have an instruction "load immediate
>
> user space use _pseudo_ bpf_ld_imm64 instruction.
> _pseudo_ stands for using 'map_fd' as imm instead of pointer.
>
>> relocated pointer" that contains a reference to a relocation table and
>
> Andy, I guess you missed explanation in:
> https://lkml.org/lkml/2014/8/13/111
> "
> Obviously user space doesn't know what kernel map pointer is associated
> with process-local map-FD.
> So it's using pseudo BPF_LD_IMM64 instruction.
> BPF_LD_IMM64 with src_reg == 0 -> generic move 64-bit immediate into dst_reg
> BPF_LD_IMM64 with src_reg == BPF_PSEUDO_MAP_FD -> mov map_fd into dst_reg
> Other values are reserved for now. (They will be used to implement
> global variables, strings and other constants and per-cpu areas in the future)
> So the programs look like:
>   BPF_LD_MAP_FD(BPF_REG_1, process_local_map_fd),
>   BPF_CALL(BPF_FUNC_map_lookup_elem),
> eBPF verifier scans the program for such pseudo instructions, converts
> process_local_map_fd -> in-kernel map pointer
> and drops 'pseudo' flag of BPF_LD_IMM64 instruction.
> "

Will a program that uses BPF_LD_IMM64 w/o the FPG_REG_1 thing be accepted?

--Andy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 net-next 1/6] net: filter: add "load 64-bit immediate" eBPF instruction
       [not found]                 ` <CALCETrWLvjt_D2B2sYoQtXeU1_9-005BfPcuYDKp75GbPk68dQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-08-26  2:02                   ` Alexei Starovoitov
  0 siblings, 0 replies; 24+ messages in thread
From: Alexei Starovoitov @ 2014-08-26  2:02 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Miller, Ingo Molnar, Linus Torvalds, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton,
	Kees Cook, Linux API, Network Development, LKML

On Mon, Aug 25, 2014 at 6:54 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Mon, Aug 25, 2014 at 6:53 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>> On Mon, Aug 25, 2014 at 6:38 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>> On Mon, Aug 25, 2014 at 6:35 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>>>> On Mon, Aug 25, 2014 at 6:06 PM, David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> wrote:
>>>>> From: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
>>>>> Date: Mon, 25 Aug 2014 18:00:53 -0700
>>>>>
>>>>>> add BPF_LD_IMM64 instruction to load 64-bit immediate value into a register.
>>>>>
>>>>> I think you need to rethink this.
>>>>>
>>>>> I understand that you want to be able to compile arbitrary C code into
>>>>> eBPF, but you have to restrict strongly what data the eBPF code can get
>>>>> to.
>>>>
>>>> I believe verifier already does restrict it. I don't see any holes in
>>>> the architecture. I'm probably not explaining it clearly though :(
>>>>
>>>>> Arbitrary pointer loads is asking for trouble.
>>>>
>>>> Of course.
>>>> There is no arbitrary pointer from user space.
>>>> Verifier checks all pointers.
>>>> I guess this commit log description is confusing.
>>>> It says:
>>>> BPF_LD_IMM64(R1, const_imm_map_ptr)
>>>> that's what appears in the program _after_ it goes through verifier.
>>>> User space cannot pass a pointer into the kernel.
>>>
>>> If you don't intend for userspace to load a program that contains this
>>> instruction, then why does it need to be an instruction that the
>>> verifier rewrites?  Why not have an instruction "load immediate
>>
>> user space use _pseudo_ bpf_ld_imm64 instruction.
>> _pseudo_ stands for using 'map_fd' as imm instead of pointer.
>>
>>> relocated pointer" that contains a reference to a relocation table and
>>
>> Andy, I guess you missed explanation in:
>> https://lkml.org/lkml/2014/8/13/111
>> "
>> Obviously user space doesn't know what kernel map pointer is associated
>> with process-local map-FD.
>> So it's using pseudo BPF_LD_IMM64 instruction.
>> BPF_LD_IMM64 with src_reg == 0 -> generic move 64-bit immediate into dst_reg
>> BPF_LD_IMM64 with src_reg == BPF_PSEUDO_MAP_FD -> mov map_fd into dst_reg
>> Other values are reserved for now. (They will be used to implement
>> global variables, strings and other constants and per-cpu areas in the future)
>> So the programs look like:
>>   BPF_LD_MAP_FD(BPF_REG_1, process_local_map_fd),
>>   BPF_CALL(BPF_FUNC_map_lookup_elem),
>> eBPF verifier scans the program for such pseudo instructions, converts
>> process_local_map_fd -> in-kernel map pointer
>> and drops 'pseudo' flag of BPF_LD_IMM64 instruction.
>> "
>
> Will a program that uses BPF_LD_IMM64 w/o the FPG_REG_1 thing be accepted?

If you mean the program like:
BPF_LD_IMM64(BPF_REG_1, 0xdead),
BPF_CALL(BPF_FUNC_map_lookup_elem),
yes, it will be rejected, because type of R1 will not match
map_lookup() argument
constraints.
See check_ld_imm() in verifier.c where it assigns the type during verification.
There are 5 tests in verifier testsuite that test things around bpf_ld_imm64
and 2 tests around _pseudo_ bpf_ld_imm64.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 net-next 4/6] bpf: enable bpf syscall on x64 and i386
       [not found]   ` <1409014858-1410-5-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
  2014-08-26  1:07     ` David Miller
@ 2014-08-26  3:52     ` Stephen Hemminger
  2014-08-26  4:24       ` Alexei Starovoitov
  1 sibling, 1 reply; 24+ messages in thread
From: Stephen Hemminger @ 2014-08-26  3:52 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet,
	Peter Zijlstra, Brendan Gregg, Namhyung Kim, H. Peter Anvin,
	Andrew Morton, Kees Cook, linux-api-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Per discussion at Kernel Summit. Every new syscall requires
a manual page and test programs. We have had too many new syscalls
that are DOA.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 net-next 1/6] net: filter: add "load 64-bit immediate" eBPF instruction
  2014-08-26  1:06   ` David Miller
  2014-08-26  1:35     ` Alexei Starovoitov
@ 2014-08-26  4:12     ` Alexei Starovoitov
  1 sibling, 0 replies; 24+ messages in thread
From: Alexei Starovoitov @ 2014-08-26  4:12 UTC (permalink / raw)
  To: David Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton,
	Kees Cook, Linux API, Network Development, LKML

On Mon, Aug 25, 2014 at 6:06 PM, David Miller <davem@davemloft.net> wrote:
>
> Instead I would rather you look into a model like what the quake
> engine uses for it's VM.

Thanks for the tip! I wasn't aware of quake vm.
I've looked through several papers and slides.
I'm surely missing something in what they're doing, but
here is my comparison of eBPF vs QVM:
- QVM ISA is stack based vs eBPF registers
- pointer types are predefined by QVM ISA whereas eBPF relies
  on static verifier which is more extensible, since verifier can get
  progressively smarter with time without need to change interpreter,
  llvm and JITs, whereas QVM would need changes through the
  toolchain, interpreter, JITs to support new pointer type
- QVM calls with negative values invoke helper functions, which is
  similar to eBPF calls. The difference is QVM keeps negative values
  while interpreting and doing run-time checking of arguments whereas
  eBPF is statically verifying all before interpreting
- access to QVM 'local' memory is bounds checked at run-time,
  whereas eBPF does load/store bounds checking by static analysis

I may be wrong, but it seems possible to side step QVM run-time
checking, since their 'top of stack' is typeless and it seems possible
to push constant as a pointer there.

I'm biased, but eBPF seems like better architecture,
more flexible, likely faster to interpret, simple JITs, more powerful
compiler. The downside, of course, eBPF verifier is more complex
than QVM which is mainly relying on run-time checks.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 net-next 4/6] bpf: enable bpf syscall on x64 and i386
  2014-08-26  3:52     ` Stephen Hemminger
@ 2014-08-26  4:24       ` Alexei Starovoitov
  2014-08-26  7:46         ` Ingo Molnar
  0 siblings, 1 reply; 24+ messages in thread
From: Alexei Starovoitov @ 2014-08-26  4:24 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet,
	Peter Zijlstra, Brendan Gregg, Namhyung Kim, H. Peter Anvin,
	Andrew Morton, Kees Cook, Linux API, Network Development, LKML

On Mon, Aug 25, 2014 at 8:52 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> Per discussion at Kernel Summit. Every new syscall requires
> a manual page and test programs. We have had too many new syscalls
> that are DOA.

There is verifier testsuite that is testing eBPF verifier from userspace
via bpf syscall. Also there are multiple examples and libbpf.
I think test coverage for bpf syscall is quite substantial already.

As far as manpage I already mentioned my plan in the other thread.
I will do it when stuff lands, since writing docs in english is the most
difficult part of these patches. I already rewrote Documentation/../filter.txt
several times. Even once Kees noticed discrepancies there when
interface changed from map_id to map_fd.
To do a manpage I'll use a help from tech writer, since I want it to
be better than docs I put in Documentation/...

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 net-next 4/6] bpf: enable bpf syscall on x64 and i386
  2014-08-26  1:43         ` Alexei Starovoitov
@ 2014-08-26  7:45           ` Ingo Molnar
  2014-08-26 16:29             ` Alexei Starovoitov
  0 siblings, 1 reply; 24+ messages in thread
From: Ingo Molnar @ 2014-08-26  7:45 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Miller, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton,
	Kees Cook, Linux API, Network Development, LKML

* Alexei Starovoitov <ast@plumgrid.com> wrote:

> On Mon, Aug 25, 2014 at 6:07 PM, David Miller <davem@davemloft.net> wrote:
> > From: Alexei Starovoitov <ast@plumgrid.com>
> > Date: Mon, 25 Aug 2014 18:00:56 -0700
> >
> >> -
> >> +asmlinkage long sys_bpf(int cmd, unsigned long arg2, unsigned long arg3,
> >> +                     unsigned long arg4, unsigned long arg5);
> >
> > Please do not add interfaces with opaque types as arguments.
> >
> > It is impossible for the compiler to type check the args at
> > compile time when userspace tries to use this stuff.
> 
> I share this concern. I went with single BPF syscall, because
> alternative is 6 syscalls for every command and more
> syscalls in the future when we'd need to add another command.

We had a similar problem growing the perf syscall - and we were 
able to hold to a single syscall, which I think has served us 
well. Had we gone with a per functionality syscall we'd have 
something like a dozen syscalls today, scattered all around 
non-continuously in the syscall space on most platforms.

But note that 'opaque or non-opaque' is a false dichotomy, as 
there are 3 options in reality: what we used instead of an opaque 
type was an extensible data type, and extensible C structure, 
with structure size expectations part of the structure.

See 'struct perf_event_attr':

SYSCALL_DEFINE5(perf_event_open,
                struct perf_event_attr __user *, attr_uptr,
                pid_t, pid, int, cpu, int, group_fd, unsigned long, flags)

That way new versions of the data type are immediately obvious to 
the kernel, and compatibility can be handled well. Smaller, 
previous versions received from old user-space are padded out 
transparently to the kernel's value of the structure, with zeroes 
filled in.

See perf_copy_attr() in kernel/events/core.c. Instead of 
versioning the structure, we use its size as a finegrained and 
robust version indicator in essence.

That way it's both forwards and backwards compatible, as much as 
possible technically: old kernel can run new user-space, and new 
user-space will be able to take advantage of as much of an old 
kernel's capabilities as possible, and in the typical case of 
version match there's no extra overhead worth speaking of.

This way we were able to gradually grow to the sophisticated ABI 
you can find in include/uapi/linux/perf_event.h, without having 
to touch the syscall interface. (It's not the only method: we 
also have a handful of ioctls, where that's the most natural 
interface for a perf event fd.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 net-next 4/6] bpf: enable bpf syscall on x64 and i386
  2014-08-26  4:24       ` Alexei Starovoitov
@ 2014-08-26  7:46         ` Ingo Molnar
       [not found]           ` <20140826074655.GB19799-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Ingo Molnar @ 2014-08-26  7:46 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Stephen Hemminger, David S. Miller, Linus Torvalds,
	Andy Lutomirski, Steven Rostedt, Daniel Borkmann, Chema Gonzalez,
	Eric Dumazet, Peter Zijlstra, Brendan Gregg, Namhyung Kim,
	H. Peter Anvin, Andrew Morton, Kees Cook, Linux API,
	Network Development, LKML


* Alexei Starovoitov <ast@plumgrid.com> wrote:

> On Mon, Aug 25, 2014 at 8:52 PM, Stephen Hemminger
> <stephen@networkplumber.org> wrote:
> > Per discussion at Kernel Summit. Every new syscall requires
> > a manual page and test programs. We have had too many new syscalls
> > that are DOA.
> 
> There is verifier testsuite that is testing eBPF verifier from userspace
> via bpf syscall. Also there are multiple examples and libbpf.
> I think test coverage for bpf syscall is quite substantial already.

This is in tools/bpf/, right?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 net-next 4/6] bpf: enable bpf syscall on x64 and i386
       [not found]           ` <20140826074655.GB19799-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2014-08-26  8:00             ` Daniel Borkmann
       [not found]               ` <53FC3E9A.1020108-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Daniel Borkmann @ 2014-08-26  8:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alexei Starovoitov, Stephen Hemminger, David S. Miller,
	Linus Torvalds, Andy Lutomirski, Steven Rostedt, Chema Gonzalez,
	Eric Dumazet, Peter Zijlstra, Brendan Gregg, Namhyung Kim,
	H. Peter Anvin, Andrew Morton, Kees Cook, Linux API,
	Network Development, LKML

On 08/26/2014 09:46 AM, Ingo Molnar wrote:
> * Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>> On Mon, Aug 25, 2014 at 8:52 PM, Stephen Hemminger
>> <stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ@public.gmane.org> wrote:
>>> Per discussion at Kernel Summit. Every new syscall requires
>>> a manual page and test programs. We have had too many new syscalls
>>> that are DOA.
>>
>> There is verifier testsuite that is testing eBPF verifier from userspace
>> via bpf syscall. Also there are multiple examples and libbpf.
>> I think test coverage for bpf syscall is quite substantial already.
>
> This is in tools/bpf/, right?

No, it contains a BPF JIT disasm, bpf assembler and a debugger, but the
last two are for the 'classic' BPF interface only. There's a test suite
for BPF/eBPF in general under lib/test_bpf.c, but so far it tests only
the current code w/o eBPF verifier.

That said, I think Alexei is referring to the examples et al from the
bigger previous proposed patch set.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 net-next 4/6] bpf: enable bpf syscall on x64 and i386
       [not found]               ` <53FC3E9A.1020108-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-08-26  8:02                 ` Ingo Molnar
       [not found]                   ` <20140826080231.GA20565-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Ingo Molnar @ 2014-08-26  8:02 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Stephen Hemminger, David S. Miller,
	Linus Torvalds, Andy Lutomirski, Steven Rostedt, Chema Gonzalez,
	Eric Dumazet, Peter Zijlstra, Brendan Gregg, Namhyung Kim,
	H. Peter Anvin, Andrew Morton, Kees Cook, Linux API,
	Network Development, LKML


* Daniel Borkmann <dborkman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> On 08/26/2014 09:46 AM, Ingo Molnar wrote:
> >* Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
> >>On Mon, Aug 25, 2014 at 8:52 PM, Stephen Hemminger
> >><stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ@public.gmane.org> wrote:
> >>>Per discussion at Kernel Summit. Every new syscall requires
> >>>a manual page and test programs. We have had too many new syscalls
> >>>that are DOA.
> >>
> >>There is verifier testsuite that is testing eBPF verifier from userspace
> >>via bpf syscall. Also there are multiple examples and libbpf.
> >>I think test coverage for bpf syscall is quite substantial already.
> >
> >This is in tools/bpf/, right?
> 
> No, it contains a BPF JIT disasm, bpf assembler and a debugger, 
> but the last two are for the 'classic' BPF interface only. 
> There's a test suite for BPF/eBPF in general under 
> lib/test_bpf.c, but so far it tests only the current code w/o 
> eBPF verifier.
> 
> That said, I think Alexei is referring to the examples et al 
> from the bigger previous proposed patch set.

I mean, if all the testing already exists, it should be part of 
an initial submission and such.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 net-next 4/6] bpf: enable bpf syscall on x64 and i386
  2014-08-26  7:45           ` Ingo Molnar
@ 2014-08-26 16:29             ` Alexei Starovoitov
  0 siblings, 0 replies; 24+ messages in thread
From: Alexei Starovoitov @ 2014-08-26 16:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Miller, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton,
	Kees Cook, Linux API, Network Development, LKML

On Tue, Aug 26, 2014 at 12:45 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Alexei Starovoitov <ast@plumgrid.com> wrote:
>
>> On Mon, Aug 25, 2014 at 6:07 PM, David Miller <davem@davemloft.net> wrote:
>> > From: Alexei Starovoitov <ast@plumgrid.com>
>> > Date: Mon, 25 Aug 2014 18:00:56 -0700
>> >
>> >> -
>> >> +asmlinkage long sys_bpf(int cmd, unsigned long arg2, unsigned long arg3,
>> >> +                     unsigned long arg4, unsigned long arg5);
>> >
>> > Please do not add interfaces with opaque types as arguments.
>> >
>> > It is impossible for the compiler to type check the args at
>> > compile time when userspace tries to use this stuff.
>>
>> I share this concern. I went with single BPF syscall, because
>> alternative is 6 syscalls for every command and more
>> syscalls in the future when we'd need to add another command.
>
> See 'struct perf_event_attr':
>
> SYSCALL_DEFINE5(perf_event_open,
>                 struct perf_event_attr __user *, attr_uptr,
>                 pid_t, pid, int, cpu, int, group_fd, unsigned long, flags)
>
> This way we were able to gradually grow to the sophisticated ABI
> you can find in include/uapi/linux/perf_event.h, without having
> to touch the syscall interface. (It's not the only method: we
> also have a handful of ioctls, where that's the most natural
> interface for a perf event fd.)

Thanks! that's a good alternative.
One approach would be to have two bpf syscalls:
map_fd = bpf_map_create(map_type, nlattr)
+ bunch of ioctl()s that do lookup/update/delete/...
prog_fd = bpf_prog_load(prog_type, nlattr)
but ioctl()s would still do a lot of type casting.

so how about single syscall:
fd = bpf(int cmd, union bpf_attr *attrs, int size)
union bpf_attr {
  struct bpf_map_create_attr { /* used by bpf_map_create cmd */
    __u32 key_size, value_size, ...
  };
  struct bpf_map_access_attr { /* by lookup/update/delete/get_next */
    int map_fd;
    void __user *key; /* used by all */
    void __user *value; /* used by lookup/update */
    void __user *next_key; /* used by get_next */
  };
  struct nlattr prog_attr[0]; /* used by bpf_prog_load cmd */
};

If you prefer I can make prog_load attrs to be explicit struct
as well, but nlattr feels cleaner, since there can be optional
sections that are quite large. Like one with 'readonly constants'
that I'm still working on as part of cleaner strings support.

Also I think I will change tracing attachment interface.
Currently write("prog_fd_as_int") -> "/sys/../tracing/events/..."
does the attachment, but as Andy noticed that may be not
secure enough if there is fork() somewhere in the process
that does open() of debugfs and write().
Looks like another ioctl() like PERF_EVENT_IOC_SET_FILTER_BPF
on top of perf_event_open() would be cleaner and event->owner
check can be done easily.
Can I create kprobe event through perf_event_open() ?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 net-next 4/6] bpf: enable bpf syscall on x64 and i386
       [not found]                   ` <20140826080231.GA20565-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2014-08-26 16:40                     ` Alexei Starovoitov
  0 siblings, 0 replies; 24+ messages in thread
From: Alexei Starovoitov @ 2014-08-26 16:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Daniel Borkmann, Stephen Hemminger, David S. Miller,
	Linus Torvalds, Andy Lutomirski, Steven Rostedt, Chema Gonzalez,
	Eric Dumazet, Peter Zijlstra, Brendan Gregg, Namhyung Kim,
	H. Peter Anvin, Andrew Morton, Kees Cook, Linux API,
	Network Development, LKML

On Tue, Aug 26, 2014 at 1:02 AM, Ingo Molnar <mingo-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>
>> That said, I think Alexei is referring to the examples et al
>> from the bigger previous proposed patch set.
>
> I mean, if all the testing already exists, it should be part of
> an initial submission and such.

That's what I did. V5 set contains all examples and verifier testsuite:
https://lkml.org/lkml/2014/8/24/107
Dave asked to split it up into subsets, since it's too big too review
at once. which makes sense, so this V6 is only first part and
the plan was:
1st(this) set - introduces uapi/linux/bpf.h and BPF syscall for maps only
2nd set will extend BPF syscall with programs and verifier
3rd set will use eBPF in tracing, add samples and verifier tests
4th set will have llvm and C examples
Each set is around 6-8 patches instead of 29 in V5

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2014-08-26 16:40 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-08-26  1:00 [PATCH v6 net-next 0/6] introduce BPF syscall Alexei Starovoitov
2014-08-26  1:00 ` [PATCH v6 net-next 1/6] net: filter: add "load 64-bit immediate" eBPF instruction Alexei Starovoitov
2014-08-26  1:06   ` David Miller
2014-08-26  1:35     ` Alexei Starovoitov
     [not found]       ` <CAMEtUuxdQpkX8t1_szde=Q1ALcp5t7rRyK+zEDafj27_J2LzVg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-08-26  1:38         ` Andy Lutomirski
2014-08-26  1:53           ` Alexei Starovoitov
     [not found]             ` <CAMEtUuwz9MEei+tjWx4Fv8cK_zc9TKVbWxQEAE+yWvxRMa793g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-08-26  1:54               ` Andy Lutomirski
     [not found]                 ` <CALCETrWLvjt_D2B2sYoQtXeU1_9-005BfPcuYDKp75GbPk68dQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-08-26  2:02                   ` Alexei Starovoitov
2014-08-26  4:12     ` Alexei Starovoitov
2014-08-26  1:00 ` [PATCH v6 net-next 3/6] bpf: introduce syscall(BPF, ...) and BPF maps Alexei Starovoitov
2014-08-26  1:00 ` [PATCH v6 net-next 4/6] bpf: enable bpf syscall on x64 and i386 Alexei Starovoitov
     [not found]   ` <1409014858-1410-5-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
2014-08-26  1:07     ` David Miller
     [not found]       ` <20140825.180718.137768107010295086.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
2014-08-26  1:43         ` Alexei Starovoitov
2014-08-26  7:45           ` Ingo Molnar
2014-08-26 16:29             ` Alexei Starovoitov
2014-08-26  3:52     ` Stephen Hemminger
2014-08-26  4:24       ` Alexei Starovoitov
2014-08-26  7:46         ` Ingo Molnar
     [not found]           ` <20140826074655.GB19799-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2014-08-26  8:00             ` Daniel Borkmann
     [not found]               ` <53FC3E9A.1020108-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-08-26  8:02                 ` Ingo Molnar
     [not found]                   ` <20140826080231.GA20565-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2014-08-26 16:40                     ` Alexei Starovoitov
     [not found] ` <1409014858-1410-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
2014-08-26  1:00   ` [PATCH v6 net-next 2/6] net: filter: split filter.h and expose eBPF to user space Alexei Starovoitov
2014-08-26  1:00   ` [PATCH v6 net-next 5/6] bpf: add lookup/update/delete/iterate methods to BPF maps Alexei Starovoitov
2014-08-26  1:00   ` [PATCH v6 net-next 6/6] bpf: add hashtable type of " Alexei Starovoitov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).