* [PATCH v6 net-next 0/6] introduce BPF syscall
@ 2014-08-26  1:00 Alexei Starovoitov
  2014-08-26  1:00 ` [PATCH v6 net-next 1/6] net: filter: add "load 64-bit immediate" eBPF instruction Alexei Starovoitov
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Alexei Starovoitov @ 2014-08-26  1:00 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton,
	Kees Cook, linux-api-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
Hi All,
splitting big set of patches into smaller sets:
1st(this) set - introduces uapi/linux/bpf.h and BPF syscall for maps only
2nd set will extend BPF syscall with programs and verifier
3rd set will use eBPF in tracing, add samples and verifier tests
4th set will have llvm and C examples
Tested on x64 and i386.
BPF syscall with maps only is usable, but not very useful until the rest comes
Build/boot tested on arm and sparc.
They got new warning 'syscall bpf not implemented'
V5->V6:
- fixed few checkpatch warnings.
  Few lines are still over 80 char, but fixing them will reduce readability
- rebased
V5 thread:
https://lkml.org/lkml/2014/8/24/107
All patches:
git://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf master
Alexei Starovoitov (6):
  net: filter: add "load 64-bit immediate" eBPF instruction
  net: filter: split filter.h and expose eBPF to user space
  bpf: introduce syscall(BPF, ...) and BPF maps
  bpf: enable bpf syscall on x64 and i386
  bpf: add lookup/update/delete/iterate methods to BPF maps
  bpf: add hashtable type of BPF maps
 Documentation/networking/filter.txt |   79 +++++++-
 arch/x86/net/bpf_jit_comp.c         |   17 ++
 arch/x86/syscalls/syscall_32.tbl    |    1 +
 arch/x86/syscalls/syscall_64.tbl    |    1 +
 include/linux/bpf.h                 |   50 +++++
 include/linux/filter.h              |  294 +--------------------------
 include/linux/syscalls.h            |    3 +-
 include/uapi/asm-generic/unistd.h   |    4 +-
 include/uapi/linux/Kbuild           |    1 +
 include/uapi/linux/bpf.h            |  371 ++++++++++++++++++++++++++++++++++
 kernel/bpf/Makefile                 |    2 +-
 kernel/bpf/core.c                   |    5 +
 kernel/bpf/hashtab.c                |  373 +++++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c                |  354 +++++++++++++++++++++++++++++++++
 kernel/sys_ni.c                     |    3 +
 lib/test_bpf.c                      |   21 ++
 16 files changed, 1282 insertions(+), 297 deletions(-)
 create mode 100644 include/linux/bpf.h
 create mode 100644 include/uapi/linux/bpf.h
 create mode 100644 kernel/bpf/hashtab.c
 create mode 100644 kernel/bpf/syscall.c
-- 
1.7.9.5
^ permalink raw reply	[flat|nested] 24+ messages in thread* [PATCH v6 net-next 1/6] net: filter: add "load 64-bit immediate" eBPF instruction 2014-08-26 1:00 [PATCH v6 net-next 0/6] introduce BPF syscall Alexei Starovoitov @ 2014-08-26 1:00 ` Alexei Starovoitov 2014-08-26 1:06 ` David Miller 2014-08-26 1:00 ` [PATCH v6 net-next 3/6] bpf: introduce syscall(BPF, ...) and BPF maps Alexei Starovoitov ` (2 subsequent siblings) 3 siblings, 1 reply; 24+ messages in thread From: Alexei Starovoitov @ 2014-08-26 1:00 UTC (permalink / raw) To: David S. Miller Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra, Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton, Kees Cook, linux-api, netdev, linux-kernel add BPF_LD_IMM64 instruction to load 64-bit immediate value into a register. All previous instructions were 8-byte. This is first 16-byte instruction. Two consecutive 'struct bpf_insn' blocks are interpreted as single instruction: insn[0].code = BPF_LD | BPF_DW | BPF_IMM insn[0].dst_reg = destination register insn[0].imm = lower 32-bit insn[1].code = 0 insn[1].imm = upper 32-bit All unused fields must be zero. Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM which loads 32-bit immediate value into a register. x64 JITs it as single 'movabsq %rax, imm64' arm64 may JIT as sequence of four 'movk x0, #imm16, lsl #shift' insn Note that old eBPF programs are binary compatible with new interpreter. In the following patches this instruction is used to store eBPF map pointers: BPF_LD_IMM64(R1, const_imm_map_ptr) BPF_CALL(BPF_FUNC_map_lookup_elem) and verifier is introduced to check validity of the programs. Later LLVM compiler is using this insn as generic load of 64-bit immediate constant and as a load of map pointer with relocation. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> --- Documentation/networking/filter.txt | 8 +++++++- arch/x86/net/bpf_jit_comp.c | 17 +++++++++++++++++ include/linux/filter.h | 18 ++++++++++++++++++ kernel/bpf/core.c | 5 +++++ lib/test_bpf.c | 21 +++++++++++++++++++++ 5 files changed, 68 insertions(+), 1 deletion(-) diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt index c48a9704bda8..81916ab5d96f 100644 --- a/Documentation/networking/filter.txt +++ b/Documentation/networking/filter.txt @@ -951,7 +951,7 @@ Size modifier is one of ... Mode modifier is one of: - BPF_IMM 0x00 /* classic BPF only, reserved in eBPF */ + BPF_IMM 0x00 /* used for 32-bit mov in classic BPF and 64-bit in eBPF */ BPF_ABS 0x20 BPF_IND 0x40 BPF_MEM 0x60 @@ -995,6 +995,12 @@ BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and 2 byte atomic increments are not supported. +eBPF has one 16-byte instruction: BPF_LD | BPF_DW | BPF_IMM which consists +of two consecutive 'struct bpf_insn' 8-byte blocks and interpreted as single +instruction that loads 64-bit immediate value into a dst_reg. +Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM which loads +32-bit immediate value into a register. + Testing ------- diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index b08a98c59530..98837147ee57 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -393,6 +393,23 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, EMIT1_off32(add_1reg(0xB8, dst_reg), imm32); break; + case BPF_LD | BPF_IMM | BPF_DW: + if (insn[1].code != 0 || insn[1].src_reg != 0 || + insn[1].dst_reg != 0 || insn[1].off != 0) { + /* verifier must catch invalid insns */ + pr_err("invalid BPF_LD_IMM64 insn\n"); + return -EINVAL; + } + + /* movabsq %rax, imm64 */ + EMIT2(add_1mod(0x48, dst_reg), add_1reg(0xB8, dst_reg)); + EMIT(insn[0].imm, 4); + EMIT(insn[1].imm, 4); + + insn++; + i++; + break; + /* dst %= src, dst /= src, dst %= imm32, dst /= imm32 */ case BPF_ALU | BPF_MOD | BPF_X: case BPF_ALU | BPF_DIV | BPF_X: diff --git a/include/linux/filter.h b/include/linux/filter.h index a5227ab8ccb1..f3262b598262 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -161,6 +161,24 @@ enum { .off = 0, \ .imm = IMM }) +/* BPF_LD_IMM64 macro encodes single 'load 64-bit immediate' insn */ +#define BPF_LD_IMM64(DST, IMM) \ + BPF_LD_IMM64_RAW(DST, 0, IMM) + +#define BPF_LD_IMM64_RAW(DST, SRC, IMM) \ + ((struct bpf_insn) { \ + .code = BPF_LD | BPF_DW | BPF_IMM, \ + .dst_reg = DST, \ + .src_reg = SRC, \ + .off = 0, \ + .imm = (__u32) (IMM) }), \ + ((struct bpf_insn) { \ + .code = 0, /* zero is reserved opcode */ \ + .dst_reg = 0, \ + .src_reg = 0, \ + .off = 0, \ + .imm = ((__u64) (IMM)) >> 32 }) + /* Short form of mov based on type, BPF_X: dst_reg = src_reg, BPF_K: dst_reg = imm32 */ #define BPF_MOV64_RAW(TYPE, DST, SRC, IMM) \ diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 7f0dbcbb34af..0434c2170f2b 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -180,6 +180,7 @@ static unsigned int __bpf_prog_run(void *ctx, const struct bpf_insn *insn) [BPF_LD | BPF_IND | BPF_W] = &&LD_IND_W, [BPF_LD | BPF_IND | BPF_H] = &&LD_IND_H, [BPF_LD | BPF_IND | BPF_B] = &&LD_IND_B, + [BPF_LD | BPF_IMM | BPF_DW] = &&LD_IMM_DW, }; void *ptr; int off; @@ -239,6 +240,10 @@ select_insn: ALU64_MOV_K: DST = IMM; CONT; + LD_IMM_DW: + DST = (u64) (u32) insn[0].imm | ((u64) (u32) insn[1].imm) << 32; + insn++; + CONT; ALU64_ARSH_X: (*(s64 *) &DST) >>= SRC; CONT; diff --git a/lib/test_bpf.c b/lib/test_bpf.c index 8c66c6aace04..46ab1a7ef135 100644 --- a/lib/test_bpf.c +++ b/lib/test_bpf.c @@ -1735,6 +1735,27 @@ static struct bpf_test tests[] = { { }, { { 1, 0 } }, }, + { + "load 64-bit immediate", + .u.insns_int = { + BPF_LD_IMM64(R1, 0x567800001234L), + BPF_MOV64_REG(R2, R1), + BPF_MOV64_REG(R3, R2), + BPF_ALU64_IMM(BPF_RSH, R2, 32), + BPF_ALU64_IMM(BPF_LSH, R3, 32), + BPF_ALU64_IMM(BPF_RSH, R3, 32), + BPF_ALU64_IMM(BPF_MOV, R0, 0), + BPF_JMP_IMM(BPF_JEQ, R2, 0x5678, 1), + BPF_EXIT_INSN(), + BPF_JMP_IMM(BPF_JEQ, R3, 0x1234, 1), + BPF_EXIT_INSN(), + BPF_ALU64_IMM(BPF_MOV, R0, 1), + BPF_EXIT_INSN(), + }, + INTERNAL, + { }, + { { 0, 1 } } + }, }; static struct net_device dev; -- 1.7.9.5 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH v6 net-next 1/6] net: filter: add "load 64-bit immediate" eBPF instruction 2014-08-26 1:00 ` [PATCH v6 net-next 1/6] net: filter: add "load 64-bit immediate" eBPF instruction Alexei Starovoitov @ 2014-08-26 1:06 ` David Miller 2014-08-26 1:35 ` Alexei Starovoitov 2014-08-26 4:12 ` Alexei Starovoitov 0 siblings, 2 replies; 24+ messages in thread From: David Miller @ 2014-08-26 1:06 UTC (permalink / raw) To: ast Cc: mingo, torvalds, luto, rostedt, dborkman, chema, edumazet, a.p.zijlstra, brendan.d.gregg, namhyung, hpa, akpm, keescook, linux-api, netdev, linux-kernel From: Alexei Starovoitov <ast@plumgrid.com> Date: Mon, 25 Aug 2014 18:00:53 -0700 > add BPF_LD_IMM64 instruction to load 64-bit immediate value into a register. I think you need to rethink this. I understand that you want to be able to compile arbitrary C code into eBPF, but you have to restrict strongly what data the eBPF code can get to. Arbitrary pointer loads is asking for trouble. Instead I would rather you look into a model like what the quake engine uses for it's VM. Namely, the program can do loads and stores from/to a data section, but all of them are validated to be in the range of the VM program's image. eBPF programs should only be able to access things using either: 1) Well defined entry/exit points for control transfer 2) Load/Store within a private limited data segment range used only by the eBPF program I don't want the eBPF program to be able to get "out of it's box" in any way shape or form. And besides, you're only making this thing as an optimization right? ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v6 net-next 1/6] net: filter: add "load 64-bit immediate" eBPF instruction 2014-08-26 1:06 ` David Miller @ 2014-08-26 1:35 ` Alexei Starovoitov [not found] ` <CAMEtUuxdQpkX8t1_szde=Q1ALcp5t7rRyK+zEDafj27_J2LzVg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2014-08-26 4:12 ` Alexei Starovoitov 1 sibling, 1 reply; 24+ messages in thread From: Alexei Starovoitov @ 2014-08-26 1:35 UTC (permalink / raw) To: David Miller Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra, Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton, Kees Cook, Linux API, Network Development, LKML On Mon, Aug 25, 2014 at 6:06 PM, David Miller <davem@davemloft.net> wrote: > From: Alexei Starovoitov <ast@plumgrid.com> > Date: Mon, 25 Aug 2014 18:00:53 -0700 > >> add BPF_LD_IMM64 instruction to load 64-bit immediate value into a register. > > I think you need to rethink this. > > I understand that you want to be able to compile arbitrary C code into > eBPF, but you have to restrict strongly what data the eBPF code can get > to. I believe verifier already does restrict it. I don't see any holes in the architecture. I'm probably not explaining it clearly though :( > Arbitrary pointer loads is asking for trouble. Of course. There is no arbitrary pointer from user space. Verifier checks all pointers. I guess this commit log description is confusing. It says: BPF_LD_IMM64(R1, const_imm_map_ptr) that's what appears in the program _after_ it goes through verifier. User space cannot pass a pointer into the kernel. > Instead I would rather you look into a model like what the quake > engine uses for it's VM. > > Namely, the program can do loads and stores from/to a data section, > but all of them are validated to be in the range of the VM program's > image. That's exactly what eBPF does as well. load and stores can only be from three 'sections': pointer to stack, pointer to context, pointer to map value. All verifier logic including these pointer checks is described in extensive verifier documentation. Please take a look at it. Andy said that it's good doc :) Programs like: BPF_LD_IMM64(R1, some_64bit_constant) *(u8*)(R1+ off) = imm; will be rejected by verifier. I even have a test case for that later in the patches. > eBPF programs should only be able to access things using either: > > 1) Well defined entry/exit points for control transfer > > 2) Load/Store within a private limited data segment range used > only by the eBPF program > > I don't want the eBPF program to be able to get "out of it's box" > in any way shape or form. Agree 100%. And I believe I already achieved that with verifier. > And besides, you're only making this thing as an optimization right? not quite. This instruction is needed to get rid of IDR for maps, speeded up lookup, made verifier much simpler and most important it cleaned up program<->maps interaction. I've described it better in V4 set: https://lkml.org/lkml/2014/8/13/111 Should I resend this patch together with all of the verifier patches? I included it here as #1, since patch #2 moves all eBPF macros into uapi/linux/bpf.h and this instruction by itself is completely harmless. It's verifier that needs to do proper checking. ^ permalink raw reply [flat|nested] 24+ messages in thread
[parent not found: <CAMEtUuxdQpkX8t1_szde=Q1ALcp5t7rRyK+zEDafj27_J2LzVg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v6 net-next 1/6] net: filter: add "load 64-bit immediate" eBPF instruction [not found] ` <CAMEtUuxdQpkX8t1_szde=Q1ALcp5t7rRyK+zEDafj27_J2LzVg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2014-08-26 1:38 ` Andy Lutomirski 2014-08-26 1:53 ` Alexei Starovoitov 0 siblings, 1 reply; 24+ messages in thread From: Andy Lutomirski @ 2014-08-26 1:38 UTC (permalink / raw) To: Alexei Starovoitov Cc: David Miller, Ingo Molnar, Linus Torvalds, Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra, Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton, Kees Cook, Linux API, Network Development, LKML On Mon, Aug 25, 2014 at 6:35 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote: > On Mon, Aug 25, 2014 at 6:06 PM, David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> wrote: >> From: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> >> Date: Mon, 25 Aug 2014 18:00:53 -0700 >> >>> add BPF_LD_IMM64 instruction to load 64-bit immediate value into a register. >> >> I think you need to rethink this. >> >> I understand that you want to be able to compile arbitrary C code into >> eBPF, but you have to restrict strongly what data the eBPF code can get >> to. > > I believe verifier already does restrict it. I don't see any holes in > the architecture. I'm probably not explaining it clearly though :( > >> Arbitrary pointer loads is asking for trouble. > > Of course. > There is no arbitrary pointer from user space. > Verifier checks all pointers. > I guess this commit log description is confusing. > It says: > BPF_LD_IMM64(R1, const_imm_map_ptr) > that's what appears in the program _after_ it goes through verifier. > User space cannot pass a pointer into the kernel. If you don't intend for userspace to load a program that contains this instruction, then why does it need to be an instruction that the verifier rewrites? Why not have an instruction "load immediate relocated pointer" that contains a reference to a relocation table and have the JIT do it? That might be easier to understand than having the verifier do it, and it'll avoid committing to ABIs before we need them. --Andy ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v6 net-next 1/6] net: filter: add "load 64-bit immediate" eBPF instruction 2014-08-26 1:38 ` Andy Lutomirski @ 2014-08-26 1:53 ` Alexei Starovoitov [not found] ` <CAMEtUuwz9MEei+tjWx4Fv8cK_zc9TKVbWxQEAE+yWvxRMa793g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 24+ messages in thread From: Alexei Starovoitov @ 2014-08-26 1:53 UTC (permalink / raw) To: Andy Lutomirski Cc: David Miller, Ingo Molnar, Linus Torvalds, Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra, Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton, Kees Cook, Linux API, Network Development, LKML On Mon, Aug 25, 2014 at 6:38 PM, Andy Lutomirski <luto@amacapital.net> wrote: > On Mon, Aug 25, 2014 at 6:35 PM, Alexei Starovoitov <ast@plumgrid.com> wrote: >> On Mon, Aug 25, 2014 at 6:06 PM, David Miller <davem@davemloft.net> wrote: >>> From: Alexei Starovoitov <ast@plumgrid.com> >>> Date: Mon, 25 Aug 2014 18:00:53 -0700 >>> >>>> add BPF_LD_IMM64 instruction to load 64-bit immediate value into a register. >>> >>> I think you need to rethink this. >>> >>> I understand that you want to be able to compile arbitrary C code into >>> eBPF, but you have to restrict strongly what data the eBPF code can get >>> to. >> >> I believe verifier already does restrict it. I don't see any holes in >> the architecture. I'm probably not explaining it clearly though :( >> >>> Arbitrary pointer loads is asking for trouble. >> >> Of course. >> There is no arbitrary pointer from user space. >> Verifier checks all pointers. >> I guess this commit log description is confusing. >> It says: >> BPF_LD_IMM64(R1, const_imm_map_ptr) >> that's what appears in the program _after_ it goes through verifier. >> User space cannot pass a pointer into the kernel. > > If you don't intend for userspace to load a program that contains this > instruction, then why does it need to be an instruction that the > verifier rewrites? Why not have an instruction "load immediate user space use _pseudo_ bpf_ld_imm64 instruction. _pseudo_ stands for using 'map_fd' as imm instead of pointer. > relocated pointer" that contains a reference to a relocation table and Andy, I guess you missed explanation in: https://lkml.org/lkml/2014/8/13/111 " Obviously user space doesn't know what kernel map pointer is associated with process-local map-FD. So it's using pseudo BPF_LD_IMM64 instruction. BPF_LD_IMM64 with src_reg == 0 -> generic move 64-bit immediate into dst_reg BPF_LD_IMM64 with src_reg == BPF_PSEUDO_MAP_FD -> mov map_fd into dst_reg Other values are reserved for now. (They will be used to implement global variables, strings and other constants and per-cpu areas in the future) So the programs look like: BPF_LD_MAP_FD(BPF_REG_1, process_local_map_fd), BPF_CALL(BPF_FUNC_map_lookup_elem), eBPF verifier scans the program for such pseudo instructions, converts process_local_map_fd -> in-kernel map pointer and drops 'pseudo' flag of BPF_LD_IMM64 instruction. " To rephrase it differently. BPF_LD_MAP_FD(BPF_REG_1, process_local_map_fd), is very much what you suggesting by "load immediate relocated pointer" Right? > have the JIT do it? That might be easier to understand than having > the verifier do it, and it'll avoid committing to ABIs before we need > them. that part I don't understand. The patch that handles pseudo_with_map_fd -> -> normal_with_kernel_pointer conversion is only 147 lines: https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/commit/?id=d82d3daa20465dfdc6b2a0094ad27de9edbb328b Cannot think of shorter version. ^ permalink raw reply [flat|nested] 24+ messages in thread
[parent not found: <CAMEtUuwz9MEei+tjWx4Fv8cK_zc9TKVbWxQEAE+yWvxRMa793g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v6 net-next 1/6] net: filter: add "load 64-bit immediate" eBPF instruction [not found] ` <CAMEtUuwz9MEei+tjWx4Fv8cK_zc9TKVbWxQEAE+yWvxRMa793g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2014-08-26 1:54 ` Andy Lutomirski [not found] ` <CALCETrWLvjt_D2B2sYoQtXeU1_9-005BfPcuYDKp75GbPk68dQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 24+ messages in thread From: Andy Lutomirski @ 2014-08-26 1:54 UTC (permalink / raw) To: Alexei Starovoitov Cc: David Miller, Ingo Molnar, Linus Torvalds, Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra, Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton, Kees Cook, Linux API, Network Development, LKML On Mon, Aug 25, 2014 at 6:53 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote: > On Mon, Aug 25, 2014 at 6:38 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote: >> On Mon, Aug 25, 2014 at 6:35 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote: >>> On Mon, Aug 25, 2014 at 6:06 PM, David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> wrote: >>>> From: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> >>>> Date: Mon, 25 Aug 2014 18:00:53 -0700 >>>> >>>>> add BPF_LD_IMM64 instruction to load 64-bit immediate value into a register. >>>> >>>> I think you need to rethink this. >>>> >>>> I understand that you want to be able to compile arbitrary C code into >>>> eBPF, but you have to restrict strongly what data the eBPF code can get >>>> to. >>> >>> I believe verifier already does restrict it. I don't see any holes in >>> the architecture. I'm probably not explaining it clearly though :( >>> >>>> Arbitrary pointer loads is asking for trouble. >>> >>> Of course. >>> There is no arbitrary pointer from user space. >>> Verifier checks all pointers. >>> I guess this commit log description is confusing. >>> It says: >>> BPF_LD_IMM64(R1, const_imm_map_ptr) >>> that's what appears in the program _after_ it goes through verifier. >>> User space cannot pass a pointer into the kernel. >> >> If you don't intend for userspace to load a program that contains this >> instruction, then why does it need to be an instruction that the >> verifier rewrites? Why not have an instruction "load immediate > > user space use _pseudo_ bpf_ld_imm64 instruction. > _pseudo_ stands for using 'map_fd' as imm instead of pointer. > >> relocated pointer" that contains a reference to a relocation table and > > Andy, I guess you missed explanation in: > https://lkml.org/lkml/2014/8/13/111 > " > Obviously user space doesn't know what kernel map pointer is associated > with process-local map-FD. > So it's using pseudo BPF_LD_IMM64 instruction. > BPF_LD_IMM64 with src_reg == 0 -> generic move 64-bit immediate into dst_reg > BPF_LD_IMM64 with src_reg == BPF_PSEUDO_MAP_FD -> mov map_fd into dst_reg > Other values are reserved for now. (They will be used to implement > global variables, strings and other constants and per-cpu areas in the future) > So the programs look like: > BPF_LD_MAP_FD(BPF_REG_1, process_local_map_fd), > BPF_CALL(BPF_FUNC_map_lookup_elem), > eBPF verifier scans the program for such pseudo instructions, converts > process_local_map_fd -> in-kernel map pointer > and drops 'pseudo' flag of BPF_LD_IMM64 instruction. > " Will a program that uses BPF_LD_IMM64 w/o the FPG_REG_1 thing be accepted? --Andy ^ permalink raw reply [flat|nested] 24+ messages in thread
[parent not found: <CALCETrWLvjt_D2B2sYoQtXeU1_9-005BfPcuYDKp75GbPk68dQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v6 net-next 1/6] net: filter: add "load 64-bit immediate" eBPF instruction [not found] ` <CALCETrWLvjt_D2B2sYoQtXeU1_9-005BfPcuYDKp75GbPk68dQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2014-08-26 2:02 ` Alexei Starovoitov 0 siblings, 0 replies; 24+ messages in thread From: Alexei Starovoitov @ 2014-08-26 2:02 UTC (permalink / raw) To: Andy Lutomirski Cc: David Miller, Ingo Molnar, Linus Torvalds, Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra, Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton, Kees Cook, Linux API, Network Development, LKML On Mon, Aug 25, 2014 at 6:54 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote: > On Mon, Aug 25, 2014 at 6:53 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote: >> On Mon, Aug 25, 2014 at 6:38 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote: >>> On Mon, Aug 25, 2014 at 6:35 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote: >>>> On Mon, Aug 25, 2014 at 6:06 PM, David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> wrote: >>>>> From: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> >>>>> Date: Mon, 25 Aug 2014 18:00:53 -0700 >>>>> >>>>>> add BPF_LD_IMM64 instruction to load 64-bit immediate value into a register. >>>>> >>>>> I think you need to rethink this. >>>>> >>>>> I understand that you want to be able to compile arbitrary C code into >>>>> eBPF, but you have to restrict strongly what data the eBPF code can get >>>>> to. >>>> >>>> I believe verifier already does restrict it. I don't see any holes in >>>> the architecture. I'm probably not explaining it clearly though :( >>>> >>>>> Arbitrary pointer loads is asking for trouble. >>>> >>>> Of course. >>>> There is no arbitrary pointer from user space. >>>> Verifier checks all pointers. >>>> I guess this commit log description is confusing. >>>> It says: >>>> BPF_LD_IMM64(R1, const_imm_map_ptr) >>>> that's what appears in the program _after_ it goes through verifier. >>>> User space cannot pass a pointer into the kernel. >>> >>> If you don't intend for userspace to load a program that contains this >>> instruction, then why does it need to be an instruction that the >>> verifier rewrites? Why not have an instruction "load immediate >> >> user space use _pseudo_ bpf_ld_imm64 instruction. >> _pseudo_ stands for using 'map_fd' as imm instead of pointer. >> >>> relocated pointer" that contains a reference to a relocation table and >> >> Andy, I guess you missed explanation in: >> https://lkml.org/lkml/2014/8/13/111 >> " >> Obviously user space doesn't know what kernel map pointer is associated >> with process-local map-FD. >> So it's using pseudo BPF_LD_IMM64 instruction. >> BPF_LD_IMM64 with src_reg == 0 -> generic move 64-bit immediate into dst_reg >> BPF_LD_IMM64 with src_reg == BPF_PSEUDO_MAP_FD -> mov map_fd into dst_reg >> Other values are reserved for now. (They will be used to implement >> global variables, strings and other constants and per-cpu areas in the future) >> So the programs look like: >> BPF_LD_MAP_FD(BPF_REG_1, process_local_map_fd), >> BPF_CALL(BPF_FUNC_map_lookup_elem), >> eBPF verifier scans the program for such pseudo instructions, converts >> process_local_map_fd -> in-kernel map pointer >> and drops 'pseudo' flag of BPF_LD_IMM64 instruction. >> " > > Will a program that uses BPF_LD_IMM64 w/o the FPG_REG_1 thing be accepted? If you mean the program like: BPF_LD_IMM64(BPF_REG_1, 0xdead), BPF_CALL(BPF_FUNC_map_lookup_elem), yes, it will be rejected, because type of R1 will not match map_lookup() argument constraints. See check_ld_imm() in verifier.c where it assigns the type during verification. There are 5 tests in verifier testsuite that test things around bpf_ld_imm64 and 2 tests around _pseudo_ bpf_ld_imm64. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v6 net-next 1/6] net: filter: add "load 64-bit immediate" eBPF instruction 2014-08-26 1:06 ` David Miller 2014-08-26 1:35 ` Alexei Starovoitov @ 2014-08-26 4:12 ` Alexei Starovoitov 1 sibling, 0 replies; 24+ messages in thread From: Alexei Starovoitov @ 2014-08-26 4:12 UTC (permalink / raw) To: David Miller Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra, Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton, Kees Cook, Linux API, Network Development, LKML On Mon, Aug 25, 2014 at 6:06 PM, David Miller <davem@davemloft.net> wrote: > > Instead I would rather you look into a model like what the quake > engine uses for it's VM. Thanks for the tip! I wasn't aware of quake vm. I've looked through several papers and slides. I'm surely missing something in what they're doing, but here is my comparison of eBPF vs QVM: - QVM ISA is stack based vs eBPF registers - pointer types are predefined by QVM ISA whereas eBPF relies on static verifier which is more extensible, since verifier can get progressively smarter with time without need to change interpreter, llvm and JITs, whereas QVM would need changes through the toolchain, interpreter, JITs to support new pointer type - QVM calls with negative values invoke helper functions, which is similar to eBPF calls. The difference is QVM keeps negative values while interpreting and doing run-time checking of arguments whereas eBPF is statically verifying all before interpreting - access to QVM 'local' memory is bounds checked at run-time, whereas eBPF does load/store bounds checking by static analysis I may be wrong, but it seems possible to side step QVM run-time checking, since their 'top of stack' is typeless and it seems possible to push constant as a pointer there. I'm biased, but eBPF seems like better architecture, more flexible, likely faster to interpret, simple JITs, more powerful compiler. The downside, of course, eBPF verifier is more complex than QVM which is mainly relying on run-time checks. ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH v6 net-next 3/6] bpf: introduce syscall(BPF, ...) and BPF maps 2014-08-26 1:00 [PATCH v6 net-next 0/6] introduce BPF syscall Alexei Starovoitov 2014-08-26 1:00 ` [PATCH v6 net-next 1/6] net: filter: add "load 64-bit immediate" eBPF instruction Alexei Starovoitov @ 2014-08-26 1:00 ` Alexei Starovoitov 2014-08-26 1:00 ` [PATCH v6 net-next 4/6] bpf: enable bpf syscall on x64 and i386 Alexei Starovoitov [not found] ` <1409014858-1410-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> 3 siblings, 0 replies; 24+ messages in thread From: Alexei Starovoitov @ 2014-08-26 1:00 UTC (permalink / raw) To: David S. Miller Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra, Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton, Kees Cook, linux-api, netdev, linux-kernel BPF syscall is a demux for different BPF releated commands. 'maps' is a generic storage of different types for sharing data between kernel and userspace. The maps can be created from user space via BPF syscall: - create a map with given type and attributes fd = bpf_map_create(map_type, struct nlattr *attr, int len) returns fd or negative error - close(fd) deletes the map Next patch allows userspace programs to populate/read maps that eBPF programs are concurrently updating. maps can have different types: hash, bloom filter, radix-tree, etc. The map is defined by: . type . max number of elements . key size in bytes . value size in bytes This patch establishes core infrastructure for BPF maps. Next patches implement lookup/update and hashtable type. More map types can be added in the future. syscall is using type-length-value style of passing arguments to be backwards compatible with future extensions to map attributes. Different map types may use different attributes as well. The concept of type-lenght-value is borrowed from netlink, but netlink itself is not applicable here, since BPF programs and maps can be used in NET-less configurations. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> --- Documentation/networking/filter.txt | 71 ++++++++++++++++ include/linux/bpf.h | 42 ++++++++++ include/uapi/linux/bpf.h | 24 ++++++ kernel/bpf/Makefile | 2 +- kernel/bpf/syscall.c | 156 +++++++++++++++++++++++++++++++++++ 5 files changed, 294 insertions(+), 1 deletion(-) create mode 100644 include/linux/bpf.h create mode 100644 kernel/bpf/syscall.c diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt index 81916ab5d96f..27a0a6c6acb4 100644 --- a/Documentation/networking/filter.txt +++ b/Documentation/networking/filter.txt @@ -1001,6 +1001,77 @@ instruction that loads 64-bit immediate value into a dst_reg. Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM which loads 32-bit immediate value into a register. +eBPF maps +--------- +'maps' is a generic storage of different types for sharing data between kernel +and userspace. + +The maps are accessed from user space via BPF syscall, which has commands: +- create a map with given type and attributes + map_fd = bpf_map_create(map_type, struct nlattr *attr, int len) + returns process-local file descriptor or negative error + +- lookup key in a given map + err = bpf_map_lookup_elem(int fd, void *key, void *value) + returns zero and stores found elem into value or negative error + +- create or update key/value pair in a given map + err = bpf_map_update_elem(int fd, void *key, void *value) + returns zero or negative error + +- find and delete element by key in a given map + err = bpf_map_delete_elem(int fd, void *key) + +- to delete map: close(fd) + Exiting process will delete maps automatically + +userspace programs uses this API to create/populate/read maps that eBPF programs +are concurrently updating. + +maps can have different types: hash, array, bloom filter, radix-tree, etc. + +The map is defined by: + . type + . max number of elements + . key size in bytes + . value size in bytes + +The maps are accesible from eBPF program with API: + void * bpf_map_lookup_elem(u32 map_fd, void *key); + int bpf_map_update_elem(u32 map_fd, void *key, void *value); + int bpf_map_delete_elem(u32 map_fd, void *key); + +The kernel replaces process-local map_fd with kernel internal map pointer, +while loading eBPF program. + +If eBPF verifier is configured to recognize extra calls in the program +bpf_map_lookup_elem() and bpf_map_update_elem() then access to maps looks like: + ... + ptr_to_value = bpf_map_lookup_elem(map_fd, key) + access memory range [ptr_to_value, ptr_to_value + value_size_in_bytes) + ... + prepare key2 and value2 on stack of key_size and value_size + err = bpf_map_update_elem(map_fd, key2, value2) + ... + +eBPF program cannot create or delete maps +(such calls will be unknown to verifier) + +During program loading the refcnt of used maps is incremented, so they don't get +deleted while program is running + +bpf_map_update_elem() can fail if maximum number of elements reached. +if key2 already exists, bpf_map_update_elem() replaces it with value2 atomically + +bpf_map_lookup_elem() returns NULL or ptr_to_value, so program must do +if (ptr_to_value != NULL) check before accessing it. +NULL means that element with given 'key' was not found. + +The verifier will check that the program accesses map elements within specified +size. It will not let programs pass junk values to bpf_map_*_elem() functions, +so these functions (implemented in C inside kernel) can safely access +the pointers in all cases. + Testing ------- diff --git a/include/linux/bpf.h b/include/linux/bpf.h new file mode 100644 index 000000000000..607ca53fe2af --- /dev/null +++ b/include/linux/bpf.h @@ -0,0 +1,42 @@ +/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of version 2 of the GNU General Public + * License as published by the Free Software Foundation. + */ +#ifndef _LINUX_BPF_H +#define _LINUX_BPF_H 1 + +#include <uapi/linux/bpf.h> +#include <linux/workqueue.h> + +struct bpf_map; +struct nlattr; + +/* map is generic key/value storage optionally accesible by eBPF programs */ +struct bpf_map_ops { + /* funcs callable from userspace (via syscall) */ + struct bpf_map *(*map_alloc)(struct nlattr *attrs[BPF_MAP_ATTR_MAX + 1]); + void (*map_free)(struct bpf_map *); +}; + +struct bpf_map { + atomic_t refcnt; + enum bpf_map_type map_type; + u32 key_size; + u32 value_size; + u32 max_entries; + struct bpf_map_ops *ops; + struct work_struct work; +}; + +struct bpf_map_type_list { + struct list_head list_node; + struct bpf_map_ops *ops; + enum bpf_map_type type; +}; + +void bpf_register_map_type(struct bpf_map_type_list *tl); +void bpf_map_put(struct bpf_map *map); + +#endif /* _LINUX_BPF_H */ diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 45a09b46c578..de21c8ecf0bb 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -318,4 +318,28 @@ struct bpf_insn { __s32 imm; /* signed immediate constant */ }; +/* BPF syscall commands */ +enum bpf_cmd { + /* create a map with given type and attributes + * fd = bpf_map_create(bpf_map_type, struct nlattr *attr, int len) + * returns fd or negative error + * map is deleted when fd is closed + */ + BPF_MAP_CREATE, +}; + +enum bpf_map_attributes { + BPF_MAP_UNSPEC, + BPF_MAP_KEY_SIZE, /* size of key in bytes */ + BPF_MAP_VALUE_SIZE, /* size of value in bytes */ + BPF_MAP_MAX_ENTRIES, /* maximum number of entries in a map */ + __BPF_MAP_ATTR_MAX, +}; +#define BPF_MAP_ATTR_MAX (__BPF_MAP_ATTR_MAX - 1) +#define BPF_MAP_MAX_ATTR_SIZE 65535 + +enum bpf_map_type { + BPF_MAP_TYPE_UNSPEC, +}; + #endif /* _UAPI__LINUX_BPF_H__ */ diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile index 6a71145e2769..e9f7334ed07a 100644 --- a/kernel/bpf/Makefile +++ b/kernel/bpf/Makefile @@ -1 +1 @@ -obj-y := core.o +obj-y := core.o syscall.o diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c new file mode 100644 index 000000000000..b929ba7fc48e --- /dev/null +++ b/kernel/bpf/syscall.c @@ -0,0 +1,156 @@ +/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of version 2 of the GNU General Public + * License as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ +#include <linux/bpf.h> +#include <linux/syscalls.h> +#include <net/netlink.h> +#include <linux/anon_inodes.h> + +static LIST_HEAD(bpf_map_types); + +static struct bpf_map *find_and_alloc_map(enum bpf_map_type type, + struct nlattr *tb[BPF_MAP_ATTR_MAX + 1]) +{ + struct bpf_map_type_list *tl; + struct bpf_map *map; + + list_for_each_entry(tl, &bpf_map_types, list_node) { + if (tl->type == type) { + map = tl->ops->map_alloc(tb); + if (IS_ERR(map)) + return map; + map->ops = tl->ops; + map->map_type = type; + return map; + } + } + return ERR_PTR(-EINVAL); +} + +/* boot time registration of different map implementations */ +void bpf_register_map_type(struct bpf_map_type_list *tl) +{ + list_add(&tl->list_node, &bpf_map_types); +} + +/* called from workqueue */ +static void bpf_map_free_deferred(struct work_struct *work) +{ + struct bpf_map *map = container_of(work, struct bpf_map, work); + + /* implementation dependent freeing */ + map->ops->map_free(map); +} + +/* decrement map refcnt and schedule it for freeing via workqueue + * (unrelying map implementation ops->map_free() might sleep) + */ +void bpf_map_put(struct bpf_map *map) +{ + if (atomic_dec_and_test(&map->refcnt)) { + INIT_WORK(&map->work, bpf_map_free_deferred); + schedule_work(&map->work); + } +} + +static int bpf_map_release(struct inode *inode, struct file *filp) +{ + struct bpf_map *map = filp->private_data; + + bpf_map_put(map); + return 0; +} + +static const struct file_operations bpf_map_fops = { + .release = bpf_map_release, +}; + +static const struct nla_policy map_policy[BPF_MAP_ATTR_MAX + 1] = { + [BPF_MAP_KEY_SIZE] = { .type = NLA_U32 }, + [BPF_MAP_VALUE_SIZE] = { .type = NLA_U32 }, + [BPF_MAP_MAX_ENTRIES] = { .type = NLA_U32 }, +}; + +/* called via syscall */ +static int map_create(enum bpf_map_type type, struct nlattr __user *uattr, int len) +{ + struct nlattr *tb[BPF_MAP_ATTR_MAX + 1]; + struct bpf_map *map; + struct nlattr *attr; + int err; + + if (len <= 0 || len > BPF_MAP_MAX_ATTR_SIZE) + return -EINVAL; + + attr = kmalloc(len, GFP_USER); + if (!attr) + return -ENOMEM; + + /* copy map attributes from user space */ + err = -EFAULT; + if (copy_from_user(attr, uattr, len) != 0) + goto free_attr; + + /* perform basic validation */ + err = nla_parse(tb, BPF_MAP_ATTR_MAX, attr, len, map_policy); + if (err < 0) + goto free_attr; + + /* find map type and init map: hashtable vs rbtree vs bloom vs ... */ + map = find_and_alloc_map(type, tb); + if (IS_ERR(map)) { + err = PTR_ERR(map); + goto free_attr; + } + + atomic_set(&map->refcnt, 1); + + err = anon_inode_getfd("bpf-map", &bpf_map_fops, map, O_RDWR | O_CLOEXEC); + + if (err < 0) + /* failed to allocate fd */ + goto free_map; + + /* user supplied array of map attributes is no longer needed */ + kfree(attr); + + return err; + +free_map: + map->ops->map_free(map); +free_attr: + kfree(attr); + return err; +} + +SYSCALL_DEFINE5(bpf, int, cmd, unsigned long, arg2, unsigned long, arg3, + unsigned long, arg4, unsigned long, arg5) +{ + /* eBPF syscall is limited to root temporarily. This restriction will + * be lifted when verifier has enough mileage and security audit is + * clean. Note that tracing/networking analytics use cases will be + * turning off 'secure' mode of verifier, since they need to pass + * kernel data back to user space + */ + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + if (arg5 != 0) + return -EINVAL; + + switch (cmd) { + case BPF_MAP_CREATE: + return map_create((enum bpf_map_type) arg2, + (struct nlattr __user *) arg3, (int) arg4); + default: + return -EINVAL; + } +} -- 1.7.9.5 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH v6 net-next 4/6] bpf: enable bpf syscall on x64 and i386 2014-08-26 1:00 [PATCH v6 net-next 0/6] introduce BPF syscall Alexei Starovoitov 2014-08-26 1:00 ` [PATCH v6 net-next 1/6] net: filter: add "load 64-bit immediate" eBPF instruction Alexei Starovoitov 2014-08-26 1:00 ` [PATCH v6 net-next 3/6] bpf: introduce syscall(BPF, ...) and BPF maps Alexei Starovoitov @ 2014-08-26 1:00 ` Alexei Starovoitov [not found] ` <1409014858-1410-5-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> [not found] ` <1409014858-1410-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> 3 siblings, 1 reply; 24+ messages in thread From: Alexei Starovoitov @ 2014-08-26 1:00 UTC (permalink / raw) To: David S. Miller Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra, Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton, Kees Cook, linux-api, netdev, linux-kernel done as separate commit to ease conflict resolution Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> --- arch/x86/syscalls/syscall_32.tbl | 1 + arch/x86/syscalls/syscall_64.tbl | 1 + include/linux/syscalls.h | 3 ++- include/uapi/asm-generic/unistd.h | 4 +++- kernel/sys_ni.c | 3 +++ 5 files changed, 10 insertions(+), 2 deletions(-) diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl index 028b78168d85..9fe1b5d002f0 100644 --- a/arch/x86/syscalls/syscall_32.tbl +++ b/arch/x86/syscalls/syscall_32.tbl @@ -363,3 +363,4 @@ 354 i386 seccomp sys_seccomp 355 i386 getrandom sys_getrandom 356 i386 memfd_create sys_memfd_create +357 i386 bpf sys_bpf diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl index 35dd922727b9..281150b539a2 100644 --- a/arch/x86/syscalls/syscall_64.tbl +++ b/arch/x86/syscalls/syscall_64.tbl @@ -327,6 +327,7 @@ 318 common getrandom sys_getrandom 319 common memfd_create sys_memfd_create 320 common kexec_file_load sys_kexec_file_load +321 common bpf sys_bpf # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 0f86d85a9ce4..61bc112b9fa5 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -875,5 +875,6 @@ asmlinkage long sys_seccomp(unsigned int op, unsigned int flags, const char __user *uargs); asmlinkage long sys_getrandom(char __user *buf, size_t count, unsigned int flags); - +asmlinkage long sys_bpf(int cmd, unsigned long arg2, unsigned long arg3, + unsigned long arg4, unsigned long arg5); #endif diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 11d11bc5c78f..22749c134117 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -705,9 +705,11 @@ __SYSCALL(__NR_seccomp, sys_seccomp) __SYSCALL(__NR_getrandom, sys_getrandom) #define __NR_memfd_create 279 __SYSCALL(__NR_memfd_create, sys_memfd_create) +#define __NR_bpf 280 +__SYSCALL(__NR_bpf, sys_bpf) #undef __NR_syscalls -#define __NR_syscalls 280 +#define __NR_syscalls 281 /* * All syscalls below here should go away really, diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 391d4ddb6f4b..b4b5083f5f5e 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -218,3 +218,6 @@ cond_syscall(sys_kcmp); /* operate on Secure Computing state */ cond_syscall(sys_seccomp); + +/* access BPF programs and maps */ +cond_syscall(sys_bpf); -- 1.7.9.5 ^ permalink raw reply related [flat|nested] 24+ messages in thread
[parent not found: <1409014858-1410-5-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>]
* Re: [PATCH v6 net-next 4/6] bpf: enable bpf syscall on x64 and i386 [not found] ` <1409014858-1410-5-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> @ 2014-08-26 1:07 ` David Miller [not found] ` <20140825.180718.137768107010295086.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> 2014-08-26 3:52 ` Stephen Hemminger 1 sibling, 1 reply; 24+ messages in thread From: David Miller @ 2014-08-26 1:07 UTC (permalink / raw) To: ast-uqk4Ao+rVK5Wk0Htik3J/w Cc: mingo-DgEjT+Ai2ygdnm+yROfE0A, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, luto-kltTT9wpgjJwATOyAt5JVQ, rostedt-nx8X9YLhiw1AfugRpC6u6w, dborkman-H+wXaHxf7aLQT0dZR+AlfA, chema-hpIqsD4AKlfQT0dZR+AlfA, edumazet-hpIqsD4AKlfQT0dZR+AlfA, a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, brendan.d.gregg-Re5JQEeQqe8AvxtiuMwx3w, namhyung-DgEjT+Ai2ygdnm+yROfE0A, hpa-YMNOUZJC4hwAvxtiuMwx3w, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, keescook-F7+t8E8rja9g9hUCZPvPmw, linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA From: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> Date: Mon, 25 Aug 2014 18:00:56 -0700 > - > +asmlinkage long sys_bpf(int cmd, unsigned long arg2, unsigned long arg3, > + unsigned long arg4, unsigned long arg5); Please do not add interfaces with opaque types as arguments. It is impossible for the compiler to type check the args at compile time when userspace tries to use this stuff. ^ permalink raw reply [flat|nested] 24+ messages in thread
[parent not found: <20140825.180718.137768107010295086.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>]
* Re: [PATCH v6 net-next 4/6] bpf: enable bpf syscall on x64 and i386 [not found] ` <20140825.180718.137768107010295086.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> @ 2014-08-26 1:43 ` Alexei Starovoitov 2014-08-26 7:45 ` Ingo Molnar 0 siblings, 1 reply; 24+ messages in thread From: Alexei Starovoitov @ 2014-08-26 1:43 UTC (permalink / raw) To: David Miller Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra, Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton, Kees Cook, Linux API, Network Development, LKML On Mon, Aug 25, 2014 at 6:07 PM, David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> wrote: > From: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> > Date: Mon, 25 Aug 2014 18:00:56 -0700 > >> - >> +asmlinkage long sys_bpf(int cmd, unsigned long arg2, unsigned long arg3, >> + unsigned long arg4, unsigned long arg5); > > Please do not add interfaces with opaque types as arguments. > > It is impossible for the compiler to type check the args at > compile time when userspace tries to use this stuff. I share this concern. I went with single BPF syscall, because alternative is 6 syscalls for every command and more syscalls in the future when we'd need to add another command. I think type casting is much lesser evil. We already have similar muxing syscalls. It feels to me that single mux/demux syscall is easier to support, document, add new commands. Type casting, yeah, not pretty. Most users will be using wrappers similar to those I've defined in libbpf.h ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v6 net-next 4/6] bpf: enable bpf syscall on x64 and i386 2014-08-26 1:43 ` Alexei Starovoitov @ 2014-08-26 7:45 ` Ingo Molnar 2014-08-26 16:29 ` Alexei Starovoitov 0 siblings, 1 reply; 24+ messages in thread From: Ingo Molnar @ 2014-08-26 7:45 UTC (permalink / raw) To: Alexei Starovoitov Cc: David Miller, Linus Torvalds, Andy Lutomirski, Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra, Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton, Kees Cook, Linux API, Network Development, LKML * Alexei Starovoitov <ast@plumgrid.com> wrote: > On Mon, Aug 25, 2014 at 6:07 PM, David Miller <davem@davemloft.net> wrote: > > From: Alexei Starovoitov <ast@plumgrid.com> > > Date: Mon, 25 Aug 2014 18:00:56 -0700 > > > >> - > >> +asmlinkage long sys_bpf(int cmd, unsigned long arg2, unsigned long arg3, > >> + unsigned long arg4, unsigned long arg5); > > > > Please do not add interfaces with opaque types as arguments. > > > > It is impossible for the compiler to type check the args at > > compile time when userspace tries to use this stuff. > > I share this concern. I went with single BPF syscall, because > alternative is 6 syscalls for every command and more > syscalls in the future when we'd need to add another command. We had a similar problem growing the perf syscall - and we were able to hold to a single syscall, which I think has served us well. Had we gone with a per functionality syscall we'd have something like a dozen syscalls today, scattered all around non-continuously in the syscall space on most platforms. But note that 'opaque or non-opaque' is a false dichotomy, as there are 3 options in reality: what we used instead of an opaque type was an extensible data type, and extensible C structure, with structure size expectations part of the structure. See 'struct perf_event_attr': SYSCALL_DEFINE5(perf_event_open, struct perf_event_attr __user *, attr_uptr, pid_t, pid, int, cpu, int, group_fd, unsigned long, flags) That way new versions of the data type are immediately obvious to the kernel, and compatibility can be handled well. Smaller, previous versions received from old user-space are padded out transparently to the kernel's value of the structure, with zeroes filled in. See perf_copy_attr() in kernel/events/core.c. Instead of versioning the structure, we use its size as a finegrained and robust version indicator in essence. That way it's both forwards and backwards compatible, as much as possible technically: old kernel can run new user-space, and new user-space will be able to take advantage of as much of an old kernel's capabilities as possible, and in the typical case of version match there's no extra overhead worth speaking of. This way we were able to gradually grow to the sophisticated ABI you can find in include/uapi/linux/perf_event.h, without having to touch the syscall interface. (It's not the only method: we also have a handful of ioctls, where that's the most natural interface for a perf event fd.) Thanks, Ingo ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v6 net-next 4/6] bpf: enable bpf syscall on x64 and i386 2014-08-26 7:45 ` Ingo Molnar @ 2014-08-26 16:29 ` Alexei Starovoitov 0 siblings, 0 replies; 24+ messages in thread From: Alexei Starovoitov @ 2014-08-26 16:29 UTC (permalink / raw) To: Ingo Molnar Cc: David Miller, Linus Torvalds, Andy Lutomirski, Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra, Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton, Kees Cook, Linux API, Network Development, LKML On Tue, Aug 26, 2014 at 12:45 AM, Ingo Molnar <mingo@kernel.org> wrote: > > * Alexei Starovoitov <ast@plumgrid.com> wrote: > >> On Mon, Aug 25, 2014 at 6:07 PM, David Miller <davem@davemloft.net> wrote: >> > From: Alexei Starovoitov <ast@plumgrid.com> >> > Date: Mon, 25 Aug 2014 18:00:56 -0700 >> > >> >> - >> >> +asmlinkage long sys_bpf(int cmd, unsigned long arg2, unsigned long arg3, >> >> + unsigned long arg4, unsigned long arg5); >> > >> > Please do not add interfaces with opaque types as arguments. >> > >> > It is impossible for the compiler to type check the args at >> > compile time when userspace tries to use this stuff. >> >> I share this concern. I went with single BPF syscall, because >> alternative is 6 syscalls for every command and more >> syscalls in the future when we'd need to add another command. > > See 'struct perf_event_attr': > > SYSCALL_DEFINE5(perf_event_open, > struct perf_event_attr __user *, attr_uptr, > pid_t, pid, int, cpu, int, group_fd, unsigned long, flags) > > This way we were able to gradually grow to the sophisticated ABI > you can find in include/uapi/linux/perf_event.h, without having > to touch the syscall interface. (It's not the only method: we > also have a handful of ioctls, where that's the most natural > interface for a perf event fd.) Thanks! that's a good alternative. One approach would be to have two bpf syscalls: map_fd = bpf_map_create(map_type, nlattr) + bunch of ioctl()s that do lookup/update/delete/... prog_fd = bpf_prog_load(prog_type, nlattr) but ioctl()s would still do a lot of type casting. so how about single syscall: fd = bpf(int cmd, union bpf_attr *attrs, int size) union bpf_attr { struct bpf_map_create_attr { /* used by bpf_map_create cmd */ __u32 key_size, value_size, ... }; struct bpf_map_access_attr { /* by lookup/update/delete/get_next */ int map_fd; void __user *key; /* used by all */ void __user *value; /* used by lookup/update */ void __user *next_key; /* used by get_next */ }; struct nlattr prog_attr[0]; /* used by bpf_prog_load cmd */ }; If you prefer I can make prog_load attrs to be explicit struct as well, but nlattr feels cleaner, since there can be optional sections that are quite large. Like one with 'readonly constants' that I'm still working on as part of cleaner strings support. Also I think I will change tracing attachment interface. Currently write("prog_fd_as_int") -> "/sys/../tracing/events/..." does the attachment, but as Andy noticed that may be not secure enough if there is fork() somewhere in the process that does open() of debugfs and write(). Looks like another ioctl() like PERF_EVENT_IOC_SET_FILTER_BPF on top of perf_event_open() would be cleaner and event->owner check can be done easily. Can I create kprobe event through perf_event_open() ? ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v6 net-next 4/6] bpf: enable bpf syscall on x64 and i386 [not found] ` <1409014858-1410-5-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> 2014-08-26 1:07 ` David Miller @ 2014-08-26 3:52 ` Stephen Hemminger 2014-08-26 4:24 ` Alexei Starovoitov 1 sibling, 1 reply; 24+ messages in thread From: Stephen Hemminger @ 2014-08-26 3:52 UTC (permalink / raw) To: Alexei Starovoitov Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra, Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton, Kees Cook, linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA Per discussion at Kernel Summit. Every new syscall requires a manual page and test programs. We have had too many new syscalls that are DOA. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v6 net-next 4/6] bpf: enable bpf syscall on x64 and i386 2014-08-26 3:52 ` Stephen Hemminger @ 2014-08-26 4:24 ` Alexei Starovoitov 2014-08-26 7:46 ` Ingo Molnar 0 siblings, 1 reply; 24+ messages in thread From: Alexei Starovoitov @ 2014-08-26 4:24 UTC (permalink / raw) To: Stephen Hemminger Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra, Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton, Kees Cook, Linux API, Network Development, LKML On Mon, Aug 25, 2014 at 8:52 PM, Stephen Hemminger <stephen@networkplumber.org> wrote: > Per discussion at Kernel Summit. Every new syscall requires > a manual page and test programs. We have had too many new syscalls > that are DOA. There is verifier testsuite that is testing eBPF verifier from userspace via bpf syscall. Also there are multiple examples and libbpf. I think test coverage for bpf syscall is quite substantial already. As far as manpage I already mentioned my plan in the other thread. I will do it when stuff lands, since writing docs in english is the most difficult part of these patches. I already rewrote Documentation/../filter.txt several times. Even once Kees noticed discrepancies there when interface changed from map_id to map_fd. To do a manpage I'll use a help from tech writer, since I want it to be better than docs I put in Documentation/... ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v6 net-next 4/6] bpf: enable bpf syscall on x64 and i386 2014-08-26 4:24 ` Alexei Starovoitov @ 2014-08-26 7:46 ` Ingo Molnar [not found] ` <20140826074655.GB19799-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 0 siblings, 1 reply; 24+ messages in thread From: Ingo Molnar @ 2014-08-26 7:46 UTC (permalink / raw) To: Alexei Starovoitov Cc: Stephen Hemminger, David S. Miller, Linus Torvalds, Andy Lutomirski, Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra, Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton, Kees Cook, Linux API, Network Development, LKML * Alexei Starovoitov <ast@plumgrid.com> wrote: > On Mon, Aug 25, 2014 at 8:52 PM, Stephen Hemminger > <stephen@networkplumber.org> wrote: > > Per discussion at Kernel Summit. Every new syscall requires > > a manual page and test programs. We have had too many new syscalls > > that are DOA. > > There is verifier testsuite that is testing eBPF verifier from userspace > via bpf syscall. Also there are multiple examples and libbpf. > I think test coverage for bpf syscall is quite substantial already. This is in tools/bpf/, right? Thanks, Ingo ^ permalink raw reply [flat|nested] 24+ messages in thread
[parent not found: <20140826074655.GB19799-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]
* Re: [PATCH v6 net-next 4/6] bpf: enable bpf syscall on x64 and i386 [not found] ` <20140826074655.GB19799-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2014-08-26 8:00 ` Daniel Borkmann [not found] ` <53FC3E9A.1020108-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 24+ messages in thread From: Daniel Borkmann @ 2014-08-26 8:00 UTC (permalink / raw) To: Ingo Molnar Cc: Alexei Starovoitov, Stephen Hemminger, David S. Miller, Linus Torvalds, Andy Lutomirski, Steven Rostedt, Chema Gonzalez, Eric Dumazet, Peter Zijlstra, Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton, Kees Cook, Linux API, Network Development, LKML On 08/26/2014 09:46 AM, Ingo Molnar wrote: > * Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote: >> On Mon, Aug 25, 2014 at 8:52 PM, Stephen Hemminger >> <stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ@public.gmane.org> wrote: >>> Per discussion at Kernel Summit. Every new syscall requires >>> a manual page and test programs. We have had too many new syscalls >>> that are DOA. >> >> There is verifier testsuite that is testing eBPF verifier from userspace >> via bpf syscall. Also there are multiple examples and libbpf. >> I think test coverage for bpf syscall is quite substantial already. > > This is in tools/bpf/, right? No, it contains a BPF JIT disasm, bpf assembler and a debugger, but the last two are for the 'classic' BPF interface only. There's a test suite for BPF/eBPF in general under lib/test_bpf.c, but so far it tests only the current code w/o eBPF verifier. That said, I think Alexei is referring to the examples et al from the bigger previous proposed patch set. ^ permalink raw reply [flat|nested] 24+ messages in thread
[parent not found: <53FC3E9A.1020108-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v6 net-next 4/6] bpf: enable bpf syscall on x64 and i386 [not found] ` <53FC3E9A.1020108-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2014-08-26 8:02 ` Ingo Molnar [not found] ` <20140826080231.GA20565-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 0 siblings, 1 reply; 24+ messages in thread From: Ingo Molnar @ 2014-08-26 8:02 UTC (permalink / raw) To: Daniel Borkmann Cc: Alexei Starovoitov, Stephen Hemminger, David S. Miller, Linus Torvalds, Andy Lutomirski, Steven Rostedt, Chema Gonzalez, Eric Dumazet, Peter Zijlstra, Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton, Kees Cook, Linux API, Network Development, LKML * Daniel Borkmann <dborkman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On 08/26/2014 09:46 AM, Ingo Molnar wrote: > >* Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote: > >>On Mon, Aug 25, 2014 at 8:52 PM, Stephen Hemminger > >><stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ@public.gmane.org> wrote: > >>>Per discussion at Kernel Summit. Every new syscall requires > >>>a manual page and test programs. We have had too many new syscalls > >>>that are DOA. > >> > >>There is verifier testsuite that is testing eBPF verifier from userspace > >>via bpf syscall. Also there are multiple examples and libbpf. > >>I think test coverage for bpf syscall is quite substantial already. > > > >This is in tools/bpf/, right? > > No, it contains a BPF JIT disasm, bpf assembler and a debugger, > but the last two are for the 'classic' BPF interface only. > There's a test suite for BPF/eBPF in general under > lib/test_bpf.c, but so far it tests only the current code w/o > eBPF verifier. > > That said, I think Alexei is referring to the examples et al > from the bigger previous proposed patch set. I mean, if all the testing already exists, it should be part of an initial submission and such. Thanks, Ingo ^ permalink raw reply [flat|nested] 24+ messages in thread
[parent not found: <20140826080231.GA20565-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]
* Re: [PATCH v6 net-next 4/6] bpf: enable bpf syscall on x64 and i386 [not found] ` <20140826080231.GA20565-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2014-08-26 16:40 ` Alexei Starovoitov 0 siblings, 0 replies; 24+ messages in thread From: Alexei Starovoitov @ 2014-08-26 16:40 UTC (permalink / raw) To: Ingo Molnar Cc: Daniel Borkmann, Stephen Hemminger, David S. Miller, Linus Torvalds, Andy Lutomirski, Steven Rostedt, Chema Gonzalez, Eric Dumazet, Peter Zijlstra, Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton, Kees Cook, Linux API, Network Development, LKML On Tue, Aug 26, 2014 at 1:02 AM, Ingo Molnar <mingo-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote: > >> That said, I think Alexei is referring to the examples et al >> from the bigger previous proposed patch set. > > I mean, if all the testing already exists, it should be part of > an initial submission and such. That's what I did. V5 set contains all examples and verifier testsuite: https://lkml.org/lkml/2014/8/24/107 Dave asked to split it up into subsets, since it's too big too review at once. which makes sense, so this V6 is only first part and the plan was: 1st(this) set - introduces uapi/linux/bpf.h and BPF syscall for maps only 2nd set will extend BPF syscall with programs and verifier 3rd set will use eBPF in tracing, add samples and verifier tests 4th set will have llvm and C examples Each set is around 6-8 patches instead of 29 in V5 ^ permalink raw reply [flat|nested] 24+ messages in thread
[parent not found: <1409014858-1410-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>]
* [PATCH v6 net-next 2/6] net: filter: split filter.h and expose eBPF to user space [not found] ` <1409014858-1410-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> @ 2014-08-26 1:00 ` Alexei Starovoitov 2014-08-26 1:00 ` [PATCH v6 net-next 5/6] bpf: add lookup/update/delete/iterate methods to BPF maps Alexei Starovoitov 2014-08-26 1:00 ` [PATCH v6 net-next 6/6] bpf: add hashtable type of " Alexei Starovoitov 2 siblings, 0 replies; 24+ messages in thread From: Alexei Starovoitov @ 2014-08-26 1:00 UTC (permalink / raw) To: David S. Miller Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra, Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton, Kees Cook, linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA eBPF can be used from user space. uapi/linux/bpf.h: eBPF instruction set definition linux/filter.h: the rest This patch only moves macro definitions, but practically it freezes existing eBPF instruction set, though new instructions can still be added in the future. These eBPF definitions cannot go into uapi/linux/filter.h, since the names may conflict with existing applications. Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> --- include/linux/filter.h | 312 +------------------------------------------ include/uapi/linux/Kbuild | 1 + include/uapi/linux/bpf.h | 321 +++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 323 insertions(+), 311 deletions(-) create mode 100644 include/uapi/linux/bpf.h diff --git a/include/linux/filter.h b/include/linux/filter.h index f3262b598262..f04793474d16 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -9,322 +9,12 @@ #include <linux/skbuff.h> #include <linux/workqueue.h> #include <uapi/linux/filter.h> - -/* Internally used and optimized filter representation with extended - * instruction set based on top of classic BPF. - */ - -/* instruction classes */ -#define BPF_ALU64 0x07 /* alu mode in double word width */ - -/* ld/ldx fields */ -#define BPF_DW 0x18 /* double word */ -#define BPF_XADD 0xc0 /* exclusive add */ - -/* alu/jmp fields */ -#define BPF_MOV 0xb0 /* mov reg to reg */ -#define BPF_ARSH 0xc0 /* sign extending arithmetic shift right */ - -/* change endianness of a register */ -#define BPF_END 0xd0 /* flags for endianness conversion: */ -#define BPF_TO_LE 0x00 /* convert to little-endian */ -#define BPF_TO_BE 0x08 /* convert to big-endian */ -#define BPF_FROM_LE BPF_TO_LE -#define BPF_FROM_BE BPF_TO_BE - -#define BPF_JNE 0x50 /* jump != */ -#define BPF_JSGT 0x60 /* SGT is signed '>', GT in x86 */ -#define BPF_JSGE 0x70 /* SGE is signed '>=', GE in x86 */ -#define BPF_CALL 0x80 /* function call */ -#define BPF_EXIT 0x90 /* function return */ - -/* Register numbers */ -enum { - BPF_REG_0 = 0, - BPF_REG_1, - BPF_REG_2, - BPF_REG_3, - BPF_REG_4, - BPF_REG_5, - BPF_REG_6, - BPF_REG_7, - BPF_REG_8, - BPF_REG_9, - BPF_REG_10, - __MAX_BPF_REG, -}; - -/* BPF has 10 general purpose 64-bit registers and stack frame. */ -#define MAX_BPF_REG __MAX_BPF_REG - -/* ArgX, context and stack frame pointer register positions. Note, - * Arg1, Arg2, Arg3, etc are used as argument mappings of function - * calls in BPF_CALL instruction. - */ -#define BPF_REG_ARG1 BPF_REG_1 -#define BPF_REG_ARG2 BPF_REG_2 -#define BPF_REG_ARG3 BPF_REG_3 -#define BPF_REG_ARG4 BPF_REG_4 -#define BPF_REG_ARG5 BPF_REG_5 -#define BPF_REG_CTX BPF_REG_6 -#define BPF_REG_FP BPF_REG_10 - -/* Additional register mappings for converted user programs. */ -#define BPF_REG_A BPF_REG_0 -#define BPF_REG_X BPF_REG_7 -#define BPF_REG_TMP BPF_REG_8 - -/* BPF program can access up to 512 bytes of stack space. */ -#define MAX_BPF_STACK 512 - -/* Helper macros for filter block array initializers. */ - -/* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */ - -#define BPF_ALU64_REG(OP, DST, SRC) \ - ((struct bpf_insn) { \ - .code = BPF_ALU64 | BPF_OP(OP) | BPF_X, \ - .dst_reg = DST, \ - .src_reg = SRC, \ - .off = 0, \ - .imm = 0 }) - -#define BPF_ALU32_REG(OP, DST, SRC) \ - ((struct bpf_insn) { \ - .code = BPF_ALU | BPF_OP(OP) | BPF_X, \ - .dst_reg = DST, \ - .src_reg = SRC, \ - .off = 0, \ - .imm = 0 }) - -/* ALU ops on immediates, bpf_add|sub|...: dst_reg += imm32 */ - -#define BPF_ALU64_IMM(OP, DST, IMM) \ - ((struct bpf_insn) { \ - .code = BPF_ALU64 | BPF_OP(OP) | BPF_K, \ - .dst_reg = DST, \ - .src_reg = 0, \ - .off = 0, \ - .imm = IMM }) - -#define BPF_ALU32_IMM(OP, DST, IMM) \ - ((struct bpf_insn) { \ - .code = BPF_ALU | BPF_OP(OP) | BPF_K, \ - .dst_reg = DST, \ - .src_reg = 0, \ - .off = 0, \ - .imm = IMM }) - -/* Endianess conversion, cpu_to_{l,b}e(), {l,b}e_to_cpu() */ - -#define BPF_ENDIAN(TYPE, DST, LEN) \ - ((struct bpf_insn) { \ - .code = BPF_ALU | BPF_END | BPF_SRC(TYPE), \ - .dst_reg = DST, \ - .src_reg = 0, \ - .off = 0, \ - .imm = LEN }) - -/* Short form of mov, dst_reg = src_reg */ - -#define BPF_MOV64_REG(DST, SRC) \ - ((struct bpf_insn) { \ - .code = BPF_ALU64 | BPF_MOV | BPF_X, \ - .dst_reg = DST, \ - .src_reg = SRC, \ - .off = 0, \ - .imm = 0 }) - -#define BPF_MOV32_REG(DST, SRC) \ - ((struct bpf_insn) { \ - .code = BPF_ALU | BPF_MOV | BPF_X, \ - .dst_reg = DST, \ - .src_reg = SRC, \ - .off = 0, \ - .imm = 0 }) - -/* Short form of mov, dst_reg = imm32 */ - -#define BPF_MOV64_IMM(DST, IMM) \ - ((struct bpf_insn) { \ - .code = BPF_ALU64 | BPF_MOV | BPF_K, \ - .dst_reg = DST, \ - .src_reg = 0, \ - .off = 0, \ - .imm = IMM }) - -#define BPF_MOV32_IMM(DST, IMM) \ - ((struct bpf_insn) { \ - .code = BPF_ALU | BPF_MOV | BPF_K, \ - .dst_reg = DST, \ - .src_reg = 0, \ - .off = 0, \ - .imm = IMM }) - -/* BPF_LD_IMM64 macro encodes single 'load 64-bit immediate' insn */ -#define BPF_LD_IMM64(DST, IMM) \ - BPF_LD_IMM64_RAW(DST, 0, IMM) - -#define BPF_LD_IMM64_RAW(DST, SRC, IMM) \ - ((struct bpf_insn) { \ - .code = BPF_LD | BPF_DW | BPF_IMM, \ - .dst_reg = DST, \ - .src_reg = SRC, \ - .off = 0, \ - .imm = (__u32) (IMM) }), \ - ((struct bpf_insn) { \ - .code = 0, /* zero is reserved opcode */ \ - .dst_reg = 0, \ - .src_reg = 0, \ - .off = 0, \ - .imm = ((__u64) (IMM)) >> 32 }) - -/* Short form of mov based on type, BPF_X: dst_reg = src_reg, BPF_K: dst_reg = imm32 */ - -#define BPF_MOV64_RAW(TYPE, DST, SRC, IMM) \ - ((struct bpf_insn) { \ - .code = BPF_ALU64 | BPF_MOV | BPF_SRC(TYPE), \ - .dst_reg = DST, \ - .src_reg = SRC, \ - .off = 0, \ - .imm = IMM }) - -#define BPF_MOV32_RAW(TYPE, DST, SRC, IMM) \ - ((struct bpf_insn) { \ - .code = BPF_ALU | BPF_MOV | BPF_SRC(TYPE), \ - .dst_reg = DST, \ - .src_reg = SRC, \ - .off = 0, \ - .imm = IMM }) - -/* Direct packet access, R0 = *(uint *) (skb->data + imm32) */ - -#define BPF_LD_ABS(SIZE, IMM) \ - ((struct bpf_insn) { \ - .code = BPF_LD | BPF_SIZE(SIZE) | BPF_ABS, \ - .dst_reg = 0, \ - .src_reg = 0, \ - .off = 0, \ - .imm = IMM }) - -/* Indirect packet access, R0 = *(uint *) (skb->data + src_reg + imm32) */ - -#define BPF_LD_IND(SIZE, SRC, IMM) \ - ((struct bpf_insn) { \ - .code = BPF_LD | BPF_SIZE(SIZE) | BPF_IND, \ - .dst_reg = 0, \ - .src_reg = SRC, \ - .off = 0, \ - .imm = IMM }) - -/* Memory load, dst_reg = *(uint *) (src_reg + off16) */ - -#define BPF_LDX_MEM(SIZE, DST, SRC, OFF) \ - ((struct bpf_insn) { \ - .code = BPF_LDX | BPF_SIZE(SIZE) | BPF_MEM, \ - .dst_reg = DST, \ - .src_reg = SRC, \ - .off = OFF, \ - .imm = 0 }) - -/* Memory store, *(uint *) (dst_reg + off16) = src_reg */ - -#define BPF_STX_MEM(SIZE, DST, SRC, OFF) \ - ((struct bpf_insn) { \ - .code = BPF_STX | BPF_SIZE(SIZE) | BPF_MEM, \ - .dst_reg = DST, \ - .src_reg = SRC, \ - .off = OFF, \ - .imm = 0 }) - -/* Memory store, *(uint *) (dst_reg + off16) = imm32 */ - -#define BPF_ST_MEM(SIZE, DST, OFF, IMM) \ - ((struct bpf_insn) { \ - .code = BPF_ST | BPF_SIZE(SIZE) | BPF_MEM, \ - .dst_reg = DST, \ - .src_reg = 0, \ - .off = OFF, \ - .imm = IMM }) - -/* Conditional jumps against registers, if (dst_reg 'op' src_reg) goto pc + off16 */ - -#define BPF_JMP_REG(OP, DST, SRC, OFF) \ - ((struct bpf_insn) { \ - .code = BPF_JMP | BPF_OP(OP) | BPF_X, \ - .dst_reg = DST, \ - .src_reg = SRC, \ - .off = OFF, \ - .imm = 0 }) - -/* Conditional jumps against immediates, if (dst_reg 'op' imm32) goto pc + off16 */ - -#define BPF_JMP_IMM(OP, DST, IMM, OFF) \ - ((struct bpf_insn) { \ - .code = BPF_JMP | BPF_OP(OP) | BPF_K, \ - .dst_reg = DST, \ - .src_reg = 0, \ - .off = OFF, \ - .imm = IMM }) - -/* Function call */ - -#define BPF_EMIT_CALL(FUNC) \ - ((struct bpf_insn) { \ - .code = BPF_JMP | BPF_CALL, \ - .dst_reg = 0, \ - .src_reg = 0, \ - .off = 0, \ - .imm = ((FUNC) - __bpf_call_base) }) - -/* Raw code statement block */ - -#define BPF_RAW_INSN(CODE, DST, SRC, OFF, IMM) \ - ((struct bpf_insn) { \ - .code = CODE, \ - .dst_reg = DST, \ - .src_reg = SRC, \ - .off = OFF, \ - .imm = IMM }) - -/* Program exit */ - -#define BPF_EXIT_INSN() \ - ((struct bpf_insn) { \ - .code = BPF_JMP | BPF_EXIT, \ - .dst_reg = 0, \ - .src_reg = 0, \ - .off = 0, \ - .imm = 0 }) - -#define bytes_to_bpf_size(bytes) \ -({ \ - int bpf_size = -EINVAL; \ - \ - if (bytes == sizeof(u8)) \ - bpf_size = BPF_B; \ - else if (bytes == sizeof(u16)) \ - bpf_size = BPF_H; \ - else if (bytes == sizeof(u32)) \ - bpf_size = BPF_W; \ - else if (bytes == sizeof(u64)) \ - bpf_size = BPF_DW; \ - \ - bpf_size; \ -}) +#include <uapi/linux/bpf.h> /* Macro to invoke filter function. */ #define SK_RUN_FILTER(filter, ctx) \ (*filter->prog->bpf_func)(ctx, filter->prog->insnsi) -struct bpf_insn { - __u8 code; /* opcode */ - __u8 dst_reg:4; /* dest register */ - __u8 src_reg:4; /* source register */ - __s16 off; /* signed offset */ - __s32 imm; /* signed immediate constant */ -}; - #ifdef CONFIG_COMPAT /* A struct sock_filter is architecture independent. */ struct compat_sock_fprog { diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild index 24e9033f8b3f..fb3f7b675229 100644 --- a/include/uapi/linux/Kbuild +++ b/include/uapi/linux/Kbuild @@ -67,6 +67,7 @@ header-y += bfs_fs.h header-y += binfmts.h header-y += blkpg.h header-y += blktrace_api.h +header-y += bpf.h header-y += bpqether.h header-y += bsg.h header-y += btrfs.h diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h new file mode 100644 index 000000000000..45a09b46c578 --- /dev/null +++ b/include/uapi/linux/bpf.h @@ -0,0 +1,321 @@ +/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of version 2 of the GNU General Public + * License as published by the Free Software Foundation. + */ +#ifndef _UAPI__LINUX_BPF_H__ +#define _UAPI__LINUX_BPF_H__ + +#include <linux/types.h> + +/* Extended instruction set based on top of classic BPF */ + +/* instruction classes */ +#define BPF_ALU64 0x07 /* alu mode in double word width */ + +/* ld/ldx fields */ +#define BPF_DW 0x18 /* double word */ +#define BPF_XADD 0xc0 /* exclusive add */ + +/* alu/jmp fields */ +#define BPF_MOV 0xb0 /* mov reg to reg */ +#define BPF_ARSH 0xc0 /* sign extending arithmetic shift right */ + +/* change endianness of a register */ +#define BPF_END 0xd0 /* flags for endianness conversion: */ +#define BPF_TO_LE 0x00 /* convert to little-endian */ +#define BPF_TO_BE 0x08 /* convert to big-endian */ +#define BPF_FROM_LE BPF_TO_LE +#define BPF_FROM_BE BPF_TO_BE + +#define BPF_JNE 0x50 /* jump != */ +#define BPF_JSGT 0x60 /* SGT is signed '>', GT in x86 */ +#define BPF_JSGE 0x70 /* SGE is signed '>=', GE in x86 */ +#define BPF_CALL 0x80 /* function call */ +#define BPF_EXIT 0x90 /* function return */ + +/* Register numbers */ +enum { + BPF_REG_0 = 0, + BPF_REG_1, + BPF_REG_2, + BPF_REG_3, + BPF_REG_4, + BPF_REG_5, + BPF_REG_6, + BPF_REG_7, + BPF_REG_8, + BPF_REG_9, + BPF_REG_10, + __MAX_BPF_REG, +}; + +/* BPF has 10 general purpose 64-bit registers and stack frame. */ +#define MAX_BPF_REG __MAX_BPF_REG + +/* ArgX, context and stack frame pointer register positions. Note, + * Arg1, Arg2, Arg3, etc are used as argument mappings of function + * calls in BPF_CALL instruction. + */ +#define BPF_REG_ARG1 BPF_REG_1 +#define BPF_REG_ARG2 BPF_REG_2 +#define BPF_REG_ARG3 BPF_REG_3 +#define BPF_REG_ARG4 BPF_REG_4 +#define BPF_REG_ARG5 BPF_REG_5 +#define BPF_REG_CTX BPF_REG_6 +#define BPF_REG_FP BPF_REG_10 + +/* Additional register mappings for converted user programs. */ +#define BPF_REG_A BPF_REG_0 +#define BPF_REG_X BPF_REG_7 +#define BPF_REG_TMP BPF_REG_8 + +/* BPF program can access up to 512 bytes of stack space. */ +#define MAX_BPF_STACK 512 + +/* Helper macros for filter block array initializers. */ + +/* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */ + +#define BPF_ALU64_REG(OP, DST, SRC) \ + ((struct bpf_insn) { \ + .code = BPF_ALU64 | BPF_OP(OP) | BPF_X, \ + .dst_reg = DST, \ + .src_reg = SRC, \ + .off = 0, \ + .imm = 0 }) + +#define BPF_ALU32_REG(OP, DST, SRC) \ + ((struct bpf_insn) { \ + .code = BPF_ALU | BPF_OP(OP) | BPF_X, \ + .dst_reg = DST, \ + .src_reg = SRC, \ + .off = 0, \ + .imm = 0 }) + +/* ALU ops on immediates, bpf_add|sub|...: dst_reg += imm32 */ + +#define BPF_ALU64_IMM(OP, DST, IMM) \ + ((struct bpf_insn) { \ + .code = BPF_ALU64 | BPF_OP(OP) | BPF_K, \ + .dst_reg = DST, \ + .src_reg = 0, \ + .off = 0, \ + .imm = IMM }) + +#define BPF_ALU32_IMM(OP, DST, IMM) \ + ((struct bpf_insn) { \ + .code = BPF_ALU | BPF_OP(OP) | BPF_K, \ + .dst_reg = DST, \ + .src_reg = 0, \ + .off = 0, \ + .imm = IMM }) + +/* Endianess conversion, cpu_to_{l,b}e(), {l,b}e_to_cpu() */ + +#define BPF_ENDIAN(TYPE, DST, LEN) \ + ((struct bpf_insn) { \ + .code = BPF_ALU | BPF_END | BPF_SRC(TYPE), \ + .dst_reg = DST, \ + .src_reg = 0, \ + .off = 0, \ + .imm = LEN }) + +/* Short form of mov, dst_reg = src_reg */ + +#define BPF_MOV64_REG(DST, SRC) \ + ((struct bpf_insn) { \ + .code = BPF_ALU64 | BPF_MOV | BPF_X, \ + .dst_reg = DST, \ + .src_reg = SRC, \ + .off = 0, \ + .imm = 0 }) + +#define BPF_MOV32_REG(DST, SRC) \ + ((struct bpf_insn) { \ + .code = BPF_ALU | BPF_MOV | BPF_X, \ + .dst_reg = DST, \ + .src_reg = SRC, \ + .off = 0, \ + .imm = 0 }) + +/* Short form of mov, dst_reg = imm32 */ + +#define BPF_MOV64_IMM(DST, IMM) \ + ((struct bpf_insn) { \ + .code = BPF_ALU64 | BPF_MOV | BPF_K, \ + .dst_reg = DST, \ + .src_reg = 0, \ + .off = 0, \ + .imm = IMM }) + +#define BPF_MOV32_IMM(DST, IMM) \ + ((struct bpf_insn) { \ + .code = BPF_ALU | BPF_MOV | BPF_K, \ + .dst_reg = DST, \ + .src_reg = 0, \ + .off = 0, \ + .imm = IMM }) + +/* BPF_LD_IMM64 macro encodes single 'load 64-bit immediate' insn */ +#define BPF_LD_IMM64(DST, IMM) \ + BPF_LD_IMM64_RAW(DST, 0, IMM) + +#define BPF_LD_IMM64_RAW(DST, SRC, IMM) \ + ((struct bpf_insn) { \ + .code = BPF_LD | BPF_DW | BPF_IMM, \ + .dst_reg = DST, \ + .src_reg = SRC, \ + .off = 0, \ + .imm = (__u32) (IMM) }), \ + ((struct bpf_insn) { \ + .code = 0, /* zero is reserved opcode */ \ + .dst_reg = 0, \ + .src_reg = 0, \ + .off = 0, \ + .imm = ((__u64) (IMM)) >> 32 }) + +/* Short form of mov based on type, BPF_X: dst_reg = src_reg, BPF_K: dst_reg = imm32 */ + +#define BPF_MOV64_RAW(TYPE, DST, SRC, IMM) \ + ((struct bpf_insn) { \ + .code = BPF_ALU64 | BPF_MOV | BPF_SRC(TYPE), \ + .dst_reg = DST, \ + .src_reg = SRC, \ + .off = 0, \ + .imm = IMM }) + +#define BPF_MOV32_RAW(TYPE, DST, SRC, IMM) \ + ((struct bpf_insn) { \ + .code = BPF_ALU | BPF_MOV | BPF_SRC(TYPE), \ + .dst_reg = DST, \ + .src_reg = SRC, \ + .off = 0, \ + .imm = IMM }) + +/* Direct packet access, R0 = *(uint *) (skb->data + imm32) */ + +#define BPF_LD_ABS(SIZE, IMM) \ + ((struct bpf_insn) { \ + .code = BPF_LD | BPF_SIZE(SIZE) | BPF_ABS, \ + .dst_reg = 0, \ + .src_reg = 0, \ + .off = 0, \ + .imm = IMM }) + +/* Indirect packet access, R0 = *(uint *) (skb->data + src_reg + imm32) */ + +#define BPF_LD_IND(SIZE, SRC, IMM) \ + ((struct bpf_insn) { \ + .code = BPF_LD | BPF_SIZE(SIZE) | BPF_IND, \ + .dst_reg = 0, \ + .src_reg = SRC, \ + .off = 0, \ + .imm = IMM }) + +/* Memory load, dst_reg = *(uint *) (src_reg + off16) */ + +#define BPF_LDX_MEM(SIZE, DST, SRC, OFF) \ + ((struct bpf_insn) { \ + .code = BPF_LDX | BPF_SIZE(SIZE) | BPF_MEM, \ + .dst_reg = DST, \ + .src_reg = SRC, \ + .off = OFF, \ + .imm = 0 }) + +/* Memory store, *(uint *) (dst_reg + off16) = src_reg */ + +#define BPF_STX_MEM(SIZE, DST, SRC, OFF) \ + ((struct bpf_insn) { \ + .code = BPF_STX | BPF_SIZE(SIZE) | BPF_MEM, \ + .dst_reg = DST, \ + .src_reg = SRC, \ + .off = OFF, \ + .imm = 0 }) + +/* Memory store, *(uint *) (dst_reg + off16) = imm32 */ + +#define BPF_ST_MEM(SIZE, DST, OFF, IMM) \ + ((struct bpf_insn) { \ + .code = BPF_ST | BPF_SIZE(SIZE) | BPF_MEM, \ + .dst_reg = DST, \ + .src_reg = 0, \ + .off = OFF, \ + .imm = IMM }) + +/* Conditional jumps against registers, if (dst_reg 'op' src_reg) goto pc + off16 */ + +#define BPF_JMP_REG(OP, DST, SRC, OFF) \ + ((struct bpf_insn) { \ + .code = BPF_JMP | BPF_OP(OP) | BPF_X, \ + .dst_reg = DST, \ + .src_reg = SRC, \ + .off = OFF, \ + .imm = 0 }) + +/* Conditional jumps against immediates, if (dst_reg 'op' imm32) goto pc + off16 */ + +#define BPF_JMP_IMM(OP, DST, IMM, OFF) \ + ((struct bpf_insn) { \ + .code = BPF_JMP | BPF_OP(OP) | BPF_K, \ + .dst_reg = DST, \ + .src_reg = 0, \ + .off = OFF, \ + .imm = IMM }) + +/* Function call */ + +#define BPF_EMIT_CALL(FUNC) \ + ((struct bpf_insn) { \ + .code = BPF_JMP | BPF_CALL, \ + .dst_reg = 0, \ + .src_reg = 0, \ + .off = 0, \ + .imm = ((FUNC) - __bpf_call_base) }) + +/* Raw code statement block */ + +#define BPF_RAW_INSN(CODE, DST, SRC, OFF, IMM) \ + ((struct bpf_insn) { \ + .code = CODE, \ + .dst_reg = DST, \ + .src_reg = SRC, \ + .off = OFF, \ + .imm = IMM }) + +/* Program exit */ + +#define BPF_EXIT_INSN() \ + ((struct bpf_insn) { \ + .code = BPF_JMP | BPF_EXIT, \ + .dst_reg = 0, \ + .src_reg = 0, \ + .off = 0, \ + .imm = 0 }) + +#define bytes_to_bpf_size(bytes) \ +({ \ + int bpf_size = -EINVAL; \ + \ + if (bytes == sizeof(u8)) \ + bpf_size = BPF_B; \ + else if (bytes == sizeof(u16)) \ + bpf_size = BPF_H; \ + else if (bytes == sizeof(u32)) \ + bpf_size = BPF_W; \ + else if (bytes == sizeof(u64)) \ + bpf_size = BPF_DW; \ + \ + bpf_size; \ +}) + +struct bpf_insn { + __u8 code; /* opcode */ + __u8 dst_reg:4; /* dest register */ + __u8 src_reg:4; /* source register */ + __s16 off; /* signed offset */ + __s32 imm; /* signed immediate constant */ +}; + +#endif /* _UAPI__LINUX_BPF_H__ */ -- 1.7.9.5 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH v6 net-next 5/6] bpf: add lookup/update/delete/iterate methods to BPF maps [not found] ` <1409014858-1410-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> 2014-08-26 1:00 ` [PATCH v6 net-next 2/6] net: filter: split filter.h and expose eBPF to user space Alexei Starovoitov @ 2014-08-26 1:00 ` Alexei Starovoitov 2014-08-26 1:00 ` [PATCH v6 net-next 6/6] bpf: add hashtable type of " Alexei Starovoitov 2 siblings, 0 replies; 24+ messages in thread From: Alexei Starovoitov @ 2014-08-26 1:00 UTC (permalink / raw) To: David S. Miller Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra, Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton, Kees Cook, linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA 'maps' is a generic storage of different types for sharing data between kernel and userspace. The maps are accessed from user space via BPF syscall, which has commands: - create a map with given type and attributes fd = bpf_map_create(map_type, struct nlattr *attr, int len) returns fd or negative error - lookup key in a given map referenced by fd err = bpf_map_lookup_elem(int fd, void *key, void *value) returns zero and stores found elem into value or negative error - create or update key/value pair in a given map err = bpf_map_update_elem(int fd, void *key, void *value) returns zero or negative error - find and delete element by key in a given map err = bpf_map_delete_elem(int fd, void *key) - iterate map elements (based on input key return next_key) err = bpf_map_get_next_key(int fd, void *key, void *next_key) - close(fd) deletes the map Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> --- include/linux/bpf.h | 8 ++ include/uapi/linux/bpf.h | 25 ++++++ kernel/bpf/syscall.c | 198 ++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 231 insertions(+) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 607ca53fe2af..fd1ac4b5ba8b 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -9,6 +9,7 @@ #include <uapi/linux/bpf.h> #include <linux/workqueue.h> +#include <linux/file.h> struct bpf_map; struct nlattr; @@ -18,6 +19,12 @@ struct bpf_map_ops { /* funcs callable from userspace (via syscall) */ struct bpf_map *(*map_alloc)(struct nlattr *attrs[BPF_MAP_ATTR_MAX + 1]); void (*map_free)(struct bpf_map *); + int (*map_get_next_key)(struct bpf_map *map, void *key, void *next_key); + + /* funcs callable from userspace and from eBPF programs */ + void *(*map_lookup_elem)(struct bpf_map *map, void *key); + int (*map_update_elem)(struct bpf_map *map, void *key, void *value); + int (*map_delete_elem)(struct bpf_map *map, void *key); }; struct bpf_map { @@ -38,5 +45,6 @@ struct bpf_map_type_list { void bpf_register_map_type(struct bpf_map_type_list *tl); void bpf_map_put(struct bpf_map *map); +struct bpf_map *bpf_map_get(struct fd f); #endif /* _LINUX_BPF_H */ diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index de21c8ecf0bb..f68edb2681f8 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -326,6 +326,31 @@ enum bpf_cmd { * map is deleted when fd is closed */ BPF_MAP_CREATE, + + /* lookup key in a given map + * err = bpf_map_lookup_elem(int fd, void *key, void *value) + * returns zero and stores found elem into value + * or negative error + */ + BPF_MAP_LOOKUP_ELEM, + + /* create or update key/value pair in a given map + * err = bpf_map_update_elem(int fd, void *key, void *value) + * returns zero or negative error + */ + BPF_MAP_UPDATE_ELEM, + + /* find and delete elem by key in a given map + * err = bpf_map_delete_elem(int fd, void *key) + * returns zero or negative error + */ + BPF_MAP_DELETE_ELEM, + + /* lookup key in a given map and return next key + * err = bpf_map_get_elem(int fd, void *key, void *next_key) + * returns zero and stores next key or negative error + */ + BPF_MAP_GET_NEXT_KEY, }; enum bpf_map_attributes { diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index b929ba7fc48e..c16bf2052c64 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -13,6 +13,7 @@ #include <linux/syscalls.h> #include <net/netlink.h> #include <linux/anon_inodes.h> +#include <linux/file.h> static LIST_HEAD(bpf_map_types); @@ -131,6 +132,189 @@ free_attr: return err; } +/* if error is returned, fd is released. + * On success caller should complete fd access with matching fdput() + */ +struct bpf_map *bpf_map_get(struct fd f) +{ + struct bpf_map *map; + + if (!f.file) + return ERR_PTR(-EBADF); + + if (f.file->f_op != &bpf_map_fops) { + fdput(f); + return ERR_PTR(-EINVAL); + } + + map = f.file->private_data; + + return map; +} + +static int map_lookup_elem(int ufd, void __user *ukey, void __user *uvalue) +{ + struct fd f = fdget(ufd); + struct bpf_map *map; + void *key, *value; + int err; + + map = bpf_map_get(f); + if (IS_ERR(map)) + return PTR_ERR(map); + + err = -ENOMEM; + key = kmalloc(map->key_size, GFP_USER); + if (!key) + goto err_put; + + err = -EFAULT; + if (copy_from_user(key, ukey, map->key_size) != 0) + goto free_key; + + err = -ESRCH; + rcu_read_lock(); + value = map->ops->map_lookup_elem(map, key); + if (!value) + goto err_unlock; + + err = -EFAULT; + if (copy_to_user(uvalue, value, map->value_size) != 0) + goto err_unlock; + + err = 0; + +err_unlock: + rcu_read_unlock(); +free_key: + kfree(key); +err_put: + fdput(f); + return err; +} + +static int map_update_elem(int ufd, void __user *ukey, void __user *uvalue) +{ + struct fd f = fdget(ufd); + struct bpf_map *map; + void *key, *value; + int err; + + map = bpf_map_get(f); + if (IS_ERR(map)) + return PTR_ERR(map); + + err = -ENOMEM; + key = kmalloc(map->key_size, GFP_USER); + if (!key) + goto err_put; + + err = -EFAULT; + if (copy_from_user(key, ukey, map->key_size) != 0) + goto free_key; + + err = -ENOMEM; + value = kmalloc(map->value_size, GFP_USER); + if (!value) + goto free_key; + + err = -EFAULT; + if (copy_from_user(value, uvalue, map->value_size) != 0) + goto free_value; + + /* eBPF program that use maps are running under rcu_read_lock(), + * therefore all map accessors rely on this fact, so do the same here + */ + rcu_read_lock(); + err = map->ops->map_update_elem(map, key, value); + rcu_read_unlock(); + +free_value: + kfree(value); +free_key: + kfree(key); +err_put: + fdput(f); + return err; +} + +static int map_delete_elem(int ufd, void __user *ukey) +{ + struct fd f = fdget(ufd); + struct bpf_map *map; + void *key; + int err; + + map = bpf_map_get(f); + if (IS_ERR(map)) + return PTR_ERR(map); + + err = -ENOMEM; + key = kmalloc(map->key_size, GFP_USER); + if (!key) + goto err_put; + + err = -EFAULT; + if (copy_from_user(key, ukey, map->key_size) != 0) + goto free_key; + + rcu_read_lock(); + err = map->ops->map_delete_elem(map, key); + rcu_read_unlock(); + +free_key: + kfree(key); +err_put: + fdput(f); + return err; +} + +static int map_get_next_key(int ufd, void __user *ukey, void __user *unext_key) +{ + struct fd f = fdget(ufd); + struct bpf_map *map; + void *key, *next_key; + int err; + + map = bpf_map_get(f); + if (IS_ERR(map)) + return PTR_ERR(map); + + err = -ENOMEM; + key = kmalloc(map->key_size, GFP_USER); + if (!key) + goto err_put; + + err = -EFAULT; + if (copy_from_user(key, ukey, map->key_size) != 0) + goto free_key; + + err = -ENOMEM; + next_key = kmalloc(map->key_size, GFP_USER); + if (!next_key) + goto free_key; + + rcu_read_lock(); + err = map->ops->map_get_next_key(map, key, next_key); + rcu_read_unlock(); + if (err) + goto free_next_key; + + err = -EFAULT; + if (copy_to_user(unext_key, next_key, map->key_size) != 0) + goto free_next_key; + + err = 0; + +free_next_key: + kfree(next_key); +free_key: + kfree(key); +err_put: + fdput(f); + return err; +} + SYSCALL_DEFINE5(bpf, int, cmd, unsigned long, arg2, unsigned long, arg3, unsigned long, arg4, unsigned long, arg5) { @@ -150,6 +334,20 @@ SYSCALL_DEFINE5(bpf, int, cmd, unsigned long, arg2, unsigned long, arg3, case BPF_MAP_CREATE: return map_create((enum bpf_map_type) arg2, (struct nlattr __user *) arg3, (int) arg4); + case BPF_MAP_LOOKUP_ELEM: + return map_lookup_elem((int) arg2, (void __user *) arg3, + (void __user *) arg4); + case BPF_MAP_UPDATE_ELEM: + return map_update_elem((int) arg2, (void __user *) arg3, + (void __user *) arg4); + case BPF_MAP_DELETE_ELEM: + if (arg4 != 0) + return -EINVAL; + return map_delete_elem((int) arg2, (void __user *) arg3); + + case BPF_MAP_GET_NEXT_KEY: + return map_get_next_key((int) arg2, (void __user *) arg3, + (void __user *) arg4); default: return -EINVAL; } -- 1.7.9.5 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH v6 net-next 6/6] bpf: add hashtable type of BPF maps [not found] ` <1409014858-1410-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> 2014-08-26 1:00 ` [PATCH v6 net-next 2/6] net: filter: split filter.h and expose eBPF to user space Alexei Starovoitov 2014-08-26 1:00 ` [PATCH v6 net-next 5/6] bpf: add lookup/update/delete/iterate methods to BPF maps Alexei Starovoitov @ 2014-08-26 1:00 ` Alexei Starovoitov 2 siblings, 0 replies; 24+ messages in thread From: Alexei Starovoitov @ 2014-08-26 1:00 UTC (permalink / raw) To: David S. Miller Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra, Brendan Gregg, Namhyung Kim, H. Peter Anvin, Andrew Morton, Kees Cook, linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA add new map type BPF_MAP_TYPE_HASH and its implementation - key/value are opaque range of bytes - user space provides 3 configuration attributes via BPF syscall: key_size, value_size, max_entries - if value_size == 0, the map is used as a set - map_update_elem() must fail to insert new element when max_entries limit is reached - map takes care of allocating/freeing key/value pairs - update/lookup/delete methods may be called from eBPF program attached to kprobes, so use spin_lock_irqsave() mechanism for concurrent updates - optimized for speed of lookup() which can be called multiple times from eBPF program which itself is triggered by high volume of events - in the future JIT compiler may recognize lookup() call and optimize it further, since key_size is constant for life of eBPF program Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> --- Note, lib/rhashtable.c couldn't be reused here, since it's not managing the objects, shrink/expand is not supported out of irq context, overhead of generic key/head_offsets and hashfn is too high. In the future if/when rhashtable gains shrink/expand as a separate thread, we may add another BPF_MAP_TYPE_ for it, but security implications of dynamically resizeable maps would need to be analyzed. include/uapi/linux/bpf.h | 1 + kernel/bpf/Makefile | 2 +- kernel/bpf/hashtab.c | 373 ++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 375 insertions(+), 1 deletion(-) create mode 100644 kernel/bpf/hashtab.c diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index f68edb2681f8..8069ab7b64cf 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -365,6 +365,7 @@ enum bpf_map_attributes { enum bpf_map_type { BPF_MAP_TYPE_UNSPEC, + BPF_MAP_TYPE_HASH, }; #endif /* _UAPI__LINUX_BPF_H__ */ diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile index e9f7334ed07a..558e12712ebc 100644 --- a/kernel/bpf/Makefile +++ b/kernel/bpf/Makefile @@ -1 +1 @@ -obj-y := core.o syscall.o +obj-y := core.o syscall.o hashtab.o diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c new file mode 100644 index 000000000000..3178a6746b92 --- /dev/null +++ b/kernel/bpf/hashtab.c @@ -0,0 +1,373 @@ +/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of version 2 of the GNU General Public + * License as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ +#include <linux/bpf.h> +#include <net/netlink.h> +#include <linux/jhash.h> + +struct bpf_htab { + struct bpf_map map; + struct hlist_head *buckets; + struct kmem_cache *elem_cache; + spinlock_t lock; + u32 count; /* number of elements in this hashtable */ + u32 n_buckets; /* number of hash buckets */ + u32 elem_size; /* size of each element in bytes */ +}; + +/* each htab element is struct htab_elem + key + value */ +struct htab_elem { + struct hlist_node hash_node; + struct rcu_head rcu; + struct bpf_htab *htab; + u32 hash; + u32 pad; + char key[0]; +}; + +#define BPF_MAP_MAX_KEY_SIZE 256 +static struct bpf_map *htab_map_alloc(struct nlattr *attr[BPF_MAP_ATTR_MAX + 1]) +{ + struct bpf_htab *htab; + int err, i; + + htab = kzalloc(sizeof(*htab), GFP_USER); + if (!htab) + return ERR_PTR(-ENOMEM); + + /* look for mandatory map attributes */ + err = -EINVAL; + if (!attr[BPF_MAP_KEY_SIZE]) + goto free_htab; + htab->map.key_size = nla_get_u32(attr[BPF_MAP_KEY_SIZE]); + + if (!attr[BPF_MAP_VALUE_SIZE]) + goto free_htab; + htab->map.value_size = nla_get_u32(attr[BPF_MAP_VALUE_SIZE]); + + if (!attr[BPF_MAP_MAX_ENTRIES]) + goto free_htab; + htab->map.max_entries = nla_get_u32(attr[BPF_MAP_MAX_ENTRIES]); + + /* check sanity of attributes. + * value_size == 0 is allowed, in this case map is used as a set + */ + if (htab->map.max_entries == 0 || htab->map.key_size == 0) + goto free_htab; + + /* hash table size must be power of 2 */ + htab->n_buckets = roundup_pow_of_two(htab->map.max_entries); + + err = -E2BIG; + if (htab->map.key_size > BPF_MAP_MAX_KEY_SIZE) + goto free_htab; + + err = -ENOMEM; + htab->buckets = kmalloc_array(htab->n_buckets, + sizeof(struct hlist_head), GFP_USER); + + if (!htab->buckets) + goto free_htab; + + for (i = 0; i < htab->n_buckets; i++) + INIT_HLIST_HEAD(&htab->buckets[i]); + + spin_lock_init(&htab->lock); + htab->count = 0; + + htab->elem_size = sizeof(struct htab_elem) + + round_up(htab->map.key_size, 8) + + htab->map.value_size; + + htab->elem_cache = kmem_cache_create("bpf_htab", htab->elem_size, 0, 0, + NULL); + if (!htab->elem_cache) + goto free_buckets; + + return &htab->map; + +free_buckets: + kfree(htab->buckets); +free_htab: + kfree(htab); + return ERR_PTR(err); +} + +static inline u32 htab_map_hash(const void *key, u32 key_len) +{ + return jhash(key, key_len, 0); +} + +static inline struct hlist_head *select_bucket(struct bpf_htab *htab, u32 hash) +{ + return &htab->buckets[hash & (htab->n_buckets - 1)]; +} + +static struct htab_elem *lookup_elem_raw(struct hlist_head *head, u32 hash, + void *key, u32 key_size) +{ + struct htab_elem *l; + + hlist_for_each_entry_rcu(l, head, hash_node) { + if (l->hash == hash && !memcmp(&l->key, key, key_size)) + return l; + } + return NULL; +} + +/* Must be called with rcu_read_lock. */ +static void *htab_map_lookup_elem(struct bpf_map *map, void *key) +{ + struct bpf_htab *htab = container_of(map, struct bpf_htab, map); + struct hlist_head *head; + struct htab_elem *l; + u32 hash, key_size; + + WARN_ON_ONCE(!rcu_read_lock_held()); + + key_size = map->key_size; + + hash = htab_map_hash(key, key_size); + + head = select_bucket(htab, hash); + + l = lookup_elem_raw(head, hash, key, key_size); + + if (l) + return l->key + round_up(map->key_size, 8); + + return NULL; +} + +/* Must be called with rcu_read_lock. */ +static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key) +{ + struct bpf_htab *htab = container_of(map, struct bpf_htab, map); + struct hlist_head *head; + struct htab_elem *l, *next_l; + u32 hash, key_size; + int i; + + WARN_ON_ONCE(!rcu_read_lock_held()); + + key_size = map->key_size; + + hash = htab_map_hash(key, key_size); + + head = select_bucket(htab, hash); + + /* lookup the key */ + l = lookup_elem_raw(head, hash, key, key_size); + + if (!l) { + i = 0; + goto find_first_elem; + } + + /* key was found, get next key in the same bucket */ + next_l = hlist_entry_safe(rcu_dereference_raw(hlist_next_rcu(&l->hash_node)), + struct htab_elem, hash_node); + + if (next_l) { + /* if next elem in this hash list is non-zero, just return it */ + memcpy(next_key, next_l->key, key_size); + return 0; + } + + /* no more elements in this hash list, go to the next bucket */ + i = hash & (htab->n_buckets - 1); + i++; + +find_first_elem: + /* iterate over buckets */ + for (; i < htab->n_buckets; i++) { + head = select_bucket(htab, i); + + /* pick first element in the bucket */ + next_l = hlist_entry_safe(rcu_dereference_raw(hlist_first_rcu(head)), + struct htab_elem, hash_node); + if (next_l) { + /* if it's not empty, just return it */ + memcpy(next_key, next_l->key, key_size); + return 0; + } + } + + /* itereated over all buckets and all elements */ + return -ENOENT; +} + +static struct htab_elem *htab_alloc_elem(struct bpf_htab *htab) +{ + void *l; + + l = kmem_cache_alloc(htab->elem_cache, GFP_ATOMIC); + if (!l) + return ERR_PTR(-ENOMEM); + return l; +} + +static void free_htab_elem_rcu(struct rcu_head *rcu) +{ + struct htab_elem *l = container_of(rcu, struct htab_elem, rcu); + + kmem_cache_free(l->htab->elem_cache, l); +} + +static void release_htab_elem(struct bpf_htab *htab, struct htab_elem *l) +{ + l->htab = htab; + call_rcu(&l->rcu, free_htab_elem_rcu); +} + +/* Must be called with rcu_read_lock. */ +static int htab_map_update_elem(struct bpf_map *map, void *key, void *value) +{ + struct bpf_htab *htab = container_of(map, struct bpf_htab, map); + struct htab_elem *l_new, *l_old; + struct hlist_head *head; + unsigned long flags; + u32 key_size; + + WARN_ON_ONCE(!rcu_read_lock_held()); + + l_new = htab_alloc_elem(htab); + if (IS_ERR(l_new)) + return -ENOMEM; + + key_size = map->key_size; + + memcpy(l_new->key, key, key_size); + memcpy(l_new->key + round_up(key_size, 8), value, map->value_size); + + l_new->hash = htab_map_hash(l_new->key, key_size); + + /* bpf_map_update_elem() can be called in_irq() as well, so + * spin_lock() or spin_lock_bh() cannot be used + */ + spin_lock_irqsave(&htab->lock, flags); + + head = select_bucket(htab, l_new->hash); + + l_old = lookup_elem_raw(head, l_new->hash, key, key_size); + + if (!l_old && unlikely(htab->count >= map->max_entries)) { + /* if elem with this 'key' doesn't exist and we've reached + * max_entries limit, fail insertion of new elem + */ + spin_unlock_irqrestore(&htab->lock, flags); + kmem_cache_free(htab->elem_cache, l_new); + return -EFBIG; + } + + /* add new element to the head of the list, so that concurrent + * search will find it before old elem + */ + hlist_add_head_rcu(&l_new->hash_node, head); + if (l_old) { + hlist_del_rcu(&l_old->hash_node); + release_htab_elem(htab, l_old); + } else { + htab->count++; + } + spin_unlock_irqrestore(&htab->lock, flags); + + return 0; +} + +/* Must be called with rcu_read_lock. */ +static int htab_map_delete_elem(struct bpf_map *map, void *key) +{ + struct bpf_htab *htab = container_of(map, struct bpf_htab, map); + struct hlist_head *head; + struct htab_elem *l; + unsigned long flags; + u32 hash, key_size; + int ret = -ESRCH; + + WARN_ON_ONCE(!rcu_read_lock_held()); + + key_size = map->key_size; + + hash = htab_map_hash(key, key_size); + + spin_lock_irqsave(&htab->lock, flags); + + head = select_bucket(htab, hash); + + l = lookup_elem_raw(head, hash, key, key_size); + + if (l) { + hlist_del_rcu(&l->hash_node); + htab->count--; + release_htab_elem(htab, l); + ret = 0; + } + + spin_unlock_irqrestore(&htab->lock, flags); + return ret; +} + +static void delete_all_elements(struct bpf_htab *htab) +{ + int i; + + for (i = 0; i < htab->n_buckets; i++) { + struct hlist_head *head = select_bucket(htab, i); + struct hlist_node *n; + struct htab_elem *l; + + hlist_for_each_entry_safe(l, n, head, hash_node) { + hlist_del_rcu(&l->hash_node); + htab->count--; + kmem_cache_free(htab->elem_cache, l); + } + } +} + +/* called when map->refcnt goes to zero */ +static void htab_map_free(struct bpf_map *map) +{ + struct bpf_htab *htab = container_of(map, struct bpf_htab, map); + + /* wait for all outstanding updates to complete */ + synchronize_rcu(); + + /* kmem_cache_free all htab elements */ + delete_all_elements(htab); + + /* and destroy cache, which might sleep */ + kmem_cache_destroy(htab->elem_cache); + + kfree(htab->buckets); + kfree(htab); +} + +static struct bpf_map_ops htab_ops = { + .map_alloc = htab_map_alloc, + .map_free = htab_map_free, + .map_get_next_key = htab_map_get_next_key, + .map_lookup_elem = htab_map_lookup_elem, + .map_update_elem = htab_map_update_elem, + .map_delete_elem = htab_map_delete_elem, +}; + +static struct bpf_map_type_list tl = { + .ops = &htab_ops, + .type = BPF_MAP_TYPE_HASH, +}; + +static int __init register_htab_map(void) +{ + bpf_register_map_type(&tl); + return 0; +} +late_initcall(register_htab_map); -- 1.7.9.5 ^ permalink raw reply related [flat|nested] 24+ messages in thread
end of thread, other threads:[~2014-08-26 16:40 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-08-26  1:00 [PATCH v6 net-next 0/6] introduce BPF syscall Alexei Starovoitov
2014-08-26  1:00 ` [PATCH v6 net-next 1/6] net: filter: add "load 64-bit immediate" eBPF instruction Alexei Starovoitov
2014-08-26  1:06   ` David Miller
2014-08-26  1:35     ` Alexei Starovoitov
     [not found]       ` <CAMEtUuxdQpkX8t1_szde=Q1ALcp5t7rRyK+zEDafj27_J2LzVg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-08-26  1:38         ` Andy Lutomirski
2014-08-26  1:53           ` Alexei Starovoitov
     [not found]             ` <CAMEtUuwz9MEei+tjWx4Fv8cK_zc9TKVbWxQEAE+yWvxRMa793g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-08-26  1:54               ` Andy Lutomirski
     [not found]                 ` <CALCETrWLvjt_D2B2sYoQtXeU1_9-005BfPcuYDKp75GbPk68dQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-08-26  2:02                   ` Alexei Starovoitov
2014-08-26  4:12     ` Alexei Starovoitov
2014-08-26  1:00 ` [PATCH v6 net-next 3/6] bpf: introduce syscall(BPF, ...) and BPF maps Alexei Starovoitov
2014-08-26  1:00 ` [PATCH v6 net-next 4/6] bpf: enable bpf syscall on x64 and i386 Alexei Starovoitov
     [not found]   ` <1409014858-1410-5-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
2014-08-26  1:07     ` David Miller
     [not found]       ` <20140825.180718.137768107010295086.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
2014-08-26  1:43         ` Alexei Starovoitov
2014-08-26  7:45           ` Ingo Molnar
2014-08-26 16:29             ` Alexei Starovoitov
2014-08-26  3:52     ` Stephen Hemminger
2014-08-26  4:24       ` Alexei Starovoitov
2014-08-26  7:46         ` Ingo Molnar
     [not found]           ` <20140826074655.GB19799-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2014-08-26  8:00             ` Daniel Borkmann
     [not found]               ` <53FC3E9A.1020108-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-08-26  8:02                 ` Ingo Molnar
     [not found]                   ` <20140826080231.GA20565-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2014-08-26 16:40                     ` Alexei Starovoitov
     [not found] ` <1409014858-1410-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
2014-08-26  1:00   ` [PATCH v6 net-next 2/6] net: filter: split filter.h and expose eBPF to user space Alexei Starovoitov
2014-08-26  1:00   ` [PATCH v6 net-next 5/6] bpf: add lookup/update/delete/iterate methods to BPF maps Alexei Starovoitov
2014-08-26  1:00   ` [PATCH v6 net-next 6/6] bpf: add hashtable type of " Alexei Starovoitov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).