[PATCH RFC 0/4] net: add bpfilter

netfilter-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC 0/4] net: add bpfilter
@ 2018-02-16 13:40 Daniel Borkmann
  2018-02-16 13:40 ` [PATCH RFC 1/4] modules: allow insmod load regular elf binaries Daniel Borkmann
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Daniel Borkmann @ 2018-02-16 13:40 UTC (permalink / raw)
  To: netdev; +Cc: netfilter-devel, davem, alexei.starovoitov, Daniel Borkmann

This is a very rough and early proof of concept that implements bpfilter.
The basic idea of bpfilter is that it can process iptables queries and
translate them in user space into BPF programs which can then get attached
at various locations. For simplicity, in this RFC we demo attaching them
to XDP layer, but any other location would work as well (e.g. at the tc
sch_clsact ingress/egress location or any other/new hook with equivalent
semantics).

Also, as a benefit from such design, we get BPF JIT compilation on x86_64,
arm64, ppc64, sparc64, mips64, s390x and arm32, but also rule offloading
into HW for free for Netronome NFP SmartNICs that are already capable of
offloading BPF since we can reuse all existing BPF infrastructure as the
back end. The user space iptables binary issuing rule addition or dumps was
left as-is, thus at some point any binaries against iptables uapi kernel
interface could transparently be supported in such manner in long term.

As rule translation can potentially become very complex, this is performed
entirely in user space. In order to ease deployment, request_module() code
is extended to allow user mode helpers to be invoked. Idea is that user mode
helpers are built as part of the kernel build and installed as traditional
kernel modules with .ko file extension into distro specified location,
such that from a distribution point of view, they are no different than
regular kernel modules. Thus, allow request_module() logic to load such
user mode helper (umh) binaries via:

  request_module("foo") ->
    call_umh("modprobe foo") ->
      sys_finit_module(FD of /lib/modules/.../foo.ko) ->
        call_umh(struct file)

Such approach enables kernel to delegate functionality traditionally done
by kernel modules into user space processes (either root or !root) and
reduces security attack surface of such new code, meaning in case of
potential bugs only the umh would crash but not the kernel. Another
advantage coming with that would be that bpfilter.ko can be debugged and
tested out of user space as well (e.g. opening the possibility to run
all clang sanitizers, fuzzers or test suites for checking translation).
Also, such architecture makes the kernel/user boundary very precise,
meaning requests can be handled and BPF translated in control plane part
in user space with its own user memory etc, while minimal data plane
bits are in kernel. It would also allow to remove old xtables modules
at some point from the kernel while keeping functionality in place.

In the implemented proof of concept we show that simple /32 src/dst IPs
are translated in such manner. More complex rules would be added later
as well, also different BPF code generation backends that can be selected
for the various attachment points, proper encoder/decoder for the uapi
requests, etc. This just starts out very simple and basic for the sake
of an early RFC to demo the idea.

In the below example, we show that dumping, loading and offloading of
one or multiple simple rules work, we show the bpftool XDP dump of the
generated BPF instruction sequence as well as a simple functional ping
test to enforce policy in such way.

Set rebased on top of 255442c93843 ("Merge tag 'docs-4.16' of [...]").

Feedback very welcome!

Various bpfilter usage examples from the PoC code:

1) Dumping current rules:

  # iptables -t filter -L
  Chain INPUT (policy ACCEPT)
  target     prot opt source               destination

  Chain FORWARD (policy ACCEPT)
  target     prot opt source               destination

  Chain OUTPUT (policy ACCEPT)
  target     prot opt source               destination

2) ping test:

  # ping -c 1 127.0.0.1 -I 127.0.0.2
    PING 127.0.0.1 (127.0.0.1) from 127.0.0.2 : 56(84) bytes of data.
    64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.040 ms

    --- 127.0.0.1 ping statistics ---
    1 packets transmitted, 1 received, 0% packet loss, time 0ms
    rtt min/avg/max/mdev = 0.040/0.040/0.040/0.000 ms

3) Adding & dumping a simple rule:

  # iptables -t filter -A INPUT -i lo -s 127.0.0.2/32 -d 127.0.0.1/32 -j DROP
  # iptables -t filter -L
  Chain INPUT (policy ACCEPT)
  target     prot opt source               destination
  DROP       all  --  127.0.0.2            localhost

  Chain FORWARD (policy ACCEPT)
  target     prot opt source               destination

  Chain OUTPUT (policy ACCEPT)
  target     prot opt source               destination

4) Dump BPF generated code for that rule (on lo it's XDP generic, otherwise
   native XDP for XDP supported drivers):

  # bpftool p
    18: xdp  tag 6b07f663830d5b0c
        loaded_at Feb 14/01:15  uid 0
        xlated 208B  not jited  memlock 4096B
  # bpftool p d x i 18
   0: (bf) r9 = r1
   1: (79) r2 = *(u64 *)(r9 +0)
   2: (79) r3 = *(u64 *)(r9 +8)
   3: (bf) r1 = r2
   4: (07) r1 += 14
   5: (bd) if r1 <= r3 goto pc+2
   6: (b4) (u32) r0 = (u32) 2
   7: (95) exit
   8: (bf) r1 = r2
   9: (b4) (u32) r5 = (u32) 0
  10: (69) r4 = *(u16 *)(r1 +12)
  11: (55) if r4 != 0x8 goto pc+9
  12: (07) r1 += 34
  13: (2d) if r1 > r3 goto pc+7
  14: (07) r1 += -20
  15: (61) r4 = *(u32 *)(r1 +12)
  16: (55) if r4 != 0x200007f goto pc+1
  17: (04) (u32) r5 += (u32) 1
  18: (61) r4 = *(u32 *)(r1 +16)
  19: (55) if r4 != 0x100007f goto pc+1
  20: (04) (u32) r5 += (u32) 1
  21: (55) if r5 != 0x2 goto pc+2
  22: (b4) (u32) r0 = (u32) 1
  23: (95) exit
  24: (b4) (u32) r0 = (u32) 2
  25: (95) exit

5) ping test:

  # ping -c 1 127.0.0.1 -I 127.0.0.2
    PING 127.0.0.1 (127.0.0.1) from 127.0.0.2 : 56(84) bytes of data.

    --- 127.0.0.1 ping statistics ---
    1 packets transmitted, 0 received, 100% packet loss, time 0ms

  # ping -c 1 127.0.0.1 -I 127.0.0.1
    PING 127.0.0.1 (127.0.0.1) from 127.0.0.1 : 56(84) bytes of data.
    64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.018 ms

    --- 127.0.0.1 ping statistics ---
    1 packets transmitted, 1 received, 0% packet loss, time 0ms
    rtt min/avg/max/mdev = 0.018/0.018/0.018/0.000 ms

  # ping -c 1 127.0.0.2 -I 127.0.0.2
    PING 127.0.0.2 (127.0.0.2) from 127.0.0.2 : 56(84) bytes of data.
    64 bytes from 127.0.0.2: icmp_seq=1 ttl=64 time=0.018 ms

    --- 127.0.0.2 ping statistics ---
    1 packets transmitted, 1 received, 0% packet loss, time 0ms
    rtt min/avg/max/mdev = 0.018/0.018/0.018/0.000 ms

6) Adding & dumping a 2nd and 3rd rule:

  # iptables -t filter -A INPUT -i lo -s 127.0.0.4/32 -d 127.0.0.3/32 -j DROP
  # iptables -t filter -A INPUT -i lo -s 127.0.0.5/32 -j DROP
  # iptables -t filter -L
  Chain INPUT (policy ACCEPT)
  target     prot opt source               destination
  DROP       all  --  127.0.0.2            localhost
  DROP       all  --  127.0.0.4            127.0.0.3
  DROP       all  --  anywhere             127.0.0.5

  Chain FORWARD (policy ACCEPT)
  target     prot opt source               destination

  Chain OUTPUT (policy ACCEPT)
  target     prot opt source               destination

7) Dump BPF generated code again:

  # bpftool p
    20: xdp  tag 19519bdd253cbfe5
        loaded_at Feb 14/01:17  uid 0
        xlated 440B  not jited  memlock 4096B
  # bpftool p d x i 20
   0: (bf) r9 = r1
   1: (79) r2 = *(u64 *)(r9 +0)
   2: (79) r3 = *(u64 *)(r9 +8)
   3: (bf) r1 = r2
   4: (07) r1 += 14
   5: (bd) if r1 <= r3 goto pc+2
   6: (b4) (u32) r0 = (u32) 2
   7: (95) exit
   8: (bf) r1 = r2
   9: (b4) (u32) r5 = (u32) 0
  10: (69) r4 = *(u16 *)(r1 +12)
  11: (55) if r4 != 0x8 goto pc+9
  12: (07) r1 += 34
  13: (2d) if r1 > r3 goto pc+7
  14: (07) r1 += -20
  15: (61) r4 = *(u32 *)(r1 +12)
  16: (55) if r4 != 0x200007f goto pc+1
  17: (04) (u32) r5 += (u32) 1
  18: (61) r4 = *(u32 *)(r1 +16)
  19: (55) if r4 != 0x100007f goto pc+1
  20: (04) (u32) r5 += (u32) 1
  21: (55) if r5 != 0x2 goto pc+2
  22: (b4) (u32) r0 = (u32) 1
  23: (95) exit
  24: (bf) r1 = r2
  25: (b4) (u32) r5 = (u32) 0
  26: (69) r4 = *(u16 *)(r1 +12)
  27: (55) if r4 != 0x8 goto pc+9
  28: (07) r1 += 34
  29: (2d) if r1 > r3 goto pc+7
  30: (07) r1 += -20
  31: (61) r4 = *(u32 *)(r1 +12)
  32: (55) if r4 != 0x400007f goto pc+1
  33: (04) (u32) r5 += (u32) 1
  34: (61) r4 = *(u32 *)(r1 +16)
  35: (55) if r4 != 0x300007f goto pc+1
  36: (04) (u32) r5 += (u32) 1
  37: (55) if r5 != 0x2 goto pc+2
  38: (b4) (u32) r0 = (u32) 1
  39: (95) exit
  40: (bf) r1 = r2
  41: (b4) (u32) r5 = (u32) 0
  42: (69) r4 = *(u16 *)(r1 +12)
  43: (55) if r4 != 0x8 goto pc+6
  44: (07) r1 += 34
  45: (2d) if r1 > r3 goto pc+4
  46: (07) r1 += -20
  47: (61) r4 = *(u32 *)(r1 +12)
  48: (55) if r4 != 0x500007f goto pc+1
  49: (04) (u32) r5 += (u32) 1
  50: (55) if r5 != 0x1 goto pc+2
  51: (b4) (u32) r0 = (u32) 1
  52: (95) exit
  53: (b4) (u32) r0 = (u32) 2
  54: (95) exit

8) ping test again:

  # ping -c 1 127.0.0.4 -I 127.0.0.4
    PING 127.0.0.4 (127.0.0.4) from 127.0.0.4 : 56(84) bytes of data.
    64 bytes from 127.0.0.4: icmp_seq=1 ttl=64 time=0.032 ms

    --- 127.0.0.4 ping statistics ---
    1 packets transmitted, 1 received, 0% packet loss, time 0ms
    rtt min/avg/max/mdev = 0.032/0.032/0.032/0.000 ms

  # ping -c 1 127.0.0.4 -I 127.0.0.3
    PING 127.0.0.4 (127.0.0.4) from 127.0.0.3 : 56(84) bytes of data.

    --- 127.0.0.4 ping statistics ---
    1 packets transmitted, 0 received, 100% packet loss, time 0ms

  # ping -c 1 127.0.0.1 -I 127.0.0.2
    PING 127.0.0.1 (127.0.0.1) from 127.0.0.2 : 56(84) bytes of data.

    --- 127.0.0.1 ping statistics ---
    1 packets transmitted, 0 received, 100% packet loss, time 0ms

  # ping -c 1 127.0.0.1 -I 127.0.0.5
    PING 127.0.0.1 (127.0.0.1) from 127.0.0.5 : 56(84) bytes of data.

    --- 127.0.0.1 ping statistics ---
    1 packets transmitted, 0 received, 100% packet loss, time 0ms

9) Now example test with offload into nfp device:

  # ethtool -i enp2s0
    driver: nfp
    version: 4.15.0+ SMP mod_unload
    firmware-version: 0.0.5.5 0.17 bpf_xxxxxxx ebpf
    expansion-rom-version:
    bus-info: 0000:02:00.0
    supports-statistics: yes
    supports-test: no
    supports-eeprom-access: no
    supports-register-dump: yes
    supports-priv-flags: no

  # iptables -t filter -A INPUT -i enp2s0 -s 192.168.2.2/32 -j DROP

  # bpftool p
  1: xdp  tag 88896d0ae0f463a6 dev enp2s0  ( <-- offloaded into HW )
        loaded_at Feb 15/14:30  uid 0
        xlated 184B  jited 640B  memlock 4096B
  # bpftool p d x i 1
   0: (bf) r9 = r1
   1: (79) r2 = *(u64 *)(r9 +0)
   2: (79) r3 = *(u64 *)(r9 +8)
   3: (bf) r1 = r2
   4: (07) r1 += 14
   5: (bd) if r1 <= r3 goto pc+2
   6: (b4) (u32) r0 = (u32) 2
   7: (95) exit
   8: (bf) r1 = r2
   9: (b4) (u32) r5 = (u32) 0
  10: (69) r4 = *(u16 *)(r1 +12)
  11: (55) if r4 != 0x8 goto pc+6
  12: (07) r1 += 34
  13: (2d) if r1 > r3 goto pc+4
  14: (07) r1 += -20
  15: (61) r4 = *(u32 *)(r1 +12)
  16: (55) if r4 != 0x202a8c0 goto pc+1
  17: (04) (u32) r5 += (u32) 1
  18: (55) if r5 != 0x1 goto pc+2
  19: (b4) (u32) r0 = (u32) 1
  20: (95) exit
  21: (b4) (u32) r0 = (u32) 2
  22: (95) exit

Thanks!

Alexei Starovoitov (2):
  modules: allow insmod load regular elf binaries
  bpf: introduce bpfilter commands

Daniel Borkmann (1):
  bpf: rough bpfilter codegen example hack

David S. Miller (1):
  net: initial bpfilter skeleton

 fs/exec.c                     |  40 ++++-
 include/linux/binfmts.h       |   1 +
 include/linux/bpfilter.h      |  13 ++
 include/linux/umh.h           |   4 +
 include/uapi/linux/bpf.h      |  31 ++++
 include/uapi/linux/bpfilter.h | 200 ++++++++++++++++++++++
 kernel/bpf/syscall.c          |  52 ++++++
 kernel/module.c               |  33 +++-
 kernel/umh.c                  |  24 ++-
 net/Kconfig                   |   2 +
 net/Makefile                  |   1 +
 net/bpfilter/Kconfig          |   7 +
 net/bpfilter/Makefile         |   9 +
 net/bpfilter/bpfilter.c       | 106 ++++++++++++
 net/bpfilter/bpfilter_mod.h   | 373 ++++++++++++++++++++++++++++++++++++++++++
 net/bpfilter/ctor.c           |  91 +++++++++++
 net/bpfilter/gen.c            | 290 ++++++++++++++++++++++++++++++++
 net/bpfilter/init.c           |  36 ++++
 net/bpfilter/sockopt.c        | 236 ++++++++++++++++++++++++++
 net/bpfilter/tables.c         |  73 +++++++++
 net/bpfilter/targets.c        |  51 ++++++
 net/bpfilter/tgts.c           |  26 +++
 net/ipv4/Makefile             |   2 +
 net/ipv4/bpfilter/Makefile    |   2 +
 net/ipv4/bpfilter/sockopt.c   |  64 ++++++++
 net/ipv4/ip_sockglue.c        |  17 ++
 26 files changed, 1767 insertions(+), 17 deletions(-)
 create mode 100644 include/linux/bpfilter.h
 create mode 100644 include/uapi/linux/bpfilter.h
 create mode 100644 net/bpfilter/Kconfig
 create mode 100644 net/bpfilter/Makefile
 create mode 100644 net/bpfilter/bpfilter.c
 create mode 100644 net/bpfilter/bpfilter_mod.h
 create mode 100644 net/bpfilter/ctor.c
 create mode 100644 net/bpfilter/gen.c
 create mode 100644 net/bpfilter/init.c
 create mode 100644 net/bpfilter/sockopt.c
 create mode 100644 net/bpfilter/tables.c
 create mode 100644 net/bpfilter/targets.c
 create mode 100644 net/bpfilter/tgts.c
 create mode 100644 net/ipv4/bpfilter/Makefile
 create mode 100644 net/ipv4/bpfilter/sockopt.c

-- 
2.9.5

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH RFC 1/4] modules: allow insmod load regular elf binaries
  2018-02-16 13:40 [PATCH RFC 0/4] net: add bpfilter Daniel Borkmann
@ 2018-02-16 13:40 ` Daniel Borkmann
  2018-02-16 13:40 ` [PATCH RFC 2/4] bpf: introduce bpfilter commands Daniel Borkmann
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Daniel Borkmann @ 2018-02-16 13:40 UTC (permalink / raw)
  To: netdev; +Cc: netfilter-devel, davem, alexei.starovoitov, Alexei Starovoitov

From: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 fs/exec.c               | 40 +++++++++++++++++++++++++++++++---------
 include/linux/binfmts.h |  1 +
 include/linux/umh.h     |  4 ++++
 kernel/module.c         | 33 ++++++++++++++++++++++++++++-----
 kernel/umh.c            | 24 +++++++++++++++++++++---
 5 files changed, 85 insertions(+), 17 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 7eb8d21..0483c43 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1691,14 +1691,13 @@ static int exec_binprm(struct linux_binprm *bprm)
 /*
  * sys_execve() executes a new program.
  */
-static int do_execveat_common(int fd, struct filename *filename,
-			      struct user_arg_ptr argv,
-			      struct user_arg_ptr envp,
-			      int flags)
+static int __do_execve_file(int fd, struct filename *filename,
+			    struct user_arg_ptr argv,
+			    struct user_arg_ptr envp,
+			    int flags, struct file *file)
 {
 	char *pathbuf = NULL;
 	struct linux_binprm *bprm;
-	struct file *file;
 	struct files_struct *displaced;
 	int retval;
 
@@ -1737,7 +1736,8 @@ static int do_execveat_common(int fd, struct filename *filename,
 	check_unsafe_exec(bprm);
 	current->in_execve = 1;
 
-	file = do_open_execat(fd, filename, flags);
+	if (!file)
+		file = do_open_execat(fd, filename, flags);
 	retval = PTR_ERR(file);
 	if (IS_ERR(file))
 		goto out_unmark;
@@ -1745,7 +1745,9 @@ static int do_execveat_common(int fd, struct filename *filename,
 	sched_exec();
 
 	bprm->file = file;
-	if (fd == AT_FDCWD || filename->name[0] == '/') {
+	if (!filename) {
+		bprm->filename = "/dev/null";
+	} else if (fd == AT_FDCWD || filename->name[0] == '/') {
 		bprm->filename = filename->name;
 	} else {
 		if (filename->name[0] == '\0')
@@ -1811,7 +1813,8 @@ static int do_execveat_common(int fd, struct filename *filename,
 	task_numa_free(current);
 	free_bprm(bprm);
 	kfree(pathbuf);
-	putname(filename);
+	if (filename)
+		putname(filename);
 	if (displaced)
 		put_files_struct(displaced);
 	return retval;
@@ -1834,10 +1837,29 @@ static int do_execveat_common(int fd, struct filename *filename,
 	if (displaced)
 		reset_files_struct(displaced);
 out_ret:
-	putname(filename);
+	if (filename)
+		putname(filename);
 	return retval;
 }
 
+static int do_execveat_common(int fd, struct filename *filename,
+			      struct user_arg_ptr argv,
+			      struct user_arg_ptr envp,
+			      int flags)
+{
+	struct file *file = NULL;
+
+	return __do_execve_file(fd, filename, argv, envp, flags, file);
+}
+
+int do_execve_file(struct file *file, void *__argv, void *__envp)
+{
+	struct user_arg_ptr argv = { .ptr.native = __argv };
+	struct user_arg_ptr envp = { .ptr.native = __envp };
+
+	return __do_execve_file(AT_FDCWD, NULL, argv, envp, 0, file);
+}
+
 int do_execve(struct filename *filename,
 	const char __user *const __user *__argv,
 	const char __user *const __user *__envp)
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index b0abe21..c783a7b 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -147,5 +147,6 @@ extern int do_execveat(int, struct filename *,
 		       const char __user * const __user *,
 		       const char __user * const __user *,
 		       int);
+int do_execve_file(struct file *file, void *__argv, void *__envp);
 
 #endif /* _LINUX_BINFMTS_H */
diff --git a/include/linux/umh.h b/include/linux/umh.h
index 244aff6..68ddd4f 100644
--- a/include/linux/umh.h
+++ b/include/linux/umh.h
@@ -22,6 +22,7 @@ struct subprocess_info {
 	const char *path;
 	char **argv;
 	char **envp;
+	struct file *file;
 	int wait;
 	int retval;
 	int (*init)(struct subprocess_info *info, struct cred *new);
@@ -38,6 +39,9 @@ call_usermodehelper_setup(const char *path, char **argv, char **envp,
 			  int (*init)(struct subprocess_info *info, struct cred *new),
 			  void (*cleanup)(struct subprocess_info *), void *data);
 
+extern struct subprocess_info *
+call_usermodehelper_setup_file(struct file *file);
+
 extern int
 call_usermodehelper_exec(struct subprocess_info *info, int wait);
 
diff --git a/kernel/module.c b/kernel/module.c
index 1d65b2c..b0febe3 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -325,6 +325,7 @@ struct load_info {
 	struct {
 		unsigned int sym, str, mod, vers, info, pcpu;
 	} index;
+	struct file *file;
 };
 
 /*
@@ -2801,6 +2802,15 @@ static int module_sig_check(struct load_info *info, int flags)
 }
 #endif /* !CONFIG_MODULE_SIG */
 
+static int run_umh(struct file *file)
+{
+	struct subprocess_info *sub_info = call_usermodehelper_setup_file(file);
+
+	if (!file)
+		return -ENOMEM;
+	return call_usermodehelper_exec(sub_info, UMH_WAIT_EXEC);
+}
+
 /* Sanity checks against invalid binaries, wrong arch, weird elf version. */
 static int elf_header_check(struct load_info *info)
 {
@@ -2808,7 +2818,6 @@ static int elf_header_check(struct load_info *info)
 		return -ENOEXEC;
 
 	if (memcmp(info->hdr->e_ident, ELFMAG, SELFMAG) != 0
-	    || info->hdr->e_type != ET_REL
 	    || !elf_check_arch(info->hdr)
 	    || info->hdr->e_shentsize != sizeof(Elf_Shdr))
 		return -ENOEXEC;
@@ -2818,6 +2827,11 @@ static int elf_header_check(struct load_info *info)
 		info->len - info->hdr->e_shoff))
 		return -ENOEXEC;
 
+	if (info->hdr->e_type == ET_EXEC)
+		return run_umh(info->file);
+
+	if (info->hdr->e_type != ET_REL)
+		return -ENOEXEC;
 	return 0;
 }
 
@@ -3861,6 +3875,7 @@ SYSCALL_DEFINE3(init_module, void __user *, umod,
 SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
 {
 	struct load_info info = { };
+	struct fd f;
 	loff_t size;
 	void *hdr;
 	int err;
@@ -3875,14 +3890,22 @@ SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
 		      |MODULE_INIT_IGNORE_VERMAGIC))
 		return -EINVAL;
 
-	err = kernel_read_file_from_fd(fd, &hdr, &size, INT_MAX,
-				       READING_MODULE);
+	err = -EBADF;
+	f = fdget(fd);
+	if (!f.file)
+		goto out;
+
+	err = kernel_read_file(f.file, &hdr, &size, INT_MAX, READING_MODULE);
 	if (err)
-		return err;
+		goto out;
 	info.hdr = hdr;
 	info.len = size;
+	info.file = f.file;
 
-	return load_module(&info, uargs, flags);
+	err = load_module(&info, uargs, flags);
+out:
+	fdput(f);
+	return err;
 }
 
 static inline int within(unsigned long addr, void *start, unsigned long size)
diff --git a/kernel/umh.c b/kernel/umh.c
index 18e5fa4..073a686 100644
--- a/kernel/umh.c
+++ b/kernel/umh.c
@@ -97,9 +97,12 @@ static int call_usermodehelper_exec_async(void *data)
 
 	commit_creds(new);
 
-	retval = do_execve(getname_kernel(sub_info->path),
-			   (const char __user *const __user *)sub_info->argv,
-			   (const char __user *const __user *)sub_info->envp);
+	if (sub_info->file)
+		retval = do_execve_file(sub_info->file, sub_info->argv, sub_info->envp);
+	else
+		retval = do_execve(getname_kernel(sub_info->path),
+				   (const char __user *const __user *)sub_info->argv,
+				   (const char __user *const __user *)sub_info->envp);
 out:
 	sub_info->retval = retval;
 	/*
@@ -393,6 +396,21 @@ struct subprocess_info *call_usermodehelper_setup(const char *path, char **argv,
 }
 EXPORT_SYMBOL(call_usermodehelper_setup);
 
+struct subprocess_info *call_usermodehelper_setup_file(struct file *file)
+{
+	struct subprocess_info *sub_info;
+	sub_info = kzalloc(sizeof(struct subprocess_info), GFP_KERNEL);
+	if (!sub_info)
+		goto out;
+
+	INIT_WORK(&sub_info->work, call_usermodehelper_exec_work);
+
+	sub_info->path = "/dev/null";
+	sub_info->file = file;
+  out:
+	return sub_info;
+}
+
 /**
  * call_usermodehelper_exec - start a usermode application
  * @sub_info: information about the subprocessa
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC 2/4] bpf: introduce bpfilter commands
  2018-02-16 13:40 [PATCH RFC 0/4] net: add bpfilter Daniel Borkmann
  2018-02-16 13:40 ` [PATCH RFC 1/4] modules: allow insmod load regular elf binaries Daniel Borkmann
@ 2018-02-16 13:40 ` Daniel Borkmann
  2018-02-16 13:40 ` [PATCH RFC 3/4] net: initial bpfilter skeleton Daniel Borkmann
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Daniel Borkmann @ 2018-02-16 13:40 UTC (permalink / raw)
  To: netdev; +Cc: netfilter-devel, davem, alexei.starovoitov, Alexei Starovoitov

From: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/uapi/linux/bpf.h | 16 ++++++++++++++++
 kernel/bpf/syscall.c     | 41 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 57 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index db6bdc3..ea977e9 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -94,6 +94,8 @@ enum bpf_cmd {
 	BPF_MAP_GET_FD_BY_ID,
 	BPF_OBJ_GET_INFO_BY_FD,
 	BPF_PROG_QUERY,
+	BPFILTER_GET_CMD,
+	BPFILTER_REPLY,
 };
 
 enum bpf_map_type {
@@ -231,6 +233,17 @@ enum bpf_attach_type {
 #define BPF_F_RDONLY		(1U << 3)
 #define BPF_F_WRONLY		(1U << 4)
 
+struct bpfilter_get_cmd {
+	__u32 pid;
+	__u32 cmd;
+	__u64 addr;
+	__u32 len;
+};
+
+struct bpfilter_reply {
+	__u32 status;
+};
+
 union bpf_attr {
 	struct { /* anonymous struct used by BPF_MAP_CREATE command */
 		__u32	map_type;	/* one of enum bpf_map_type */
@@ -320,6 +333,9 @@ union bpf_attr {
 		__aligned_u64	prog_ids;
 		__u32		prog_cnt;
 	} query;
+
+	struct bpfilter_get_cmd bpfilter_get_cmd;
+	struct bpfilter_reply bpfilter_reply;
 } __attribute__((aligned(8)));
 
 /* BPF helper function descriptions:
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index e24aa32..e933bf9 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1840,6 +1840,41 @@ static int bpf_obj_get_info_by_fd(const union bpf_attr *attr,
 	return err;
 }
 
+DECLARE_WAIT_QUEUE_HEAD(bpfilter_get_cmd_wq);
+DECLARE_WAIT_QUEUE_HEAD(bpfilter_reply_wq);
+bool bpfilter_get_cmd_ready = false;
+bool bpfilter_reply_ready = false;
+struct bpfilter_get_cmd bpfilter_get_cmd_mbox;
+struct bpfilter_reply bpfilter_reply_mbox;
+
+#define BPFILTER_GET_CMD_LAST_FIELD bpfilter_get_cmd.len
+
+static int bpfilter_get_cmd(const union bpf_attr *attr,
+			    union bpf_attr __user *uattr)
+{
+	if (CHECK_ATTR(BPFILTER_GET_CMD))
+		return -EINVAL;
+	wait_event_killable(bpfilter_get_cmd_wq, bpfilter_get_cmd_ready);
+	bpfilter_get_cmd_ready = false;
+	if (copy_to_user(&uattr->bpfilter_get_cmd, &bpfilter_get_cmd_mbox,
+			 sizeof(bpfilter_get_cmd_mbox)))
+		return -EFAULT;
+	return 0;
+}
+
+#define BPFILTER_REPLY_LAST_FIELD bpfilter_reply.status
+
+static int bpfilter_reply(const union bpf_attr *attr,
+			  union bpf_attr __user *uattr)
+{
+	if (CHECK_ATTR(BPFILTER_REPLY))
+		return -EINVAL;
+	bpfilter_reply_mbox.status = attr->bpfilter_reply.status;
+	bpfilter_reply_ready = true;
+	wake_up(&bpfilter_reply_wq);
+	return 0;
+}
+
 SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size)
 {
 	union bpf_attr attr = {};
@@ -1917,6 +1952,12 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 	case BPF_OBJ_GET_INFO_BY_FD:
 		err = bpf_obj_get_info_by_fd(&attr, uattr);
 		break;
+	case BPFILTER_GET_CMD:
+		err = bpfilter_get_cmd(&attr, uattr);
+		break;
+	case BPFILTER_REPLY:
+		err = bpfilter_reply(&attr, uattr);
+		break;
 	default:
 		err = -EINVAL;
 		break;
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC 3/4] net: initial bpfilter skeleton
  2018-02-16 13:40 [PATCH RFC 0/4] net: add bpfilter Daniel Borkmann
  2018-02-16 13:40 ` [PATCH RFC 1/4] modules: allow insmod load regular elf binaries Daniel Borkmann
  2018-02-16 13:40 ` [PATCH RFC 2/4] bpf: introduce bpfilter commands Daniel Borkmann
@ 2018-02-16 13:40 ` Daniel Borkmann
  2018-02-16 13:40 ` [PATCH RFC 4/4] bpf: rough bpfilter codegen example hack Daniel Borkmann
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Daniel Borkmann @ 2018-02-16 13:40 UTC (permalink / raw)
  To: netdev; +Cc: netfilter-devel, davem, alexei.starovoitov, Alexei Starovoitov

From: "David S. Miller" <davem@davemloft.net>

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpfilter.h      |  13 +++
 include/uapi/linux/bpfilter.h | 200 ++++++++++++++++++++++++++++++++++++++++++
 net/Kconfig                   |   2 +
 net/Makefile                  |   1 +
 net/bpfilter/Kconfig          |   7 ++
 net/bpfilter/Makefile         |   9 ++
 net/bpfilter/bpfilter.c       |  89 +++++++++++++++++++
 net/bpfilter/bpfilter_mod.h   |  96 ++++++++++++++++++++
 net/bpfilter/ctor.c           |  80 +++++++++++++++++
 net/bpfilter/init.c           |  33 +++++++
 net/bpfilter/sockopt.c        | 153 ++++++++++++++++++++++++++++++++
 net/bpfilter/tables.c         |  70 +++++++++++++++
 net/bpfilter/targets.c        |  51 +++++++++++
 net/bpfilter/tgts.c           |  25 ++++++
 net/ipv4/Makefile             |   2 +
 net/ipv4/bpfilter/Makefile    |   2 +
 net/ipv4/bpfilter/sockopt.c   |  49 +++++++++++
 net/ipv4/ip_sockglue.c        |  17 ++++
 18 files changed, 899 insertions(+)
 create mode 100644 include/linux/bpfilter.h
 create mode 100644 include/uapi/linux/bpfilter.h
 create mode 100644 net/bpfilter/Kconfig
 create mode 100644 net/bpfilter/Makefile
 create mode 100644 net/bpfilter/bpfilter.c
 create mode 100644 net/bpfilter/bpfilter_mod.h
 create mode 100644 net/bpfilter/ctor.c
 create mode 100644 net/bpfilter/init.c
 create mode 100644 net/bpfilter/sockopt.c
 create mode 100644 net/bpfilter/tables.c
 create mode 100644 net/bpfilter/targets.c
 create mode 100644 net/bpfilter/tgts.c
 create mode 100644 net/ipv4/bpfilter/Makefile
 create mode 100644 net/ipv4/bpfilter/sockopt.c

diff --git a/include/linux/bpfilter.h b/include/linux/bpfilter.h
new file mode 100644
index 0000000..26adad1
--- /dev/null
+++ b/include/linux/bpfilter.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_BPFILTER_H
+#define _LINUX_BPFILTER_H
+
+#include <uapi/linux/bpfilter.h>
+
+struct sock;
+int bpfilter_ip_set_sockopt(struct sock *sk, int optname, char *optval,
+			    unsigned int optlen);
+int bpfilter_ip_get_sockopt(struct sock *sk, int optname, char *optval,
+			    int *optlen);
+#endif
+
diff --git a/include/uapi/linux/bpfilter.h b/include/uapi/linux/bpfilter.h
new file mode 100644
index 0000000..38d54e9
--- /dev/null
+++ b/include/uapi/linux/bpfilter.h
@@ -0,0 +1,200 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _UAPI_LINUX_BPFILTER_H
+#define _UAPI_LINUX_BPFILTER_H
+
+#include <linux/if.h>
+
+enum {
+	BPFILTER_IPT_SO_SET_REPLACE = 64,
+	BPFILTER_IPT_SO_SET_ADD_COUNTERS = 65,
+	BPFILTER_IPT_SET_MAX,
+};
+
+enum {
+	BPFILTER_IPT_SO_GET_INFO = 64,
+	BPFILTER_IPT_SO_GET_ENTRIES = 65,
+	BPFILTER_IPT_SO_GET_REVISION_MATCH = 66,
+	BPFILTER_IPT_SO_GET_REVISION_TARGET = 67,
+	BPFILTER_IPT_GET_MAX,
+};
+
+enum {
+	BPFILTER_XT_TABLE_MAXNAMELEN = 32,
+};
+
+enum {
+	BPFILTER_NF_DROP = 0,
+	BPFILTER_NF_ACCEPT = 1,
+	BPFILTER_NF_STOLEN = 2,
+	BPFILTER_NF_QUEUE = 3,
+	BPFILTER_NF_REPEAT = 4,
+	BPFILTER_NF_STOP = 5,
+	BPFILTER_NF_MAX_VERDICT = BPFILTER_NF_STOP,
+};
+
+enum {
+	BPFILTER_INET_HOOK_PRE_ROUTING	= 0,
+	BPFILTER_INET_HOOK_LOCAL_IN	= 1,
+	BPFILTER_INET_HOOK_FORWARD	= 2,
+	BPFILTER_INET_HOOK_LOCAL_OUT	= 3,
+	BPFILTER_INET_HOOK_POST_ROUTING	= 4,
+	BPFILTER_INET_HOOK_MAX,
+};
+
+enum {
+	BPFILTER_PROTO_UNSPEC	= 0,
+	BPFILTER_PROTO_INET	= 1,
+	BPFILTER_PROTO_IPV4	= 2,
+	BPFILTER_PROTO_ARP	= 3,
+	BPFILTER_PROTO_NETDEV	= 5,
+	BPFILTER_PROTO_BRIDGE	= 7,
+	BPFILTER_PROTO_IPV6	= 10,
+	BPFILTER_PROTO_DECNET	= 12,
+	BPFILTER_PROTO_NUMPROTO,
+};
+
+#ifndef INT_MAX
+#define INT_MAX		((int)(~0U>>1))
+#endif
+#ifndef INT_MIN
+#define INT_MIN         (-INT_MAX - 1)
+#endif
+
+enum {
+	BPFILTER_IP_PRI_FIRST			= INT_MIN,
+	BPFILTER_IP_PRI_CONNTRACK_DEFRAG	= -400,
+	BPFILTER_IP_PRI_RAW			= -300,
+	BPFILTER_IP_PRI_SELINUX_FIRST		= -225,
+	BPFILTER_IP_PRI_CONNTRACK		= -200,
+	BPFILTER_IP_PRI_MANGLE			= -150,
+	BPFILTER_IP_PRI_NAT_DST			= -100,
+	BPFILTER_IP_PRI_FILTER			= 0,
+	BPFILTER_IP_PRI_SECURITY		= 50,
+	BPFILTER_IP_PRI_NAT_SRC			= 100,
+	BPFILTER_IP_PRI_SELINUX_LAST		= 225,
+	BPFILTER_IP_PRI_CONNTRACK_HELPER	= 300,
+	BPFILTER_IP_PRI_CONNTRACK_CONFIRM	= INT_MAX,
+	BPFILTER_IP_PRI_LAST			= INT_MAX,
+};
+
+#define BPFILTER_FUNCTION_MAXNAMELEN	30
+#define BPFILTER_EXTENSION_MAXNAMELEN	29
+#define BPFILTER_TABLE_MAXNAMELEN	32
+
+struct bpfilter_match;
+struct bpfilter_entry_match {
+	union {
+		struct {
+			__u16		match_size;
+			char		name[BPFILTER_EXTENSION_MAXNAMELEN];
+			__u8		revision;
+		} user;
+		struct {
+			__u16			match_size;
+			struct bpfilter_match	*match;
+		} kernel;
+		__u16		match_size;
+	} u;
+	unsigned char	data[0];
+};
+
+struct bpfilter_target;
+struct bpfilter_entry_target {
+	union {
+		struct {
+			__u16		target_size;
+			char		name[BPFILTER_EXTENSION_MAXNAMELEN];
+			__u8		revision;
+		} user;
+		struct {
+			__u16			target_size;
+			struct bpfilter_target	*target;
+		} kernel;
+		__u16		target_size;
+	} u;
+	unsigned char	data[0];
+};
+
+struct bpfilter_standard_target {
+	struct bpfilter_entry_target	target;
+	int				verdict;
+};
+
+struct bpfilter_error_target {
+	struct bpfilter_entry_target	target;
+	char				error_name[BPFILTER_FUNCTION_MAXNAMELEN];
+};
+
+#define __ALIGN_KERNEL(x, a)            __ALIGN_KERNEL_MASK(x, (typeof(x))(a) - 1)
+#define __ALIGN_KERNEL_MASK(x, mask)    (((x) + (mask)) & ~(mask))
+
+#define BPFILTER_ALIGN(__X)	\
+	__ALIGN_KERNEL(__X, __alignof__(__u64))
+
+#define BPFILTER_TARGET_INIT(__name, __size)			\
+{								\
+	.target.u.user = {					\
+		.target_size	= BPFILTER_ALIGN(__size),	\
+		.name		= (__name),			\
+	},							\
+}
+#define BPFILTER_STANDARD_TARGET	""
+#define BPFILTER_ERROR_TARGET		"ERROR"
+
+struct bpfilter_xt_counters {
+	__u64	packet_cnt;
+	__u64	byte_cnt;
+};
+
+struct bpfilter_ipt_ip {
+	__u32	src;
+	__u32	dst;
+	__u32	src_mask;
+	__u32	dst_mask;
+	char	in_iface[IFNAMSIZ];
+	char	out_iface[IFNAMSIZ];
+	__u8	in_iface_mask[IFNAMSIZ];
+	__u8	out_iface_mask[IFNAMSIZ];
+	__u16	protocol;
+	__u8	flags;
+	__u8	inv_flags;
+};
+
+struct bpfilter_ipt_entry {
+	struct bpfilter_ipt_ip		ip;
+	__u32				bfcache;
+	__u16				target_offset;
+	__u16				next_offset;
+	__u32				camefrom;
+	struct bpfilter_xt_counters	cntrs;
+	__u8				elems[0];
+};
+
+struct bpfilter_ipt_get_info {
+	char				name[BPFILTER_XT_TABLE_MAXNAMELEN];
+	__u32				valid_hooks;
+	__u32				hook_entry[BPFILTER_INET_HOOK_MAX];
+	__u32				underflow[BPFILTER_INET_HOOK_MAX];
+	__u32				num_entries;
+	__u32				size;
+};
+
+struct bpfilter_ipt_get_entries {
+	char				name[BPFILTER_XT_TABLE_MAXNAMELEN];
+	__u32				size;
+	struct bpfilter_ipt_entry	entries[0];
+};
+
+struct bpfilter_ipt_replace {
+	char				name[BPFILTER_XT_TABLE_MAXNAMELEN];
+	__u32				valid_hooks;
+	__u32				num_entries;
+	__u32				size;
+	__u32				hook_entry[BPFILTER_INET_HOOK_MAX];
+	__u32				underflow[BPFILTER_INET_HOOK_MAX];
+	__u32				num_counters;
+	struct bpfilter_xt_counters	*cntrs;
+	struct bpfilter_ipt_entry	entries[0];
+};
+
+#endif /* _UAPI_LINUX_BPFILTER_H */
diff --git a/net/Kconfig b/net/Kconfig
index 37ec8e6..ec96506 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -201,6 +201,8 @@ source "net/bridge/netfilter/Kconfig"
 
 endif
 
+source "net/bpfilter/Kconfig"
+
 source "net/dccp/Kconfig"
 source "net/sctp/Kconfig"
 source "net/rds/Kconfig"
diff --git a/net/Makefile b/net/Makefile
index 14fede5..c388b3d 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -20,6 +20,7 @@ obj-$(CONFIG_TLS)		+= tls/
 obj-$(CONFIG_XFRM)		+= xfrm/
 obj-$(CONFIG_UNIX)		+= unix/
 obj-$(CONFIG_NET)		+= ipv6/
+obj-$(CONFIG_BPFILTER)		+= bpfilter/
 obj-$(CONFIG_PACKET)		+= packet/
 obj-$(CONFIG_NET_KEY)		+= key/
 obj-$(CONFIG_BRIDGE)		+= bridge/
diff --git a/net/bpfilter/Kconfig b/net/bpfilter/Kconfig
new file mode 100644
index 0000000..d29b5cb
--- /dev/null
+++ b/net/bpfilter/Kconfig
@@ -0,0 +1,7 @@
+menuconfig BPFILTER
+	bool "BPF Filter Configuration"
+	depends on NET && BPF
+
+if BPFILTER
+
+endif
diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile
new file mode 100644
index 0000000..5e05505
--- /dev/null
+++ b/net/bpfilter/Makefile
@@ -0,0 +1,9 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Makefile for the Linux BPFILTER layer.
+#
+
+hostprogs-y := bpfilter.ko
+always := $(hostprogs-y)
+bpfilter.ko-objs := bpfilter.o tgts.o targets.o tables.o init.o ctor.o sockopt.o
+HOSTCFLAGS += -I. -Itools/include/
diff --git a/net/bpfilter/bpfilter.c b/net/bpfilter/bpfilter.c
new file mode 100644
index 0000000..445ae65
--- /dev/null
+++ b/net/bpfilter/bpfilter.c
@@ -0,0 +1,89 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <sys/uio.h>
+#include <errno.h>
+#include <stdio.h>
+#include <sys/socket.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include "include/uapi/linux/bpf.h"
+#include <asm/unistd.h>
+#include "bpfilter_mod.h"
+
+extern long int syscall (long int __sysno, ...);
+
+static inline int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr,
+			  unsigned int size)
+{
+	return syscall(321, cmd, attr, size);
+}
+
+int pid;
+int debug_fd;
+
+int copy_from_user(void *dst, void *addr, int len)
+{
+	struct iovec local;
+	struct iovec remote;
+
+	local.iov_base = dst;
+	local.iov_len = len;
+	remote.iov_base = addr;
+	remote.iov_len = len;
+	return process_vm_readv(pid, &local, 1, &remote, 1, 0) != len;
+}
+
+int copy_to_user(void *addr, const void *src, int len)
+{
+	struct iovec local;
+	struct iovec remote;
+
+	local.iov_base = (void *) src;
+	local.iov_len = len;
+	remote.iov_base = addr;
+	remote.iov_len = len;
+	return process_vm_writev(pid, &local, 1, &remote, 1, 0) != len;
+}
+
+static int handle_cmd(struct bpfilter_get_cmd *cmd)
+{
+	pid = cmd->pid;
+	switch (cmd->cmd) {
+	case BPFILTER_IPT_SO_GET_INFO:
+		return bpfilter_get_info((void *) (long) cmd->addr, cmd->len);
+	case BPFILTER_IPT_SO_GET_ENTRIES:
+		return bpfilter_get_entries((void *) (long) cmd->addr, cmd->len);
+	default:
+		break;
+	}
+	return -ENOPROTOOPT;
+}
+
+static void loop(void)
+{
+	bpfilter_tables_init();
+	bpfilter_ipv4_init();
+
+	while (1) {
+		union bpf_attr get_cmd = {};
+		union bpf_attr reply = {};
+		struct bpfilter_get_cmd *cmd;
+
+		sys_bpf(BPFILTER_GET_CMD, &get_cmd, sizeof(get_cmd));
+		cmd = &get_cmd.bpfilter_get_cmd;
+
+		dprintf(debug_fd, "pid %d cmd %d addr %llx len %d\n",
+			cmd->pid, cmd->cmd, cmd->addr, cmd->len);
+
+		reply.bpfilter_reply.status = handle_cmd(cmd);
+		sys_bpf(BPFILTER_REPLY, &reply, sizeof(reply));
+	}
+}
+
+int main(void)
+{
+	debug_fd = open("/tmp/aa", 00000002 | 00000100);
+	loop();
+	close(debug_fd);
+	return 0;
+}
diff --git a/net/bpfilter/bpfilter_mod.h b/net/bpfilter/bpfilter_mod.h
new file mode 100644
index 0000000..f0de41b
--- /dev/null
+++ b/net/bpfilter/bpfilter_mod.h
@@ -0,0 +1,96 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_BPFILTER_INTERNAL_H
+#define _LINUX_BPFILTER_INTERNAL_H
+
+#include "include/uapi/linux/bpfilter.h"
+#include <linux/list.h>
+
+struct bpfilter_table {
+	struct hlist_node	hash;
+	u32			valid_hooks;
+	struct			bpfilter_table_info *info;
+	int			hold;
+	u8			family;
+	int			priority;
+	const char		name[BPFILTER_XT_TABLE_MAXNAMELEN];
+};
+
+struct bpfilter_table_info {
+	unsigned int		size;
+	u32			num_entries;
+	unsigned int		initial_entries;
+	unsigned int		hook_entry[BPFILTER_INET_HOOK_MAX];
+	unsigned int		underflow[BPFILTER_INET_HOOK_MAX];
+	unsigned int		stacksize;
+	void			***jumpstack;
+	unsigned char		entries[0] __aligned(8);
+};
+
+struct bpfilter_table *bpfilter_table_get_by_name(const char *name, int name_len);
+void bpfilter_table_put(struct bpfilter_table *tbl);
+int bpfilter_table_add(struct bpfilter_table *tbl);
+
+struct bpfilter_ipt_standard {
+	struct bpfilter_ipt_entry	entry;
+	struct bpfilter_standard_target	target;
+};
+
+struct bpfilter_ipt_error {
+	struct bpfilter_ipt_entry	entry;
+	struct bpfilter_error_target	target;
+};
+
+#define BPFILTER_IPT_ENTRY_INIT(__sz) 				\
+{								\
+	.target_offset = sizeof(struct bpfilter_ipt_entry),	\
+	.next_offset = (__sz),					\
+}
+
+#define BPFILTER_IPT_STANDARD_INIT(__verdict) 					\
+{										\
+	.entry = BPFILTER_IPT_ENTRY_INIT(sizeof(struct bpfilter_ipt_standard)),	\
+	.target = BPFILTER_TARGET_INIT(BPFILTER_STANDARD_TARGET,		\
+				       sizeof(struct bpfilter_standard_target)),\
+	.target.verdict = -(__verdict) - 1,					\
+}
+
+#define BPFILTER_IPT_ERROR_INIT							\
+{										\
+	.entry = BPFILTER_IPT_ENTRY_INIT(sizeof(struct bpfilter_ipt_error)),	\
+	.target = BPFILTER_TARGET_INIT(BPFILTER_ERROR_TARGET,			\
+				       sizeof(struct bpfilter_error_target)),	\
+	.target.error_name = "ERROR",						\
+}
+
+struct bpfilter_target {
+	struct list_head	all_target_list;
+	const char		name[BPFILTER_EXTENSION_MAXNAMELEN];
+	unsigned int		size;
+	int			hold;
+	u16			family;
+	u8			rev;
+};
+
+struct bpfilter_target *bpfilter_target_get_by_name(const char *name);
+void bpfilter_target_put(struct bpfilter_target *tgt);
+int bpfilter_target_add(struct bpfilter_target *tgt);
+
+struct bpfilter_table_info *bpfilter_ipv4_table_ctor(struct bpfilter_table *tbl);
+int bpfilter_ipv4_register_targets(void);
+void bpfilter_tables_init(void);
+int bpfilter_get_info(void *addr, int len);
+int bpfilter_get_entries(void *cmd, int len);
+int bpfilter_ipv4_init(void);
+
+int copy_from_user(void *dst, void *addr, int len);
+int copy_to_user(void *addr, const void *src, int len);
+#define put_user(x, ptr) \
+({ \
+	__typeof__(*(ptr)) __x = (x); \
+	copy_to_user(ptr, &__x, sizeof(*(ptr))); \
+})
+extern int pid;
+extern int debug_fd;
+#define ENOTSUPP        524
+
+#endif
diff --git a/net/bpfilter/ctor.c b/net/bpfilter/ctor.c
new file mode 100644
index 0000000..efb7fee
--- /dev/null
+++ b/net/bpfilter/ctor.c
@@ -0,0 +1,80 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/socket.h>
+#include <linux/bitops.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include "bpfilter_mod.h"
+
+unsigned int __sw_hweight32(unsigned int w)
+{
+	w -= (w >> 1) & 0x55555555;
+	w =  (w & 0x33333333) + ((w >> 2) & 0x33333333);
+	w =  (w + (w >> 4)) & 0x0f0f0f0f;
+	return (w * 0x01010101) >> 24;
+}
+
+struct bpfilter_table_info *bpfilter_ipv4_table_ctor(struct bpfilter_table *tbl)
+{
+	unsigned int num_hooks = hweight32(tbl->valid_hooks);
+	struct bpfilter_ipt_standard *tgts;
+	struct bpfilter_table_info *info;
+	struct bpfilter_ipt_error *term;
+	unsigned int mask, offset, h, i;
+	unsigned int size, alloc_size;
+
+	size  = sizeof(struct bpfilter_ipt_standard) * num_hooks;
+	size += sizeof(struct bpfilter_ipt_error);
+
+	alloc_size = size + sizeof(struct bpfilter_table_info);
+
+	info = malloc(alloc_size);
+	if (!info)
+		return NULL;
+
+	info->num_entries = num_hooks + 1;
+	info->size = size;
+
+	tgts = (struct bpfilter_ipt_standard *) (info + 1);
+	term = (struct bpfilter_ipt_error *) (tgts + num_hooks);
+
+	mask = tbl->valid_hooks;
+	offset = 0;
+	h = 0;
+	i = 0;
+	dprintf(debug_fd, "mask %x num_hooks %d\n", mask, num_hooks);
+	while (mask) {
+		struct bpfilter_ipt_standard *t;
+
+		if (!(mask & 1))
+			goto next;
+
+		info->hook_entry[h] = offset;
+		info->underflow[h] = offset;
+		t = &tgts[i++];
+		*t = (struct bpfilter_ipt_standard)
+			BPFILTER_IPT_STANDARD_INIT(BPFILTER_NF_ACCEPT);
+		t->target.target.u.kernel.target =
+			bpfilter_target_get_by_name(t->target.target.u.user.name);
+		dprintf(debug_fd, "user.name %s\n", t->target.target.u.user.name);
+		if (!t->target.target.u.kernel.target)
+			goto out_fail;
+
+		offset += sizeof(struct bpfilter_ipt_standard);
+	next:
+		mask >>= 1;
+		h++;
+	}
+	*term = (struct bpfilter_ipt_error) BPFILTER_IPT_ERROR_INIT;
+	term->target.target.u.kernel.target =
+		bpfilter_target_get_by_name(term->target.target.u.user.name);
+	dprintf(debug_fd, "user.name %s\n", term->target.target.u.user.name);
+	if (!term->target.target.u.kernel.target)
+		goto out_fail;
+
+	dprintf(debug_fd, "info %p\n", info);
+	return info;
+
+out_fail:
+	free(info);
+	return NULL;
+}
diff --git a/net/bpfilter/init.c b/net/bpfilter/init.c
new file mode 100644
index 0000000..699f3f6
--- /dev/null
+++ b/net/bpfilter/init.c
@@ -0,0 +1,33 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/socket.h>
+#include <errno.h>
+#include "bpfilter_mod.h"
+
+static struct bpfilter_table filter_table_ipv4 = {
+	.name		= "filter",
+	.valid_hooks	= ((1<<BPFILTER_INET_HOOK_LOCAL_IN) |
+			   (1<<BPFILTER_INET_HOOK_FORWARD) |
+			   (1<<BPFILTER_INET_HOOK_LOCAL_OUT)),
+	.family		= BPFILTER_PROTO_IPV4,
+	.priority	= BPFILTER_IP_PRI_FILTER,
+};
+
+int bpfilter_ipv4_init(void)
+{
+	struct bpfilter_table *t = &filter_table_ipv4;
+	struct bpfilter_table_info *info;
+	int err;
+
+	err = bpfilter_ipv4_register_targets();
+	if (err)
+		return err;
+
+	info = bpfilter_ipv4_table_ctor(t);
+	if (!info)
+		return -ENOMEM;
+
+	t->info = info;
+
+	return bpfilter_table_add(&filter_table_ipv4);
+}
+
diff --git a/net/bpfilter/sockopt.c b/net/bpfilter/sockopt.c
new file mode 100644
index 0000000..43687da
--- /dev/null
+++ b/net/bpfilter/sockopt.c
@@ -0,0 +1,153 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/socket.h>
+#include <errno.h>
+#include <string.h>
+#include <stdio.h>
+#include "bpfilter_mod.h"
+
+static int fetch_name(void *addr, int len, char *name, int name_len)
+{
+	if (copy_from_user(name, addr, name_len))
+		return -EFAULT;
+
+	name[BPFILTER_XT_TABLE_MAXNAMELEN-1] = '\0';
+	return 0;
+}
+
+int bpfilter_get_info(void *addr, int len)
+{
+	char name[BPFILTER_XT_TABLE_MAXNAMELEN];
+	struct bpfilter_ipt_get_info resp;
+	struct bpfilter_table_info *info;
+	struct bpfilter_table *tbl;
+	int err;
+
+	if (len != sizeof(struct bpfilter_ipt_get_info))
+		return -EINVAL;
+
+	err = fetch_name(addr, len, name, sizeof(name));
+	if (err)
+		return err;
+
+	tbl = bpfilter_table_get_by_name(name, strlen(name));
+	if (!tbl)
+		return -ENOENT;
+
+	info = tbl->info;
+	if (!info) {
+		err = -ENOENT;
+		goto out_put;
+	}
+
+	memset(&resp, 0, sizeof(resp));
+	memcpy(resp.name, name, sizeof(resp.name));
+	resp.valid_hooks = tbl->valid_hooks;
+	memcpy(&resp.hook_entry, info->hook_entry, sizeof(resp.hook_entry));
+	memcpy(&resp.underflow, info->underflow, sizeof(resp.underflow));
+	resp.num_entries = info->num_entries;
+	resp.size = info->size;
+
+	err = 0;
+	if (copy_to_user(addr, &resp, len))
+		err = -EFAULT;
+out_put:
+	bpfilter_table_put(tbl);
+	return err;
+}
+
+static int copy_target(struct bpfilter_standard_target *ut,
+		       struct bpfilter_standard_target *kt)
+{
+	struct bpfilter_target *tgt;
+	int sz;
+
+
+	if (put_user(kt->target.u.target_size,
+		     &ut->target.u.target_size))
+		return -EFAULT;
+
+	tgt = kt->target.u.kernel.target;
+	if (copy_to_user(ut->target.u.user.name, tgt->name, strlen(tgt->name)))
+		return -EFAULT;
+
+	if (put_user(tgt->rev, &ut->target.u.user.revision))
+		return -EFAULT;
+
+	sz = tgt->size;
+	if (copy_to_user(ut->target.data, kt->target.data, sz))
+		return -EFAULT;
+
+	return 0;
+}
+
+static int do_get_entries(void *up,
+			  struct bpfilter_table *tbl,
+			  struct bpfilter_table_info *info)
+{
+	unsigned int total_size = info->size;
+	const struct bpfilter_ipt_entry *ent;
+	unsigned int off;
+	void *base;
+
+	base = info->entries;
+
+	for (off = 0; off < total_size; off += ent->next_offset) {
+		struct bpfilter_xt_counters *cntrs;
+		struct bpfilter_standard_target *tgt;
+
+		ent = base + off;
+		if (copy_to_user(up + off, ent, sizeof(*ent)))
+			return -EFAULT;
+
+		/* XXX Just clear counters for now. XXX */
+		cntrs = up + off + offsetof(struct bpfilter_ipt_entry, cntrs);
+		if (put_user(0, &cntrs->packet_cnt) ||
+		    put_user(0, &cntrs->byte_cnt))
+			return -EINVAL;
+
+		tgt = (void *) ent + ent->target_offset;
+		dprintf(debug_fd, "target.verdict %d\n", tgt->verdict);
+		if (copy_target(up + off + ent->target_offset, tgt))
+			return -EFAULT;
+	}
+	return 0;
+}
+
+int bpfilter_get_entries(void *cmd, int len)
+{
+	struct bpfilter_ipt_get_entries *uptr = cmd;
+	struct bpfilter_ipt_get_entries req;
+	struct bpfilter_table_info *info;
+	struct bpfilter_table *tbl;
+	int err;
+
+	if (len < sizeof(struct bpfilter_ipt_get_entries))
+		return -EINVAL;
+
+	if (copy_from_user(&req, cmd, sizeof(req)))
+		return -EFAULT;
+
+	tbl = bpfilter_table_get_by_name(req.name, strlen(req.name));
+	if (!tbl)
+		return -ENOENT;
+
+	info = tbl->info;
+	if (!info) {
+		err = -ENOENT;
+		goto out_put;
+	}
+
+	if (info->size != req.size) {
+		err = -EINVAL;
+		goto out_put;
+	}
+
+	err = do_get_entries(uptr->entries, tbl, info);
+	dprintf(debug_fd, "do_get_entries %d req.size %d\n", err, req.size);
+
+out_put:
+	bpfilter_table_put(tbl);
+
+	return err;
+}
+
diff --git a/net/bpfilter/tables.c b/net/bpfilter/tables.c
new file mode 100644
index 0000000..9a96599
--- /dev/null
+++ b/net/bpfilter/tables.c
@@ -0,0 +1,70 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/socket.h>
+#include <errno.h>
+#include <string.h>
+#include <linux/hashtable.h>
+#include "bpfilter_mod.h"
+
+static unsigned int full_name_hash(const void *salt, const char *name, unsigned int len)
+{
+	unsigned int hash = 0;
+	int i;
+
+	for (i = 0; i < len; i++)
+		hash ^= *(name + i);
+	return hash;
+}
+
+DEFINE_HASHTABLE(bpfilter_tables, 4);
+//DEFINE_MUTEX(bpfilter_table_mutex);
+
+struct bpfilter_table *bpfilter_table_get_by_name(const char *name, int name_len)
+{
+	unsigned int hval = full_name_hash(NULL, name, name_len);
+	struct bpfilter_table *tbl;
+
+//	mutex_lock(&bpfilter_table_mutex);
+	hash_for_each_possible(bpfilter_tables, tbl, hash, hval) {
+		if (!strcmp(name, tbl->name)) {
+			tbl->hold++;
+			goto out;
+		}
+	}
+	tbl = NULL;
+out:
+//	mutex_unlock(&bpfilter_table_mutex);
+	return tbl;
+}
+
+void bpfilter_table_put(struct bpfilter_table *tbl)
+{
+//	mutex_lock(&bpfilter_table_mutex);
+	tbl->hold--;
+//	mutex_unlock(&bpfilter_table_mutex);
+}
+
+int bpfilter_table_add(struct bpfilter_table *tbl)
+{
+	unsigned int hval = full_name_hash(NULL, tbl->name, strlen(tbl->name));
+	struct bpfilter_table *srch;
+
+//	mutex_lock(&bpfilter_table_mutex);
+	hash_for_each_possible(bpfilter_tables, srch, hash, hval) {
+		if (!strcmp(srch->name, tbl->name))
+			goto exists;
+	}
+	hash_add(bpfilter_tables, &tbl->hash, hval);
+//	mutex_unlock(&bpfilter_table_mutex);
+
+	return 0;
+
+exists:
+//	mutex_unlock(&bpfilter_table_mutex);
+	return -EEXIST;
+}
+
+void bpfilter_tables_init(void)
+{
+	hash_init(bpfilter_tables);
+}
+
diff --git a/net/bpfilter/targets.c b/net/bpfilter/targets.c
new file mode 100644
index 0000000..4086ac8
--- /dev/null
+++ b/net/bpfilter/targets.c
@@ -0,0 +1,51 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/socket.h>
+#include <errno.h>
+#include <string.h>
+#include "bpfilter_mod.h"
+
+//DEFINE_MUTEX(bpfilter_target_mutex);
+static LIST_HEAD(bpfilter_targets);
+
+struct bpfilter_target *bpfilter_target_get_by_name(const char *name)
+{
+	struct bpfilter_target *tgt;
+
+//	mutex_lock(&bpfilter_target_mutex);
+	list_for_each_entry(tgt, &bpfilter_targets, all_target_list) {
+		if (!strcmp(tgt->name, name)) {
+			tgt->hold++;
+			goto out;
+		}
+	}
+	tgt = NULL;
+out:
+//	mutex_unlock(&bpfilter_target_mutex);
+	return tgt;
+}
+
+void bpfilter_target_put(struct bpfilter_target *tgt)
+{
+//	mutex_lock(&bpfilter_target_mutex);
+	tgt->hold--;
+//	mutex_unlock(&bpfilter_target_mutex);
+}
+
+int bpfilter_target_add(struct bpfilter_target *tgt)
+{
+	struct bpfilter_target *srch;
+
+//	mutex_lock(&bpfilter_target_mutex);
+	list_for_each_entry(srch, &bpfilter_targets, all_target_list) {
+		if (!strcmp(srch->name, tgt->name))
+			goto exists;
+	}
+	list_add_tail(&tgt->all_target_list, &bpfilter_targets);
+//	mutex_unlock(&bpfilter_target_mutex);
+	return 0;
+
+exists:
+//	mutex_unlock(&bpfilter_target_mutex);
+	return -EEXIST;
+}
+
diff --git a/net/bpfilter/tgts.c b/net/bpfilter/tgts.c
new file mode 100644
index 0000000..eac5e8a
--- /dev/null
+++ b/net/bpfilter/tgts.c
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/socket.h>
+#include "bpfilter_mod.h"
+
+struct bpfilter_target std_tgt = {
+	.name = BPFILTER_STANDARD_TARGET,
+	.family = BPFILTER_PROTO_IPV4,
+	.size = sizeof(int),
+};
+
+struct bpfilter_target err_tgt = {
+	.name = BPFILTER_ERROR_TARGET,
+	.family = BPFILTER_PROTO_IPV4,
+	.size = BPFILTER_FUNCTION_MAXNAMELEN,
+};
+
+int bpfilter_ipv4_register_targets(void)
+{
+	int err = bpfilter_target_add(&std_tgt);
+
+	if (err)
+		return err;
+	return bpfilter_target_add(&err_tgt);
+}
+
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index 47a0a66..ed5f53b 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -15,6 +15,8 @@ obj-y     := route.o inetpeer.o protocol.o \
 	     fib_frontend.o fib_semantics.o fib_trie.o fib_notifier.o \
 	     inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o
 
+obj-$(CONFIG_BPFILTER) += bpfilter/
+
 obj-$(CONFIG_NET_IP_TUNNEL) += ip_tunnel.o
 obj-$(CONFIG_SYSCTL) += sysctl_net_ipv4.o
 obj-$(CONFIG_PROC_FS) += proc.o
diff --git a/net/ipv4/bpfilter/Makefile b/net/ipv4/bpfilter/Makefile
new file mode 100644
index 0000000..ce262d7
--- /dev/null
+++ b/net/ipv4/bpfilter/Makefile
@@ -0,0 +1,2 @@
+obj-$(CONFIG_BPFILTER) += sockopt.o
+
diff --git a/net/ipv4/bpfilter/sockopt.c b/net/ipv4/bpfilter/sockopt.c
new file mode 100644
index 0000000..26e544f
--- /dev/null
+++ b/net/ipv4/bpfilter/sockopt.c
@@ -0,0 +1,49 @@
+#include <linux/uaccess.h>
+#include <linux/bpfilter.h>
+#include <uapi/linux/bpf.h>
+#include <linux/wait.h>
+#include <linux/kmod.h>
+struct sock;
+
+extern struct wait_queue_head bpfilter_get_cmd_wq;
+extern struct wait_queue_head bpfilter_reply_wq;
+extern bool bpfilter_get_cmd_ready;
+extern bool bpfilter_reply_ready;
+extern struct bpfilter_get_cmd bpfilter_get_cmd_mbox;
+extern struct bpfilter_reply bpfilter_reply_mbox;
+
+bool loaded = false;
+
+int bpfilter_ip_set_sockopt(struct sock *sk, int optname, char __user *optval,
+			    unsigned int optlen)
+{
+	int err;
+
+	if (!loaded) {
+		err = request_module("bpfilter");
+		printk("request_module %d\n", err);
+//		if (err)
+//			return err;
+		loaded = true;
+	}
+	bpfilter_get_cmd_mbox.pid = current->pid;
+	bpfilter_get_cmd_mbox.cmd = optname;
+	bpfilter_get_cmd_mbox.addr = (long) optval;
+	bpfilter_get_cmd_mbox.len = optlen;
+	bpfilter_get_cmd_ready = true;
+	wake_up(&bpfilter_get_cmd_wq);
+	wait_event_killable(bpfilter_reply_wq, bpfilter_reply_ready);
+	bpfilter_reply_ready = false;
+	return bpfilter_reply_mbox.status;
+}
+
+int bpfilter_ip_get_sockopt(struct sock *sk, int optname, char __user *optval,
+			    int __user *optlen)
+{
+	int len;
+
+	if (get_user(len, optlen))
+		return -EFAULT;
+
+	return bpfilter_ip_set_sockopt(sk, optname, optval, len);
+}
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 6cc70fa..439c1b9 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -47,6 +47,8 @@
 #include <linux/errqueue.h>
 #include <linux/uaccess.h>
 
+#include <linux/bpfilter.h>
+
 /*
  *	SOL_IP control messages.
  */
@@ -1250,6 +1252,11 @@ int ip_setsockopt(struct sock *sk, int level,
 		return -ENOPROTOOPT;
 
 	err = do_ip_setsockopt(sk, level, optname, optval, optlen);
+#ifdef CONFIG_BPFILTER
+	if (optname >= BPFILTER_IPT_SO_SET_REPLACE &&
+	    optname < BPFILTER_IPT_SET_MAX)
+		err = bpfilter_ip_set_sockopt(sk, optname, optval, optlen);
+#endif
 #ifdef CONFIG_NETFILTER
 	/* we need to exclude all possible ENOPROTOOPTs except default case */
 	if (err == -ENOPROTOOPT && optname != IP_HDRINCL &&
@@ -1564,6 +1571,11 @@ int ip_getsockopt(struct sock *sk, int level,
 	int err;
 
 	err = do_ip_getsockopt(sk, level, optname, optval, optlen, 0);
+#ifdef CONFIG_BPFILTER
+	if (optname >= BPFILTER_IPT_SO_GET_INFO &&
+	    optname < BPFILTER_IPT_GET_MAX)
+		err = bpfilter_ip_get_sockopt(sk, optname, optval, optlen);
+#endif
 #ifdef CONFIG_NETFILTER
 	/* we need to exclude all possible ENOPROTOOPTs except default case */
 	if (err == -ENOPROTOOPT && optname != IP_PKTOPTIONS &&
@@ -1599,6 +1611,11 @@ int compat_ip_getsockopt(struct sock *sk, int level, int optname,
 	err = do_ip_getsockopt(sk, level, optname, optval, optlen,
 		MSG_CMSG_COMPAT);
 
+#ifdef CONFIG_BPFILTER
+	if (optname >= BPFILTER_IPT_SO_GET_INFO &&
+	    optname < BPFILTER_IPT_GET_MAX)
+		err = bpfilter_ip_get_sockopt(sk, optname, optval, optlen);
+#endif
 #ifdef CONFIG_NETFILTER
 	/* we need to exclude all possible ENOPROTOOPTs except default case */
 	if (err == -ENOPROTOOPT && optname != IP_PKTOPTIONS &&
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC 4/4] bpf: rough bpfilter codegen example hack
  2018-02-16 13:40 [PATCH RFC 0/4] net: add bpfilter Daniel Borkmann
                   ` (2 preceding siblings ...)
  2018-02-16 13:40 ` [PATCH RFC 3/4] net: initial bpfilter skeleton Daniel Borkmann
@ 2018-02-16 13:40 ` Daniel Borkmann
  2018-02-16 14:57 ` [PATCH RFC 0/4] net: add bpfilter Florian Westphal
  2018-02-17 12:11 ` Harald Welte
  5 siblings, 0 replies; 18+ messages in thread
From: Daniel Borkmann @ 2018-02-16 13:40 UTC (permalink / raw)
  To: netdev; +Cc: netfilter-devel, davem, alexei.starovoitov, Daniel Borkmann

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 include/uapi/linux/bpf.h    |  31 +++--
 kernel/bpf/syscall.c        |  39 +++---
 net/bpfilter/Makefile       |   2 +-
 net/bpfilter/bpfilter.c     |  59 +++++----
 net/bpfilter/bpfilter_mod.h | 285 ++++++++++++++++++++++++++++++++++++++++++-
 net/bpfilter/ctor.c         |  57 +++++----
 net/bpfilter/gen.c          | 290 ++++++++++++++++++++++++++++++++++++++++++++
 net/bpfilter/init.c         |  11 +-
 net/bpfilter/sockopt.c      | 137 ++++++++++++++++-----
 net/bpfilter/tables.c       |   5 +-
 net/bpfilter/tgts.c         |   1 +
 net/ipv4/bpfilter/sockopt.c |  25 +++-
 13 files changed, 835 insertions(+), 109 deletions(-)
 create mode 100644 net/bpfilter/gen.c

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ea977e9..066d76b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -94,8 +94,8 @@ enum bpf_cmd {
 	BPF_MAP_GET_FD_BY_ID,
 	BPF_OBJ_GET_INFO_BY_FD,
 	BPF_PROG_QUERY,
-	BPFILTER_GET_CMD,
-	BPFILTER_REPLY,
+	BPF_MBOX_REQUEST,
+	BPF_MBOX_REPLY,
 };
 
 enum bpf_map_type {
@@ -233,14 +233,29 @@ enum bpf_attach_type {
 #define BPF_F_RDONLY		(1U << 3)
 #define BPF_F_WRONLY		(1U << 4)
 
-struct bpfilter_get_cmd {
-	__u32 pid;
-	__u32 cmd;
+enum bpf_mbox_subsys {
+	BPF_MBOX_SUBSYS_BPFILTER,
+#define BPF_MBOX_SUBSYS_BPFILTER	BPF_MBOX_SUBSYS_BPFILTER
+};
+
+enum bpf_mbox_kind {
+	BPF_MBOX_KIND_SET,
+#define BPF_MBOX_KIND_SET		BPF_MBOX_KIND_SET
+	BPF_MBOX_KIND_GET,
+#define BPF_MBOX_KIND_GET		BPF_MBOX_KIND_GET
+};
+
+struct bpf_mbox_request {
 	__u64 addr;
 	__u32 len;
+	__u32 subsys;
+	__u32 kind;
+	__u32 cmd;
+	__u32 pid;
 };
 
-struct bpfilter_reply {
+struct bpf_mbox_reply {
+	__u32 subsys;
 	__u32 status;
 };
 
@@ -334,8 +349,8 @@ union bpf_attr {
 		__u32		prog_cnt;
 	} query;
 
-	struct bpfilter_get_cmd bpfilter_get_cmd;
-	struct bpfilter_reply bpfilter_reply;
+	struct bpf_mbox_request	mbox_request;
+	struct bpf_mbox_reply 	mbox_reply;
 } __attribute__((aligned(8)));
 
 /* BPF helper function descriptions:
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index e933bf9..2feb438 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1842,36 +1842,47 @@ static int bpf_obj_get_info_by_fd(const union bpf_attr *attr,
 
 DECLARE_WAIT_QUEUE_HEAD(bpfilter_get_cmd_wq);
 DECLARE_WAIT_QUEUE_HEAD(bpfilter_reply_wq);
+
 bool bpfilter_get_cmd_ready = false;
 bool bpfilter_reply_ready = false;
-struct bpfilter_get_cmd bpfilter_get_cmd_mbox;
-struct bpfilter_reply bpfilter_reply_mbox;
 
-#define BPFILTER_GET_CMD_LAST_FIELD bpfilter_get_cmd.len
+struct bpf_mbox_request bpfilter_get_cmd_mbox;
+struct bpf_mbox_reply   bpfilter_reply_mbox;
+
+#define BPF_MBOX_REQUEST_LAST_FIELD	mbox_request.pid
 
-static int bpfilter_get_cmd(const union bpf_attr *attr,
+static int bpf_mbox_request(const union bpf_attr *attr,
 			    union bpf_attr __user *uattr)
 {
-	if (CHECK_ATTR(BPFILTER_GET_CMD))
+	if (CHECK_ATTR(BPF_MBOX_REQUEST))
 		return -EINVAL;
+	if (attr->mbox_request.subsys != BPF_MBOX_SUBSYS_BPFILTER)
+		return -ENOTSUPP;
+
 	wait_event_killable(bpfilter_get_cmd_wq, bpfilter_get_cmd_ready);
 	bpfilter_get_cmd_ready = false;
-	if (copy_to_user(&uattr->bpfilter_get_cmd, &bpfilter_get_cmd_mbox,
+
+	if (copy_to_user(&uattr->mbox_request, &bpfilter_get_cmd_mbox,
 			 sizeof(bpfilter_get_cmd_mbox)))
 		return -EFAULT;
 	return 0;
 }
 
-#define BPFILTER_REPLY_LAST_FIELD bpfilter_reply.status
+#define BPF_MBOX_REPLY_LAST_FIELD	mbox_reply.status
 
-static int bpfilter_reply(const union bpf_attr *attr,
+static int bpf_mbox_reply(const union bpf_attr *attr,
 			  union bpf_attr __user *uattr)
 {
-	if (CHECK_ATTR(BPFILTER_REPLY))
+	if (CHECK_ATTR(BPF_MBOX_REPLY))
 		return -EINVAL;
-	bpfilter_reply_mbox.status = attr->bpfilter_reply.status;
+	if (attr->mbox_reply.subsys != BPF_MBOX_SUBSYS_BPFILTER)
+		return -ENOTSUPP;
+
+	bpfilter_reply_mbox.subsys = attr->mbox_reply.subsys;
+	bpfilter_reply_mbox.status = attr->mbox_reply.status;
 	bpfilter_reply_ready = true;
 	wake_up(&bpfilter_reply_wq);
+
 	return 0;
 }
 
@@ -1952,11 +1963,11 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 	case BPF_OBJ_GET_INFO_BY_FD:
 		err = bpf_obj_get_info_by_fd(&attr, uattr);
 		break;
-	case BPFILTER_GET_CMD:
-		err = bpfilter_get_cmd(&attr, uattr);
+	case BPF_MBOX_REQUEST:
+		err = bpf_mbox_request(&attr, uattr);
 		break;
-	case BPFILTER_REPLY:
-		err = bpfilter_reply(&attr, uattr);
+	case BPF_MBOX_REPLY:
+		err = bpf_mbox_reply(&attr, uattr);
 		break;
 	default:
 		err = -EINVAL;
diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile
index 5e05505..5a85ef7 100644
--- a/net/bpfilter/Makefile
+++ b/net/bpfilter/Makefile
@@ -5,5 +5,5 @@
 
 hostprogs-y := bpfilter.ko
 always := $(hostprogs-y)
-bpfilter.ko-objs := bpfilter.o tgts.o targets.o tables.o init.o ctor.o sockopt.o
+bpfilter.ko-objs := bpfilter.o tgts.o targets.o tables.o init.o ctor.o sockopt.o gen.o
 HOSTCFLAGS += -I. -Itools/include/
diff --git a/net/bpfilter/bpfilter.c b/net/bpfilter/bpfilter.c
index 445ae65..364c66a 100644
--- a/net/bpfilter/bpfilter.c
+++ b/net/bpfilter/bpfilter.c
@@ -1,19 +1,22 @@
 // SPDX-License-Identifier: GPL-2.0
 #define _GNU_SOURCE
-#include <sys/uio.h>
 #include <errno.h>
 #include <stdio.h>
-#include <sys/socket.h>
 #include <fcntl.h>
 #include <unistd.h>
-#include "include/uapi/linux/bpf.h"
+
+#include <sys/uio.h>
+#include <sys/socket.h>
+
 #include <asm/unistd.h>
+
+#include "include/uapi/linux/bpf.h"
+
 #include "bpfilter_mod.h"
 
 extern long int syscall (long int __sysno, ...);
 
-static inline int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr,
-			  unsigned int size)
+int sys_bpf(int cmd, union bpf_attr *attr, unsigned int size)
 {
 	return syscall(321, cmd, attr, size);
 }
@@ -38,21 +41,35 @@ int copy_to_user(void *addr, const void *src, int len)
 	struct iovec local;
 	struct iovec remote;
 
-	local.iov_base = (void *) src;
+	local.iov_base = (void *)src;
 	local.iov_len = len;
 	remote.iov_base = addr;
 	remote.iov_len = len;
 	return process_vm_writev(pid, &local, 1, &remote, 1, 0) != len;
 }
 
-static int handle_cmd(struct bpfilter_get_cmd *cmd)
+static int handle_get_cmd(struct bpf_mbox_request *cmd)
 {
 	pid = cmd->pid;
 	switch (cmd->cmd) {
 	case BPFILTER_IPT_SO_GET_INFO:
-		return bpfilter_get_info((void *) (long) cmd->addr, cmd->len);
+		return bpfilter_get_info((void *)(long)cmd->addr, cmd->len);
 	case BPFILTER_IPT_SO_GET_ENTRIES:
-		return bpfilter_get_entries((void *) (long) cmd->addr, cmd->len);
+		return bpfilter_get_entries((void *)(long)cmd->addr, cmd->len);
+	default:
+		break;
+	}
+	return -ENOPROTOOPT;
+}
+
+static int handle_set_cmd(struct bpf_mbox_request *cmd)
+{
+	pid = cmd->pid;
+	switch (cmd->cmd) {
+	case BPFILTER_IPT_SO_SET_REPLACE:
+		return bpfilter_set_replace((void *)(long)cmd->addr, cmd->len);
+	case BPFILTER_IPT_SO_SET_ADD_COUNTERS:
+		return bpfilter_set_add_counters((void *)(long)cmd->addr, cmd->len);
 	default:
 		break;
 	}
@@ -65,24 +82,24 @@ static void loop(void)
 	bpfilter_ipv4_init();
 
 	while (1) {
-		union bpf_attr get_cmd = {};
-		union bpf_attr reply = {};
-		struct bpfilter_get_cmd *cmd;
-
-		sys_bpf(BPFILTER_GET_CMD, &get_cmd, sizeof(get_cmd));
-		cmd = &get_cmd.bpfilter_get_cmd;
-
-		dprintf(debug_fd, "pid %d cmd %d addr %llx len %d\n",
-			cmd->pid, cmd->cmd, cmd->addr, cmd->len);
+		union bpf_attr req = {};
+		union bpf_attr rep = {};
+		struct bpf_mbox_request *cmd;
 
-		reply.bpfilter_reply.status = handle_cmd(cmd);
-		sys_bpf(BPFILTER_REPLY, &reply, sizeof(reply));
+		req.mbox_request.subsys = BPF_MBOX_SUBSYS_BPFILTER;
+		sys_bpf(BPF_MBOX_REQUEST, &req, sizeof(req));
+		cmd = &req.mbox_request;
+		rep.mbox_reply.subsys = BPF_MBOX_SUBSYS_BPFILTER;
+		rep.mbox_reply.status = cmd->kind == BPF_MBOX_KIND_SET ?
+					handle_set_cmd(cmd) :
+					handle_get_cmd(cmd);
+		sys_bpf(BPF_MBOX_REPLY, &rep, sizeof(rep));
 	}
 }
 
 int main(void)
 {
-	debug_fd = open("/tmp/aa", 00000002 | 00000100);
+	debug_fd = open("/dev/pts/1" /* /tmp/aa */, 00000002 | 00000100);
 	loop();
 	close(debug_fd);
 	return 0;
diff --git a/net/bpfilter/bpfilter_mod.h b/net/bpfilter/bpfilter_mod.h
index f0de41b..b420998 100644
--- a/net/bpfilter/bpfilter_mod.h
+++ b/net/bpfilter/bpfilter_mod.h
@@ -21,8 +21,8 @@ struct bpfilter_table_info {
 	unsigned int		initial_entries;
 	unsigned int		hook_entry[BPFILTER_INET_HOOK_MAX];
 	unsigned int		underflow[BPFILTER_INET_HOOK_MAX];
-	unsigned int		stacksize;
-	void			***jumpstack;
+//	unsigned int		stacksize;
+//	void			***jumpstack;
 	unsigned char		entries[0] __aligned(8);
 };
 
@@ -64,22 +64,55 @@ struct bpfilter_ipt_error {
 
 struct bpfilter_target {
 	struct list_head	all_target_list;
-	const char		name[BPFILTER_EXTENSION_MAXNAMELEN];
+	char			name[BPFILTER_EXTENSION_MAXNAMELEN];
 	unsigned int		size;
 	int			hold;
 	u16			family;
 	u8			rev;
 };
 
+struct bpfilter_gen_ctx {
+	struct bpf_insn		*img;
+	u32			len_cur;
+	u32			len_max;
+	u32			default_verdict;
+	int			fd;
+	int			ifindex;
+	bool			offloaded;
+};
+
+union bpf_attr;
+int sys_bpf(int cmd, union bpf_attr *attr, unsigned int size);
+
+int bpfilter_gen_init(struct bpfilter_gen_ctx *ctx);
+int bpfilter_gen_prologue(struct bpfilter_gen_ctx *ctx);
+int bpfilter_gen_epilogue(struct bpfilter_gen_ctx *ctx);
+int bpfilter_gen_append(struct bpfilter_gen_ctx *ctx,
+			struct bpfilter_ipt_ip *ent, int verdict);
+int bpfilter_gen_commit(struct bpfilter_gen_ctx *ctx);
+void bpfilter_gen_destroy(struct bpfilter_gen_ctx *ctx);
+
 struct bpfilter_target *bpfilter_target_get_by_name(const char *name);
 void bpfilter_target_put(struct bpfilter_target *tgt);
 int bpfilter_target_add(struct bpfilter_target *tgt);
 
-struct bpfilter_table_info *bpfilter_ipv4_table_ctor(struct bpfilter_table *tbl);
+struct bpfilter_table_info *
+bpfilter_ipv4_table_alloc(struct bpfilter_table *tbl, __u32 size_ents);
+struct bpfilter_table_info *
+bpfilter_ipv4_table_finalize(struct bpfilter_table *tbl,
+			     struct bpfilter_table_info *info,
+			     __u32 size_ents, __u32 num_ents);
+struct bpfilter_table_info *
+bpfilter_ipv4_table_finalize2(struct bpfilter_table *tbl,
+			      struct bpfilter_table_info *info,
+			      __u32 size_ents, __u32 num_ents);
+
 int bpfilter_ipv4_register_targets(void);
 void bpfilter_tables_init(void);
 int bpfilter_get_info(void *addr, int len);
 int bpfilter_get_entries(void *cmd, int len);
+int bpfilter_set_replace(void *cmd, int len);
+int bpfilter_set_add_counters(void *cmd, int len);
 int bpfilter_ipv4_init(void);
 
 int copy_from_user(void *dst, void *addr, int len);
@@ -93,4 +126,248 @@ extern int pid;
 extern int debug_fd;
 #define ENOTSUPP        524
 
+/* Helper macros for filter block array initializers. */
+
+/* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */
+
+#define BPF_ALU64_REG(OP, DST, SRC)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU64 | BPF_OP(OP) | BPF_X,	\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = 0 })
+
+#define BPF_ALU32_REG(OP, DST, SRC)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU | BPF_OP(OP) | BPF_X,		\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = 0 })
+
+/* ALU ops on immediates, bpf_add|sub|...: dst_reg += imm32 */
+
+#define BPF_ALU64_IMM(OP, DST, IMM)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU64 | BPF_OP(OP) | BPF_K,	\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+#define BPF_ALU32_IMM(OP, DST, IMM)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU | BPF_OP(OP) | BPF_K,		\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+/* Endianess conversion, cpu_to_{l,b}e(), {l,b}e_to_cpu() */
+
+#define BPF_ENDIAN(TYPE, DST, LEN)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU | BPF_END | BPF_SRC(TYPE),	\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = LEN })
+
+/* Short form of mov, dst_reg = src_reg */
+
+#define BPF_MOV64_REG(DST, SRC)					\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU64 | BPF_MOV | BPF_X,		\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = 0 })
+
+#define BPF_MOV32_REG(DST, SRC)					\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU | BPF_MOV | BPF_X,		\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = 0 })
+
+/* Short form of mov, dst_reg = imm32 */
+
+#define BPF_MOV64_IMM(DST, IMM)					\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU64 | BPF_MOV | BPF_K,		\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+#define BPF_MOV32_IMM(DST, IMM)					\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU | BPF_MOV | BPF_K,		\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+/* BPF_LD_IMM64 macro encodes single 'load 64-bit immediate' insn */
+#define BPF_LD_IMM64(DST, IMM)					\
+	BPF_LD_IMM64_RAW(DST, 0, IMM)
+
+#define BPF_LD_IMM64_RAW(DST, SRC, IMM)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_LD | BPF_DW | BPF_IMM,		\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = (__u32) (IMM) }),			\
+	((struct bpf_insn) {					\
+		.code  = 0, /* zero is reserved opcode */	\
+		.dst_reg = 0,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = ((__u64) (IMM)) >> 32 })
+
+/* pseudo BPF_LD_IMM64 insn used to refer to process-local map_fd */
+#define BPF_LD_MAP_FD(DST, MAP_FD)				\
+	BPF_LD_IMM64_RAW(DST, BPF_PSEUDO_MAP_FD, MAP_FD)
+
+/* Short form of mov based on type, BPF_X: dst_reg = src_reg, BPF_K: dst_reg = imm32 */
+
+#define BPF_MOV64_RAW(TYPE, DST, SRC, IMM)			\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU64 | BPF_MOV | BPF_SRC(TYPE),	\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+#define BPF_MOV32_RAW(TYPE, DST, SRC, IMM)			\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU | BPF_MOV | BPF_SRC(TYPE),	\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+/* Direct packet access, R0 = *(uint *) (skb->data + imm32) */
+
+#define BPF_LD_ABS(SIZE, IMM)					\
+	((struct bpf_insn) {					\
+		.code  = BPF_LD | BPF_SIZE(SIZE) | BPF_ABS,	\
+		.dst_reg = 0,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+/* Indirect packet access, R0 = *(uint *) (skb->data + src_reg + imm32) */
+
+#define BPF_LD_IND(SIZE, SRC, IMM)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_LD | BPF_SIZE(SIZE) | BPF_IND,	\
+		.dst_reg = 0,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+/* Memory load, dst_reg = *(uint *) (src_reg + off16) */
+
+#define BPF_LDX_MEM(SIZE, DST, SRC, OFF)			\
+	((struct bpf_insn) {					\
+		.code  = BPF_LDX | BPF_SIZE(SIZE) | BPF_MEM,	\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = OFF,					\
+		.imm   = 0 })
+
+/* Memory store, *(uint *) (dst_reg + off16) = src_reg */
+
+#define BPF_STX_MEM(SIZE, DST, SRC, OFF)			\
+	((struct bpf_insn) {					\
+		.code  = BPF_STX | BPF_SIZE(SIZE) | BPF_MEM,	\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = OFF,					\
+		.imm   = 0 })
+
+/* Atomic memory add, *(uint *)(dst_reg + off16) += src_reg */
+
+#define BPF_STX_XADD(SIZE, DST, SRC, OFF)			\
+	((struct bpf_insn) {					\
+		.code  = BPF_STX | BPF_SIZE(SIZE) | BPF_XADD,	\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = OFF,					\
+		.imm   = 0 })
+
+/* Memory store, *(uint *) (dst_reg + off16) = imm32 */
+
+#define BPF_ST_MEM(SIZE, DST, OFF, IMM)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_ST | BPF_SIZE(SIZE) | BPF_MEM,	\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = OFF,					\
+		.imm   = IMM })
+
+/* Conditional jumps against registers, if (dst_reg 'op' src_reg) goto pc + off16 */
+
+#define BPF_JMP_REG(OP, DST, SRC, OFF)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_JMP | BPF_OP(OP) | BPF_X,		\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = OFF,					\
+		.imm   = 0 })
+
+/* Conditional jumps against immediates, if (dst_reg 'op' imm32) goto pc + off16 */
+
+#define BPF_JMP_IMM(OP, DST, IMM, OFF)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_JMP | BPF_OP(OP) | BPF_K,		\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = OFF,					\
+		.imm   = IMM })
+
+/* Unconditional jumps, goto pc + off16 */
+
+#define BPF_JMP_A(OFF)						\
+	((struct bpf_insn) {					\
+		.code  = BPF_JMP | BPF_JA,			\
+		.dst_reg = 0,					\
+		.src_reg = 0,					\
+		.off   = OFF,					\
+		.imm   = 0 })
+
+/* Function call */
+
+#define BPF_EMIT_CALL(FUNC)					\
+	((struct bpf_insn) {					\
+		.code  = BPF_JMP | BPF_CALL,			\
+		.dst_reg = 0,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = ((FUNC) - __bpf_call_base) })
+
+/* Raw code statement block */
+
+#define BPF_RAW_INSN(CODE, DST, SRC, OFF, IMM)			\
+	((struct bpf_insn) {					\
+		.code  = CODE,					\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = OFF,					\
+		.imm   = IMM })
+
+/* Program exit */
+
+#define BPF_EXIT_INSN()						\
+	((struct bpf_insn) {					\
+		.code  = BPF_JMP | BPF_EXIT,			\
+		.dst_reg = 0,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = 0 })
+
 #endif
diff --git a/net/bpfilter/ctor.c b/net/bpfilter/ctor.c
index efb7fee..ba44c21 100644
--- a/net/bpfilter/ctor.c
+++ b/net/bpfilter/ctor.c
@@ -1,8 +1,12 @@
 // SPDX-License-Identifier: GPL-2.0
-#include <sys/socket.h>
-#include <linux/bitops.h>
 #include <stdlib.h>
 #include <stdio.h>
+#include <string.h>
+
+#include <sys/socket.h>
+
+#include <linux/bitops.h>
+
 #include "bpfilter_mod.h"
 
 unsigned int __sw_hweight32(unsigned int w)
@@ -13,35 +17,47 @@ unsigned int __sw_hweight32(unsigned int w)
 	return (w * 0x01010101) >> 24;
 }
 
-struct bpfilter_table_info *bpfilter_ipv4_table_ctor(struct bpfilter_table *tbl)
+struct bpfilter_table_info *bpfilter_ipv4_table_alloc(struct bpfilter_table *tbl,
+						      __u32 size_ents)
 {
 	unsigned int num_hooks = hweight32(tbl->valid_hooks);
-	struct bpfilter_ipt_standard *tgts;
 	struct bpfilter_table_info *info;
-	struct bpfilter_ipt_error *term;
-	unsigned int mask, offset, h, i;
 	unsigned int size, alloc_size;
 
 	size  = sizeof(struct bpfilter_ipt_standard) * num_hooks;
 	size += sizeof(struct bpfilter_ipt_error);
+	size += size_ents;
 
 	alloc_size = size + sizeof(struct bpfilter_table_info);
 
 	info = malloc(alloc_size);
-	if (!info)
-		return NULL;
+	if (info) {
+		memset(info, 0, alloc_size);
+		info->size = size;
+	}
+	return info;
+}
+
+struct bpfilter_table_info *bpfilter_ipv4_table_finalize(struct bpfilter_table *tbl,
+							 struct bpfilter_table_info *info,
+							 __u32 size_ents, __u32 num_ents)
+{
+	unsigned int num_hooks = hweight32(tbl->valid_hooks);
+	struct bpfilter_ipt_standard *tgts;
+	struct bpfilter_ipt_error *term;
+	struct bpfilter_ipt_entry *ent;
+	unsigned int mask, offset, h, i;
 
-	info->num_entries = num_hooks + 1;
-	info->size = size;
+	info->num_entries = num_ents + num_hooks + 1;
 
-	tgts = (struct bpfilter_ipt_standard *) (info + 1);
-	term = (struct bpfilter_ipt_error *) (tgts + num_hooks);
+	ent  = (struct bpfilter_ipt_entry *)(info + 1);
+	tgts = (struct bpfilter_ipt_standard *)((u8 *)ent + size_ents);
+	term = (struct bpfilter_ipt_error *)(tgts + num_hooks);
 
 	mask = tbl->valid_hooks;
 	offset = 0;
 	h = 0;
 	i = 0;
-	dprintf(debug_fd, "mask %x num_hooks %d\n", mask, num_hooks);
 	while (mask) {
 		struct bpfilter_ipt_standard *t;
 
@@ -55,7 +71,6 @@ struct bpfilter_table_info *bpfilter_ipv4_table_ctor(struct bpfilter_table *tbl)
 			BPFILTER_IPT_STANDARD_INIT(BPFILTER_NF_ACCEPT);
 		t->target.target.u.kernel.target =
 			bpfilter_target_get_by_name(t->target.target.u.user.name);
-		dprintf(debug_fd, "user.name %s\n", t->target.target.u.user.name);
 		if (!t->target.target.u.kernel.target)
 			goto out_fail;
 
@@ -67,14 +82,10 @@ struct bpfilter_table_info *bpfilter_ipv4_table_ctor(struct bpfilter_table *tbl)
 	*term = (struct bpfilter_ipt_error) BPFILTER_IPT_ERROR_INIT;
 	term->target.target.u.kernel.target =
 		bpfilter_target_get_by_name(term->target.target.u.user.name);
-	dprintf(debug_fd, "user.name %s\n", term->target.target.u.user.name);
-	if (!term->target.target.u.kernel.target)
-		goto out_fail;
-
-	dprintf(debug_fd, "info %p\n", info);
-	return info;
-
+	if (!term->target.target.u.kernel.target) {
 out_fail:
-	free(info);
-	return NULL;
+		free(info);
+		return NULL;
+	}
+	return info;
 }
diff --git a/net/bpfilter/gen.c b/net/bpfilter/gen.c
new file mode 100644
index 0000000..8e08561
--- /dev/null
+++ b/net/bpfilter/gen.c
@@ -0,0 +1,290 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <errno.h>
+#include <string.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include <linux/if_ether.h>
+#include <linux/if_link.h>
+#include <linux/rtnetlink.h>
+#include <linux/bpf.h>
+typedef __u16 __bitwise __sum16; /* hack */
+#include <linux/ip.h>
+
+#include <arpa/inet.h>
+
+#include "bpfilter_mod.h"
+
+unsigned int if_nametoindex(const char *ifname);
+
+static inline __u64 bpf_ptr_to_u64(const void *ptr)
+{
+	return (__u64)(unsigned long)ptr;
+}
+
+static int bpf_prog_load(enum bpf_prog_type type,
+			 const struct bpf_insn *insns,
+			 unsigned int insn_num,
+			 __u32 offload_ifindex)
+{
+	union bpf_attr attr = {};
+
+	attr.prog_type		= type;
+	attr.insns		= bpf_ptr_to_u64(insns);
+	attr.insn_cnt		= insn_num;
+	attr.license		= bpf_ptr_to_u64("GPL");
+	attr.prog_ifindex	= offload_ifindex;
+
+	return sys_bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
+}
+
+static int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags)
+{
+	struct sockaddr_nl sa;
+	int sock, seq = 0, len, ret = -1;
+	char buf[4096];
+	struct nlattr *nla, *nla_xdp;
+	struct {
+		struct nlmsghdr  nh;
+		struct ifinfomsg ifinfo;
+		char             attrbuf[64];
+	} req;
+	struct nlmsghdr *nh;
+	struct nlmsgerr *err;
+
+	memset(&sa, 0, sizeof(sa));
+	sa.nl_family = AF_NETLINK;
+
+	sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
+	if (sock < 0) {
+		printf("open netlink socket: %s\n", strerror(errno));
+		return -1;
+	}
+
+	if (bind(sock, (struct sockaddr *)&sa, sizeof(sa)) < 0) {
+		printf("bind to netlink: %s\n", strerror(errno));
+		goto cleanup;
+	}
+
+	memset(&req, 0, sizeof(req));
+	req.nh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg));
+	req.nh.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+	req.nh.nlmsg_type = RTM_SETLINK;
+	req.nh.nlmsg_pid = 0;
+	req.nh.nlmsg_seq = ++seq;
+	req.ifinfo.ifi_family = AF_UNSPEC;
+	req.ifinfo.ifi_index = ifindex;
+
+	/* started nested attribute for XDP */
+	nla = (struct nlattr *)(((char *)&req)
+				+ NLMSG_ALIGN(req.nh.nlmsg_len));
+	nla->nla_type = NLA_F_NESTED | 43/*IFLA_XDP*/;
+	nla->nla_len = NLA_HDRLEN;
+
+	/* add XDP fd */
+	nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len);
+	nla_xdp->nla_type = 1/*IFLA_XDP_FD*/;
+	nla_xdp->nla_len = NLA_HDRLEN + sizeof(int);
+	memcpy((char *)nla_xdp + NLA_HDRLEN, &fd, sizeof(fd));
+	nla->nla_len += nla_xdp->nla_len;
+
+	/* if user passed in any flags, add those too */
+	if (flags) {
+		nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len);
+		nla_xdp->nla_type = 3/*IFLA_XDP_FLAGS*/;
+		nla_xdp->nla_len = NLA_HDRLEN + sizeof(flags);
+		memcpy((char *)nla_xdp + NLA_HDRLEN, &flags, sizeof(flags));
+		nla->nla_len += nla_xdp->nla_len;
+	}
+
+	req.nh.nlmsg_len += NLA_ALIGN(nla->nla_len);
+
+	if (send(sock, &req, req.nh.nlmsg_len, 0) < 0) {
+		printf("send to netlink: %s\n", strerror(errno));
+		goto cleanup;
+	}
+
+	len = recv(sock, buf, sizeof(buf), 0);
+	if (len < 0) {
+		printf("recv from netlink: %s\n", strerror(errno));
+		goto cleanup;
+	}
+
+	for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
+	     nh = NLMSG_NEXT(nh, len)) {
+		if (nh->nlmsg_pid != getpid()) {
+			printf("Wrong pid %d, expected %d\n",
+			       nh->nlmsg_pid, getpid());
+			goto cleanup;
+		}
+		if (nh->nlmsg_seq != seq) {
+			printf("Wrong seq %d, expected %d\n",
+			       nh->nlmsg_seq, seq);
+			goto cleanup;
+		}
+		switch (nh->nlmsg_type) {
+		case NLMSG_ERROR:
+			err = (struct nlmsgerr *)NLMSG_DATA(nh);
+			if (!err->error)
+				continue;
+			printf("nlmsg error %s\n", strerror(-err->error));
+			goto cleanup;
+		case NLMSG_DONE:
+			break;
+		}
+	}
+
+	ret = 0;
+
+cleanup:
+	close(sock);
+	return ret;
+}
+
+static int bpfilter_load_dev(struct bpfilter_gen_ctx *ctx)
+{
+	u32 xdp_flags = 0;
+
+	if (ctx->offloaded)
+		xdp_flags |= XDP_FLAGS_HW_MODE;
+	return bpf_set_link_xdp_fd(ctx->ifindex, ctx->fd, xdp_flags);
+}
+
+int bpfilter_gen_init(struct bpfilter_gen_ctx *ctx)
+{
+	unsigned int len_max = BPF_MAXINSNS;
+
+	memset(ctx, 0, sizeof(*ctx));
+	ctx->img = calloc(len_max, sizeof(struct bpf_insn));
+	if (!ctx->img)
+		return -ENOMEM;
+	ctx->len_max = len_max;
+	ctx->fd = -1;
+	ctx->default_verdict = XDP_PASS;
+
+	return 0;
+}
+
+#define EMIT(x)						\
+	do {						\
+		if (ctx->len_cur + 1 > ctx->len_max)	\
+			return -ENOMEM;			\
+		ctx->img[ctx->len_cur++] = x;		\
+	} while (0)
+
+int bpfilter_gen_prologue(struct bpfilter_gen_ctx *ctx)
+{
+	EMIT(BPF_MOV64_REG(BPF_REG_9, BPF_REG_1));
+	EMIT(BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_9,
+			 offsetof(struct xdp_md, data)));
+	EMIT(BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_9,
+			 offsetof(struct xdp_md, data_end)));
+	EMIT(BPF_MOV64_REG(BPF_REG_1, BPF_REG_2));
+	EMIT(BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, ETH_HLEN));
+	EMIT(BPF_JMP_REG(BPF_JLE, BPF_REG_1, BPF_REG_3, 2));
+	EMIT(BPF_MOV32_IMM(BPF_REG_0, ctx->default_verdict));
+	EMIT(BPF_EXIT_INSN());
+	return 0;
+}
+
+int bpfilter_gen_epilogue(struct bpfilter_gen_ctx *ctx)
+{
+	EMIT(BPF_MOV32_IMM(BPF_REG_0, ctx->default_verdict));
+	EMIT(BPF_EXIT_INSN());
+	return 0;
+}
+
+static int bpfilter_gen_check_entry(const struct bpfilter_ipt_ip *ent)
+{
+#define M_FF	"\xff\xff\xff\xff"
+	static const __u8 mask1[IFNAMSIZ] = M_FF M_FF M_FF M_FF;
+	static const __u8 mask0[IFNAMSIZ] = { };
+	int ones = strlen(ent->in_iface); ones += ones > 0;
+#undef M_FF
+	if (strlen(ent->out_iface) > 0)
+		return -ENOTSUPP;
+	if (memcmp(ent->in_iface_mask, mask1, ones) ||
+	    memcmp(&ent->in_iface_mask[ones], mask0, sizeof(mask0) - ones))
+		return -ENOTSUPP;
+	if ((ent->src_mask != 0 && ent->src_mask != 0xffffffff) ||
+	    (ent->dst_mask != 0 && ent->dst_mask != 0xffffffff))
+		return -ENOTSUPP;
+
+	return 0;
+}
+
+int bpfilter_gen_append(struct bpfilter_gen_ctx *ctx,
+			struct bpfilter_ipt_ip *ent, int verdict)
+{
+	u32 match_xdp = verdict == -1 ? XDP_DROP : XDP_PASS;
+	int ret, ifindex, match_state = 0;
+
+	/* convention R1: tmp, R2: data, R3: data_end, R9: xdp_buff */
+	ret = bpfilter_gen_check_entry(ent);
+	if (ret < 0)
+		return ret;
+	if (ent->src_mask == 0 && ent->dst_mask == 0)
+		return 0;
+
+	ifindex = if_nametoindex(ent->in_iface);
+	if (!ifindex)
+		return 0;
+	if (ctx->ifindex && ctx->ifindex != ifindex)
+		return -ENOTSUPP;
+
+	ctx->ifindex = ifindex;
+	match_state = !!ent->src_mask + !!ent->dst_mask;
+
+	EMIT(BPF_MOV64_REG(BPF_REG_1, BPF_REG_2));
+	EMIT(BPF_MOV32_IMM(BPF_REG_5, 0));
+	EMIT(BPF_LDX_MEM(BPF_H, BPF_REG_4, BPF_REG_1,
+			 offsetof(struct ethhdr, h_proto)));
+	EMIT(BPF_JMP_IMM(BPF_JNE, BPF_REG_4, htons(ETH_P_IP),
+			 3 + match_state * 3));
+	EMIT(BPF_ALU64_IMM(BPF_ADD, BPF_REG_1,
+			   sizeof(struct ethhdr) + sizeof(struct iphdr)));
+	EMIT(BPF_JMP_REG(BPF_JGT, BPF_REG_1, BPF_REG_3, 1 + match_state * 3));
+	EMIT(BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -(int)sizeof(struct iphdr)));
+	if (ent->src_mask) {
+		EMIT(BPF_LDX_MEM(BPF_W, BPF_REG_4, BPF_REG_1,
+				 offsetof(struct iphdr, saddr)));
+		EMIT(BPF_JMP_IMM(BPF_JNE, BPF_REG_4, ent->src, 1));
+		EMIT(BPF_ALU32_IMM(BPF_ADD, BPF_REG_5, 1));
+	}
+	if (ent->dst_mask) {
+		EMIT(BPF_LDX_MEM(BPF_W, BPF_REG_4, BPF_REG_1,
+				 offsetof(struct iphdr, daddr)));
+		EMIT(BPF_JMP_IMM(BPF_JNE, BPF_REG_4, ent->dst, 1));
+		EMIT(BPF_ALU32_IMM(BPF_ADD, BPF_REG_5, 1));
+	}
+	EMIT(BPF_JMP_IMM(BPF_JNE, BPF_REG_5, match_state, 2));
+	EMIT(BPF_MOV32_IMM(BPF_REG_0, match_xdp));
+	EMIT(BPF_EXIT_INSN());
+	return 0;
+}
+
+int bpfilter_gen_commit(struct bpfilter_gen_ctx *ctx)
+{
+	int ret;
+
+	ret = bpf_prog_load(BPF_PROG_TYPE_XDP, ctx->img,
+			    ctx->len_cur, ctx->ifindex);
+	if (ret > 0)
+		ctx->offloaded = true;
+	if (ret < 0)
+		ret = bpf_prog_load(BPF_PROG_TYPE_XDP, ctx->img,
+				    ctx->len_cur, 0);
+	if (ret > 0) {
+		ctx->fd = ret;
+		ret = bpfilter_load_dev(ctx);
+	}
+
+	return ret < 0 ? ret : 0;
+}
+
+void bpfilter_gen_destroy(struct bpfilter_gen_ctx *ctx)
+{
+	free(ctx->img);
+	close(ctx->fd);
+}
diff --git a/net/bpfilter/init.c b/net/bpfilter/init.c
index 699f3f6..14e621a 100644
--- a/net/bpfilter/init.c
+++ b/net/bpfilter/init.c
@@ -1,6 +1,8 @@
 // SPDX-License-Identifier: GPL-2.0
-#include <sys/socket.h>
 #include <errno.h>
+
+#include <sys/socket.h>
+
 #include "bpfilter_mod.h"
 
 static struct bpfilter_table filter_table_ipv4 = {
@@ -22,12 +24,13 @@ int bpfilter_ipv4_init(void)
 	if (err)
 		return err;
 
-	info = bpfilter_ipv4_table_ctor(t);
+	info = bpfilter_ipv4_table_alloc(t, 0);
+	if (!info)
+		return -ENOMEM;
+	info = bpfilter_ipv4_table_finalize(t, info, 0, 0);
 	if (!info)
 		return -ENOMEM;
-
 	t->info = info;
-
 	return bpfilter_table_add(&filter_table_ipv4);
 }
 
diff --git a/net/bpfilter/sockopt.c b/net/bpfilter/sockopt.c
index 43687da..26ad12a 100644
--- a/net/bpfilter/sockopt.c
+++ b/net/bpfilter/sockopt.c
@@ -1,10 +1,14 @@
 // SPDX-License-Identifier: GPL-2.0
-#include <sys/socket.h>
 #include <errno.h>
 #include <string.h>
 #include <stdio.h>
+#include <stdlib.h>
+
+#include <sys/socket.h>
+
 #include "bpfilter_mod.h"
 
+/* TODO: Get all of this in here properly done in encoding/decoding layer. */
 static int fetch_name(void *addr, int len, char *name, int name_len)
 {
 	if (copy_from_user(name, addr, name_len))
@@ -55,12 +59,17 @@ int bpfilter_get_info(void *addr, int len)
 	return err;
 }
 
-static int copy_target(struct bpfilter_standard_target *ut,
-		       struct bpfilter_standard_target *kt)
+static int target_u2k(struct bpfilter_standard_target *kt)
 {
-	struct bpfilter_target *tgt;
-	int sz;
+	kt->target.u.kernel.target =
+		bpfilter_target_get_by_name(kt->target.u.user.name);
+	return kt->target.u.kernel.target ? 0 : -EINVAL;
+}
 
+static int target_k2u(struct bpfilter_standard_target *ut,
+		      struct bpfilter_standard_target *kt)
+{
+	struct bpfilter_target *tgt;
 
 	if (put_user(kt->target.u.target_size,
 		     &ut->target.u.target_size))
@@ -69,12 +78,9 @@ static int copy_target(struct bpfilter_standard_target *ut,
 	tgt = kt->target.u.kernel.target;
 	if (copy_to_user(ut->target.u.user.name, tgt->name, strlen(tgt->name)))
 		return -EFAULT;
-
 	if (put_user(tgt->rev, &ut->target.u.user.revision))
 		return -EFAULT;
-
-	sz = tgt->size;
-	if (copy_to_user(ut->target.data, kt->target.data, sz))
+	if (copy_to_user(ut->target.data, kt->target.data, tgt->size))
 		return -EFAULT;
 
 	return 0;
@@ -84,30 +90,25 @@ static int do_get_entries(void *up,
 			  struct bpfilter_table *tbl,
 			  struct bpfilter_table_info *info)
 {
-	unsigned int total_size = info->size;
 	const struct bpfilter_ipt_entry *ent;
+	unsigned int total_size = info->size;
+	void *base = info->entries;
 	unsigned int off;
-	void *base;
-
-	base = info->entries;
 
 	for (off = 0; off < total_size; off += ent->next_offset) {
-		struct bpfilter_xt_counters *cntrs;
 		struct bpfilter_standard_target *tgt;
+		struct bpfilter_xt_counters *cntrs;
 
 		ent = base + off;
 		if (copy_to_user(up + off, ent, sizeof(*ent)))
 			return -EFAULT;
-
-		/* XXX Just clear counters for now. XXX */
+		/* XXX: Just clear counters for now. */
 		cntrs = up + off + offsetof(struct bpfilter_ipt_entry, cntrs);
 		if (put_user(0, &cntrs->packet_cnt) ||
 		    put_user(0, &cntrs->byte_cnt))
 			return -EINVAL;
-
-		tgt = (void *) ent + ent->target_offset;
-		dprintf(debug_fd, "target.verdict %d\n", tgt->verdict);
-		if (copy_target(up + off + ent->target_offset, tgt))
+		tgt = (void *)ent + ent->target_offset;
+		if (target_k2u(up + off + ent->target_offset, tgt))
 			return -EFAULT;
 	}
 	return 0;
@@ -123,31 +124,113 @@ int bpfilter_get_entries(void *cmd, int len)
 
 	if (len < sizeof(struct bpfilter_ipt_get_entries))
 		return -EINVAL;
-
 	if (copy_from_user(&req, cmd, sizeof(req)))
 		return -EFAULT;
-
 	tbl = bpfilter_table_get_by_name(req.name, strlen(req.name));
 	if (!tbl)
 		return -ENOENT;
-
 	info = tbl->info;
 	if (!info) {
 		err = -ENOENT;
 		goto out_put;
 	}
-
 	if (info->size != req.size) {
 		err = -EINVAL;
 		goto out_put;
 	}
-
 	err = do_get_entries(uptr->entries, tbl, info);
-	dprintf(debug_fd, "do_get_entries %d req.size %d\n", err, req.size);
-
 out_put:
 	bpfilter_table_put(tbl);
+	return err;
+}
 
+static int do_set_replace(struct bpfilter_ipt_replace *req, void *base,
+			  struct bpfilter_table *tbl)
+{
+	unsigned int total_size = req->size;
+	struct bpfilter_table_info *info;
+	struct bpfilter_ipt_entry *ent;
+	struct bpfilter_gen_ctx ctx;
+	unsigned int off, sents = 0, ents = 0;
+	int ret;
+
+	ret = bpfilter_gen_init(&ctx);
+	if (ret < 0)
+		return ret;
+	ret = bpfilter_gen_prologue(&ctx);
+	if (ret < 0)
+		return ret;
+	info = bpfilter_ipv4_table_alloc(tbl, total_size);
+	if (!info)
+		return -ENOMEM;
+	if (copy_from_user(&info->entries[0], base, req->size)) {
+		free(info);
+		return -EFAULT;
+	}
+	base = &info->entries[0];
+	for (off = 0; off < total_size; off += ent->next_offset) {
+		struct bpfilter_standard_target *tgt;
+		ent = base + off;
+		ents++;
+		sents += ent->next_offset;
+		tgt = (void *) ent + ent->target_offset;
+		target_u2k(tgt);
+		ret = bpfilter_gen_append(&ctx, &ent->ip, tgt->verdict);
+                if (ret < 0)
+                        goto err;
+	}
+	info->num_entries = ents;
+	info->size = sents;
+	memcpy(info->hook_entry, req->hook_entry, sizeof(info->hook_entry));
+	memcpy(info->underflow, req->underflow, sizeof(info->hook_entry));
+	ret = bpfilter_gen_epilogue(&ctx);
+	if (ret < 0)
+		goto err;
+	ret = bpfilter_gen_commit(&ctx);
+	if (ret < 0)
+		goto err;
+	free(tbl->info);
+	tbl->info = info;
+	bpfilter_gen_destroy(&ctx);
+	dprintf(debug_fd, "offloaded %u\n", ctx.offloaded);
+	return ret;
+err:
+	free(info);
+	return ret;
+}
+
+int bpfilter_set_replace(void *cmd, int len)
+{
+	struct bpfilter_ipt_replace *uptr = cmd;
+	struct bpfilter_ipt_replace req;
+	struct bpfilter_table_info *info;
+	struct bpfilter_table *tbl;
+	int err;
+
+	if (len < sizeof(req))
+		return -EINVAL;
+	if (copy_from_user(&req, cmd, sizeof(req)))
+		return -EFAULT;
+	if (req.num_counters >= INT_MAX / sizeof(struct bpfilter_xt_counters))
+		return -ENOMEM;
+	if (req.num_counters == 0)
+		return -EINVAL;
+	req.name[sizeof(req.name) - 1] = 0;
+	tbl = bpfilter_table_get_by_name(req.name, strlen(req.name));
+	if (!tbl)
+		return -ENOENT;
+	info = tbl->info;
+	if (!info) {
+		err = -ENOENT;
+		goto out_put;
+	}
+	err = do_set_replace(&req, uptr->entries, tbl);
+out_put:
+	bpfilter_table_put(tbl);
 	return err;
 }
 
+int bpfilter_set_add_counters(void *cmd, int len)
+{
+	return 0;
+}
diff --git a/net/bpfilter/tables.c b/net/bpfilter/tables.c
index 9a96599..e0dab28 100644
--- a/net/bpfilter/tables.c
+++ b/net/bpfilter/tables.c
@@ -1,8 +1,11 @@
 // SPDX-License-Identifier: GPL-2.0
-#include <sys/socket.h>
 #include <errno.h>
 #include <string.h>
+
+#include <sys/socket.h>
+
 #include <linux/hashtable.h>
+
 #include "bpfilter_mod.h"
 
 static unsigned int full_name_hash(const void *salt, const char *name, unsigned int len)
diff --git a/net/bpfilter/tgts.c b/net/bpfilter/tgts.c
index eac5e8a..0a00bc28 100644
--- a/net/bpfilter/tgts.c
+++ b/net/bpfilter/tgts.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <sys/socket.h>
+
 #include "bpfilter_mod.h"
 
 struct bpfilter_target std_tgt = {
diff --git a/net/ipv4/bpfilter/sockopt.c b/net/ipv4/bpfilter/sockopt.c
index 26e544f..159a64580 100644
--- a/net/ipv4/bpfilter/sockopt.c
+++ b/net/ipv4/bpfilter/sockopt.c
@@ -7,15 +7,17 @@ struct sock;
 
 extern struct wait_queue_head bpfilter_get_cmd_wq;
 extern struct wait_queue_head bpfilter_reply_wq;
+
 extern bool bpfilter_get_cmd_ready;
 extern bool bpfilter_reply_ready;
-extern struct bpfilter_get_cmd bpfilter_get_cmd_mbox;
-extern struct bpfilter_reply bpfilter_reply_mbox;
+
+extern struct bpf_mbox_request bpfilter_get_cmd_mbox;
+extern struct bpf_mbox_reply bpfilter_reply_mbox;
 
 bool loaded = false;
 
-int bpfilter_ip_set_sockopt(struct sock *sk, int optname, char __user *optval,
-			    unsigned int optlen)
+int bpfilter_mbox_request(struct sock *sk, int optname, char __user *optval,
+			  unsigned int optlen, int kind)
 {
 	int err;
 
@@ -26,17 +28,29 @@ int bpfilter_ip_set_sockopt(struct sock *sk, int optname, char __user *optval,
 //			return err;
 		loaded = true;
 	}
+
+	bpfilter_get_cmd_mbox.subsys = BPF_MBOX_SUBSYS_BPFILTER;
+	bpfilter_get_cmd_mbox.kind = kind;
 	bpfilter_get_cmd_mbox.pid = current->pid;
 	bpfilter_get_cmd_mbox.cmd = optname;
 	bpfilter_get_cmd_mbox.addr = (long) optval;
 	bpfilter_get_cmd_mbox.len = optlen;
 	bpfilter_get_cmd_ready = true;
+
 	wake_up(&bpfilter_get_cmd_wq);
 	wait_event_killable(bpfilter_reply_wq, bpfilter_reply_ready);
 	bpfilter_reply_ready = false;
+
 	return bpfilter_reply_mbox.status;
 }
 
+int bpfilter_ip_set_sockopt(struct sock *sk, int optname, char __user *optval,
+			    unsigned int optlen)
+{
+	return bpfilter_mbox_request(sk, optname, optval, optlen,
+				     BPF_MBOX_KIND_SET);
+}
+
 int bpfilter_ip_get_sockopt(struct sock *sk, int optname, char __user *optval,
 			    int __user *optlen)
 {
@@ -45,5 +59,6 @@ int bpfilter_ip_get_sockopt(struct sock *sk, int optname, char __user *optval,
 	if (get_user(len, optlen))
 		return -EFAULT;
 
-	return bpfilter_ip_set_sockopt(sk, optname, optval, len);
+	return bpfilter_mbox_request(sk, optname, optval, len,
+				     BPF_MBOX_KIND_GET);
 }
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/4] net: add bpfilter
  2018-02-16 13:40 [PATCH RFC 0/4] net: add bpfilter Daniel Borkmann
                   ` (3 preceding siblings ...)
  2018-02-16 13:40 ` [PATCH RFC 4/4] bpf: rough bpfilter codegen example hack Daniel Borkmann
@ 2018-02-16 14:57 ` Florian Westphal
  2018-02-16 16:14   ` Florian Westphal
                     ` (2 more replies)
  2018-02-17 12:11 ` Harald Welte
  5 siblings, 3 replies; 18+ messages in thread
From: Florian Westphal @ 2018-02-16 14:57 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: netdev, netfilter-devel, davem, alexei.starovoitov

Daniel Borkmann <daniel@iogearbox.net> wrote:
> This is a very rough and early proof of concept that implements bpfilter.

[..]

> Also, as a benefit from such design, we get BPF JIT compilation on x86_64,
> arm64, ppc64, sparc64, mips64, s390x and arm32, but also rule offloading
> into HW for free for Netronome NFP SmartNICs that are already capable of
> offloading BPF since we can reuse all existing BPF infrastructure as the
> back end. The user space iptables binary issuing rule addition or dumps was
> left as-is, thus at some point any binaries against iptables uapi kernel
> interface could transparently be supported in such manner in long term.
>
> As rule translation can potentially become very complex, this is performed
> entirely in user space. In order to ease deployment, request_module() code
> is extended to allow user mode helpers to be invoked. Idea is that user mode
> helpers are built as part of the kernel build and installed as traditional
> kernel modules with .ko file extension into distro specified location,
> such that from a distribution point of view, they are no different than
> regular kernel modules. Thus, allow request_module() logic to load such
> user mode helper (umh) binaries via:
> 
>   request_module("foo") ->
>     call_umh("modprobe foo") ->
>       sys_finit_module(FD of /lib/modules/.../foo.ko) ->
>         call_umh(struct file)
>
> Such approach enables kernel to delegate functionality traditionally done
> by kernel modules into user space processes (either root or !root) and
> reduces security attack surface of such new code, meaning in case of
> potential bugs only the umh would crash but not the kernel. Another
> advantage coming with that would be that bpfilter.ko can be debugged and
> tested out of user space as well (e.g. opening the possibility to run
> all clang sanitizers, fuzzers or test suites for checking translation).

Several questions spinning at the moment, I will probably come up with
more:
1. Does this still attach the binary blob to the 'normal' iptables
   hooks?
2. If yes, do you see issues wrt. 'iptables' and 'bpfilter' attached
programs being different in nature (e.g. changed by different entities)?
3. What happens if the rule can't be translated (yet?)
4. Do you plan to reimplement connection tracking in userspace?
If no, how will the bpf program interact with it?
[ same question applies to ipv6 exthdr traversal, ip defragmentation
and the like ].

I will probably have a quadrillion of followup questions, sorry :-/

> Also, such architecture makes the kernel/user boundary very precise,
> meaning requests can be handled and BPF translated in control plane part
> in user space with its own user memory etc, while minimal data plane
> bits are in kernel. It would also allow to remove old xtables modules
> at some point from the kernel while keeping functionality in place.

This is what we tried with nftables :-/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/4] net: add bpfilter
  2018-02-16 14:57 ` [PATCH RFC 0/4] net: add bpfilter Florian Westphal
@ 2018-02-16 16:14   ` Florian Westphal
  2018-02-16 20:44     ` Daniel Borkmann
  2018-02-16 22:33     ` David Miller
  2018-02-16 16:53   ` Daniel Borkmann
  2018-02-16 22:32   ` David Miller
  2 siblings, 2 replies; 18+ messages in thread
From: Florian Westphal @ 2018-02-16 16:14 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Daniel Borkmann, netdev, netfilter-devel, davem,
	alexei.starovoitov

Florian Westphal <fw@strlen.de> wrote:
> Daniel Borkmann <daniel@iogearbox.net> wrote:
> Several questions spinning at the moment, I will probably come up with
> more:

... and here there are some more ...

One of the many pain points of xtables design is the assumption of 'used
only by sysadmin'.

This has not been true for a very long time, so by now iptables has
this userspace lock (yes, its fugly workaround) to serialize concurrent
iptables invocations in userspace.

AFAIU the translate-in-userspace design now brings back the old problem
of different tools overwriting each others iptables rules.

Another question -- am i correct in that each rule manipulation would
incur a 'recompilation'?  Or are there different mini programs chained
together?

One of the nftables advantages is that (since rule representation in
kernel is black-box from userspace point of view) is that the kernel
can announce add/delete of rules or elements from nftables sets.

Any particular reason why translating iptables rather than nftables
(it should be possible to monitor the nftables changes that are
 announced by kernel and act on those)?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/4] net: add bpfilter
  2018-02-16 16:14   ` Florian Westphal
@ 2018-02-16 20:44     ` Daniel Borkmann
  2018-02-17 12:33       ` Harald Welte
  2018-02-17 19:18       ` Florian Westphal
  2018-02-16 22:33     ` David Miller
  1 sibling, 2 replies; 18+ messages in thread
From: Daniel Borkmann @ 2018-02-16 20:44 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev, netfilter-devel, davem, alexei.starovoitov

Hi Florian,

On 02/16/2018 05:14 PM, Florian Westphal wrote:
> Florian Westphal <fw@strlen.de> wrote:
>> Daniel Borkmann <daniel@iogearbox.net> wrote:
>> Several questions spinning at the moment, I will probably come up with
>> more:
> 
> ... and here there are some more ...
> 
> One of the many pain points of xtables design is the assumption of 'used
> only by sysadmin'.
> 
> This has not been true for a very long time, so by now iptables has
> this userspace lock (yes, its fugly workaround) to serialize concurrent
> iptables invocations in userspace.
> 
> AFAIU the translate-in-userspace design now brings back the old problem
> of different tools overwriting each others iptables rules.

Right, so the behavior would need to be adapted to be exactly the same,
given all the requests go into kernel space first via the usual uapis,
I don't think there would be anything in the way of keeping that as is.

> Another question -- am i correct in that each rule manipulation would
> incur a 'recompilation'?  Or are there different mini programs chained
> together?

Right now in the PoC yes, basically it regenerates the program on the fly
in gen.c when walking the struct bpfilter_ipt_ip's and appends the entries
to the program, but it doesn't have to be that way. There are multiple
options to allow for a partial code generation, e.g. via chaining tail
call arrays or directly via BPF to BPF calls eventually, there would be
few changes on BPF side needed, but it can be done; there could additionally
be various optimizations passes during code generation phase performed
while keeping given constraints in order to speed up getting to a verdict.

> One of the nftables advantages is that (since rule representation in
> kernel is black-box from userspace point of view) is that the kernel
> can announce add/delete of rules or elements from nftables sets.
> 
> Any particular reason why translating iptables rather than nftables
> (it should be possible to monitor the nftables changes that are
>  announced by kernel and act on those)?

Yeah, correct, this should be possible as well. We started out with the
iptables part in the demo as the majority of bigger infrastructure projects
all still rely heavily on it (e.g. docker, k8s to just name two big ones).
Usually they have their requests to iptables baked into their code directly
which probably won't change any time soon, so thought was that they could
benefit initially from it once there would be sufficient coverage.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/4] net: add bpfilter
  2018-02-16 20:44     ` Daniel Borkmann
@ 2018-02-17 12:33       ` Harald Welte
  2018-02-17 19:18       ` Florian Westphal
  1 sibling, 0 replies; 18+ messages in thread
From: Harald Welte @ 2018-02-17 12:33 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Florian Westphal, netdev, netfilter-devel, davem,
	alexei.starovoitov

Hi Daniel,

On Fri, Feb 16, 2018 at 09:44:01PM +0100, Daniel Borkmann wrote:
> We started out with the
> iptables part in the demo as the majority of bigger infrastructure projects
> all still rely heavily on it (e.g. docker, k8s to just name two big ones).

docker is exec'ing the iptables command line program.  So one could simply
offer a syntactically compatible userspace replacement that does the compilation
in userspce and avoid the iptables->libiptc->setsockopt->userspace roundtrip
and the associated changes to the kernel module loader you introduced.

kubernetes is using iptables-restore, which is part of iptables and
again has the same syntax.  However, it aovids the per-rule fork+exec
overhead, which is why the netfilter project has been recommending it to
be used in such situations.

Do you have a list of known projects that use the legacy sockopt-based
iptables uapi directly, without using code from the iptables.git
codebase (e.g. libiptc, iptables or iptables-restore)?  IMHO only
those projects would benefit from the approach you have taken vs. an
approach that simply offers a compatible commandline syntax.

> Usually they have their requests to iptables baked into their code directly
> which probably won't change any time soon, so thought was that they could
> benefit initially from it once there would be sufficient coverage.

If the binary offeers the same syntax (it could even be a fork/version
of the iptables codebase, only using the parsing without the existing
backend generating the ruleS), the same goal could be achieved.

The above of course assumes that you have a 100% functional replacement
(for 100% of the features that your use cases use) underneath the
"iptables command syntax" compatibility.  But you need that in both
cases, whether you use the existing userspace api or not.

Regards,
	Harald
-- 
- Harald Welte <laforge@gnumonks.org>           http://laforge.gnumonks.org/
============================================================================
"Privacy in residential applications is a desirable marketing option."
                                                  (ETSI EN 300 175-7 Ch. A6)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/4] net: add bpfilter
  2018-02-16 20:44     ` Daniel Borkmann
  2018-02-17 12:33       ` Harald Welte
@ 2018-02-17 19:18       ` Florian Westphal
  1 sibling, 0 replies; 18+ messages in thread
From: Florian Westphal @ 2018-02-17 19:18 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Florian Westphal, netdev, netfilter-devel, davem,
	alexei.starovoitov

Daniel Borkmann <daniel@iogearbox.net> wrote:
> Hi Florian,
> 
> On 02/16/2018 05:14 PM, Florian Westphal wrote:
> > Florian Westphal <fw@strlen.de> wrote:
> >> Daniel Borkmann <daniel@iogearbox.net> wrote:
> >> Several questions spinning at the moment, I will probably come up with
> >> more:
> > 
> > ... and here there are some more ...
> > 
> > One of the many pain points of xtables design is the assumption of 'used
> > only by sysadmin'.
> > 
> > This has not been true for a very long time, so by now iptables has
> > this userspace lock (yes, its fugly workaround) to serialize concurrent
> > iptables invocations in userspace.
> > 
> > AFAIU the translate-in-userspace design now brings back the old problem
> > of different tools overwriting each others iptables rules.
> 
> Right, so the behavior would need to be adapted to be exactly the same,
> given all the requests go into kernel space first via the usual uapis,
> I don't think there would be anything in the way of keeping that as is.

Uff.  This isn't solveable.  At least thats what I tried to say here.
This is a limitation of the xtables setsockopt interface design.

If $docker (or anything else) adds a new rule using plain iptables other
daemons are not aware of it.

If some deletes a rule added by $software it won't learn that either.

The "solutions" in place now (periodic reloads/'is my rule still in
place' etc. are not desirable long-term.

You'll also need 4 decoders for arp/ip/ip6/ebtables plus translations
for all matches and targets xtables currently has. (almost 100 i would
guess from quick glance).

Some of the more crazy ones also have external user visible interfaces
outside setsockopt (proc files, ipset).

> > One of the nftables advantages is that (since rule representation in
> > kernel is black-box from userspace point of view) is that the kernel
> > can announce add/delete of rules or elements from nftables sets.
> > 
> > Any particular reason why translating iptables rather than nftables
> > (it should be possible to monitor the nftables changes that are
> >  announced by kernel and act on those)?
> 
> Yeah, correct, this should be possible as well. We started out with the
> iptables part in the demo as the majority of bigger infrastructure projects
> all still rely heavily on it (e.g. docker, k8s to just name two big ones).

Yes, which is why we have translation tools in place.

Just for the fun of it I tried to delete ip/ip6tables binaries on my
fedora27 laptop and replaced them with symlinks to
'xtables-compat-multi'.

Aside from two issues (SELinux denying 'iptables' to use netlink) and
one translation issue (-m rpfilter, which can be translated in current
upstream version) this works out of the box, the translator uses
nftables api to kernel (so kernel doesn't even know which program is
talking...), 'nft monitor' displays the rules being added, and
'nft list ruleset' shows the default firewalld ruleset.

Obviously there are a few limitations, for instance ip6tables-save will
stop working once you add nft-based rules that use features that cannot
be expressed in xtables syntax (it will throw an error message similar
to 'you are using nftables featues not available in xtables, please use
nft'), for intance verdict maps, sets and the like.

> Usually they have their requests to iptables baked into their code directly
> which probably won't change any time soon, so thought was that they could
> benefit initially from it once there would be sufficient coverage.

See above, the translator covers most basic use cases nowadays.
The more extreme cases are not covered because we were reluctant to
provide equivalent in nftables (-m time comes to mind which was always a
PITA because kernel has no notion of timezone or DST transitions,
leading to 'magic' mismatches when timezone changes...

I could explain on more problem cases but none of them are too
important I think.

If you'd like to have more ebpf users in the kernel, then there is at
least one use case where ebpf could be very attractive for nftables
(matching dynamic headers and the like).  This would be a new
feature and would need changes on nftables userspace side
as well (we don't have syntax/grammar to represent this in either
nft or iptables).

In most basic form, it would be nftables replacement for '-m string'
(and perhaps also -m bpf to some degree, depends on how it would be
 realized).

We can discuss more if there is interest, but I think it
would be more suitable for conference/face to face discussion.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/4] net: add bpfilter
  2018-02-16 16:14   ` Florian Westphal
  2018-02-16 20:44     ` Daniel Borkmann
@ 2018-02-16 22:33     ` David Miller
  2018-02-17 12:21       ` Harald Welte
  2018-02-17 20:10       ` Florian Westphal
  1 sibling, 2 replies; 18+ messages in thread
From: David Miller @ 2018-02-16 22:33 UTC (permalink / raw)
  To: fw; +Cc: daniel, netdev, netfilter-devel, alexei.starovoitov

From: Florian Westphal <fw@strlen.de>
Date: Fri, 16 Feb 2018 17:14:08 +0100

> Any particular reason why translating iptables rather than nftables
> (it should be possible to monitor the nftables changes that are
>  announced by kernel and act on those)?

As Daniel said, iptables is by far the most deployed of the two
technologies.  Therefore it provides the largest environment for
testing and coverage.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/4] net: add bpfilter
  2018-02-16 22:33     ` David Miller
@ 2018-02-17 12:21       ` Harald Welte
  2018-02-17 20:10       ` Florian Westphal
  1 sibling, 0 replies; 18+ messages in thread
From: Harald Welte @ 2018-02-17 12:21 UTC (permalink / raw)
  To: David Miller; +Cc: fw, daniel, netdev, netfilter-devel, alexei.starovoitov

Hi David,

On Fri, Feb 16, 2018 at 05:33:54PM -0500, David Miller wrote:
> From: Florian Westphal <fw@strlen.de>
> 
> > Any particular reason why translating iptables rather than nftables
> > (it should be possible to monitor the nftables changes that are
> >  announced by kernel and act on those)?
> 
> As Daniel said, iptables is by far the most deployed of the two
> technologies.  Therefore it provides the largest environment for
> testing and coverage.

As I outlined earlier, this way you are perpetuating the architectural
mistakes and constraints that were created ~ 18 years ago without any
benefit from the lessons learned ever since.  In netfilter, we already
wanted to replace it as early as 2006 (AFAIR) with nfnetlink based
pkttables (which never materialized).

I would strongly suggest to focus on nftables (or even some other way of
configuration / userspace interaction) to ensure that the iptables
userspace interface can at some point be phased out eventually.  Like we
did with ipchains before, and before that with ipfwadm.

By making a new implementation dependant on the oldest interface you are
perpetuating it.  Sure, one can go that way, but I would suggest this to
be a *very* carefully weighed decision after a detailed
analysis/discusison.

-- 
- Harald Welte <laforge@gnumonks.org>           http://laforge.gnumonks.org/
============================================================================
"Privacy in residential applications is a desirable marketing option."
                                                  (ETSI EN 300 175-7 Ch. A6)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/4] net: add bpfilter
  2018-02-16 22:33     ` David Miller
  2018-02-17 12:21       ` Harald Welte
@ 2018-02-17 20:10       ` Florian Westphal
  2018-02-17 22:38         ` Florian Westphal
  1 sibling, 1 reply; 18+ messages in thread
From: Florian Westphal @ 2018-02-17 20:10 UTC (permalink / raw)
  To: David Miller; +Cc: fw, daniel, netdev, netfilter-devel, alexei.starovoitov

David Miller <davem@davemloft.net> wrote:
> From: Florian Westphal <fw@strlen.de>
> Date: Fri, 16 Feb 2018 17:14:08 +0100
> 
> > Any particular reason why translating iptables rather than nftables
> > (it should be possible to monitor the nftables changes that are
> >  announced by kernel and act on those)?
> 
> As Daniel said, iptables is by far the most deployed of the two
> technologies.  Therefore it provides the largest environment for
> testing and coverage.

Right, but the approach of hooking old blob format comes with
lots of limitations that were meant to be resolved with a netlink based
interface which places kernel in a position to mediate all transactions
to the rule database (which isn't fixable with old setsockopt format).

As all programs call iptables(-restore) or variants translation can
be done in userspace to nftables so api spoken is nfnetlink.
Such a translator already exists and can handle some cases already:

nft flush ruleset
nft list ruleset | wc -l
0
xtables-compat-multi iptables -A INPUT -s 192.168.0.24 -j ACCEPT
xtables-compat-multi iptables -A INPUT -s 192.168.0.0/16 -p tcp --dport 22 -j ACCEPT
xtables-compat-multi iptables -A INPUT -i eth0 -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT
xtables-compat-multi iptables -A INPUT -p icmp -j ACCEPT
xtables-compat-multi iptables -N REJECT_LOG
xtables-compat-multi iptables -A REJECT_LOG -i eth0 -p tcp --tcp-flags SYN,ACK SYN --dport 22:80 -m limit --limit 1/sec -j LOG --log-prefix "RejectTCPConnectReq"
xtables-compat-multi iptables -A REJECT_LOG -j DROP
xtables-compat-multi iptables -A INPUT -j REJECT_LOG

nft list ruleset
table ip filter {
        chain INPUT {
                type filter hook input priority 0; policy accept;
                ip saddr 192.168.0.24 counter packets 0 bytes 0 accept
                ip saddr 192.168.0.0/16 tcp dport 22 counter accept
                iifname "eth0" ct state related,established counter accept
                ip protocol icmp counter packets 0 bytes 0 accept
                counter packets 0 bytes 0 jump REJECT_LOG
        }

        chain FORWARD {
                type filter hook forward priority 0; policy accept;
        }

        chain OUTPUT {
                type filter hook output priority 0; policy accept;
        }

        chain REJECT_LOG {
                iifname "eth0" tcp dport 22-80 tcp flags & (syn | ack) == syn limit rate 1/second burst 5 packets counter packets 0 bytes 0 log prefix "RejectTCPConnectReq"
                counter packets 0 bytes 0 drop
        }
}

and, while 'iptables' rules were added, nft monitor in different terminal:
nft monitor
add table ip filter
add chain ip filter INPUT { type filter hook input priority 0; policy accept; }
add chain ip filter FORWARD { type filter hook forward priority 0; policy accept; }
add chain ip filter OUTPUT { type filter hook output priority 0; policy accept; }
add rule ip filter INPUT ip saddr 192.168.0.24 counter packets 0 bytes 0 accept
# new generation 9893 by process 7471 (xtables-compat-)
add rule ip filter INPUT ip saddr 192.168.0.0/16 tcp dport 22 counter accept
# new generation 9894 by process 7504 (xtables-compat-)
add rule ip filter INPUT iifname "eth0" ct state related,established counter accept
# new generation 9895 by process 7528 (xtables-compat-)
add rule ip filter INPUT ip protocol icmp counter packets 0 bytes 0 accept
# new generation 9896 by process 7542 (xtables-compat-)
add chain ip filter REJECT_LOG
# new generation 9897 by process 7595 (xtables-compat-)
add rule ip filter REJECT_LOG iifname "eth0" tcp dport 22-80 tcp flags & (syn | ack) == syn limit rate 1/second burst 5 packets counter packets 0 bytes 0 log prefix "RejectTCPConnectReq"
# new generation 9898 by process 7639 (xtables-compat-)
add rule ip filter REJECT_LOG counter packets 0 bytes 0 drop
# new generation 9899 by process 7657 (xtables-compat-)
add rule ip filter INPUT counter packets 0 bytes 0 jump REJECT_LOG
# new generation 9900 by process 7663 (xtables-compat-)

Now, does this work in all cases?

Unfortunately not -- this is still work-in-progress, so I would
not rm /sbin/iptables and replace it with a link to xtables-compat-multi just yet.

(f.e. nftables misses some selinux matches/targets for netlabel so we obviously
can't translate this, same for ipsec sa/policy matching -- but this isn't
impossible to resolve).

Hopefully this does show that at least some commonly used features work
and that we've come a long way to make seamless nftables transition happen.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/4] net: add bpfilter
  2018-02-17 20:10       ` Florian Westphal
@ 2018-02-17 22:38         ` Florian Westphal
  0 siblings, 0 replies; 18+ messages in thread
From: Florian Westphal @ 2018-02-17 22:38 UTC (permalink / raw)
  To: Florian Westphal
  Cc: David Miller, daniel, netdev, netfilter-devel, alexei.starovoitov

Florian Westphal <fw@strlen.de> wrote:
> David Miller <davem@davemloft.net> wrote:
> > From: Florian Westphal <fw@strlen.de>
> > Date: Fri, 16 Feb 2018 17:14:08 +0100
> > 
> > > Any particular reason why translating iptables rather than nftables
> > > (it should be possible to monitor the nftables changes that are
> > >  announced by kernel and act on those)?
> > 
> > As Daniel said, iptables is by far the most deployed of the two
> > technologies.  Therefore it provides the largest environment for
> > testing and coverage.
> 
> Right, but the approach of hooking old blob format comes with
> lots of limitations that were meant to be resolved with a netlink based
> interface which places kernel in a position to mediate all transactions
> to the rule database (which isn't fixable with old setsockopt format).
> 
> As all programs call iptables(-restore) or variants translation can
> be done in userspace to nftables so api spoken is nfnetlink.
> Such a translator already exists and can handle some cases already:
> 
> nft flush ruleset
> nft list ruleset | wc -l
> 0
> xtables-compat-multi iptables -A INPUT -i eth0 -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT
> xtables-compat-multi iptables -A REJECT_LOG -i eth0 -p tcp --tcp-flags SYN,ACK SYN --dport 22:80 -m limit --limit 1/sec -j LOG --log-prefix "RejectTCPConnectReq"

to be fair, for these two I had to use
$(xtables-compat-multi iptables-translate -A INPUT -i eth0 -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT)

Reason is that the 'iptables-translate' part nowadays has way more
translations available (nft gained many features since the
iptables-compat layer was added).

If given appropriate prioriy however it should be pretty
trivial to make the 'translate' descriptions available in
the 'direct' version, we already have function in libnftables
to execute/run a command directly from a buffer so this would
not even need fork/execve overhead (although I don't think
its a big concern).

> (f.e. nftables misses some selinux matches/targets for netlabel so we obviously
> can't translate this, same for ipsec sa/policy matching -- but this isn't
> impossible to resolve).

I am working on some poc code for the sa/policy thing now.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/4] net: add bpfilter
  2018-02-16 14:57 ` [PATCH RFC 0/4] net: add bpfilter Florian Westphal
  2018-02-16 16:14   ` Florian Westphal
@ 2018-02-16 16:53   ` Daniel Borkmann
  2018-02-16 22:32   ` David Miller
  2 siblings, 0 replies; 18+ messages in thread
From: Daniel Borkmann @ 2018-02-16 16:53 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev, netfilter-devel, davem, alexei.starovoitov

Hi Florian,

thanks for your feedback! More inline:

On 02/16/2018 03:57 PM, Florian Westphal wrote:
> Daniel Borkmann <daniel@iogearbox.net> wrote:
>> This is a very rough and early proof of concept that implements bpfilter.
> 
> [..]
> 
>> Also, as a benefit from such design, we get BPF JIT compilation on x86_64,
>> arm64, ppc64, sparc64, mips64, s390x and arm32, but also rule offloading
>> into HW for free for Netronome NFP SmartNICs that are already capable of
>> offloading BPF since we can reuse all existing BPF infrastructure as the
>> back end. The user space iptables binary issuing rule addition or dumps was
>> left as-is, thus at some point any binaries against iptables uapi kernel
>> interface could transparently be supported in such manner in long term.
>>
>> As rule translation can potentially become very complex, this is performed
>> entirely in user space. In order to ease deployment, request_module() code
>> is extended to allow user mode helpers to be invoked. Idea is that user mode
>> helpers are built as part of the kernel build and installed as traditional
>> kernel modules with .ko file extension into distro specified location,
>> such that from a distribution point of view, they are no different than
>> regular kernel modules. Thus, allow request_module() logic to load such
>> user mode helper (umh) binaries via:
>>
>>   request_module("foo") ->
>>     call_umh("modprobe foo") ->
>>       sys_finit_module(FD of /lib/modules/.../foo.ko) ->
>>         call_umh(struct file)
>>
>> Such approach enables kernel to delegate functionality traditionally done
>> by kernel modules into user space processes (either root or !root) and
>> reduces security attack surface of such new code, meaning in case of
>> potential bugs only the umh would crash but not the kernel. Another
>> advantage coming with that would be that bpfilter.ko can be debugged and
>> tested out of user space as well (e.g. opening the possibility to run
>> all clang sanitizers, fuzzers or test suites for checking translation).
> 
> Several questions spinning at the moment, I will probably come up with
> more:

Sure, no problem at all. It's an early RFC, so purpose is to get a
discussion going on such potential approach.

> 1. Does this still attach the binary blob to the 'normal' iptables
>    hooks?

Yeah, so thought would be to keep the user land tooling functional as
is w/o having to recompile binaries, thus this would also need to attach
for the existing hooks in order to keep semantics working. As a benefit
in addition we can also reuse all the rest of the infrastructure to utilize
things like XDP for iptables in the background, there is definitely
flexibility on this side thus users could eventually benefit from this
transparently and don't need to know that 'bpfilter' exists and is
translating in the background. I realize taking this path is a long term
undertake that we would need to tackle as a community, not just one or
two individuals when we decide to go for this direction.

> 2. If yes, do you see issues wrt. 'iptables' and 'bpfilter' attached
> programs being different in nature (e.g. changed by different entities)?

There could certainly be multiple options, e.g. a fall-through with state
transfer once a request cannot be handled yet or a sysctl with iptables
being the default handler and an option to switch to bpfilter for letting
it handle requests for that time being.

> 3. What happens if the rule can't be translated (yet?)

(See above.)

> 4. Do you plan to reimplement connection tracking in userspace?

One option could be to have a generic, skb-less connection tracker in kernel
that can be reused from the various hooks it would need to handle, potentially
that would also be able to get offloaded into HW as another benefit coming
out from that.

> If no, how will the bpf program interact with it?
> [ same question applies to ipv6 exthdr traversal, ip defragmentation
> and the like ].

The v6 exthdr traversal could be realized natively via BPF which should
make the parsing more robust at the same time than having it somewhere
inside a helper in kernel directly; bounded loops in BPF would help as
well on that front, similarly for defrag this could be handled by the prog
although here we would need additional infra to queue the packets and then
recirculate.

> I will probably have a quadrillion of followup questions, sorry :-/

Definitely, please do!

Thanks,
Daniel

>> Also, such architecture makes the kernel/user boundary very precise,
>> meaning requests can be handled and BPF translated in control plane part
>> in user space with its own user memory etc, while minimal data plane
>> bits are in kernel. It would also allow to remove old xtables modules
>> at some point from the kernel while keeping functionality in place.
> 
> This is what we tried with nftables :-/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/4] net: add bpfilter
  2018-02-16 14:57 ` [PATCH RFC 0/4] net: add bpfilter Florian Westphal
  2018-02-16 16:14   ` Florian Westphal
  2018-02-16 16:53   ` Daniel Borkmann
@ 2018-02-16 22:32   ` David Miller
  2 siblings, 0 replies; 18+ messages in thread
From: David Miller @ 2018-02-16 22:32 UTC (permalink / raw)
  To: fw; +Cc: daniel, netdev, netfilter-devel, alexei.starovoitov

From: Florian Westphal <fw@strlen.de>
Date: Fri, 16 Feb 2018 15:57:27 +0100

> 4. Do you plan to reimplement connection tracking in userspace?
> If no, how will the bpf program interact with it?

The natural way to handle this, as with anything BPF related, is with
appropriate BPF helpers which would be added for this purpose.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/4] net: add bpfilter
  2018-02-16 13:40 [PATCH RFC 0/4] net: add bpfilter Daniel Borkmann
                   ` (4 preceding siblings ...)
  2018-02-16 14:57 ` [PATCH RFC 0/4] net: add bpfilter Florian Westphal
@ 2018-02-17 12:11 ` Harald Welte
  2018-02-18  0:35   ` Florian Westphal
  5 siblings, 1 reply; 18+ messages in thread
From: Harald Welte @ 2018-02-17 12:11 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: netdev, netfilter-devel, davem, alexei.starovoitov

Hi Daniel,

On Fri, Feb 16, 2018 at 02:40:19PM +0100, Daniel Borkmann wrote:
> This is a very rough and early proof of concept that implements bpfilter.
> The basic idea of bpfilter is that it can process iptables queries and
> translate them in user space into BPF programs which can then get attached
> at various locations. 

Interesting approach.  My first question would be what the goal of all of this
is.  For sure, one can implement many different things, but what is the use
case, and why do it this way?

I see several possible areas of contention:

1) If you aim for a non-feature-complete support of iptables rules, it
   will create confusion to the users.  When users use "iptables", they have
   assumptions on what it will do and how it will behave.  One can of course
   replace / refactor the internal implementation, if the resulting behavior
   is identical.  And that means rules are executed at the same hooks in the stack,
   with functionally identical matches and targets, provide the same
   counter semantics, etc.  But if the behavior is different, and/or the
   provided functionality is different, then why "hide" this new
   filtering technology behind iptables, rather than its own command
   line tool?  Such an alternative tool could share the same command
   line syntax as iptables, or even provide a converter/wrapper, but
   given that it would not be called "iptables" people will implicitly
   have different assumptions about it

2) Why try to provide compatibility to iptables, when at the same time
   many people have already migrated to (or are in the process of
   migrating) to nftables?  By using iptables semantics, structures,
   architecture, you risk perpetuating the design mistakes we made in
   iptables some 18 years ago for another decade or more.  From my POV,
   if one was to do eBPF optimized rule execution, it should be based on
   nftables rather than iptables.  This way you avoid the many
   architectural problems, such as
   * no incremental rule changes but only atomic swap of an entire table
     with all its chains
   * no common/shared rulesets for IPv4 + IPv6, which is very clumsy and
     often worked around with ugly shellscript wrappers in userspace
     which then call both iptables and ip6tables to add a rule to both
     rulesets.

> The user space iptables binary issuing rule addition or dumps was
> left as-is, thus at some point any binaries against iptables uapi kernel
> interface could transparently be supported in such manner in long term.

See my comments above:  In the netfilter community, we know for at least
a decade or more about the many problems of the old iptables userspace
interface.  For many years, a much better replacement has been designed
as part of nftables.

> As rule translation can potentially become very complex, this is performed
> entirely in user space. In order to ease deployment, request_module() code
> is extended to allow user mode helpers to be invoked. Idea is that user mode
> helpers are built as part of the kernel build and installed as traditional
> kernel modules with .ko file extension into distro specified location,
> such that from a distribution point of view, they are no different than
> regular kernel modules. 

That just blew my mind, sorry :)  This goes much beyond
netfilter/iptables, and adds some quiet singificant new piece of
kernel/userspace infrastructure.  To me, my apologies, it just sounds
like a quite strange hack.  But then, I may lack the vision of how this
might be useful in other contexts.

I'm trying to understand why exactly one would
* use a 18 year old iptables userspace program with its equally old
  setsockopt based interface between kernel and userspace
* insert an entire table with many chains of rules into the kernel
* re-eject that ruleset into another userspace program which then
  compiles it into an eBPF program
* inserert that back into the kernel

To me, this looks like some kind of legacy backwards compatibility
mechanism that one would find in proprietary operating systems, but not
in Linux.  iptables, libiptc etc. are all free software.  The source
code can be edited, and you could just as well have a new version of
iptables and/or libiptc which would pass the ruleset in userspace to
your compiler, which would then insert the resulting eBPF program.

You could even have a LD_PRELOAD wrapper doing the same.  That one
would even work with direct users of the iptables setsockopt inteerface.

Why add quite comprehensive kerne infrastructure?  What's the motivation
here?

> Thus, allow request_module() logic to load such
> user mode helper (umh) binaries via:
> 
>   request_module("foo") ->
>     call_umh("modprobe foo") ->
>       sys_finit_module(FD of /lib/modules/.../foo.ko) ->
>         call_umh(struct file)
> 
> Such approach enables kernel to delegate functionality traditionally done
> by kernel modules into user space processes (either root or !root) and
> reduces security attack surface of such new code, meaning in case of
> potential bugs only the umh would crash but not the kernel. Another
> advantage coming with that would be that bpfilter.ko can be debugged and
> tested out of user space as well (e.g. opening the possibility to run
> all clang sanitizers, fuzzers or test suites for checking translation).
> Also, such architecture makes the kernel/user boundary very precise,
> meaning requests can be handled and BPF translated in control plane part
> in user space with its own user memory etc, while minimal data plane
> bits are in kernel. 

I understand that it has advantages to have the compiler in userspace.
But then, why first send your rules into the kernel and back?

> In the implemented proof of concept we show that simple /32 src/dst IPs
> are translated in such manner. 

Of course this is the first that one starts with.  However, as we all
know, iptables was never very good or efficient about 5-tuple matching.
If you want a fast implementation of this, you don't use iptables which
does linear list iteration.  The reason/rationale/use-case of iptables
is its many (I believe more than 100 now?) extensions both on the area
of matches and targets.

Some of those can be implemented easily in BPF (like recomputing the
checksum or the like).   Some others I would find much more difficult -
particularly if you want to off-load it to the NIC.  They require access
to state that only the kernel has (like 'cgroup' or 'owner' matching).

> In the below example, we show that dumping, loading and offloading of
> one or multiple simple rules work, we show the bpftool XDP dump of the
> generated BPF instruction sequence as well as a simple functional ping
> test to enforce policy in such way.

Could you please clarify why the 'filter' table INPUT chain was used if
you're using XDP?  AFAICT they have completely different semantics.

There is a well-conceived and generally understood notion of where
exactly the filter/INPUT table processing happens.  And that's not as
early as in the NIC, but it's much later in the processing of the
packet.

I believe _if_ one wants to use the approach of "hiding" eBPF behind
iptables, then either

a) the eBPF programs must be executed at the exact same points in the
   stack as the existing hooks of the built-in chains of the
   filter/nat/mangle/raw tables, or

b) you must introduce new 'tables', like an 'xdp' table which then has
   the notion of processing very early in processing, way before the
   normal filter table INPUT processing happens.

> Feedback very welcome!

Thanks.  Despite being a former netfilter core team member, I'm trying
to look at this as neutral as possible.  So please don't perceive my
comments as overly defensive or the like.

My main points are:

1) What is the goal of this?

2) Why iptables and not nftables?

3) If something looks like existing iptables, it must behave *exactly*
   like existing iptables, otherwise it is prone to break users security
   in subtle and very dangerous ways.

Looking forward to the following discussion and on other points of view.

-- 
- Harald Welte <laforge@gnumonks.org>           http://laforge.gnumonks.org/
============================================================================
"Privacy in residential applications is a desirable marketing option."
                                                  (ETSI EN 300 175-7 Ch. A6)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/4] net: add bpfilter
  2018-02-17 12:11 ` Harald Welte
@ 2018-02-18  0:35   ` Florian Westphal
  0 siblings, 0 replies; 18+ messages in thread
From: Florian Westphal @ 2018-02-18  0:35 UTC (permalink / raw)
  To: Harald Welte
  Cc: Daniel Borkmann, netdev, netfilter-devel, davem,
	alexei.starovoitov

Harald Welte <laforge@gnumonks.org> wrote:
> I believe _if_ one wants to use the approach of "hiding" eBPF behind
> iptables, then either
[..]
> b) you must introduce new 'tables', like an 'xdp' table which then has
>    the notion of processing very early in processing, way before the
>    normal filter table INPUT processing happens.

In nftables. the netdev ingress hook location could be used for this,
but right, iptables has no equivalent.

netdev ingress is interesting from an hw-offload point of view,
unlike all other netfilter hooks its tied to a specific network interface
rather than owned by the network namespace.

A rule like (yes i am making this up)
limit 10000 byte/s

cannot be offloaded because it affects all packets going through
the system, i.e. you'd need to share state among all nics which i think
won't work :-)

Same goes for any other match/target that somehow contains (global)
state and was added to the 'classic' iptables hook points.
(exception: rule restricts interface via '-i foo').

Note well: "offloaded != ebpf" in this case.

I see no reasons why ebpf cannot be used in either iptables or
nftables.  How to get there is obviously a different beast.

For iptables, I think we should put it in maintenance mode and
focus on nftables, for many reasons outlined in other replies.

And how to best make use of ebpf+nftables

In ideal world, nftables would have used (e)bpf from the start.
But, well, its not an ideal world (iirc nft origins are just a bit
too old).

That doesn't mean that we can't leverage ebpf from nftables.
Its just a question of where it makes sense and where it doesn't,
f.e. i see no reason to replace c code with ebpf just 'because you can'.

Speedup?  Good argument.
Feature enhancements that could use ebpf programs? Another good
argument.

I guess there are a lot more.

So I'd like to second Haralds question.

What is the main goal?

For nftables, I believe most important ones are:
- make kernel keeper/owner of all rules
- allow userspace to learn of rule addition/deletion
- provide fast matching (no linear evaluation of rules,
native sets with jump and verdict maps)
- provide a single tool instead of ip/ip6/arp/ebtables
- unified ipv4/ipv6 matching
- backwards compat and/or translation infrastructure

But once these are reached, we will hopefully have more:
- offloading (hardware)
- speedup via JIT compilation
- feature enhancements such as matching arbitrary packet
contents

I suspect you see that ebpf might be a fit and/or help us with
all of these things.

So, once I understand what your goals are I might be better able
to see how nftables could fit into the picture, as you can see
I did a lot of guesswork :-)

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2018-02-18  0:38 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-02-16 13:40 [PATCH RFC 0/4] net: add bpfilter Daniel Borkmann
2018-02-16 13:40 ` [PATCH RFC 1/4] modules: allow insmod load regular elf binaries Daniel Borkmann
2018-02-16 13:40 ` [PATCH RFC 2/4] bpf: introduce bpfilter commands Daniel Borkmann
2018-02-16 13:40 ` [PATCH RFC 3/4] net: initial bpfilter skeleton Daniel Borkmann
2018-02-16 13:40 ` [PATCH RFC 4/4] bpf: rough bpfilter codegen example hack Daniel Borkmann
2018-02-16 14:57 ` [PATCH RFC 0/4] net: add bpfilter Florian Westphal
2018-02-16 16:14   ` Florian Westphal
2018-02-16 20:44     ` Daniel Borkmann
2018-02-17 12:33       ` Harald Welte
2018-02-17 19:18       ` Florian Westphal
2018-02-16 22:33     ` David Miller
2018-02-17 12:21       ` Harald Welte
2018-02-17 20:10       ` Florian Westphal
2018-02-17 22:38         ` Florian Westphal
2018-02-16 16:53   ` Daniel Borkmann
2018-02-16 22:32   ` David Miller
2018-02-17 12:11 ` Harald Welte
2018-02-18  0:35   ` Florian Westphal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).