[PATCH RFC bpf-next 0/6] bpf: introduce cgroup-bpf bind, connect, post-bind hooks

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC bpf-next 0/6] bpf: introduce cgroup-bpf bind, connect, post-bind hooks
@ 2018-03-14  3:39 Alexei Starovoitov
  2018-03-14  3:39 ` [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind Alexei Starovoitov
                   ` (7 more replies)
  0 siblings, 8 replies; 23+ messages in thread
From: Alexei Starovoitov @ 2018-03-14  3:39 UTC (permalink / raw)
  To: davem; +Cc: daniel, netdev, kernel-team

For our container management we've been using complicated and fragile setup
consisting of LD_PRELOAD wrapper intercepting bind and connect calls from
all containerized applications.
The setup involves per-container IPs, policy, etc, so traditional
network-only solutions that involve VRFs, netns, acls are not applicable.
Changing apps is not possible and LD_PRELOAD doesn't work
for apps that don't use glibc like java and golang.
BPF+cgroup looks to be the best solution for this problem.
Hence we introduce 3 hooks:
- at entry into sys_bind and sys_connect
  to let bpf prog look and modify 'struct sockaddr' provided
  by user space and fail bind/connect when appropriate
- post sys_bind after port is allocated

The approach works great and has zero overhead for anyone who doesn't
use it and very low overhead when deployed.

The main question for Daniel and Dave is what approach to take
with prog types...

In this patch set we introduce 6 new program types to make user
experience easier:
  BPF_PROG_TYPE_CGROUP_INET4_BIND,
  BPF_PROG_TYPE_CGROUP_INET6_BIND,
  BPF_PROG_TYPE_CGROUP_INET4_CONNECT,
  BPF_PROG_TYPE_CGROUP_INET6_CONNECT,
  BPF_PROG_TYPE_CGROUP_INET4_POST_BIND,
  BPF_PROG_TYPE_CGROUP_INET6_POST_BIND,

since v4 programs should not be using 'struct bpf_sock_addr'->user_ip6 fields
and different prog type for v4 and v6 helps verifier reject such access
at load time.
Similarly bind vs connect are two different prog types too,
since only sys_connect programs can call new bpf_bind() helper.

This approach is very different from tcp-bpf where single
'struct bpf_sock_ops' and single prog type is used for different hooks.
The field checks are done at run-time instead of load time.

I think the approach taken by this patch set is justified,
but we may do better if we extend BPF_PROG_ATTACH cmd
with log_buf + log_size, then we should be able to combine
bind+connect+v4+v6 into single program type.
The idea that at load time the verifier will remember a bitmask
of fields in bpf_sock_addr used by the program and helpers
that program used, then at attach time we can check that
hook is compatible with features used by the program and
report human readable error message back via log_buf.
We cannot do this right now with just EINVAL, since combinations
of errors like 'using user_ip6 field but attaching to v4 hook'
are too high to express as errno.
This would be bigger change. If you folks think it's worth it
we can go with this approach or if you think 6 new prog types
is not too bad, we can leave the patch as-is.
Comments?
Other comments on patches are welcome.

Andrey Ignatov (6):
  bpf: Hooks for sys_bind
  selftests/bpf: Selftest for sys_bind hooks
  net: Introduce __inet_bind() and __inet6_bind
  bpf: Hooks for sys_connect
  selftests/bpf: Selftest for sys_connect hooks
  bpf: Post-hooks for sys_bind

 include/linux/bpf-cgroup.h                    |  68 +++-
 include/linux/bpf_types.h                     |   6 +
 include/linux/filter.h                        |  10 +
 include/net/inet_common.h                     |   2 +
 include/net/ipv6.h                            |   2 +
 include/net/sock.h                            |   3 +
 include/net/udp.h                             |   1 +
 include/uapi/linux/bpf.h                      |  52 ++-
 kernel/bpf/cgroup.c                           |  36 ++
 kernel/bpf/syscall.c                          |  42 ++
 kernel/bpf/verifier.c                         |   6 +
 net/core/filter.c                             | 479 ++++++++++++++++++++++-
 net/ipv4/af_inet.c                            |  60 ++-
 net/ipv4/tcp_ipv4.c                           |  16 +
 net/ipv4/udp.c                                |  14 +
 net/ipv6/af_inet6.c                           |  47 ++-
 net/ipv6/tcp_ipv6.c                           |  16 +
 net/ipv6/udp.c                                |  20 +
 tools/include/uapi/linux/bpf.h                |  39 +-
 tools/testing/selftests/bpf/Makefile          |   8 +-
 tools/testing/selftests/bpf/bpf_helpers.h     |   2 +
 tools/testing/selftests/bpf/connect4_prog.c   |  45 +++
 tools/testing/selftests/bpf/connect6_prog.c   |  61 +++
 tools/testing/selftests/bpf/test_sock_addr.c  | 541 ++++++++++++++++++++++++++
 tools/testing/selftests/bpf/test_sock_addr.sh |  57 +++
 25 files changed, 1580 insertions(+), 53 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/connect4_prog.c
 create mode 100644 tools/testing/selftests/bpf/connect6_prog.c
 create mode 100644 tools/testing/selftests/bpf/test_sock_addr.c
 create mode 100755 tools/testing/selftests/bpf/test_sock_addr.sh

-- 
2.9.5

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind
  2018-03-14  3:39 [PATCH RFC bpf-next 0/6] bpf: introduce cgroup-bpf bind, connect, post-bind hooks Alexei Starovoitov
@ 2018-03-14  3:39 ` Alexei Starovoitov
  2018-03-14  6:21   ` Eric Dumazet
  2018-03-14 14:37   ` Daniel Borkmann
  2018-03-14  3:39 ` [PATCH RFC bpf-next 2/6] selftests/bpf: Selftest for sys_bind hooks Alexei Starovoitov
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 23+ messages in thread
From: Alexei Starovoitov @ 2018-03-14  3:39 UTC (permalink / raw)
  To: davem; +Cc: daniel, netdev, kernel-team

From: Andrey Ignatov <rdna@fb.com>

== The problem ==

There is a use-case when all processes inside a cgroup should use one
single IP address on a host that has multiple IP configured.  Those
processes should use the IP for both ingress and egress, for TCP and UDP
traffic. So TCP/UDP servers should be bound to that IP to accept
incoming connections on it, and TCP/UDP clients should make outgoing
connections from that IP. It should not require changing application
code since it's often not possible.

Currently it's solved by intercepting glibc wrappers around syscalls
such as `bind(2)` and `connect(2)`. It's done by a shared library that
is preloaded for every process in a cgroup so that whenever TCP/UDP
server calls `bind(2)`, the library replaces IP in sockaddr before
passing arguments to syscall. When application calls `connect(2)` the
library transparently binds the local end of connection to that IP
(`bind(2)` with `IP_BIND_ADDRESS_NO_PORT` to avoid performance penalty).

Shared library approach is fragile though, e.g.:
* some applications clear env vars (incl. `LD_PRELOAD`);
* `/etc/ld.so.preload` doesn't help since some applications are linked
  with option `-z nodefaultlib`;
* other applications don't use glibc and there is nothing to intercept.

== The solution ==

The patch provides much more reliable in-kernel solution for the 1st
part of the problem: binding TCP/UDP servers on desired IP. It does not
depend on application environment and implementation details (whether
glibc is used or not).

It adds new eBPF program types `BPF_PROG_TYPE_CGROUP_INET4_BIND` and
`BPF_PROG_TYPE_CGROUP_INET6_BIND` and corresponding attach types
`BPF_CGROUP_INET4_BIND` and `BPF_CGROUP_INET6_BIND` (similar to already
existing `BPF_CGROUP_INET_SOCK_CREATE`).

The new program types are intended to be used with sockets (`struct sock`)
in a cgroup and provided by user `struct sockaddr`. Pointers to both of
them are parts of the context passed to programs of newly added types.

The new attach types provides hooks in `bind(2)` system call for both
IPv4 and IPv6 so that one can write a program to override IP addresses
and ports user program tries to bind to and apply such a program for
whole cgroup.

== Implementation notes ==

[1]
Separate prog/attach types for `AF_INET` and `AF_INET6` are added
intentionally to prevent reading/writing to offsets that don't make
sense for corresponding socket family. E.g. if user passes `sockaddr_in`
it doesn't make sense to read from / write to `user_ip6[]` context
fields.

[2]
The write access to `struct bpf_sock_addr_kern` is implemented using
special field as an additional "register".

There are just two registers in `sock_addr_convert_ctx_access`: `src`
with value to write and `dst` with pointer to context that can't be
changed not to break later instructions. But the fields, allowed to
write to, are not available directly and to access them address of
corresponding pointer has to be loaded first. To get additional register
the 1st not used by `src` and `dst` one is taken, its content is saved
to `bpf_sock_addr_kern.tmp_reg`, then the register is used to load
address of pointer field, and finally the register's content is restored
from the temporary field after writing `src` value.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf-cgroup.h |  21 ++++
 include/linux/bpf_types.h  |   2 +
 include/linux/filter.h     |  10 ++
 include/uapi/linux/bpf.h   |  24 +++++
 kernel/bpf/cgroup.c        |  36 +++++++
 kernel/bpf/syscall.c       |  14 +++
 kernel/bpf/verifier.c      |   2 +
 net/core/filter.c          | 242 +++++++++++++++++++++++++++++++++++++++++++++
 net/ipv4/af_inet.c         |   7 ++
 net/ipv6/af_inet6.c        |   7 ++
 10 files changed, 365 insertions(+)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 8a4566691c8f..dd0cfbddcfbe 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -6,6 +6,7 @@
 #include <uapi/linux/bpf.h>
 
 struct sock;
+struct sockaddr;
 struct cgroup;
 struct sk_buff;
 struct bpf_sock_ops_kern;
@@ -63,6 +64,10 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
 int __cgroup_bpf_run_filter_sk(struct sock *sk,
 			       enum bpf_attach_type type);
 
+int __cgroup_bpf_run_filter_sock_addr(struct sock *sk,
+				      struct sockaddr *uaddr,
+				      enum bpf_attach_type type);
+
 int __cgroup_bpf_run_filter_sock_ops(struct sock *sk,
 				     struct bpf_sock_ops_kern *sock_ops,
 				     enum bpf_attach_type type);
@@ -103,6 +108,20 @@ int __cgroup_bpf_check_dev_permission(short dev_type, u32 major, u32 minor,
 	__ret;								       \
 })
 
+#define BPF_CGROUP_RUN_SA_PROG(sk, uaddr, type) 			       \
+({									       \
+	int __ret = 0;							       \
+	if (cgroup_bpf_enabled)						       \
+		__ret = __cgroup_bpf_run_filter_sock_addr(sk, uaddr, type);    \
+	__ret;								       \
+})
+
+#define BPF_CGROUP_RUN_PROG_INET4_BIND(sk, uaddr)			       \
+	BPF_CGROUP_RUN_SA_PROG(sk, uaddr, BPF_CGROUP_INET4_BIND)
+
+#define BPF_CGROUP_RUN_PROG_INET6_BIND(sk, uaddr)			       \
+	BPF_CGROUP_RUN_SA_PROG(sk, uaddr, BPF_CGROUP_INET6_BIND)
+
 #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops)				       \
 ({									       \
 	int __ret = 0;							       \
@@ -135,6 +154,8 @@ static inline int cgroup_bpf_inherit(struct cgroup *cgrp) { return 0; }
 #define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk,skb) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk,skb) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_INET_SOCK(sk) ({ 0; })
+#define BPF_CGROUP_RUN_PROG_INET4_BIND(sk, uaddr) ({ 0; })
+#define BPF_CGROUP_RUN_PROG_INET6_BIND(sk, uaddr) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type,major,minor,access) ({ 0; })
 
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 19b8349a3809..eefd877f8e68 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -8,6 +8,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_SCHED_ACT, tc_cls_act)
 BPF_PROG_TYPE(BPF_PROG_TYPE_XDP, xdp)
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SKB, cg_skb)
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCK, cg_sock)
+BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_INET4_BIND, cg_inet4_bind)
+BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_INET6_BIND, cg_inet6_bind)
 BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_IN, lwt_inout)
 BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_OUT, lwt_inout)
 BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_XMIT, lwt_xmit)
diff --git a/include/linux/filter.h b/include/linux/filter.h
index fdb691b520c0..fe469320feab 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1001,6 +1001,16 @@ static inline int bpf_tell_extensions(void)
 	return SKF_AD_MAX;
 }
 
+struct bpf_sock_addr_kern {
+	struct sock *sk;
+	struct sockaddr *uaddr;
+	/* Temporary "register" to make indirect stores to nested structures
+	 * defined above. We need three registers to make such a store, but
+	 * only two (src and dst) are available at convert_ctx_access time
+	 */
+	u64 tmp_reg;
+};
+
 struct bpf_sock_ops_kern {
 	struct	sock *sk;
 	u32	op;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 2a66769e5875..78628a3f3cd8 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -133,6 +133,8 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_SOCK_OPS,
 	BPF_PROG_TYPE_SK_SKB,
 	BPF_PROG_TYPE_CGROUP_DEVICE,
+	BPF_PROG_TYPE_CGROUP_INET4_BIND,
+	BPF_PROG_TYPE_CGROUP_INET6_BIND,
 };
 
 enum bpf_attach_type {
@@ -143,6 +145,8 @@ enum bpf_attach_type {
 	BPF_SK_SKB_STREAM_PARSER,
 	BPF_SK_SKB_STREAM_VERDICT,
 	BPF_CGROUP_DEVICE,
+	BPF_CGROUP_INET4_BIND,
+	BPF_CGROUP_INET6_BIND,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -953,6 +957,26 @@ struct bpf_map_info {
 	__u64 netns_ino;
 } __attribute__((aligned(8)));
 
+/* User bpf_sock_addr struct to access socket fields and sockaddr struct passed
+ * by user and intended to be used by socket (e.g. to bind to, depends on
+ * attach attach type).
+ */
+struct bpf_sock_addr {
+	__u32 user_family;	/* Allows 4-byte read, but no write. */
+	__u32 user_ip4;		/* Allows 1,2,4-byte read and 4-byte write.
+				 * Stored in network byte order.
+				 */
+	__u32 user_ip6[4];	/* Allows 1,2,4-byte read an 4-byte write.
+				 * Stored in network byte order.
+				 */
+	__u32 user_port;	/* Allows 4-byte read and write.
+				 * Stored in network byte order
+				 */
+	__u32 family;		/* Allows 4-byte read, but no write */
+	__u32 type;		/* Allows 4-byte read, but no write */
+	__u32 protocol;		/* Allows 4-byte read, but no write */
+};
+
 /* User bpf_sock_ops struct to access socket values and specify request ops
  * and their replies.
  * Some of this fields are in network (bigendian) byte order and may need
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index c1c0b60d3f2f..78ef086a7c2d 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -495,6 +495,42 @@ int __cgroup_bpf_run_filter_sk(struct sock *sk,
 EXPORT_SYMBOL(__cgroup_bpf_run_filter_sk);
 
 /**
+ * __cgroup_bpf_run_filter_sock_addr() - Run a program on a sock and
+ *                                       provided by user sockaddr
+ * @sk: sock struct that will use sockaddr
+ * @uaddr: sockaddr struct provided by user
+ * @type: The type of program to be exectuted
+ *
+ * socket is expected to be of type INET or INET6.
+ *
+ * This function will return %-EPERM if an attached program is found and
+ * returned value != 1 during execution. In all other cases, 0 is returned.
+ */
+int __cgroup_bpf_run_filter_sock_addr(struct sock *sk,
+				      struct sockaddr *uaddr,
+				      enum bpf_attach_type type)
+{
+	struct bpf_sock_addr_kern ctx = {
+		.sk = sk,
+		.uaddr = uaddr,
+	};
+	struct cgroup *cgrp;
+	int ret;
+
+	/* Check socket family since not all sockets represent network
+	 * endpoint (e.g. AF_UNIX).
+	 */
+	if (sk->sk_family != AF_INET && sk->sk_family != AF_INET6)
+		return 0;
+
+	cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
+	ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], &ctx, BPF_PROG_RUN);
+
+	return ret == 1 ? 0 : -EPERM;
+}
+EXPORT_SYMBOL(__cgroup_bpf_run_filter_sock_addr);
+
+/**
  * __cgroup_bpf_run_filter_sock_ops() - Run a program on a sock
  * @sk: socket to get cgroup from
  * @sock_ops: bpf_sock_ops_kern struct to pass to program. Contains
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index e24aa3241387..7f86542aa42c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1376,6 +1376,12 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 	case BPF_CGROUP_INET_SOCK_CREATE:
 		ptype = BPF_PROG_TYPE_CGROUP_SOCK;
 		break;
+	case BPF_CGROUP_INET4_BIND:
+		ptype = BPF_PROG_TYPE_CGROUP_INET4_BIND;
+		break;
+	case BPF_CGROUP_INET6_BIND:
+		ptype = BPF_PROG_TYPE_CGROUP_INET6_BIND;
+		break;
 	case BPF_CGROUP_SOCK_OPS:
 		ptype = BPF_PROG_TYPE_SOCK_OPS;
 		break;
@@ -1431,6 +1437,12 @@ static int bpf_prog_detach(const union bpf_attr *attr)
 	case BPF_CGROUP_INET_SOCK_CREATE:
 		ptype = BPF_PROG_TYPE_CGROUP_SOCK;
 		break;
+	case BPF_CGROUP_INET4_BIND:
+		ptype = BPF_PROG_TYPE_CGROUP_INET4_BIND;
+		break;
+	case BPF_CGROUP_INET6_BIND:
+		ptype = BPF_PROG_TYPE_CGROUP_INET6_BIND;
+		break;
 	case BPF_CGROUP_SOCK_OPS:
 		ptype = BPF_PROG_TYPE_SOCK_OPS;
 		break;
@@ -1478,6 +1490,8 @@ static int bpf_prog_query(const union bpf_attr *attr,
 	case BPF_CGROUP_INET_INGRESS:
 	case BPF_CGROUP_INET_EGRESS:
 	case BPF_CGROUP_INET_SOCK_CREATE:
+	case BPF_CGROUP_INET4_BIND:
+	case BPF_CGROUP_INET6_BIND:
 	case BPF_CGROUP_SOCK_OPS:
 	case BPF_CGROUP_DEVICE:
 		break;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index eb79a34359c0..01b54afcb762 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3872,6 +3872,8 @@ static int check_return_code(struct bpf_verifier_env *env)
 	switch (env->prog->type) {
 	case BPF_PROG_TYPE_CGROUP_SKB:
 	case BPF_PROG_TYPE_CGROUP_SOCK:
+	case BPF_PROG_TYPE_CGROUP_INET4_BIND:
+	case BPF_PROG_TYPE_CGROUP_INET6_BIND:
 	case BPF_PROG_TYPE_SOCK_OPS:
 	case BPF_PROG_TYPE_CGROUP_DEVICE:
 		break;
diff --git a/net/core/filter.c b/net/core/filter.c
index 33edfa8372fd..78907cf3b42f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3443,6 +3443,20 @@ sock_filter_func_proto(enum bpf_func_id func_id)
 }
 
 static const struct bpf_func_proto *
+inet_bind_func_proto(enum bpf_func_id func_id)
+{
+	switch (func_id) {
+	/* inet and inet6 sockets are created in a process
+	 * context so there is always a valid uid/gid
+	 */
+	case BPF_FUNC_get_current_uid_gid:
+		return &bpf_get_current_uid_gid_proto;
+	default:
+		return bpf_base_func_proto(func_id);
+	}
+}
+
+static const struct bpf_func_proto *
 sk_filter_func_proto(enum bpf_func_id func_id)
 {
 	switch (func_id) {
@@ -3900,6 +3914,70 @@ void bpf_warn_invalid_xdp_action(u32 act)
 }
 EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
 
+static bool __sock_addr_is_valid_access(unsigned short ctx_family, int off,
+					int size, enum bpf_access_type type,
+					struct bpf_insn_access_aux *info)
+{
+	const int size_default = sizeof(__u32);
+	unsigned short requested_family = 0;
+
+	if (off < 0 || off >= sizeof(struct bpf_sock_addr))
+		return false;
+	if (off % size != 0)
+		return false;
+
+	switch (off) {
+	case bpf_ctx_range(struct bpf_sock_addr, user_ip4):
+		requested_family = AF_INET;
+		/* FALLTHROUGH */
+	case bpf_ctx_range_till(struct bpf_sock_addr, user_ip6[0], user_ip6[3]):
+		if (!requested_family)
+			requested_family = AF_INET6;
+		/* Disallow access to IPv6 fields from IPv4 contex and vise
+		 * versa.
+		 */
+		if (requested_family != ctx_family)
+			return false;
+		/* Only narrow read access allowed for now. */
+		if (type == BPF_READ) {
+			bpf_ctx_record_field_size(info, size_default);
+			if (!bpf_ctx_narrow_access_ok(off, size, size_default))
+				return false;
+		} else {
+			if (size != size_default)
+				return false;
+		}
+		break;
+	case bpf_ctx_range(struct bpf_sock_addr, user_port):
+		if (size != size_default)
+			return false;
+		break;
+	default:
+		if (type == BPF_READ) {
+			if (size != size_default)
+				return false;
+		} else {
+			return false;
+		}
+	}
+
+	return true;
+}
+
+static bool sock_addr4_is_valid_access(int off, int size,
+				       enum bpf_access_type type,
+				       struct bpf_insn_access_aux *info)
+{
+	return __sock_addr_is_valid_access(AF_INET, off, size, type, info);
+}
+
+static bool sock_addr6_is_valid_access(int off, int size,
+				       enum bpf_access_type type,
+				       struct bpf_insn_access_aux *info)
+{
+	return __sock_addr_is_valid_access(AF_INET6, off, size, type, info);
+}
+
 static bool sock_ops_is_valid_access(int off, int size,
 				     enum bpf_access_type type,
 				     struct bpf_insn_access_aux *info)
@@ -4415,6 +4493,152 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type type,
 	return insn - insn_buf;
 }
 
+/* SOCK_ADDR_LOAD_NESTED_FIELD() loads Nested Field S.F.NF where S is type of
+ * context Structure, F is Field in context structure that contains a pointer
+ * to Nested Structure of type NS that has the field NF.
+ *
+ * SIZE encodes the load size (BPF_B, BPF_H, etc). It's up to caller to make
+ * sure that SIZE is not greater than actual size of S.F.NF.
+ *
+ * If offset OFF is provided, the load happens from that offset relative to
+ * offset of NF.
+ */
+#define SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF(S, NS, F, NF, SIZE, OFF)	       \
+	do {								       \
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(S, F), si->dst_reg,     \
+				      si->src_reg, offsetof(S, F));	       \
+		*insn++ = BPF_LDX_MEM(					       \
+			SIZE, si->dst_reg, si->dst_reg,			       \
+			bpf_target_off(NS, NF, FIELD_SIZEOF(NS, NF),	       \
+				       target_size)			       \
+				+ OFF);					       \
+	} while (0)
+
+#define SOCK_ADDR_LOAD_NESTED_FIELD(S, NS, F, NF)			       \
+	SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF(S, NS, F, NF,		       \
+					     BPF_FIELD_SIZEOF(NS, NF), 0)
+
+/* SOCK_ADDR_STORE_NESTED_FIELD_OFF() has semantic similar to
+ * SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF() but for store operation.
+ *
+ * It doesn't support SIZE argument though since narrow stores are not
+ * supported for now.
+ *
+ * In addition it uses Temporary Field TF (member of struct S) as the 3rd
+ * "register" since two registers available in convert_ctx_access are not
+ * enough: we can't override neither SRC, since it contains value to store, nor
+ * DST since it contains pointer to context that may be used by later
+ * instructions. But we need a temporary place to save pointer to nested
+ * structure whose field we want to store to.
+ */
+#define SOCK_ADDR_STORE_NESTED_FIELD_OFF(S, NS, F, NF, OFF, TF)		       \
+	do {								       \
+		int tmp_reg = BPF_REG_9;				       \
+		if (si->src_reg == tmp_reg || si->dst_reg == tmp_reg)	       \
+			--tmp_reg;					       \
+		if (si->src_reg == tmp_reg || si->dst_reg == tmp_reg)	       \
+			--tmp_reg;					       \
+		*insn++ = BPF_STX_MEM(BPF_DW, si->dst_reg, tmp_reg,	       \
+				      offsetof(S, TF));			       \
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(S, F), tmp_reg,	       \
+				      si->dst_reg, offsetof(S, F));	       \
+		*insn++ = BPF_STX_MEM(					       \
+			BPF_FIELD_SIZEOF(NS, NF), tmp_reg, si->src_reg,	       \
+			bpf_target_off(NS, NF, FIELD_SIZEOF(NS, NF),	       \
+				       target_size)			       \
+				+ OFF);					       \
+		*insn++ = BPF_LDX_MEM(BPF_DW, tmp_reg, si->dst_reg,	       \
+				      offsetof(S, TF));			       \
+	} while (0)
+
+#define SOCK_ADDR_LOAD_OR_STORE_NESTED_FIELD_SIZE_OFF(S, NS, F, NF, SIZE, OFF, \
+						      TF)		       \
+	do {								       \
+		if (type == BPF_WRITE) {				       \
+			SOCK_ADDR_STORE_NESTED_FIELD_OFF(S, NS, F, NF, OFF,    \
+							 TF);		       \
+		} else {						       \
+			SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF(		       \
+				S, NS, F, NF, SIZE, OFF);  \
+		}							       \
+	} while (0)
+
+#define SOCK_ADDR_LOAD_OR_STORE_NESTED_FIELD(S, NS, F, NF, TF)		       \
+	SOCK_ADDR_LOAD_OR_STORE_NESTED_FIELD_SIZE_OFF(			       \
+		S, NS, F, NF, BPF_FIELD_SIZEOF(NS, NF), 0, TF)
+
+static u32 sock_addr_convert_ctx_access(enum bpf_access_type type,
+					const struct bpf_insn *si,
+					struct bpf_insn *insn_buf,
+					struct bpf_prog *prog, u32 *target_size)
+{
+	struct bpf_insn *insn = insn_buf;
+	int off;
+
+	switch (si->off) {
+	case offsetof(struct bpf_sock_addr, user_family):
+		SOCK_ADDR_LOAD_NESTED_FIELD(struct bpf_sock_addr_kern,
+					    struct sockaddr, uaddr, sa_family);
+		break;
+
+	case offsetof(struct bpf_sock_addr, user_ip4):
+		SOCK_ADDR_LOAD_OR_STORE_NESTED_FIELD_SIZE_OFF(
+			struct bpf_sock_addr_kern, struct sockaddr_in, uaddr,
+			sin_addr, BPF_SIZE(si->code), 0, tmp_reg);
+		break;
+
+	case bpf_ctx_range_till(struct bpf_sock_addr, user_ip6[0], user_ip6[3]):
+		off = si->off;
+		off -= offsetof(struct bpf_sock_addr, user_ip6[0]);
+		SOCK_ADDR_LOAD_OR_STORE_NESTED_FIELD_SIZE_OFF(
+			struct bpf_sock_addr_kern, struct sockaddr_in6, uaddr,
+			sin6_addr.s6_addr32[0], BPF_SIZE(si->code), off,
+			tmp_reg);
+		break;
+
+	case offsetof(struct bpf_sock_addr, user_port):
+		/* To get port we need to know sa_family first and then treat
+		 * sockaddr as either sockaddr_in or sockaddr_in6.
+		 * Though we can simplify since port field has same offset and
+		 * size in both structures.
+		 * Here we check this invariant and use just one of the
+		 * structures if it's true.
+		 */
+		BUILD_BUG_ON(offsetof(struct sockaddr_in, sin_port) !=
+			     offsetof(struct sockaddr_in6, sin6_port));
+		BUILD_BUG_ON(FIELD_SIZEOF(struct sockaddr_in, sin_port) !=
+			     FIELD_SIZEOF(struct sockaddr_in6, sin6_port));
+		SOCK_ADDR_LOAD_OR_STORE_NESTED_FIELD(struct bpf_sock_addr_kern,
+						     struct sockaddr_in6, uaddr,
+						     sin6_port, tmp_reg);
+		break;
+
+	case offsetof(struct bpf_sock_addr, family):
+		SOCK_ADDR_LOAD_NESTED_FIELD(struct bpf_sock_addr_kern,
+					    struct sock, sk, sk_family);
+		break;
+
+	case offsetof(struct bpf_sock_addr, type):
+		SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF(
+			struct bpf_sock_addr_kern, struct sock, sk,
+			__sk_flags_offset, BPF_W, 0);
+		*insn++ = BPF_ALU32_IMM(BPF_AND, si->dst_reg, SK_FL_TYPE_MASK);
+		*insn++ = BPF_ALU32_IMM(BPF_RSH, si->dst_reg, SK_FL_TYPE_SHIFT);
+		break;
+
+	case offsetof(struct bpf_sock_addr, protocol):
+		SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF(
+			struct bpf_sock_addr_kern, struct sock, sk,
+			__sk_flags_offset, BPF_W, 0);
+		*insn++ = BPF_ALU32_IMM(BPF_AND, si->dst_reg, SK_FL_PROTO_MASK);
+		*insn++ = BPF_ALU32_IMM(BPF_RSH, si->dst_reg,
+					SK_FL_PROTO_SHIFT);
+		break;
+	}
+
+	return insn - insn_buf;
+}
+
 static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 				       const struct bpf_insn *si,
 				       struct bpf_insn *insn_buf,
@@ -4849,6 +5073,24 @@ const struct bpf_verifier_ops cg_sock_verifier_ops = {
 const struct bpf_prog_ops cg_sock_prog_ops = {
 };
 
+const struct bpf_verifier_ops cg_inet4_bind_verifier_ops = {
+	.get_func_proto		= inet_bind_func_proto,
+	.is_valid_access	= sock_addr4_is_valid_access,
+	.convert_ctx_access	= sock_addr_convert_ctx_access,
+};
+
+const struct bpf_prog_ops cg_inet4_bind_prog_ops = {
+};
+
+const struct bpf_verifier_ops cg_inet6_bind_verifier_ops = {
+	.get_func_proto		= inet_bind_func_proto,
+	.is_valid_access	= sock_addr6_is_valid_access,
+	.convert_ctx_access	= sock_addr_convert_ctx_access,
+};
+
+const struct bpf_prog_ops cg_inet6_bind_prog_ops = {
+};
+
 const struct bpf_verifier_ops sock_ops_verifier_ops = {
 	.get_func_proto		= sock_ops_func_proto,
 	.is_valid_access	= sock_ops_is_valid_access,
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index e8c7fad8c329..2dec266507dc 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -450,6 +450,13 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 	if (addr_len < sizeof(struct sockaddr_in))
 		goto out;
 
+	/* BPF prog is run before any checks are done so that if the prog
+	 * changes context in a wrong way it will be caught.
+	 */
+	err = BPF_CGROUP_RUN_PROG_INET4_BIND(sk, uaddr);
+	if (err)
+		goto out;
+
 	if (addr->sin_family != AF_INET) {
 		/* Compatibility games : accept AF_UNSPEC (mapped to AF_INET)
 		 * only if s_addr is INADDR_ANY.
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index dbbe04018813..fa24e3f06ac6 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -295,6 +295,13 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 	if (addr_len < SIN6_LEN_RFC2133)
 		return -EINVAL;
 
+	/* BPF prog is run before any checks are done so that if the prog
+	 * changes context in a wrong way it will be caught.
+	 */
+	err = BPF_CGROUP_RUN_PROG_INET6_BIND(sk, uaddr);
+	if (err)
+		return err;
+
 	if (addr->sin6_family != AF_INET6)
 		return -EAFNOSUPPORT;
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind
  2018-03-14  3:39 ` [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind Alexei Starovoitov
@ 2018-03-14  6:21   ` Eric Dumazet
  2018-03-14 18:00     ` Alexei Starovoitov
  2018-03-14 14:37   ` Daniel Borkmann
  1 sibling, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2018-03-14  6:21 UTC (permalink / raw)
  To: Alexei Starovoitov, davem; +Cc: daniel, netdev, kernel-team



On 03/13/2018 08:39 PM, Alexei Starovoitov wrote:
> From: Andrey Ignatov <rdna@fb.com>
> 
> == The problem ==
> 
> There is a use-case when all processes inside a cgroup should use one
> single IP address on a host that has multiple IP configured.  Those
> processes should use the IP for both ingress and egress, for TCP and UDP
> traffic. So TCP/UDP servers should be bound to that IP to accept
> incoming connections on it, and TCP/UDP clients should make outgoing
> connections from that IP. It should not require changing application
> code since it's often not possible.
> 
> Currently it's solved by intercepting glibc wrappers around syscalls
> such as `bind(2)` and `connect(2)`. It's done by a shared library that
> is preloaded for every process in a cgroup so that whenever TCP/UDP
> server calls `bind(2)`, the library replaces IP in sockaddr before
> passing arguments to syscall. When application calls `connect(2)` the
> library transparently binds the local end of connection to that IP
> (`bind(2)` with `IP_BIND_ADDRESS_NO_PORT` to avoid performance penalty).
> 
> Shared library approach is fragile though, e.g.:
> * some applications clear env vars (incl. `LD_PRELOAD`);
> * `/etc/ld.so.preload` doesn't help since some applications are linked
>    with option `-z nodefaultlib`;
> * other applications don't use glibc and there is nothing to intercept.
> 
> == The solution ==
> 
> The patch provides much more reliable in-kernel solution for the 1st
> part of the problem: binding TCP/UDP servers on desired IP. It does not
> depend on application environment and implementation details (whether
> glibc is used or not).
>


If I understand well,  strace(1) will not show the real (after 
modification by eBPF) IP/port ?

What about selinux and other LSM ?

We have now network namespaces for full isolation. Soon ILA will come.

The argument that it is not convenient (or even possible) to change the 
application or using modern isolation is quite strange, considering the 
added burden/complexity/bloat to the kernel.

The post hook for sys_bind is clearly a failure of the model, since 
releasing the port might already be too late, another thread might fail 
to get it during a non zero time window.
It seems this is exactly the case where a netns would be the correct answer.


If you want to provide an alternate port allocation strategy, better 
provide a correct eBPF hook.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind
  2018-03-14  6:21   ` Eric Dumazet
@ 2018-03-14 18:00     ` Alexei Starovoitov
  0 siblings, 0 replies; 23+ messages in thread
From: Alexei Starovoitov @ 2018-03-14 18:00 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Alexei Starovoitov, davem, daniel, netdev, kernel-team

On Tue, Mar 13, 2018 at 11:21:08PM -0700, Eric Dumazet wrote:
> 
> If I understand well,  strace(1) will not show the real (after modification
> by eBPF) IP/port ?

correct. Just like it won't show anything after syscall entry, whether
lsm acted, seccomp, etc

> What about selinux and other LSM ?

clearly lsm is not place to do ip/port enforcement for containers.
lsm in general is missing post-bind lsm hook and visibility in cgroups.
This patch set is not about policy, but more about connectivity.
That's why sockaddr rewrite is must have.

> We have now network namespaces for full isolation. Soon ILA will come.

we're already using a form of ila. That's orthogonal to this feature.

> The argument that it is not convenient (or even possible) to change the
> application or using modern isolation is quite strange, considering the

just like any other datacenter there are thousands of third party
applications that we cannot control. Including open source code
written by google. Would golang switch to use glibc? I very much doubt.
Statically linked apps also don't work with ld_preload.

> added burden/complexity/bloat to the kernel.

bloat? that's very odd to hear. bpf is very much anti-bloat technique.
If you were serious with that comment, please argue with tracing folks
who add thousand upon thousand lines of code to the kernel to do
hard coded things while bpf already does all that and more
without any extra kernel code.

> The post hook for sys_bind is clearly a failure of the model, since
> releasing the port might already be too late, another thread might fail to
> get it during a non zero time window.

I suspect commit log wasn't clear. In post-bind hook we don't release
the port, we only fail sys_bind and user space will eventually close
the socket and release the port.
I don't think it's safe to call inet_put_port() here. It is also
racy as you pointed out.

> If you want to provide an alternate port allocation strategy, better provide
> a correct eBPF hook.

right. that's another separate work indepedent from this feature.
port allocation/free from bpf via helper is also necessary, but
for different use case.

> It seems this is exactly the case where a netns would be the correct answer.

Unfortuantely that's not the case. That's what I tried to explain
in the cover letter:
"The setup involves per-container IPs, policy, etc, so traditional
network-only solutions that involve VRFs, netns, acls are not applicable."
To elaborate more on that:
netns is l2 isolation.
vrf is l3 isolation.
whereas to containerize an application we need to punch connectivity holes
in these layered techniques.
We also considered resurrecting Hannes's afnetns work
and even went as far as designing a new namespace for L4 isolation.
Unfortunately all hierarchical namespace abstraction don't work.
To run an application inside cgroup container that was not written
with containers in mind we have to make an illusion of running
in non-containerized environment.
In some cases we remember the port and container id in the post-bind hook
in a bpf map and when some other task in a different container is trying
to connect to a service we need to know where this service is running.
It can be remote and can be local. Both client and service may or may not
be written with containers in mind and this sockaddr rewrite is providing
connectivity and load balancing feature that you simply cannot do
with hierarchical networking primitives.

btw the per-container policy enforcement of ip+port via these hooks
wasn't our planned feature. It was requested by other folks and
we had to tweak the api a little bit to satisfy ours and theirs requirement.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind
  2018-03-14  3:39 ` [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind Alexei Starovoitov
  2018-03-14  6:21   ` Eric Dumazet
@ 2018-03-14 14:37   ` Daniel Borkmann
  2018-03-14 14:55     ` Daniel Borkmann
  2018-03-14 18:11     ` Alexei Starovoitov
  1 sibling, 2 replies; 23+ messages in thread
From: Daniel Borkmann @ 2018-03-14 14:37 UTC (permalink / raw)
  To: Alexei Starovoitov, davem; +Cc: netdev, kernel-team

On 03/14/2018 04:39 AM, Alexei Starovoitov wrote:
[...]
> +#define BPF_CGROUP_RUN_PROG_INET4_BIND(sk, uaddr)			       \
> +	BPF_CGROUP_RUN_SA_PROG(sk, uaddr, BPF_CGROUP_INET4_BIND)
> +
> +#define BPF_CGROUP_RUN_PROG_INET6_BIND(sk, uaddr)			       \
> +	BPF_CGROUP_RUN_SA_PROG(sk, uaddr, BPF_CGROUP_INET6_BIND)
> +
>  #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops)				       \
>  ({									       \
>  	int __ret = 0;							       \
> @@ -135,6 +154,8 @@ static inline int cgroup_bpf_inherit(struct cgroup *cgrp) { return 0; }
>  #define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk,skb) ({ 0; })
>  #define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk,skb) ({ 0; })
>  #define BPF_CGROUP_RUN_PROG_INET_SOCK(sk) ({ 0; })
> +#define BPF_CGROUP_RUN_PROG_INET4_BIND(sk, uaddr) ({ 0; })
> +#define BPF_CGROUP_RUN_PROG_INET6_BIND(sk, uaddr) ({ 0; })
>  #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) ({ 0; })
>  #define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type,major,minor,access) ({ 0; })
>  
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index 19b8349a3809..eefd877f8e68 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -8,6 +8,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_SCHED_ACT, tc_cls_act)
>  BPF_PROG_TYPE(BPF_PROG_TYPE_XDP, xdp)
>  BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SKB, cg_skb)
>  BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCK, cg_sock)
> +BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_INET4_BIND, cg_inet4_bind)
> +BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_INET6_BIND, cg_inet6_bind)
>  BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_IN, lwt_inout)
>  BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_OUT, lwt_inout)
>  BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_XMIT, lwt_xmit)
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index fdb691b520c0..fe469320feab 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -1001,6 +1001,16 @@ static inline int bpf_tell_extensions(void)
>  	return SKF_AD_MAX;
>  }
>  
> +struct bpf_sock_addr_kern {
> +	struct sock *sk;
> +	struct sockaddr *uaddr;
> +	/* Temporary "register" to make indirect stores to nested structures
> +	 * defined above. We need three registers to make such a store, but
> +	 * only two (src and dst) are available at convert_ctx_access time
> +	 */
> +	u64 tmp_reg;
> +};
> +
>  struct bpf_sock_ops_kern {
>  	struct	sock *sk;
>  	u32	op;
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 2a66769e5875..78628a3f3cd8 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -133,6 +133,8 @@ enum bpf_prog_type {
>  	BPF_PROG_TYPE_SOCK_OPS,
>  	BPF_PROG_TYPE_SK_SKB,
>  	BPF_PROG_TYPE_CGROUP_DEVICE,
> +	BPF_PROG_TYPE_CGROUP_INET4_BIND,
> +	BPF_PROG_TYPE_CGROUP_INET6_BIND,

Could those all be merged into BPF_PROG_TYPE_SOCK_OPS? I'm slowly getting
confused by the many sock_*/sk_* prog types we have. The attach hook could
still be something like BPF_CGROUP_BIND/BPF_CGROUP_CONNECT. Potentially
storing some prog-type specific void *private_data in prog's aux during
verification could be a way (similarly as you mention) which can later be
retrieved at attach time to reject with -ENOTSUPP or such.

>  };
>  
>  enum bpf_attach_type {
> @@ -143,6 +145,8 @@ enum bpf_attach_type {
>  	BPF_SK_SKB_STREAM_PARSER,
>  	BPF_SK_SKB_STREAM_VERDICT,
>  	BPF_CGROUP_DEVICE,
> +	BPF_CGROUP_INET4_BIND,
> +	BPF_CGROUP_INET6_BIND,

Binding to v4 mapped v6 address would work as well, right? Can't this be
squashed into one attach type as mentioned?

>  	__MAX_BPF_ATTACH_TYPE
>  };
>  
> @@ -953,6 +957,26 @@ struct bpf_map_info {
>  	__u64 netns_ino;
>  } __attribute__((aligned(8)));
>  
> +/* User bpf_sock_addr struct to access socket fields and sockaddr struct passed
> + * by user and intended to be used by socket (e.g. to bind to, depends on
> + * attach attach type).
> + */
> +struct bpf_sock_addr {
> +	__u32 user_family;	/* Allows 4-byte read, but no write. */
> +	__u32 user_ip4;		/* Allows 1,2,4-byte read and 4-byte write.
> +				 * Stored in network byte order.
> +				 */
> +	__u32 user_ip6[4];	/* Allows 1,2,4-byte read an 4-byte write.
> +				 * Stored in network byte order.
> +				 */
> +	__u32 user_port;	/* Allows 4-byte read and write.
> +				 * Stored in network byte order
> +				 */
> +	__u32 family;		/* Allows 4-byte read, but no write */
> +	__u32 type;		/* Allows 4-byte read, but no write */
> +	__u32 protocol;		/* Allows 4-byte read, but no write */

I recall bind to prefix came up from time to time in BPF context in the sense
to let the app itself be more flexible to bind to BPF prog. Do you see also app
to be able to add a BPF prog into the array itself?

> +};
> +
>  /* User bpf_sock_ops struct to access socket values and specify request ops
>   * and their replies.
>   * Some of this fields are in network (bigendian) byte order and may need
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index c1c0b60d3f2f..78ef086a7c2d 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -495,6 +495,42 @@ int __cgroup_bpf_run_filter_sk(struct sock *sk,
>  EXPORT_SYMBOL(__cgroup_bpf_run_filter_sk);
>  
>  /**
> + * __cgroup_bpf_run_filter_sock_addr() - Run a program on a sock and
> + *                                       provided by user sockaddr
> + * @sk: sock struct that will use sockaddr
> + * @uaddr: sockaddr struct provided by user
> + * @type: The type of program to be exectuted
> + *
> + * socket is expected to be of type INET or INET6.
> + *
> + * This function will return %-EPERM if an attached program is found and
> + * returned value != 1 during execution. In all other cases, 0 is returned.
> + */
> +int __cgroup_bpf_run_filter_sock_addr(struct sock *sk,
> +				      struct sockaddr *uaddr,
> +				      enum bpf_attach_type type)
> +{
> +	struct bpf_sock_addr_kern ctx = {
> +		.sk = sk,
> +		.uaddr = uaddr,
> +	};
> +	struct cgroup *cgrp;
> +	int ret;
> +
> +	/* Check socket family since not all sockets represent network
> +	 * endpoint (e.g. AF_UNIX).
> +	 */
> +	if (sk->sk_family != AF_INET && sk->sk_family != AF_INET6)
> +		return 0;
> +
> +	cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> +	ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], &ctx, BPF_PROG_RUN);
> +
> +	return ret == 1 ? 0 : -EPERM;

Semantics may be a little bit strange, though this would perhaps be at the risk
of the orchestrator(s) (?). Basically when you run through the prog array, then
the last attached program in that array has the final /real/ say to which address
to bind/connect to; all the others decisions can freely be overridden, so this
is dependent on the order the BPF progs how they were attached. I think we don't
have this case for context in multi-prog tracing, cgroup/inet (only filtering)
and cgroup/dev. Although in cgroup/sock_ops for the tcp/BPF hooks progs can already
override the sock_ops.reply (and sk_txhash which may be less relevant) field from
the ctx, so whatever one prog is assumed to reply back to the caller, another one
could override it. Wouldn't it make more sense to just have a single prog instead
to avoid this override/ordering issue?

> +}
> +EXPORT_SYMBOL(__cgroup_bpf_run_filter_sock_addr);
> +
> +/**
>   * __cgroup_bpf_run_filter_sock_ops() - Run a program on a sock
>   * @sk: socket to get cgroup from
>   * @sock_ops: bpf_sock_ops_kern struct to pass to program. Contains
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index e24aa3241387..7f86542aa42c 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1376,6 +1376,12 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>  	case BPF_CGROUP_INET_SOCK_CREATE:
>  		ptype = BPF_PROG_TYPE_CGROUP_SOCK;
>  		break;
> +	case BPF_CGROUP_INET4_BIND:
> +		ptype = BPF_PROG_TYPE_CGROUP_INET4_BIND;
> +		break;
> +	case BPF_CGROUP_INET6_BIND:
> +		ptype = BPF_PROG_TYPE_CGROUP_INET6_BIND;
> +		break;
>  	case BPF_CGROUP_SOCK_OPS:
>  		ptype = BPF_PROG_TYPE_SOCK_OPS;
>  		break;
> @@ -1431,6 +1437,12 @@ static int bpf_prog_detach(const union bpf_attr *attr)
>  	case BPF_CGROUP_INET_SOCK_CREATE:
>  		ptype = BPF_PROG_TYPE_CGROUP_SOCK;
>  		break;
> +	case BPF_CGROUP_INET4_BIND:
> +		ptype = BPF_PROG_TYPE_CGROUP_INET4_BIND;
> +		break;
> +	case BPF_CGROUP_INET6_BIND:
> +		ptype = BPF_PROG_TYPE_CGROUP_INET6_BIND;
> +		break;
>  	case BPF_CGROUP_SOCK_OPS:
>  		ptype = BPF_PROG_TYPE_SOCK_OPS;
>  		break;
> @@ -1478,6 +1490,8 @@ static int bpf_prog_query(const union bpf_attr *attr,
>  	case BPF_CGROUP_INET_INGRESS:
>  	case BPF_CGROUP_INET_EGRESS:
>  	case BPF_CGROUP_INET_SOCK_CREATE:
> +	case BPF_CGROUP_INET4_BIND:
> +	case BPF_CGROUP_INET6_BIND:
>  	case BPF_CGROUP_SOCK_OPS:
>  	case BPF_CGROUP_DEVICE:
>  		break;
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index eb79a34359c0..01b54afcb762 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -3872,6 +3872,8 @@ static int check_return_code(struct bpf_verifier_env *env)
>  	switch (env->prog->type) {
>  	case BPF_PROG_TYPE_CGROUP_SKB:
>  	case BPF_PROG_TYPE_CGROUP_SOCK:
> +	case BPF_PROG_TYPE_CGROUP_INET4_BIND:
> +	case BPF_PROG_TYPE_CGROUP_INET6_BIND:
>  	case BPF_PROG_TYPE_SOCK_OPS:
>  	case BPF_PROG_TYPE_CGROUP_DEVICE:
>  		break;

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind
  2018-03-14 14:37   ` Daniel Borkmann
@ 2018-03-14 14:55     ` Daniel Borkmann
  2018-03-14 18:11     ` Alexei Starovoitov
  1 sibling, 0 replies; 23+ messages in thread
From: Daniel Borkmann @ 2018-03-14 14:55 UTC (permalink / raw)
  To: Alexei Starovoitov, davem; +Cc: netdev, kernel-team

On 03/14/2018 03:37 PM, Daniel Borkmann wrote:
> On 03/14/2018 04:39 AM, Alexei Starovoitov wrote:
> [...]
>> +#define BPF_CGROUP_RUN_PROG_INET4_BIND(sk, uaddr)			       \
>> +	BPF_CGROUP_RUN_SA_PROG(sk, uaddr, BPF_CGROUP_INET4_BIND)
>> +
>> +#define BPF_CGROUP_RUN_PROG_INET6_BIND(sk, uaddr)			       \
>> +	BPF_CGROUP_RUN_SA_PROG(sk, uaddr, BPF_CGROUP_INET6_BIND)
>> +
>>  #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops)				       \
>>  ({									       \
>>  	int __ret = 0;							       \
>> @@ -135,6 +154,8 @@ static inline int cgroup_bpf_inherit(struct cgroup *cgrp) { return 0; }
>>  #define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk,skb) ({ 0; })
>>  #define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk,skb) ({ 0; })
>>  #define BPF_CGROUP_RUN_PROG_INET_SOCK(sk) ({ 0; })
>> +#define BPF_CGROUP_RUN_PROG_INET4_BIND(sk, uaddr) ({ 0; })
>> +#define BPF_CGROUP_RUN_PROG_INET6_BIND(sk, uaddr) ({ 0; })
>>  #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) ({ 0; })
>>  #define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type,major,minor,access) ({ 0; })
>>  
>> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
>> index 19b8349a3809..eefd877f8e68 100644
>> --- a/include/linux/bpf_types.h
>> +++ b/include/linux/bpf_types.h
>> @@ -8,6 +8,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_SCHED_ACT, tc_cls_act)
>>  BPF_PROG_TYPE(BPF_PROG_TYPE_XDP, xdp)
>>  BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SKB, cg_skb)
>>  BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCK, cg_sock)
>> +BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_INET4_BIND, cg_inet4_bind)
>> +BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_INET6_BIND, cg_inet6_bind)
>>  BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_IN, lwt_inout)
>>  BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_OUT, lwt_inout)
>>  BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_XMIT, lwt_xmit)
>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>> index fdb691b520c0..fe469320feab 100644
>> --- a/include/linux/filter.h
>> +++ b/include/linux/filter.h
>> @@ -1001,6 +1001,16 @@ static inline int bpf_tell_extensions(void)
>>  	return SKF_AD_MAX;
>>  }
>>  
>> +struct bpf_sock_addr_kern {
>> +	struct sock *sk;
>> +	struct sockaddr *uaddr;
>> +	/* Temporary "register" to make indirect stores to nested structures
>> +	 * defined above. We need three registers to make such a store, but
>> +	 * only two (src and dst) are available at convert_ctx_access time
>> +	 */
>> +	u64 tmp_reg;
>> +};
>> +
>>  struct bpf_sock_ops_kern {
>>  	struct	sock *sk;
>>  	u32	op;
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 2a66769e5875..78628a3f3cd8 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -133,6 +133,8 @@ enum bpf_prog_type {
>>  	BPF_PROG_TYPE_SOCK_OPS,
>>  	BPF_PROG_TYPE_SK_SKB,
>>  	BPF_PROG_TYPE_CGROUP_DEVICE,
>> +	BPF_PROG_TYPE_CGROUP_INET4_BIND,
>> +	BPF_PROG_TYPE_CGROUP_INET6_BIND,
> 
> Could those all be merged into BPF_PROG_TYPE_SOCK_OPS? I'm slowly getting
> confused by the many sock_*/sk_* prog types we have. The attach hook could
> still be something like BPF_CGROUP_BIND/BPF_CGROUP_CONNECT. Potentially
> storing some prog-type specific void *private_data in prog's aux during
> verification could be a way (similarly as you mention) which can later be
> retrieved at attach time to reject with -ENOTSUPP or such.
> 
>>  };
>>  
>>  enum bpf_attach_type {
>> @@ -143,6 +145,8 @@ enum bpf_attach_type {
>>  	BPF_SK_SKB_STREAM_PARSER,
>>  	BPF_SK_SKB_STREAM_VERDICT,
>>  	BPF_CGROUP_DEVICE,
>> +	BPF_CGROUP_INET4_BIND,
>> +	BPF_CGROUP_INET6_BIND,
> 
> Binding to v4 mapped v6 address would work as well, right? Can't this be
> squashed into one attach type as mentioned?
> 
>>  	__MAX_BPF_ATTACH_TYPE
>>  };
>>  
>> @@ -953,6 +957,26 @@ struct bpf_map_info {
>>  	__u64 netns_ino;
>>  } __attribute__((aligned(8)));
>>  
>> +/* User bpf_sock_addr struct to access socket fields and sockaddr struct passed
>> + * by user and intended to be used by socket (e.g. to bind to, depends on
>> + * attach attach type).
>> + */
>> +struct bpf_sock_addr {
>> +	__u32 user_family;	/* Allows 4-byte read, but no write. */
>> +	__u32 user_ip4;		/* Allows 1,2,4-byte read and 4-byte write.
>> +				 * Stored in network byte order.
>> +				 */
>> +	__u32 user_ip6[4];	/* Allows 1,2,4-byte read an 4-byte write.
>> +				 * Stored in network byte order.
>> +				 */
>> +	__u32 user_port;	/* Allows 4-byte read and write.
>> +				 * Stored in network byte order
>> +				 */
>> +	__u32 family;		/* Allows 4-byte read, but no write */
>> +	__u32 type;		/* Allows 4-byte read, but no write */
>> +	__u32 protocol;		/* Allows 4-byte read, but no write */
> 
> I recall bind to prefix came up from time to time in BPF context in the sense
> to let the app itself be more flexible to bind to BPF prog. Do you see also app
> to be able to add a BPF prog into the array itself?
> 
>> +};
>> +
>>  /* User bpf_sock_ops struct to access socket values and specify request ops
>>   * and their replies.
>>   * Some of this fields are in network (bigendian) byte order and may need
>> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
>> index c1c0b60d3f2f..78ef086a7c2d 100644
>> --- a/kernel/bpf/cgroup.c
>> +++ b/kernel/bpf/cgroup.c
>> @@ -495,6 +495,42 @@ int __cgroup_bpf_run_filter_sk(struct sock *sk,
>>  EXPORT_SYMBOL(__cgroup_bpf_run_filter_sk);
>>  
>>  /**
>> + * __cgroup_bpf_run_filter_sock_addr() - Run a program on a sock and
>> + *                                       provided by user sockaddr
>> + * @sk: sock struct that will use sockaddr
>> + * @uaddr: sockaddr struct provided by user
>> + * @type: The type of program to be exectuted
>> + *
>> + * socket is expected to be of type INET or INET6.
>> + *
>> + * This function will return %-EPERM if an attached program is found and
>> + * returned value != 1 during execution. In all other cases, 0 is returned.
>> + */
>> +int __cgroup_bpf_run_filter_sock_addr(struct sock *sk,
>> +				      struct sockaddr *uaddr,
>> +				      enum bpf_attach_type type)
>> +{
>> +	struct bpf_sock_addr_kern ctx = {
>> +		.sk = sk,
>> +		.uaddr = uaddr,
>> +	};
>> +	struct cgroup *cgrp;
>> +	int ret;
>> +
>> +	/* Check socket family since not all sockets represent network
>> +	 * endpoint (e.g. AF_UNIX).
>> +	 */
>> +	if (sk->sk_family != AF_INET && sk->sk_family != AF_INET6)
>> +		return 0;
>> +
>> +	cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
>> +	ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], &ctx, BPF_PROG_RUN);
>> +
>> +	return ret == 1 ? 0 : -EPERM;
> 
> Semantics may be a little bit strange, though this would perhaps be at the risk
> of the orchestrator(s) (?). Basically when you run through the prog array, then
> the last attached program in that array has the final /real/ say to which address
> to bind/connect to; all the others decisions can freely be overridden, so this
> is dependent on the order the BPF progs how they were attached. I think we don't
> have this case for context in multi-prog tracing, cgroup/inet (only filtering)
> and cgroup/dev. Although in cgroup/sock_ops for the tcp/BPF hooks progs can already
> override the sock_ops.reply (and sk_txhash which may be less relevant) field from
> the ctx, so whatever one prog is assumed to reply back to the caller, another one
> could override it. Wouldn't it make more sense to just have a single prog instead
> to avoid this override/ordering issue?

Sigh, scratch that last thought ... lack of coffee, it all depends on where in the
cgroup hierarchy you are to be able to override of course.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind
  2018-03-14 14:37   ` Daniel Borkmann
  2018-03-14 14:55     ` Daniel Borkmann
@ 2018-03-14 18:11     ` Alexei Starovoitov
  2018-03-14 23:27       ` Daniel Borkmann
  1 sibling, 1 reply; 23+ messages in thread
From: Alexei Starovoitov @ 2018-03-14 18:11 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Alexei Starovoitov, davem, netdev, kernel-team

On Wed, Mar 14, 2018 at 03:37:01PM +0100, Daniel Borkmann wrote:
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -133,6 +133,8 @@ enum bpf_prog_type {
> >  	BPF_PROG_TYPE_SOCK_OPS,
> >  	BPF_PROG_TYPE_SK_SKB,
> >  	BPF_PROG_TYPE_CGROUP_DEVICE,
> > +	BPF_PROG_TYPE_CGROUP_INET4_BIND,
> > +	BPF_PROG_TYPE_CGROUP_INET6_BIND,
> 
> Could those all be merged into BPF_PROG_TYPE_SOCK_OPS? I'm slowly getting
> confused by the many sock_*/sk_* prog types we have. The attach hook could
> still be something like BPF_CGROUP_BIND/BPF_CGROUP_CONNECT. Potentially
> storing some prog-type specific void *private_data in prog's aux during
> verification could be a way (similarly as you mention) which can later be
> retrieved at attach time to reject with -ENOTSUPP or such.

that's exacly what I mentioned in the cover letter,
but we need to extend attach cmd with verifier-like log_buf+log_size.
since simple enotsupp will be impossible to debug.
That's the main question of the RFC.

> >  };
> >  
> >  enum bpf_attach_type {
> > @@ -143,6 +145,8 @@ enum bpf_attach_type {
> >  	BPF_SK_SKB_STREAM_PARSER,
> >  	BPF_SK_SKB_STREAM_VERDICT,
> >  	BPF_CGROUP_DEVICE,
> > +	BPF_CGROUP_INET4_BIND,
> > +	BPF_CGROUP_INET6_BIND,
> 
> Binding to v4 mapped v6 address would work as well, right? Can't this be
> squashed into one attach type as mentioned?

explained the reasons for this in the cover letter and proposed extension
to attach cmd.

> > +struct bpf_sock_addr {
> > +	__u32 user_family;	/* Allows 4-byte read, but no write. */
> > +	__u32 user_ip4;		/* Allows 1,2,4-byte read and 4-byte write.
> > +				 * Stored in network byte order.
> > +				 */
> > +	__u32 user_ip6[4];	/* Allows 1,2,4-byte read an 4-byte write.
> > +				 * Stored in network byte order.
> > +				 */
> > +	__u32 user_port;	/* Allows 4-byte read and write.
> > +				 * Stored in network byte order
> > +				 */
> > +	__u32 family;		/* Allows 4-byte read, but no write */
> > +	__u32 type;		/* Allows 4-byte read, but no write */
> > +	__u32 protocol;		/* Allows 4-byte read, but no write */
> 
> I recall bind to prefix came up from time to time in BPF context in the sense
> to let the app itself be more flexible to bind to BPF prog. Do you see also app
> to be able to add a BPF prog into the array itself?

I'm not following. In this case the container management framework
will attach bpf progs to cgroups and apps inside the cgroups will
get their bind/connects overwritten when necessary.

> > +int __cgroup_bpf_run_filter_sock_addr(struct sock *sk,
> > +				      struct sockaddr *uaddr,
> > +				      enum bpf_attach_type type)
> > +{
> > +	struct bpf_sock_addr_kern ctx = {
> > +		.sk = sk,
> > +		.uaddr = uaddr,
> > +	};
> > +	struct cgroup *cgrp;
> > +	int ret;
> > +
> > +	/* Check socket family since not all sockets represent network
> > +	 * endpoint (e.g. AF_UNIX).
> > +	 */
> > +	if (sk->sk_family != AF_INET && sk->sk_family != AF_INET6)
> > +		return 0;
> > +
> > +	cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> > +	ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], &ctx, BPF_PROG_RUN);
> > +
> > +	return ret == 1 ? 0 : -EPERM;
> 
> Semantics may be a little bit strange, though this would perhaps be at the risk
> of the orchestrator(s) (?). Basically when you run through the prog array, then
> the last attached program in that array has the final /real/ say to which address
> to bind/connect to; all the others decisions can freely be overridden, so this
> is dependent on the order the BPF progs how they were attached. I think we don't
> have this case for context in multi-prog tracing, cgroup/inet (only filtering)
> and cgroup/dev. Although in cgroup/sock_ops for the tcp/BPF hooks progs can already
> override the sock_ops.reply (and sk_txhash which may be less relevant) field from
> the ctx, so whatever one prog is assumed to reply back to the caller, another one
> could override it. 

correct. tcp-bpf is in the same boat. When progs override the decision the last
prog in the prog_run_array is effective. Remember that
 * The programs of sub-cgroup are executed first, then programs of
 * this cgroup and then programs of parent cgroup.
so outer cgroup controlled by container management is running last.
If it would want to let children do nested overwrittes it could look at the same
sockaddr memory region and will see what children's prog or children's tasks
did with sockaddr and make approriate decision.

> Wouldn't it make more sense to just have a single prog instead
> to avoid this override/ordering issue?

I don't think there is any ordering issue, but yes, if parent is paranoid
it can install no-override program on the cgroup. Which is default anyway.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind
  2018-03-14 18:11     ` Alexei Starovoitov
@ 2018-03-14 23:27       ` Daniel Borkmann
  2018-03-15  0:29         ` Alexei Starovoitov
  0 siblings, 1 reply; 23+ messages in thread
From: Daniel Borkmann @ 2018-03-14 23:27 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Alexei Starovoitov, davem, netdev, kernel-team

On 03/14/2018 07:11 PM, Alexei Starovoitov wrote:
> On Wed, Mar 14, 2018 at 03:37:01PM +0100, Daniel Borkmann wrote:
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
>>> @@ -133,6 +133,8 @@ enum bpf_prog_type {
>>>  	BPF_PROG_TYPE_SOCK_OPS,
>>>  	BPF_PROG_TYPE_SK_SKB,
>>>  	BPF_PROG_TYPE_CGROUP_DEVICE,
>>> +	BPF_PROG_TYPE_CGROUP_INET4_BIND,
>>> +	BPF_PROG_TYPE_CGROUP_INET6_BIND,
>>
>> Could those all be merged into BPF_PROG_TYPE_SOCK_OPS? I'm slowly getting
>> confused by the many sock_*/sk_* prog types we have. The attach hook could
>> still be something like BPF_CGROUP_BIND/BPF_CGROUP_CONNECT. Potentially
>> storing some prog-type specific void *private_data in prog's aux during
>> verification could be a way (similarly as you mention) which can later be
>> retrieved at attach time to reject with -ENOTSUPP or such.
> 
> that's exacly what I mentioned in the cover letter,
> but we need to extend attach cmd with verifier-like log_buf+log_size.
> since simple enotsupp will be impossible to debug.

Hmm, adding verifier-like log_buf + log_size feels a bit of a kludge just
for this case, but it's the usual problem where getting a std error code
is like looking into a crystal ball to figure where it's coming from. I'd see
only couple of other alternatives: distinct error code like ENAVAIL for such
mismatch, or telling the verifier upfront where this is going to be attached
to - same as we do with the dev for offloading or as you did with splitting
the prog types or some sort of notion of sub-prog types; or having a query
API that returns possible/compatible attach types for this loaded prog via
e.g. bpf_prog_get_info_by_fd() that loader can precheck or check when error
occurs. All nothing really nice, though.

Making verifier-like log_buf + log_size generic meaning not just for the case
of BPF_PROG_ATTACH specifically might be yet another option, meaning you'd
have a new BPF_GET_ERROR command to pick up the log for the last failed bpf(2)
cmd, but either that might require adding a BPF related pointer to task struct
for this or any other future BPF feature (maybe not really an option); or to
have some sort of bpf cmd to config and obtain an fd for error queue/log once,
where this can then be retrieved from and used for all cmds generically.

[...]
>>> +struct bpf_sock_addr {
>>> +	__u32 user_family;	/* Allows 4-byte read, but no write. */
>>> +	__u32 user_ip4;		/* Allows 1,2,4-byte read and 4-byte write.
>>> +				 * Stored in network byte order.
>>> +				 */
>>> +	__u32 user_ip6[4];	/* Allows 1,2,4-byte read an 4-byte write.
>>> +				 * Stored in network byte order.
>>> +				 */
>>> +	__u32 user_port;	/* Allows 4-byte read and write.
>>> +				 * Stored in network byte order
>>> +				 */
>>> +	__u32 family;		/* Allows 4-byte read, but no write */
>>> +	__u32 type;		/* Allows 4-byte read, but no write */
>>> +	__u32 protocol;		/* Allows 4-byte read, but no write */
>>
>> I recall bind to prefix came up from time to time in BPF context in the sense
>> to let the app itself be more flexible to bind to BPF prog. Do you see also app
>> to be able to add a BPF prog into the array itself?
> 
> I'm not following. In this case the container management framework
> will attach bpf progs to cgroups and apps inside the cgroups will
> get their bind/connects overwritten when necessary.

Was mostly just thinking whether it could also cover the use case that was
brought up from time to time e.g.:

  https://www.mail-archive.com/netdev@vger.kernel.org/msg100914.html

Seems like it would potentially be on top of that, plus having an option to
attach from within the app instead of orchestrator.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind
  2018-03-14 23:27       ` Daniel Borkmann
@ 2018-03-15  0:29         ` Alexei Starovoitov
  0 siblings, 0 replies; 23+ messages in thread
From: Alexei Starovoitov @ 2018-03-15  0:29 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov
  Cc: Alexei Starovoitov, davem, netdev, kernel-team

On 3/14/18 4:27 PM, Daniel Borkmann wrote:
> On 03/14/2018 07:11 PM, Alexei Starovoitov wrote:
>> On Wed, Mar 14, 2018 at 03:37:01PM +0100, Daniel Borkmann wrote:
>>>> --- a/include/uapi/linux/bpf.h
>>>> +++ b/include/uapi/linux/bpf.h
>>>> @@ -133,6 +133,8 @@ enum bpf_prog_type {
>>>>  	BPF_PROG_TYPE_SOCK_OPS,
>>>>  	BPF_PROG_TYPE_SK_SKB,
>>>>  	BPF_PROG_TYPE_CGROUP_DEVICE,
>>>> +	BPF_PROG_TYPE_CGROUP_INET4_BIND,
>>>> +	BPF_PROG_TYPE_CGROUP_INET6_BIND,
>>>
>>> Could those all be merged into BPF_PROG_TYPE_SOCK_OPS? I'm slowly getting
>>> confused by the many sock_*/sk_* prog types we have. The attach hook could
>>> still be something like BPF_CGROUP_BIND/BPF_CGROUP_CONNECT. Potentially
>>> storing some prog-type specific void *private_data in prog's aux during
>>> verification could be a way (similarly as you mention) which can later be
>>> retrieved at attach time to reject with -ENOTSUPP or such.
>>
>> that's exacly what I mentioned in the cover letter,
>> but we need to extend attach cmd with verifier-like log_buf+log_size.
>> since simple enotsupp will be impossible to debug.
>
> Hmm, adding verifier-like log_buf + log_size feels a bit of a kludge just
> for this case, but it's the usual problem where getting a std error code
> is like looking into a crystal ball to figure where it's coming from. I'd see
> only couple of other alternatives: distinct error code like ENAVAIL for such
> mismatch, or telling the verifier upfront where this is going to be attached
> to - same as we do with the dev for offloading or as you did with splitting
> the prog types or some sort of notion of sub-prog types; or having a query
> API that returns possible/compatible attach types for this loaded prog via
> e.g. bpf_prog_get_info_by_fd() that loader can precheck or check when error
> occurs. All nothing really nice, though.

query after loading isn't great, since possible attach combinations
will be too high for human to understand,
but I like "passing attach_type into prog_load" idea.
That should work and it fits existing prog_ifindex too.
So we'll add '__u32 attach_type' to prog_load cmd.
elf loader would still need to parse section name to
figure out prog type and attach type.
Something like:
SEC("sock_addr/bind_v4") my_prog(struct bpf_sock_addr *ctx)
SEC("sock_addr/connect_v6") my_prog(struct bpf_sock_addr *ctx)
We still need new prog type for bind_v4/bind_v6/connect_v4/connect_v6
hooks with distinct 'struct bpf_sock_addr' context,
since the prog is accessing both sockaddr and sock.
Adding user_ip4, user_ip6 fields to 'struct bpf_sock_ops'
is doable, but it would be too confusing to users, so imo that's
not a good option.

For post-bind hook we probably can reuse 'struct bpf_sock_ops'
and BPF_PROG_TYPE_SOCK_OPS, since there only sock is the context.

> Making verifier-like log_buf + log_size generic meaning not just for the case
> of BPF_PROG_ATTACH specifically might be yet another option, meaning you'd
> have a new BPF_GET_ERROR command to pick up the log for the last failed bpf(2)
> cmd, but either that might require adding a BPF related pointer to task struct
> for this or any other future BPF feature (maybe not really an option); or to
> have some sort of bpf cmd to config and obtain an fd for error queue/log once,
> where this can then be retrieved from and used for all cmds generically.

I don't think we want to hold on to error logs in the kernel,
since user may not query it right away or ever.
verifier log is freed right after prog_load cmd is done.

> Seems like it would potentially be on top of that, plus having an option to
> attach from within the app instead of orchestrator.

right. let's worry about it as potential next step.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH RFC bpf-next 2/6] selftests/bpf: Selftest for sys_bind hooks
  2018-03-14  3:39 [PATCH RFC bpf-next 0/6] bpf: introduce cgroup-bpf bind, connect, post-bind hooks Alexei Starovoitov
  2018-03-14  3:39 ` [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind Alexei Starovoitov
@ 2018-03-14  3:39 ` Alexei Starovoitov
  2018-03-14  3:39 ` [PATCH RFC bpf-next 3/6] net: Introduce __inet_bind() and __inet6_bind Alexei Starovoitov
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Alexei Starovoitov @ 2018-03-14  3:39 UTC (permalink / raw)
  To: davem; +Cc: daniel, netdev, kernel-team

From: Andrey Ignatov <rdna@fb.com>

Add selftest to work with bpf_sock_addr context from
`BPF_PROG_TYPE_CGROUP_INET4_BIND` and BPF_PROG_TYPE_CGROUP_INET6_BIND`
programs.

Try to bind(2) on IP:port and apply:
* loads to make sure context can be read correctly, including narrow
  loads (byte, half) for IP and full-size loads (word) for all fields;
* stores to those fields allowed by verifier.

All combination from IPv4/IPv6 and TCP/UDP are tested.

Test passes when expected data can be read from context in the
BPF-program, and after the call to bind(2) socket is bound to IP:port
pair that was written by BPF-program to the context.

Example:
  # ./test_sock_addr
  Attached bind4 program.
  Test case #1 (IPv4/TCP):
          Requested: bind(192.168.1.254, 4040) ..
             Actual: bind(127.0.0.1, 4444)
  Test case #2 (IPv4/UDP):
          Requested: bind(192.168.1.254, 4040) ..
             Actual: bind(127.0.0.1, 4444)
  Attached bind6 program.
  Test case #3 (IPv6/TCP):
          Requested: bind(face:b00c:1234:5678::abcd, 6060) ..
             Actual: bind(::1, 6666)
  Test case #4 (IPv6/UDP):
          Requested: bind(face:b00c:1234:5678::abcd, 6060) ..
             Actual: bind(::1, 6666)
  ### SUCCESS

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/include/uapi/linux/bpf.h               |  24 ++
 tools/testing/selftests/bpf/Makefile         |   3 +-
 tools/testing/selftests/bpf/test_sock_addr.c | 463 +++++++++++++++++++++++++++
 3 files changed, 489 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/test_sock_addr.c

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index db6bdc375126..a6af06bb5efb 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -133,6 +133,8 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_SOCK_OPS,
 	BPF_PROG_TYPE_SK_SKB,
 	BPF_PROG_TYPE_CGROUP_DEVICE,
+	BPF_PROG_TYPE_CGROUP_INET4_BIND,
+	BPF_PROG_TYPE_CGROUP_INET6_BIND,
 };
 
 enum bpf_attach_type {
@@ -143,6 +145,8 @@ enum bpf_attach_type {
 	BPF_SK_SKB_STREAM_PARSER,
 	BPF_SK_SKB_STREAM_VERDICT,
 	BPF_CGROUP_DEVICE,
+	BPF_CGROUP_INET4_BIND,
+	BPF_CGROUP_INET6_BIND,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -952,6 +956,26 @@ struct bpf_map_info {
 	__u64 netns_ino;
 } __attribute__((aligned(8)));
 
+/* User bpf_sock_addr struct to access socket fields and sockaddr struct passed
+ * by user and intended to be used by socket (e.g. to bind to, depends on
+ * attach attach type).
+ */
+struct bpf_sock_addr {
+	__u32 user_family;	/* Allows 4-byte read, but no write. */
+	__u32 user_ip4;		/* Allows 1,2,4-byte read and 4-byte write.
+				 * Stored in network byte order.
+				 */
+	__u32 user_ip6[4];	/* Allows 1,2,4-byte read an 4-byte write.
+				 * Stored in network byte order.
+				 */
+	__u32 user_port;	/* Allows 4-byte read and write.
+				 * Stored in network byte order
+				 */
+	__u32 family;		/* Allows 4-byte read, but no write */
+	__u32 type;		/* Allows 4-byte read, but no write */
+	__u32 protocol;		/* Allows 4-byte read, but no write */
+};
+
 /* User bpf_sock_ops struct to access socket values and specify request ops
  * and their replies.
  * Some of this fields are in network (bigendian) byte order and may need
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 8567a858b789..f319b67fd0f6 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -15,7 +15,7 @@ LDLIBS += -lcap -lelf -lrt -lpthread
 
 # Order correspond to 'make run_tests' order
 TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test_progs \
-	test_align test_verifier_log test_dev_cgroup test_tcpbpf_user
+	test_align test_verifier_log test_dev_cgroup test_tcpbpf_user test_sock_addr
 
 TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test_obj_id.o \
 	test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o sockmap_parse_prog.o     \
@@ -42,6 +42,7 @@ $(TEST_GEN_PROGS): $(BPFOBJ)
 $(TEST_GEN_PROGS_EXTENDED): $(OUTPUT)/libbpf.a
 
 $(OUTPUT)/test_dev_cgroup: cgroup_helpers.c
+$(OUTPUT)/test_sock_addr: cgroup_helpers.c
 
 .PHONY: force
 
diff --git a/tools/testing/selftests/bpf/test_sock_addr.c b/tools/testing/selftests/bpf/test_sock_addr.c
new file mode 100644
index 000000000000..18ea250484dc
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_sock_addr.c
@@ -0,0 +1,463 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2018 Facebook
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include <arpa/inet.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+
+#include <linux/filter.h>
+
+#include <bpf/bpf.h>
+
+#include "cgroup_helpers.h"
+
+#define CG_PATH	"/foo"
+
+#define SERV4_IP		"192.168.1.254"
+#define SERV4_REWRITE_IP	"127.0.0.1"
+#define SERV4_PORT		4040
+#define SERV4_REWRITE_PORT	4444
+
+#define SERV6_IP		"face:b00c:1234:5678::abcd"
+#define SERV6_REWRITE_IP	"::1"
+#define SERV6_PORT		6060
+#define SERV6_REWRITE_PORT	6666
+
+#define INET_NTOP_BUF	40
+
+typedef int (*load_fn)(void);
+typedef int (*info_fn)(int, struct sockaddr *, socklen_t *);
+
+struct program {
+	enum bpf_attach_type type;
+	load_fn	loadfn;
+	int fd;
+	const char *name;
+};
+
+char bpf_log_buf[BPF_LOG_BUF_SIZE];
+
+static int mk_sockaddr(int domain, const char *ip, unsigned short port,
+		       struct sockaddr *addr, socklen_t addr_len)
+{
+	struct sockaddr_in6 *addr6;
+	struct sockaddr_in *addr4;
+
+	if (domain != AF_INET && domain != AF_INET6) {
+		log_err("Unsupported address family");
+		return -1;
+	}
+
+	memset(addr, 0, addr_len);
+
+	if (domain == AF_INET) {
+		if (addr_len < sizeof(struct sockaddr_in))
+			return -1;
+		addr4 = (struct sockaddr_in *)addr;
+		addr4->sin_family = domain;
+		addr4->sin_port = htons(port);
+		if (inet_pton(domain, ip, (void *)&addr4->sin_addr) != 1) {
+			log_err("Invalid IPv4: %s", ip);
+			return -1;
+		}
+	} else if (domain == AF_INET6) {
+		if (addr_len < sizeof(struct sockaddr_in6))
+			return -1;
+		addr6 = (struct sockaddr_in6 *)addr;
+		addr6->sin6_family = domain;
+		addr6->sin6_port = htons(port);
+		if (inet_pton(domain, ip, (void *)&addr6->sin6_addr) != 1) {
+			log_err("Invalid IPv6: %s", ip);
+			return -1;
+		}
+	}
+
+	return 0;
+}
+
+static int load_insns(enum bpf_prog_type type, const struct bpf_insn *insns,
+		      size_t insns_cnt, const char *comment)
+{
+	int ret;
+
+	ret = bpf_load_program(type, insns, insns_cnt, "GPL", 0, bpf_log_buf,
+			       BPF_LOG_BUF_SIZE);
+	if (ret < 0) {
+		log_err(">>> Loading %s program error.\n"
+			">>> Output from verifier:\n%s\n-------\n",
+			comment, bpf_log_buf);
+	}
+
+	return ret;
+}
+
+/* [1] These testing programs try to read different context fields, including
+ * narrow loads of different sizes from user_ip4 and user_ip6, and write to
+ * those allowed to be overridden.
+ *
+ * [2] BPF_LD_IMM64 & BPF_JMP_REG are used below whenever there is a need to
+ * compare a register with unsigned 32bit integer. BPF_JMP_IMM can't be used
+ * in such cases since it accepts only _signed_ 32bit integer as IMM
+ * argument. Also note that BPF_LD_IMM64 contains 2 instructions what matters
+ * to count jumps properly.
+ */
+
+static int bind4_prog_load(void)
+{
+	union {
+		uint8_t u4_addr8[4];
+		uint16_t u4_addr16[2];
+		uint32_t u4_addr32;
+	} ip4;
+	struct sockaddr_in addr4_rw;
+
+	if (inet_pton(AF_INET, SERV4_IP, (void *)&ip4) != 1) {
+		log_err("Invalid IPv4: %s", SERV4_IP);
+		return -1;
+	}
+
+	if (mk_sockaddr(AF_INET, SERV4_REWRITE_IP, SERV4_REWRITE_PORT,
+			(struct sockaddr *)&addr4_rw, sizeof(addr4_rw)) == -1)
+		return -1;
+
+	/* See [1]. */
+	struct bpf_insn insns[] = {
+		BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+
+		/* if (sk.family == AF_INET && */
+		BPF_LDX_MEM(BPF_W, BPF_REG_7, BPF_REG_6,
+			    offsetof(struct bpf_sock_addr, family)),
+		BPF_JMP_IMM(BPF_JNE, BPF_REG_7, AF_INET, 16),
+
+		/*     (sk.type == SOCK_DGRAM || sk.type == SOCK_STREAM) && */
+		BPF_LDX_MEM(BPF_W, BPF_REG_7, BPF_REG_6,
+			    offsetof(struct bpf_sock_addr, type)),
+		BPF_JMP_IMM(BPF_JNE, BPF_REG_7, SOCK_DGRAM, 1),
+		BPF_JMP_A(1),
+		BPF_JMP_IMM(BPF_JNE, BPF_REG_7, SOCK_STREAM, 12),
+
+		/*     1st_byte_of_user_ip4 == expected && */
+		BPF_LDX_MEM(BPF_B, BPF_REG_7, BPF_REG_6,
+			    offsetof(struct bpf_sock_addr, user_ip4)),
+		BPF_JMP_IMM(BPF_JNE, BPF_REG_7, ip4.u4_addr8[0], 10),
+
+		/*     1st_half_of_user_ip4 == expected && */
+		BPF_LDX_MEM(BPF_H, BPF_REG_7, BPF_REG_6,
+			    offsetof(struct bpf_sock_addr, user_ip4)),
+		BPF_JMP_IMM(BPF_JNE, BPF_REG_7, ip4.u4_addr16[0], 8),
+
+		/*     whole_user_ip4 == expected) { */
+		BPF_LDX_MEM(BPF_W, BPF_REG_7, BPF_REG_6,
+			    offsetof(struct bpf_sock_addr, user_ip4)),
+		BPF_LD_IMM64(BPF_REG_8, ip4.u4_addr32), /* See [2]. */
+		BPF_JMP_REG(BPF_JNE, BPF_REG_7, BPF_REG_8, 4),
+
+		/*      user_ip4 = addr4_rw.sin_addr */
+		BPF_MOV32_IMM(BPF_REG_7, addr4_rw.sin_addr.s_addr),
+		BPF_STX_MEM(BPF_W, BPF_REG_6, BPF_REG_7,
+			    offsetof(struct bpf_sock_addr, user_ip4)),
+
+		/*      user_port = addr4_rw.sin_port */
+		BPF_MOV32_IMM(BPF_REG_7, addr4_rw.sin_port),
+		BPF_STX_MEM(BPF_W, BPF_REG_6, BPF_REG_7,
+			    offsetof(struct bpf_sock_addr, user_port)),
+		/* } */
+
+		/* return 1 */
+		BPF_MOV64_IMM(BPF_REG_0, 1),
+		BPF_EXIT_INSN(),
+	};
+
+	return load_insns(BPF_PROG_TYPE_CGROUP_INET4_BIND, insns,
+			  sizeof(insns) / sizeof(struct bpf_insn),
+			  "bind() for AF_INET");
+}
+
+static int bind6_prog_load(void)
+{
+	struct sockaddr_in6 addr6_rw;
+	struct in6_addr ip6;
+
+	if (inet_pton(AF_INET6, SERV6_IP, (void *)&ip6) != 1) {
+		log_err("Invalid IPv6: %s", SERV6_IP);
+		return -1;
+	}
+
+	if (mk_sockaddr(AF_INET6, SERV6_REWRITE_IP, SERV6_REWRITE_PORT,
+			(struct sockaddr *)&addr6_rw, sizeof(addr6_rw)) == -1)
+		return -1;
+
+	/* See [1]. */
+	struct bpf_insn insns[] = {
+		BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+
+		/* if (sk.family == AF_INET6 && */
+		BPF_LDX_MEM(BPF_W, BPF_REG_7, BPF_REG_6,
+			    offsetof(struct bpf_sock_addr, family)),
+		BPF_JMP_IMM(BPF_JNE, BPF_REG_7, AF_INET6, 18),
+
+		/*            5th_byte_of_user_ip6 == expected && */
+		BPF_LDX_MEM(BPF_B, BPF_REG_7, BPF_REG_6,
+			    offsetof(struct bpf_sock_addr, user_ip6[1])),
+		BPF_JMP_IMM(BPF_JNE, BPF_REG_7, ip6.s6_addr[4], 16),
+
+		/*            3rd_half_of_user_ip6 == expected && */
+		BPF_LDX_MEM(BPF_H, BPF_REG_7, BPF_REG_6,
+			    offsetof(struct bpf_sock_addr, user_ip6[1])),
+		BPF_JMP_IMM(BPF_JNE, BPF_REG_7, ip6.s6_addr16[2], 14),
+
+		/*            last_word_of_user_ip6 == expected) { */
+		BPF_LDX_MEM(BPF_W, BPF_REG_7, BPF_REG_6,
+			    offsetof(struct bpf_sock_addr, user_ip6[3])),
+		BPF_LD_IMM64(BPF_REG_8, ip6.s6_addr32[3]),  /* See [2]. */
+		BPF_JMP_REG(BPF_JNE, BPF_REG_7, BPF_REG_8, 10),
+
+
+#define STORE_IPV6_WORD(N)						       \
+		BPF_MOV32_IMM(BPF_REG_7, addr6_rw.sin6_addr.s6_addr32[N]),     \
+		BPF_STX_MEM(BPF_W, BPF_REG_6, BPF_REG_7,		       \
+			    offsetof(struct bpf_sock_addr, user_ip6[N]))
+
+		/*      user_ip6 = addr6_rw.sin6_addr */
+		STORE_IPV6_WORD(0),
+		STORE_IPV6_WORD(1),
+		STORE_IPV6_WORD(2),
+		STORE_IPV6_WORD(3),
+
+		/*      user_port = addr6_rw.sin6_port */
+		BPF_MOV32_IMM(BPF_REG_7, addr6_rw.sin6_port),
+		BPF_STX_MEM(BPF_W, BPF_REG_6, BPF_REG_7,
+			    offsetof(struct bpf_sock_addr, user_port)),
+
+		/* } */
+
+		/* return 1 */
+		BPF_MOV64_IMM(BPF_REG_0, 1),
+		BPF_EXIT_INSN(),
+	};
+
+	return load_insns(BPF_PROG_TYPE_CGROUP_INET6_BIND, insns,
+			  sizeof(insns) / sizeof(struct bpf_insn),
+			  "bind() for AF_INET6");
+}
+
+static void print_ip_port(int sockfd, info_fn fn, const char *fmt)
+{
+	char addr_buf[INET_NTOP_BUF];
+	struct sockaddr_storage addr;
+	struct sockaddr_in6 *addr6;
+	struct sockaddr_in *addr4;
+	socklen_t addr_len;
+	unsigned short port;
+	void *nip;
+
+	addr_len = sizeof(struct sockaddr_storage);
+	memset(&addr, 0, addr_len);
+
+	if (fn(sockfd, (struct sockaddr *)&addr, (socklen_t *)&addr_len) == 0) {
+		if (addr.ss_family == AF_INET) {
+			addr4 = (struct sockaddr_in *)&addr;
+			nip = (void *)&addr4->sin_addr;
+			port = ntohs(addr4->sin_port);
+		} else if (addr.ss_family == AF_INET6) {
+			addr6 = (struct sockaddr_in6 *)&addr;
+			nip = (void *)&addr6->sin6_addr;
+			port = ntohs(addr6->sin6_port);
+		} else {
+			return;
+		}
+		const char *addr_str =
+			inet_ntop(addr.ss_family, nip, addr_buf, INET_NTOP_BUF);
+		printf(fmt, addr_str ? addr_str : "??", port);
+	}
+}
+
+static void print_local_ip_port(int sockfd, const char *fmt)
+{
+	print_ip_port(sockfd, getsockname, fmt);
+}
+
+static int start_server(int type, const struct sockaddr_storage *addr,
+			socklen_t addr_len)
+{
+
+	int fd;
+
+	fd = socket(addr->ss_family, type, 0);
+	if (fd == -1) {
+		log_err("Failed to create server socket");
+		goto out;
+	}
+
+	if (bind(fd, (const struct sockaddr *)addr, addr_len) == -1) {
+		log_err("Failed to bind server socket");
+		goto close_out;
+	}
+
+	if (type == SOCK_STREAM) {
+		if (listen(fd, 128) == -1) {
+			log_err("Failed to listen on server socket");
+			goto close_out;
+		}
+	}
+
+	print_local_ip_port(fd, "\t   Actual: bind(%s, %d)\n");
+
+	goto out;
+close_out:
+	close(fd);
+	fd = -1;
+out:
+	return fd;
+}
+
+static void print_test_case_num(int domain, int type)
+{
+	static int test_num;
+
+	printf("Test case #%d (%s/%s):\n", ++test_num,
+	       (domain == AF_INET ? "IPv4" :
+		domain == AF_INET6 ? "IPv6" :
+		"unknown_domain"),
+	       (type == SOCK_STREAM ? "TCP" :
+		type == SOCK_DGRAM ? "UDP" :
+		"unknown_type"));
+}
+
+static int run_test_case(int domain, int type, const char *ip,
+			 unsigned short port)
+{
+	struct sockaddr_storage addr;
+	socklen_t addr_len = sizeof(addr);
+	int servfd = -1;
+	int err = 0;
+
+	print_test_case_num(domain, type);
+
+	if (mk_sockaddr(domain, ip, port, (struct sockaddr *)&addr,
+			addr_len) == -1)
+		return -1;
+
+	printf("\tRequested: bind(%s, %d) ..\n", ip, port);
+	servfd = start_server(type, &addr, addr_len);
+	if (servfd == -1)
+		goto err;
+
+	goto out;
+err:
+	err = -1;
+out:
+	close(servfd);
+	return err;
+}
+
+static void close_progs_fds(struct program *progs, size_t prog_cnt)
+{
+	size_t i;
+
+	for (i = 0; i < prog_cnt; ++i) {
+		close(progs[i].fd);
+		progs[i].fd = -1;
+	}
+}
+
+static int load_and_attach_progs(int cgfd, struct program *progs,
+				 size_t prog_cnt)
+{
+	size_t i;
+
+	for (i = 0; i < prog_cnt; ++i) {
+		progs[i].fd = progs[i].loadfn();
+		if (progs[i].fd == -1) {
+			log_err("Failed to load program %s", progs[i].name);
+			goto err;
+		}
+		if (bpf_prog_attach(progs[i].fd, cgfd, progs[i].type,
+				    BPF_F_ALLOW_OVERRIDE) == -1) {
+			log_err("Failed to attach program %s", progs[i].name);
+			goto err;
+		}
+		printf("Attached %s program.\n", progs[i].name);
+	}
+
+	return 0;
+err:
+	close_progs_fds(progs, prog_cnt);
+	return -1;
+}
+
+static int run_domain_test(int domain, int cgfd, struct program *progs,
+			   size_t prog_cnt, const char *ip, unsigned short port)
+{
+	int err = 0;
+
+	if (load_and_attach_progs(cgfd, progs, prog_cnt) == -1)
+		goto err;
+
+	if (run_test_case(domain, SOCK_STREAM, ip, port) == -1)
+		goto err;
+
+	if (run_test_case(domain, SOCK_DGRAM, ip, port) == -1)
+		goto err;
+
+	goto out;
+err:
+	err = -1;
+out:
+	close_progs_fds(progs, prog_cnt);
+	return err;
+}
+
+static int run_test(void)
+{
+	size_t inet6_prog_cnt;
+	size_t inet_prog_cnt;
+	int cgfd = -1;
+	int err = 0;
+
+	struct program inet6_progs[] = {
+		{BPF_CGROUP_INET6_BIND, bind6_prog_load, -1, "bind6"},
+	};
+	inet6_prog_cnt = sizeof(inet6_progs) / sizeof(struct program);
+
+	struct program inet_progs[] = {
+		{BPF_CGROUP_INET4_BIND, bind4_prog_load, -1, "bind4"},
+	};
+	inet_prog_cnt = sizeof(inet_progs) / sizeof(struct program);
+
+	if (setup_cgroup_environment())
+		goto err;
+
+	cgfd = create_and_get_cgroup(CG_PATH);
+	if (!cgfd)
+		goto err;
+
+	if (join_cgroup(CG_PATH))
+		goto err;
+
+	if (run_domain_test(AF_INET, cgfd, inet_progs, inet_prog_cnt, SERV4_IP,
+			    SERV4_PORT) == -1)
+		goto err;
+
+	if (run_domain_test(AF_INET6, cgfd, inet6_progs, inet6_prog_cnt,
+			    SERV6_IP, SERV6_PORT) == -1)
+		goto err;
+
+	goto out;
+err:
+	err = -1;
+out:
+	close(cgfd);
+	cleanup_cgroup_environment();
+	printf(err ? "### FAIL\n" : "### SUCCESS\n");
+	return err;
+}
+
+int main(int argc, char **argv)
+{
+	return run_test();
+}
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC bpf-next 3/6] net: Introduce __inet_bind() and __inet6_bind
  2018-03-14  3:39 [PATCH RFC bpf-next 0/6] bpf: introduce cgroup-bpf bind, connect, post-bind hooks Alexei Starovoitov
  2018-03-14  3:39 ` [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind Alexei Starovoitov
  2018-03-14  3:39 ` [PATCH RFC bpf-next 2/6] selftests/bpf: Selftest for sys_bind hooks Alexei Starovoitov
@ 2018-03-14  3:39 ` Alexei Starovoitov
  2018-03-14  3:39 ` [PATCH RFC bpf-next 4/6] bpf: Hooks for sys_connect Alexei Starovoitov
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Alexei Starovoitov @ 2018-03-14  3:39 UTC (permalink / raw)
  To: davem; +Cc: daniel, netdev, kernel-team

From: Andrey Ignatov <rdna@fb.com>

Refactor `bind()` code to make it ready to be called from BPF helper
function `bpf_bind()` (will be added soon). Implementation of
`inet_bind()` and `inet6_bind()` is separated into `__inet_bind()` and
`__inet6_bind()` correspondingly. These function can be used from both
`sk_prot->bind` and `bpf_bind()` contexts.

New functions have two additional arguments.

`force_bind_address_no_port` forces binding to IP only w/o checking
`inet_sock.bind_address_no_port` field. It'll allow to bind local end of
a connection to desired IP in `bpf_bind()` w/o changing
`bind_address_no_port` field of a socket. It's useful since `bpf_bind()`
can return an error and we'd need to restore original value of
`bind_address_no_port` in that case if we changed this before calling to
the helper.

`with_lock` specifies whether to lock socket when working with `struct
sk` or not. The argument is set to `true` for `sk_prot->bind`, i.e. old
behavior is preserved. But it will be set to `false` for `bpf_bind()`
use-case. The reason is all call-sites, where `bpf_bind()` will be
called, already hold that socket lock.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/net/inet_common.h |  2 ++
 include/net/ipv6.h        |  2 ++
 net/ipv4/af_inet.c        | 39 ++++++++++++++++++++++++---------------
 net/ipv6/af_inet6.c       | 37 ++++++++++++++++++++++++-------------
 4 files changed, 52 insertions(+), 28 deletions(-)

diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 500f81375200..384b90c62c0b 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -32,6 +32,8 @@ int inet_shutdown(struct socket *sock, int how);
 int inet_listen(struct socket *sock, int backlog);
 void inet_sock_destruct(struct sock *sk);
 int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len);
+int __inet_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len,
+		bool force_bind_address_no_port, bool with_lock);
 int inet_getname(struct socket *sock, struct sockaddr *uaddr,
 		 int peer);
 int inet_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg);
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 50a6f0ddb878..2e5fedc56e59 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -1066,6 +1066,8 @@ void ipv6_local_error(struct sock *sk, int err, struct flowi6 *fl6, u32 info);
 void ipv6_local_rxpmtu(struct sock *sk, struct flowi6 *fl6, u32 mtu);
 
 int inet6_release(struct socket *sock);
+int __inet6_bind(struct sock *sock, struct sockaddr *uaddr, int addr_len,
+		 bool force_bind_address_no_port, bool with_lock);
 int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len);
 int inet6_getname(struct socket *sock, struct sockaddr *uaddr,
 		  int peer);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 2dec266507dc..e203a39d6988 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -432,30 +432,37 @@ EXPORT_SYMBOL(inet_release);
 
 int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 {
-	struct sockaddr_in *addr = (struct sockaddr_in *)uaddr;
 	struct sock *sk = sock->sk;
-	struct inet_sock *inet = inet_sk(sk);
-	struct net *net = sock_net(sk);
-	unsigned short snum;
-	int chk_addr_ret;
-	u32 tb_id = RT_TABLE_LOCAL;
 	int err;
 
 	/* If the socket has its own bind function then use it. (RAW) */
 	if (sk->sk_prot->bind) {
-		err = sk->sk_prot->bind(sk, uaddr, addr_len);
-		goto out;
+		return sk->sk_prot->bind(sk, uaddr, addr_len);
 	}
-	err = -EINVAL;
 	if (addr_len < sizeof(struct sockaddr_in))
-		goto out;
+		return -EINVAL;
 
 	/* BPF prog is run before any checks are done so that if the prog
 	 * changes context in a wrong way it will be caught.
 	 */
 	err = BPF_CGROUP_RUN_PROG_INET4_BIND(sk, uaddr);
 	if (err)
-		goto out;
+		return err;
+
+	return __inet_bind(sk, uaddr, addr_len, false, true);
+}
+EXPORT_SYMBOL(inet_bind);
+
+int __inet_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len,
+		bool force_bind_address_no_port, bool with_lock)
+{
+	struct sockaddr_in *addr = (struct sockaddr_in *)uaddr;
+	struct inet_sock *inet = inet_sk(sk);
+	struct net *net = sock_net(sk);
+	unsigned short snum;
+	int chk_addr_ret;
+	u32 tb_id = RT_TABLE_LOCAL;
+	int err;
 
 	if (addr->sin_family != AF_INET) {
 		/* Compatibility games : accept AF_UNSPEC (mapped to AF_INET)
@@ -499,7 +506,8 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 	 *      would be illegal to use them (multicast/broadcast) in
 	 *      which case the sending device address is used.
 	 */
-	lock_sock(sk);
+	if (with_lock)
+		lock_sock(sk);
 
 	/* Check these errors (active socket, double bind). */
 	err = -EINVAL;
@@ -511,7 +519,8 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 		inet->inet_saddr = 0;  /* Use device */
 
 	/* Make sure we are allowed to bind here. */
-	if ((snum || !inet->bind_address_no_port) &&
+	if ((snum || !(inet->bind_address_no_port ||
+		       force_bind_address_no_port)) &&
 	    sk->sk_prot->get_port(sk, snum)) {
 		inet->inet_saddr = inet->inet_rcv_saddr = 0;
 		err = -EADDRINUSE;
@@ -528,11 +537,11 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 	sk_dst_reset(sk);
 	err = 0;
 out_release_sock:
-	release_sock(sk);
+	if (with_lock)
+		release_sock(sk);
 out:
 	return err;
 }
-EXPORT_SYMBOL(inet_bind);
 
 int inet_dgram_connect(struct socket *sock, struct sockaddr *uaddr,
 		       int addr_len, int flags)
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index fa24e3f06ac6..13110bee5c14 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -277,15 +277,7 @@ static int inet6_create(struct net *net, struct socket *sock, int protocol,
 /* bind for INET6 API */
 int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 {
-	struct sockaddr_in6 *addr = (struct sockaddr_in6 *)uaddr;
 	struct sock *sk = sock->sk;
-	struct inet_sock *inet = inet_sk(sk);
-	struct ipv6_pinfo *np = inet6_sk(sk);
-	struct net *net = sock_net(sk);
-	__be32 v4addr = 0;
-	unsigned short snum;
-	bool saved_ipv6only;
-	int addr_type = 0;
 	int err = 0;
 
 	/* If the socket has its own bind function then use it. */
@@ -302,11 +294,28 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 	if (err)
 		return err;
 
+	return __inet6_bind(sk, uaddr, addr_len, false, true);
+}
+EXPORT_SYMBOL(inet6_bind);
+
+int __inet6_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len,
+		 bool force_bind_address_no_port, bool with_lock)
+{
+	struct sockaddr_in6 *addr = (struct sockaddr_in6 *)uaddr;
+	struct inet_sock *inet = inet_sk(sk);
+	struct ipv6_pinfo *np = inet6_sk(sk);
+	struct net *net = sock_net(sk);
+	__be32 v4addr = 0;
+	unsigned short snum;
+	bool saved_ipv6only;
+	int addr_type = 0;
+	int err = 0;
+
 	if (addr->sin6_family != AF_INET6)
 		return -EAFNOSUPPORT;
 
 	addr_type = ipv6_addr_type(&addr->sin6_addr);
-	if ((addr_type & IPV6_ADDR_MULTICAST) && sock->type == SOCK_STREAM)
+	if ((addr_type & IPV6_ADDR_MULTICAST) && sk->sk_type == SOCK_STREAM)
 		return -EINVAL;
 
 	snum = ntohs(addr->sin6_port);
@@ -314,7 +323,8 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 	    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
 		return -EACCES;
 
-	lock_sock(sk);
+	if (with_lock)
+		lock_sock(sk);
 
 	/* Check these errors (active socket, double bind). */
 	if (sk->sk_state != TCP_CLOSE || inet->inet_num) {
@@ -402,7 +412,8 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 		sk->sk_ipv6only = 1;
 
 	/* Make sure we are allowed to bind here. */
-	if ((snum || !inet->bind_address_no_port) &&
+	if ((snum || !(inet->bind_address_no_port ||
+		       force_bind_address_no_port)) &&
 	    sk->sk_prot->get_port(sk, snum)) {
 		sk->sk_ipv6only = saved_ipv6only;
 		inet_reset_saddr(sk);
@@ -418,13 +429,13 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 	inet->inet_dport = 0;
 	inet->inet_daddr = 0;
 out:
-	release_sock(sk);
+	if (with_lock)
+		release_sock(sk);
 	return err;
 out_unlock:
 	rcu_read_unlock();
 	goto out;
 }
-EXPORT_SYMBOL(inet6_bind);
 
 int inet6_release(struct socket *sock)
 {
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC bpf-next 4/6] bpf: Hooks for sys_connect
  2018-03-14  3:39 [PATCH RFC bpf-next 0/6] bpf: introduce cgroup-bpf bind, connect, post-bind hooks Alexei Starovoitov
                   ` (2 preceding siblings ...)
  2018-03-14  3:39 ` [PATCH RFC bpf-next 3/6] net: Introduce __inet_bind() and __inet6_bind Alexei Starovoitov
@ 2018-03-14  3:39 ` Alexei Starovoitov
  2018-03-14  3:39 ` [PATCH RFC bpf-next 5/6] selftests/bpf: Selftest for sys_connect hooks Alexei Starovoitov
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Alexei Starovoitov @ 2018-03-14  3:39 UTC (permalink / raw)
  To: davem; +Cc: daniel, netdev, kernel-team

From: Andrey Ignatov <rdna@fb.com>

== The problem ==

See description of the problem in the initial patch of this patch set.

== The solution ==

The patch provides much more reliable in-kernel solution for the 2nd
part of the problem: making outgoing connecttion from desired IP.

It adds new program types `BPF_PROG_TYPE_CGROUP_INET4_CONNECT` and
`BPF_PROG_TYPE_CGROUP_INET6_CONNECT` and corresponding attach types that
can be used to override both source and destination of a connection at
connect(2) time.

Local end of connection can be bound to desired IP using newly
introduced BPF-helper `bpf_bind()`. It allows to bind to only IP though,
and doesn't support binding to port, i.e. leverages
`IP_BIND_ADDRESS_NO_PORT` socket option. There are two reasons for this:
* looking for a free port is expensive and can affect performance
  significantly;
* there is no use-case for port.

As for remote end (`struct sockaddr *` passed by user), both parts of it
can be overridden, remote IP and remote port. It's useful if an
application inside cgroup wants to connect to another application inside
same cgroup or to itself, but knows nothing about IP assigned to the
cgroup.

Support is added for IPv4 and IPv6, for TCP and UDP.

IPv4 and IPv6 have separate program types for same reason as sys_bind
hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields
when user passes sockaddr_in since it'd be out-of-bound.

Program types for sys_bind itself can't be reused for sys_connect either
since there is a difference in allowed helpers to call: `bpf_bind()` is
useful to call from sys_connect hooks, but doesn't make sense in
sys_bind hooks.

== Implementation notes ==

The patch introduces new field in `struct proto`: `pre_connect` that is
a pointer to a function with same signature as `connect` but is called
before it. The reason is in some cases BPF hooks should be called way
before control is passed to `sk->sk_prot->connect`. Specifically
`inet_dgram_connect` autobinds socket before calling
`sk->sk_prot->connect` and there is no way to call `bpf_bind()` from
hooks from e.g. `ip4_datagram_connect` or `ip6_datagram_connect` since
it'd cause double-bind. On the other hand `proto.pre_connect` provides a
flexible way to add BPF hooks for connect only for necessary `proto` and
call them at desired time before `connect`. Since `bpf_bind()` is
allowed to bind only to IP and autobind in `inet_dgram_connect` binds
only port there is no chance of double-bind.

bpf_bind()` sets `force_bind_address_no_port` to bind to only IP despite
of value of `bind_address_no_port` socket field.

`bpf_bind()` sets `with_lock` to `false` when calling to `__inet_bind()`
and `__inet6_bind()` since all call-sites, where `bpf_bind()` is called,
already hold socket lock.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf-cgroup.h | 31 +++++++++++++++++++++
 include/linux/bpf_types.h  |  2 ++
 include/net/sock.h         |  3 +++
 include/net/udp.h          |  1 +
 include/uapi/linux/bpf.h   | 15 ++++++++++-
 kernel/bpf/syscall.c       | 14 ++++++++++
 kernel/bpf/verifier.c      |  2 ++
 net/core/filter.c          | 67 ++++++++++++++++++++++++++++++++++++++++++++++
 net/ipv4/af_inet.c         | 13 +++++++++
 net/ipv4/tcp_ipv4.c        | 16 +++++++++++
 net/ipv4/udp.c             | 14 ++++++++++
 net/ipv6/tcp_ipv6.c        | 16 +++++++++++
 net/ipv6/udp.c             | 20 ++++++++++++++
 13 files changed, 213 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index dd0cfbddcfbe..6b5c25ef1482 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -116,12 +116,38 @@ int __cgroup_bpf_check_dev_permission(short dev_type, u32 major, u32 minor,
 	__ret;								       \
 })
 
+#define BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, type)			       \
+({									       \
+	int __ret = 0;							       \
+	if (cgroup_bpf_enabled)	{					       \
+		lock_sock(sk);						       \
+		__ret = __cgroup_bpf_run_filter_sock_addr(sk, uaddr, type);    \
+		release_sock(sk);					       \
+	}								       \
+	__ret;								       \
+})
+
 #define BPF_CGROUP_RUN_PROG_INET4_BIND(sk, uaddr)			       \
 	BPF_CGROUP_RUN_SA_PROG(sk, uaddr, BPF_CGROUP_INET4_BIND)
 
 #define BPF_CGROUP_RUN_PROG_INET6_BIND(sk, uaddr)			       \
 	BPF_CGROUP_RUN_SA_PROG(sk, uaddr, BPF_CGROUP_INET6_BIND)
 
+#define BPF_CGROUP_PRE_CONNECT_ENABLED(sk) (cgroup_bpf_enabled && \
+					    sk->sk_prot->pre_connect)
+
+#define BPF_CGROUP_RUN_PROG_INET4_CONNECT(sk, uaddr)			       \
+	BPF_CGROUP_RUN_SA_PROG(sk, uaddr, BPF_CGROUP_INET4_CONNECT)
+
+#define BPF_CGROUP_RUN_PROG_INET6_CONNECT(sk, uaddr)			       \
+	BPF_CGROUP_RUN_SA_PROG(sk, uaddr, BPF_CGROUP_INET6_CONNECT)
+
+#define BPF_CGROUP_RUN_PROG_INET4_CONNECT_LOCK(sk, uaddr)		       \
+	BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_INET4_CONNECT)
+
+#define BPF_CGROUP_RUN_PROG_INET6_CONNECT_LOCK(sk, uaddr)		       \
+	BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_INET6_CONNECT)
+
 #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops)				       \
 ({									       \
 	int __ret = 0;							       \
@@ -151,11 +177,16 @@ struct cgroup_bpf {};
 static inline void cgroup_bpf_put(struct cgroup *cgrp) {}
 static inline int cgroup_bpf_inherit(struct cgroup *cgrp) { return 0; }
 
+#define BPF_CGROUP_PRE_CONNECT_ENABLED(sk) (0)
 #define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk,skb) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk,skb) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_INET_SOCK(sk) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_INET4_BIND(sk, uaddr) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_INET6_BIND(sk, uaddr) ({ 0; })
+#define BPF_CGROUP_RUN_PROG_INET4_CONNECT(sk, uaddr) ({ 0; })
+#define BPF_CGROUP_RUN_PROG_INET4_CONNECT_LOCK(sk, uaddr) ({ 0; })
+#define BPF_CGROUP_RUN_PROG_INET6_CONNECT(sk, uaddr) ({ 0; })
+#define BPF_CGROUP_RUN_PROG_INET6_CONNECT_LOCK(sk, uaddr) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type,major,minor,access) ({ 0; })
 
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index eefd877f8e68..52a571827b9f 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -10,6 +10,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SKB, cg_skb)
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCK, cg_sock)
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_INET4_BIND, cg_inet4_bind)
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_INET6_BIND, cg_inet6_bind)
+BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_INET4_CONNECT, cg_inet4_connect)
+BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_INET6_CONNECT, cg_inet6_connect)
 BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_IN, lwt_inout)
 BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_OUT, lwt_inout)
 BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_XMIT, lwt_xmit)
diff --git a/include/net/sock.h b/include/net/sock.h
index b9624581d639..997259e0ecae 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1026,6 +1026,9 @@ static inline void sk_prot_clear_nulls(struct sock *sk, int size)
 struct proto {
 	void			(*close)(struct sock *sk,
 					long timeout);
+	int			(*pre_connect)(struct sock *sk,
+					struct sockaddr *uaddr,
+					int addr_len);
 	int			(*connect)(struct sock *sk,
 					struct sockaddr *uaddr,
 					int addr_len);
diff --git a/include/net/udp.h b/include/net/udp.h
index 850a8e581cce..0676b272f6ac 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -273,6 +273,7 @@ void udp4_hwcsum(struct sk_buff *skb, __be32 src, __be32 dst);
 int udp_rcv(struct sk_buff *skb);
 int udp_ioctl(struct sock *sk, int cmd, unsigned long arg);
 int udp_init_sock(struct sock *sk);
+int udp_pre_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len);
 int __udp_disconnect(struct sock *sk, int flags);
 int udp_disconnect(struct sock *sk, int flags);
 __poll_t udp_poll(struct file *file, struct socket *sock, poll_table *wait);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 78628a3f3cd8..441a674f385a 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -135,6 +135,8 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_CGROUP_DEVICE,
 	BPF_PROG_TYPE_CGROUP_INET4_BIND,
 	BPF_PROG_TYPE_CGROUP_INET6_BIND,
+	BPF_PROG_TYPE_CGROUP_INET4_CONNECT,
+	BPF_PROG_TYPE_CGROUP_INET6_CONNECT,
 };
 
 enum bpf_attach_type {
@@ -147,6 +149,8 @@ enum bpf_attach_type {
 	BPF_CGROUP_DEVICE,
 	BPF_CGROUP_INET4_BIND,
 	BPF_CGROUP_INET6_BIND,
+	BPF_CGROUP_INET4_CONNECT,
+	BPF_CGROUP_INET6_CONNECT,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -700,6 +704,14 @@ union bpf_attr {
  * int bpf_override_return(pt_regs, rc)
  *	@pt_regs: pointer to struct pt_regs
  *	@rc: the return value to set
+ *
+ * int bpf_bind(ctx, addr, addr_len)
+ *     Bind socket to address. Only binding to IP is supported, no port can be
+ *     set in addr.
+ *     @ctx: pointer to context of type bpf_sock_addr
+ *     @addr: pointer to struct sockaddr to bind socket to
+ *     @addr_len: length of sockaddr structure
+ *     Return: 0 on success or negative error code
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -761,7 +773,8 @@ union bpf_attr {
 	FN(perf_prog_read_value),	\
 	FN(getsockopt),			\
 	FN(override_return),		\
-	FN(sock_ops_cb_flags_set),
+	FN(sock_ops_cb_flags_set),	\
+	FN(bind),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 7f86542aa42c..145de3332e32 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1382,6 +1382,12 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 	case BPF_CGROUP_INET6_BIND:
 		ptype = BPF_PROG_TYPE_CGROUP_INET6_BIND;
 		break;
+	case BPF_CGROUP_INET4_CONNECT:
+		ptype = BPF_PROG_TYPE_CGROUP_INET4_CONNECT;
+		break;
+	case BPF_CGROUP_INET6_CONNECT:
+		ptype = BPF_PROG_TYPE_CGROUP_INET6_CONNECT;
+		break;
 	case BPF_CGROUP_SOCK_OPS:
 		ptype = BPF_PROG_TYPE_SOCK_OPS;
 		break;
@@ -1443,6 +1449,12 @@ static int bpf_prog_detach(const union bpf_attr *attr)
 	case BPF_CGROUP_INET6_BIND:
 		ptype = BPF_PROG_TYPE_CGROUP_INET6_BIND;
 		break;
+	case BPF_CGROUP_INET4_CONNECT:
+		ptype = BPF_PROG_TYPE_CGROUP_INET4_CONNECT;
+		break;
+	case BPF_CGROUP_INET6_CONNECT:
+		ptype = BPF_PROG_TYPE_CGROUP_INET6_CONNECT;
+		break;
 	case BPF_CGROUP_SOCK_OPS:
 		ptype = BPF_PROG_TYPE_SOCK_OPS;
 		break;
@@ -1492,6 +1504,8 @@ static int bpf_prog_query(const union bpf_attr *attr,
 	case BPF_CGROUP_INET_SOCK_CREATE:
 	case BPF_CGROUP_INET4_BIND:
 	case BPF_CGROUP_INET6_BIND:
+	case BPF_CGROUP_INET4_CONNECT:
+	case BPF_CGROUP_INET6_CONNECT:
 	case BPF_CGROUP_SOCK_OPS:
 	case BPF_CGROUP_DEVICE:
 		break;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 01b54afcb762..cda7830a2c1b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3874,6 +3874,8 @@ static int check_return_code(struct bpf_verifier_env *env)
 	case BPF_PROG_TYPE_CGROUP_SOCK:
 	case BPF_PROG_TYPE_CGROUP_INET4_BIND:
 	case BPF_PROG_TYPE_CGROUP_INET6_BIND:
+	case BPF_PROG_TYPE_CGROUP_INET4_CONNECT:
+	case BPF_PROG_TYPE_CGROUP_INET6_CONNECT:
 	case BPF_PROG_TYPE_SOCK_OPS:
 	case BPF_PROG_TYPE_CGROUP_DEVICE:
 		break;
diff --git a/net/core/filter.c b/net/core/filter.c
index 78907cf3b42f..916195b86a23 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -33,6 +33,7 @@
 #include <linux/if_packet.h>
 #include <linux/if_arp.h>
 #include <linux/gfp.h>
+#include <net/inet_common.h>
 #include <net/ip.h>
 #include <net/protocol.h>
 #include <net/netlink.h>
@@ -3400,6 +3401,43 @@ static const struct bpf_func_proto bpf_sock_ops_cb_flags_set_proto = {
 	.arg2_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_3(bpf_bind, struct bpf_sock_addr_kern *, ctx, struct sockaddr *, addr,
+	   int, addr_len)
+{
+	struct sock *sk = ctx->sk;
+	int err;
+
+	/* Binding to port can be expensive so it's prohibited in the helper.
+	 * Only binding to IP is supported.
+	 */
+
+	err = -EINVAL;
+	if (addr->sa_family == AF_INET) {
+		if (addr_len < sizeof(struct sockaddr_in))
+			return err;
+		if (((struct sockaddr_in *)addr)->sin_port != htons(0))
+			return err;
+		return __inet_bind(sk, addr, addr_len, true, false);
+	} else if (addr->sa_family == AF_INET6) {
+		if (addr_len < SIN6_LEN_RFC2133)
+			return err;
+		if (((struct sockaddr_in6 *)addr)->sin6_port != htons(0))
+			return err;
+		return __inet6_bind(sk, addr, addr_len, true, false);
+	}
+
+	return -EAFNOSUPPORT;
+}
+
+static const struct bpf_func_proto bpf_bind_proto = {
+	.func		= bpf_bind,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_PTR_TO_MEM,
+	.arg3_type	= ARG_CONST_SIZE,
+};
+
 static const struct bpf_func_proto *
 bpf_base_func_proto(enum bpf_func_id func_id)
 {
@@ -3457,6 +3495,17 @@ inet_bind_func_proto(enum bpf_func_id func_id)
 }
 
 static const struct bpf_func_proto *
+inet_connect_func_proto(enum bpf_func_id func_id)
+{
+	switch (func_id) {
+	case BPF_FUNC_bind:
+		return &bpf_bind_proto;
+	default:
+		return inet_bind_func_proto(func_id);
+	}
+}
+
+static const struct bpf_func_proto *
 sk_filter_func_proto(enum bpf_func_id func_id)
 {
 	switch (func_id) {
@@ -5091,6 +5140,24 @@ const struct bpf_verifier_ops cg_inet6_bind_verifier_ops = {
 const struct bpf_prog_ops cg_inet6_bind_prog_ops = {
 };
 
+const struct bpf_verifier_ops cg_inet4_connect_verifier_ops = {
+	.get_func_proto		= inet_connect_func_proto,
+	.is_valid_access	= sock_addr4_is_valid_access,
+	.convert_ctx_access	= sock_addr_convert_ctx_access,
+};
+
+const struct bpf_prog_ops cg_inet4_connect_prog_ops = {
+};
+
+const struct bpf_verifier_ops cg_inet6_connect_verifier_ops = {
+	.get_func_proto		= inet_connect_func_proto,
+	.is_valid_access	= sock_addr6_is_valid_access,
+	.convert_ctx_access	= sock_addr_convert_ctx_access,
+};
+
+const struct bpf_prog_ops cg_inet6_connect_prog_ops = {
+};
+
 const struct bpf_verifier_ops sock_ops_verifier_ops = {
 	.get_func_proto		= sock_ops_func_proto,
 	.is_valid_access	= sock_ops_is_valid_access,
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index e203a39d6988..488fe26ac8e5 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -547,12 +547,19 @@ int inet_dgram_connect(struct socket *sock, struct sockaddr *uaddr,
 		       int addr_len, int flags)
 {
 	struct sock *sk = sock->sk;
+	int err;
 
 	if (addr_len < sizeof(uaddr->sa_family))
 		return -EINVAL;
 	if (uaddr->sa_family == AF_UNSPEC)
 		return sk->sk_prot->disconnect(sk, flags);
 
+	if (BPF_CGROUP_PRE_CONNECT_ENABLED(sk)) {
+		err = sk->sk_prot->pre_connect(sk, uaddr, addr_len);
+		if (err)
+			return err;
+	}
+
 	if (!inet_sk(sk)->inet_num && inet_autobind(sk))
 		return -EAGAIN;
 	return sk->sk_prot->connect(sk, uaddr, addr_len);
@@ -633,6 +640,12 @@ int __inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
 		if (sk->sk_state != TCP_CLOSE)
 			goto out;
 
+		if (BPF_CGROUP_PRE_CONNECT_ENABLED(sk)) {
+			err = sk->sk_prot->pre_connect(sk, uaddr, addr_len);
+			if (err)
+				goto out;
+		}
+
 		err = sk->sk_prot->connect(sk, uaddr, addr_len);
 		if (err < 0)
 			goto out;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 2c6aec2643e8..3c11d992d784 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -140,6 +140,21 @@ int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp)
 }
 EXPORT_SYMBOL_GPL(tcp_twsk_unique);
 
+static int tcp_v4_pre_connect(struct sock *sk, struct sockaddr *uaddr,
+			      int addr_len)
+{
+	/* This check is replicated from tcp_v4_connect() and intended to
+	 * prevent BPF program called below from accessing bytes that are out
+	 * of the bound specified by user in addr_len.
+	 */
+	if (addr_len < sizeof(struct sockaddr_in))
+		return -EINVAL;
+
+	sock_owned_by_me(sk);
+
+	return BPF_CGROUP_RUN_PROG_INET4_CONNECT(sk, uaddr);
+}
+
 /* This will initiate an outgoing connection. */
 int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 {
@@ -2409,6 +2424,7 @@ struct proto tcp_prot = {
 	.name			= "TCP",
 	.owner			= THIS_MODULE,
 	.close			= tcp_close,
+	.pre_connect		= tcp_v4_pre_connect,
 	.connect		= tcp_v4_connect,
 	.disconnect		= tcp_disconnect,
 	.accept			= inet_csk_accept,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 3013404d0935..0cbf66deed6f 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1664,6 +1664,19 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int noblock,
 	goto try_again;
 }
 
+int udp_pre_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
+{
+	/* This check is replicated from __ip4_datagram_connect() and
+	 * intended to prevent BPF program called below from accessing bytes
+	 * that are out of the bound specified by user in addr_len.
+	 */
+	if (addr_len < sizeof(struct sockaddr_in))
+		return -EINVAL;
+
+	return BPF_CGROUP_RUN_PROG_INET4_CONNECT_LOCK(sk, uaddr);
+}
+EXPORT_SYMBOL(udp_pre_connect);
+
 int __udp_disconnect(struct sock *sk, int flags)
 {
 	struct inet_sock *inet = inet_sk(sk);
@@ -2536,6 +2549,7 @@ struct proto udp_prot = {
 	.name		   = "UDP",
 	.owner		   = THIS_MODULE,
 	.close		   = udp_lib_close,
+	.pre_connect	   = udp_pre_connect,
 	.connect	   = ip4_datagram_connect,
 	.disconnect	   = udp_disconnect,
 	.ioctl		   = udp_ioctl,
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 5425d7b100ee..6469b741cf5a 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -117,6 +117,21 @@ static u32 tcp_v6_init_ts_off(const struct net *net, const struct sk_buff *skb)
 				   ipv6_hdr(skb)->saddr.s6_addr32);
 }
 
+static int tcp_v6_pre_connect(struct sock *sk, struct sockaddr *uaddr,
+			      int addr_len)
+{
+	/* This check is replicated from tcp_v6_connect() and intended to
+	 * prevent BPF program called below from accessing bytes that are out
+	 * of the bound specified by user in addr_len.
+	 */
+	if (addr_len < SIN6_LEN_RFC2133)
+		return -EINVAL;
+
+	sock_owned_by_me(sk);
+
+	return BPF_CGROUP_RUN_PROG_INET6_CONNECT(sk, uaddr);
+}
+
 static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 			  int addr_len)
 {
@@ -1925,6 +1940,7 @@ struct proto tcpv6_prot = {
 	.name			= "TCPv6",
 	.owner			= THIS_MODULE,
 	.close			= tcp_close,
+	.pre_connect		= tcp_v6_pre_connect,
 	.connect		= tcp_v6_connect,
 	.disconnect		= tcp_disconnect,
 	.accept			= inet_csk_accept,
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 52e3ea0e6f50..636904ca63ba 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -957,6 +957,25 @@ static void udp_v6_flush_pending_frames(struct sock *sk)
 	}
 }
 
+static int udpv6_pre_connect(struct sock *sk, struct sockaddr *uaddr,
+			     int addr_len)
+{
+	/* The following checks are replicated from __ip6_datagram_connect()
+	 * and intended to prevent BPF program called below from accessing
+	 * bytes that are out of the bound specified by user in addr_len.
+	 */
+	if (uaddr->sa_family == AF_INET) {
+		if (__ipv6_only_sock(sk))
+			return -EAFNOSUPPORT;
+		return udp_pre_connect(sk, uaddr, addr_len);
+	}
+
+	if (addr_len < SIN6_LEN_RFC2133)
+		return -EINVAL;
+
+	return BPF_CGROUP_RUN_PROG_INET6_CONNECT_LOCK(sk, uaddr);
+}
+
 /**
  *	udp6_hwcsum_outgoing  -  handle outgoing HW checksumming
  *	@sk:	socket we are sending on
@@ -1512,6 +1531,7 @@ struct proto udpv6_prot = {
 	.name		   = "UDPv6",
 	.owner		   = THIS_MODULE,
 	.close		   = udp_lib_close,
+	.pre_connect	   = udpv6_pre_connect,
 	.connect	   = ip6_datagram_connect,
 	.disconnect	   = udp_disconnect,
 	.ioctl		   = udp_ioctl,
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC bpf-next 5/6] selftests/bpf: Selftest for sys_connect hooks
  2018-03-14  3:39 [PATCH RFC bpf-next 0/6] bpf: introduce cgroup-bpf bind, connect, post-bind hooks Alexei Starovoitov
                   ` (3 preceding siblings ...)
  2018-03-14  3:39 ` [PATCH RFC bpf-next 4/6] bpf: Hooks for sys_connect Alexei Starovoitov
@ 2018-03-14  3:39 ` Alexei Starovoitov
  2018-03-14  3:39 ` [PATCH RFC bpf-next 6/6] bpf: Post-hooks for sys_bind Alexei Starovoitov
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Alexei Starovoitov @ 2018-03-14  3:39 UTC (permalink / raw)
  To: davem; +Cc: daniel, netdev, kernel-team

From: Andrey Ignatov <rdna@fb.com>

Add selftest for BPF_CGROUP_INET4_CONNECT and BPF_CGROUP_INET6_CONNECT
attach types.

Try to connect(2) to specified IP:port and test that:
* remote IP:port pair is overridden;
* local end of connection is bound to specified IP.

All combinations of IPv4/IPv6 and TCP/UDP are tested.

Example:
  # tcpdump -pn -i lo -w connect.pcap 2>/dev/null &
  [1] 271
  # strace -qqf -e connect -o connect.trace ./test_sock_addr.sh
  Wait for testing IPv4/IPv6 to become available ... OK
  Attached bind4 program.
  Attached connect4 program.
  Test case #1 (IPv4/TCP):
          Requested: bind(192.168.1.254, 4040) ..
             Actual: bind(127.0.0.1, 4444)
          Requested: connect(192.168.1.254, 4040) from (*, *) ..
             Actual: connect(127.0.0.1, 4444) from (127.0.0.4, 45780)
  Test case #2 (IPv4/UDP):
          Requested: bind(192.168.1.254, 4040) ..
             Actual: bind(127.0.0.1, 4444)
          Requested: connect(192.168.1.254, 4040) from (*, *) ..
             Actual: connect(127.0.0.1, 4444) from (127.0.0.4, 44708)
  Attached bind6 program.
  Attached connect6 program.
  Test case #3 (IPv6/TCP):
          Requested: bind(face:b00c:1234:5678::abcd, 6060) ..
             Actual: bind(::1, 6666)
          Requested: connect(face:b00c:1234:5678::abcd, 6060) from (*, *) .
             Actual: connect(::1, 6666) from (::6, 37510)
  Test case #4 (IPv6/UDP):
          Requested: bind(face:b00c:1234:5678::abcd, 6060) ..
             Actual: bind(::1, 6666)
          Requested: connect(face:b00c:1234:5678::abcd, 6060) from (*, *) .
             Actual: connect(::1, 6666) from (::6, 51749)
  ### SUCCESS
  # egrep 'connect\(.*AF_INET' connect.trace | \
      egrep -vw 'htons\(1025\)' | fold -b -s -w 72
  295   connect(7, {sa_family=AF_INET, sin_port=htons(4040),
  sin_addr=inet_addr("192.168.1.254")}, 128) = 0
  295   connect(8, {sa_family=AF_INET, sin_port=htons(4040),
  sin_addr=inet_addr("192.168.1.254")}, 128) = 0
  295   connect(9, {sa_family=AF_INET6, sin6_port=htons(6060),
  inet_pton(AF_INET6, "face:b00c:1234:5678::abcd", &sin6_addr),
  sin6_flowinfo=0, sin6_scope_id=0}, 128) = 0
  295   connect(10, {sa_family=AF_INET6, sin6_port=htons(6060),
  inet_pton(AF_INET6, "face:b00c:1234:5678::abcd", &sin6_addr),
  sin6_flowinfo=0, sin6_scope_id=0}, 128) = 0
  # fg
  tcpdump -pn -i lo -w connect.pcap 2> /dev/null
  # tcpdump -r connect.pcap -n tcp | cut -c 1-72
  reading from file connect.pcap, link-type EN10MB (Ethernet)
  17:20:03.346047 IP 127.0.0.4.45780 > 127.0.0.1.4444: Flags [S], seq 2460
  17:20:03.346084 IP 127.0.0.1.4444 > 127.0.0.4.45780: Flags [S.], seq 320
  17:20:03.346110 IP 127.0.0.4.45780 > 127.0.0.1.4444: Flags [.], ack 1, w
  17:20:03.347218 IP 127.0.0.1.4444 > 127.0.0.4.45780: Flags [R.], seq 1,
  17:20:03.356698 IP6 ::6.37510 > ::1.6666: Flags [S], seq 2155424486, win
  17:20:03.356733 IP6 ::1.6666 > ::6.37510: Flags [S.], seq 1308562754, ac
  17:20:03.356752 IP6 ::6.37510 > ::1.6666: Flags [.], ack 1, win 342, opt
  17:20:03.357639 IP6 ::1.6666 > ::6.37510: Flags [R.], seq 1, ack 1, win

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/include/uapi/linux/bpf.h                | 15 +++++-
 tools/testing/selftests/bpf/Makefile          |  5 +-
 tools/testing/selftests/bpf/bpf_helpers.h     |  2 +
 tools/testing/selftests/bpf/connect4_prog.c   | 45 ++++++++++++++++
 tools/testing/selftests/bpf/connect6_prog.c   | 61 +++++++++++++++++++++
 tools/testing/selftests/bpf/test_sock_addr.c  | 78 +++++++++++++++++++++++++++
 tools/testing/selftests/bpf/test_sock_addr.sh | 57 ++++++++++++++++++++
 7 files changed, 260 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/connect4_prog.c
 create mode 100644 tools/testing/selftests/bpf/connect6_prog.c
 create mode 100755 tools/testing/selftests/bpf/test_sock_addr.sh

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index a6af06bb5efb..11e1b633808a 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -135,6 +135,8 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_CGROUP_DEVICE,
 	BPF_PROG_TYPE_CGROUP_INET4_BIND,
 	BPF_PROG_TYPE_CGROUP_INET6_BIND,
+	BPF_PROG_TYPE_CGROUP_INET4_CONNECT,
+	BPF_PROG_TYPE_CGROUP_INET6_CONNECT,
 };
 
 enum bpf_attach_type {
@@ -147,6 +149,8 @@ enum bpf_attach_type {
 	BPF_CGROUP_DEVICE,
 	BPF_CGROUP_INET4_BIND,
 	BPF_CGROUP_INET6_BIND,
+	BPF_CGROUP_INET4_CONNECT,
+	BPF_CGROUP_INET6_CONNECT,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -700,6 +704,14 @@ union bpf_attr {
  * int bpf_override_return(pt_regs, rc)
  *	@pt_regs: pointer to struct pt_regs
  *	@rc: the return value to set
+ *
+ * int bpf_bind(ctx, addr, addr_len)
+ *     Bind socket to address. Only binding to IP is supported, no port can be
+ *     set in addr.
+ *     @ctx: pointer to context of type bpf_sock_addr
+ *     @addr: pointer to struct sockaddr to bind socket to
+ *     @addr_len: length of sockaddr structure
+ *     Return: 0 on success or negative error code
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -761,7 +773,8 @@ union bpf_attr {
 	FN(perf_prog_read_value),	\
 	FN(getsockopt),			\
 	FN(override_return),		\
-	FN(sock_ops_cb_flags_set),
+	FN(sock_ops_cb_flags_set),	\
+	FN(bind),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index f319b67fd0f6..a3f8d40647f2 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -21,14 +21,15 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test
 	test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o sockmap_parse_prog.o     \
 	sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o test_tracepoint.o \
 	test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \
-	sample_map_ret0.o test_tcpbpf_kern.o
+	sample_map_ret0.o test_tcpbpf_kern.o connect4_prog.o connect6_prog.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
 	test_libbpf.sh \
 	test_xdp_redirect.sh \
 	test_xdp_meta.sh \
-	test_offload.py
+	test_offload.py \
+	test_sock_addr.sh
 
 # Compile but not part of 'make run_tests'
 TEST_GEN_PROGS_EXTENDED = test_libbpf_open
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index dde2c11d7771..edf4971554e2 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -86,6 +86,8 @@ static int (*bpf_perf_prog_read_value)(void *ctx, void *buf,
 	(void *) BPF_FUNC_perf_prog_read_value;
 static int (*bpf_override_return)(void *ctx, unsigned long rc) =
 	(void *) BPF_FUNC_override_return;
+static int (*bpf_bind)(void *ctx, void *addr, int addr_len) =
+	(void *) BPF_FUNC_bind;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/tools/testing/selftests/bpf/connect4_prog.c b/tools/testing/selftests/bpf/connect4_prog.c
new file mode 100644
index 000000000000..d6507e504f91
--- /dev/null
+++ b/tools/testing/selftests/bpf/connect4_prog.c
@@ -0,0 +1,45 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2018 Facebook
+
+#include <string.h>
+
+#include <linux/stddef.h>
+#include <linux/bpf.h>
+#include <linux/in.h>
+#include <linux/in6.h>
+#include <sys/socket.h>
+
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+#define SRC_REWRITE_IP4		0x7f000004U
+#define DST_REWRITE_IP4		0x7f000001U
+#define DST_REWRITE_PORT4	4444
+
+int _version SEC("version") = 1;
+
+SEC("connect4")
+int connect_v4_prog(struct bpf_sock_addr *ctx)
+{
+	struct sockaddr_in sa;
+
+	/* Rewrite destination. */
+	ctx->user_ip4 = bpf_htonl(DST_REWRITE_IP4);
+	ctx->user_port = bpf_htons(DST_REWRITE_PORT4);
+
+	if (ctx->type == SOCK_DGRAM || ctx->type == SOCK_STREAM) {
+		///* Rewrite source. */
+		memset(&sa, 0, sizeof(sa));
+
+		sa.sin_family = AF_INET;
+		sa.sin_port = bpf_htons(0);
+		sa.sin_addr.s_addr = bpf_htonl(SRC_REWRITE_IP4);
+
+		if (bpf_bind(ctx, (struct sockaddr *)&sa, sizeof(sa)) != 0)
+			return 0;
+	}
+
+	return 1;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/connect6_prog.c b/tools/testing/selftests/bpf/connect6_prog.c
new file mode 100644
index 000000000000..553b2f630c88
--- /dev/null
+++ b/tools/testing/selftests/bpf/connect6_prog.c
@@ -0,0 +1,61 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2018 Facebook
+
+#include <string.h>
+
+#include <linux/stddef.h>
+#include <linux/bpf.h>
+#include <linux/in.h>
+#include <linux/in6.h>
+#include <sys/socket.h>
+
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+#define SRC_REWRITE_IP6_0	0
+#define SRC_REWRITE_IP6_1	0
+#define SRC_REWRITE_IP6_2	0
+#define SRC_REWRITE_IP6_3	6
+
+#define DST_REWRITE_IP6_0	0
+#define DST_REWRITE_IP6_1	0
+#define DST_REWRITE_IP6_2	0
+#define DST_REWRITE_IP6_3	1
+
+#define DST_REWRITE_PORT6	6666
+
+int _version SEC("version") = 1;
+
+SEC("connect6")
+int connect_v6_prog(struct bpf_sock_addr *ctx)
+{
+	struct sockaddr_in6 sa;
+
+	/* Rewrite destination. */
+	ctx->user_ip6[0] = bpf_htonl(DST_REWRITE_IP6_0);
+	ctx->user_ip6[1] = bpf_htonl(DST_REWRITE_IP6_1);
+	ctx->user_ip6[2] = bpf_htonl(DST_REWRITE_IP6_2);
+	ctx->user_ip6[3] = bpf_htonl(DST_REWRITE_IP6_3);
+
+	ctx->user_port = bpf_htons(DST_REWRITE_PORT6);
+
+	if (ctx->type == SOCK_DGRAM || ctx->type == SOCK_STREAM) {
+		/* Rewrite source. */
+		memset(&sa, 0, sizeof(sa));
+
+		sa.sin6_family = AF_INET6;
+		sa.sin6_port = bpf_htons(0);
+
+		sa.sin6_addr.s6_addr32[0] = bpf_htonl(SRC_REWRITE_IP6_0);
+		sa.sin6_addr.s6_addr32[1] = bpf_htonl(SRC_REWRITE_IP6_1);
+		sa.sin6_addr.s6_addr32[2] = bpf_htonl(SRC_REWRITE_IP6_2);
+		sa.sin6_addr.s6_addr32[3] = bpf_htonl(SRC_REWRITE_IP6_3);
+
+		if (bpf_bind(ctx, (struct sockaddr *)&sa, sizeof(sa)) != 0)
+			return 0;
+	}
+
+	return 1;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_sock_addr.c b/tools/testing/selftests/bpf/test_sock_addr.c
index 18ea250484dc..d4a7ba1242ab 100644
--- a/tools/testing/selftests/bpf/test_sock_addr.c
+++ b/tools/testing/selftests/bpf/test_sock_addr.c
@@ -12,10 +12,13 @@
 #include <linux/filter.h>
 
 #include <bpf/bpf.h>
+#include <bpf/libbpf.h>
 
 #include "cgroup_helpers.h"
 
 #define CG_PATH	"/foo"
+#define CONNECT4_PROG_PATH	"./connect4_prog.o"
+#define CONNECT6_PROG_PATH	"./connect6_prog.o"
 
 #define SERV4_IP		"192.168.1.254"
 #define SERV4_REWRITE_IP	"127.0.0.1"
@@ -245,6 +248,31 @@ static int bind6_prog_load(void)
 			  "bind() for AF_INET6");
 }
 
+static int connect_prog_load_path(const char *path, enum bpf_prog_type type)
+{
+	struct bpf_object *obj;
+	int prog_fd;
+
+	if (bpf_prog_load(path, type, &obj, &prog_fd)) {
+		log_err(">>> Loading %s program error.\n", path);
+		return -1;
+	}
+
+	return prog_fd;
+}
+
+static int connect4_prog_load(void)
+{
+	return connect_prog_load_path(CONNECT4_PROG_PATH,
+				      BPF_PROG_TYPE_CGROUP_INET4_CONNECT);
+}
+
+static int connect6_prog_load(void)
+{
+	return connect_prog_load_path(CONNECT6_PROG_PATH,
+				      BPF_PROG_TYPE_CGROUP_INET6_CONNECT);
+}
+
 static void print_ip_port(int sockfd, info_fn fn, const char *fmt)
 {
 	char addr_buf[INET_NTOP_BUF];
@@ -281,6 +309,11 @@ static void print_local_ip_port(int sockfd, const char *fmt)
 	print_ip_port(sockfd, getsockname, fmt);
 }
 
+static void print_remote_ip_port(int sockfd, const char *fmt)
+{
+	print_ip_port(sockfd, getpeername, fmt);
+}
+
 static int start_server(int type, const struct sockaddr_storage *addr,
 			socklen_t addr_len)
 {
@@ -315,6 +348,39 @@ static int start_server(int type, const struct sockaddr_storage *addr,
 	return fd;
 }
 
+static int connect_to_server(int type, const struct sockaddr_storage *addr,
+			     socklen_t addr_len)
+{
+	int domain;
+	int fd;
+
+	domain = addr->ss_family;
+
+	if (domain != AF_INET && domain != AF_INET6) {
+		log_err("Unsupported address family");
+		return -1;
+	}
+
+	fd = socket(domain, type, 0);
+	if (fd == -1) {
+		log_err("Failed to creating client socket");
+		return -1;
+	}
+
+	if (connect(fd, (const struct sockaddr *)addr, addr_len) == -1) {
+		log_err("Fail to connect to server");
+		goto err;
+	}
+
+	print_remote_ip_port(fd, "\t   Actual: connect(%s, %d)");
+	print_local_ip_port(fd, " from (%s, %d)\n");
+
+	return 0;
+err:
+	close(fd);
+	return -1;
+}
+
 static void print_test_case_num(int domain, int type)
 {
 	static int test_num;
@@ -347,6 +413,10 @@ static int run_test_case(int domain, int type, const char *ip,
 	if (servfd == -1)
 		goto err;
 
+	printf("\tRequested: connect(%s, %d) from (*, *) ..\n", ip, port);
+	if (connect_to_server(type, &addr, addr_len))
+		goto err;
+
 	goto out;
 err:
 	err = -1;
@@ -421,11 +491,13 @@ static int run_test(void)
 
 	struct program inet6_progs[] = {
 		{BPF_CGROUP_INET6_BIND, bind6_prog_load, -1, "bind6"},
+		{BPF_CGROUP_INET6_CONNECT, connect6_prog_load, -1, "connect6"},
 	};
 	inet6_prog_cnt = sizeof(inet6_progs) / sizeof(struct program);
 
 	struct program inet_progs[] = {
 		{BPF_CGROUP_INET4_BIND, bind4_prog_load, -1, "bind4"},
+		{BPF_CGROUP_INET4_CONNECT, connect4_prog_load, -1, "connect4"},
 	};
 	inet_prog_cnt = sizeof(inet_progs) / sizeof(struct program);
 
@@ -459,5 +531,11 @@ static int run_test(void)
 
 int main(int argc, char **argv)
 {
+	if (argc < 2) {
+		fprintf(stderr,
+			"%s has to be run via %s.sh. Skip direct run.\n",
+			argv[0], argv[0]);
+		exit(0);
+	}
 	return run_test();
 }
diff --git a/tools/testing/selftests/bpf/test_sock_addr.sh b/tools/testing/selftests/bpf/test_sock_addr.sh
new file mode 100755
index 000000000000..c6e1dcf992c4
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_sock_addr.sh
@@ -0,0 +1,57 @@
+#!/bin/sh
+
+set -eu
+
+ping_once()
+{
+	ping -q -c 1 -W 1 ${1%%/*} >/dev/null 2>&1
+}
+
+wait_for_ip()
+{
+	local _i
+	echo -n "Wait for testing IPv4/IPv6 to become available "
+	for _i in $(seq ${MAX_PING_TRIES}); do
+		echo -n "."
+		if ping_once ${TEST_IPv4} && ping_once ${TEST_IPv6}; then
+			echo " OK"
+			return
+		fi
+	done
+	echo 1>&2 "ERROR: Timeout waiting for test IP to become available."
+	exit 1
+}
+
+setup()
+{
+	# Create testing interfaces not to interfere with current environment.
+	ip link add dev ${TEST_IF} type veth peer name ${TEST_IF_PEER}
+	ip link set ${TEST_IF} up
+	ip link set ${TEST_IF_PEER} up
+
+	ip -4 addr add ${TEST_IPv4} dev ${TEST_IF}
+	ip -6 addr add ${TEST_IPv6} dev ${TEST_IF}
+	wait_for_ip
+}
+
+cleanup()
+{
+	ip link del ${TEST_IF} 2>/dev/null || :
+	ip link del ${TEST_IF_PEER} 2>/dev/null || :
+}
+
+main()
+{
+	trap cleanup EXIT 2 3 6 15
+	setup
+	./test_sock_addr setup_done
+}
+
+BASENAME=$(basename $0 .sh)
+TEST_IF="${BASENAME}1"
+TEST_IF_PEER="${BASENAME}2"
+TEST_IPv4="127.0.0.4/8"
+TEST_IPv6="::6/128"
+MAX_PING_TRIES=5
+
+main
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC bpf-next 6/6] bpf: Post-hooks for sys_bind
  2018-03-14  3:39 [PATCH RFC bpf-next 0/6] bpf: introduce cgroup-bpf bind, connect, post-bind hooks Alexei Starovoitov
                   ` (4 preceding siblings ...)
  2018-03-14  3:39 ` [PATCH RFC bpf-next 5/6] selftests/bpf: Selftest for sys_connect hooks Alexei Starovoitov
@ 2018-03-14  3:39 ` Alexei Starovoitov
  2018-03-14 17:13 ` [PATCH RFC bpf-next 0/6] bpf: introduce cgroup-bpf bind, connect, post-bind hooks David Ahern
  2018-03-14 17:22 ` Mahesh Bandewar (महेश बंडेवार)
  7 siblings, 0 replies; 23+ messages in thread
From: Alexei Starovoitov @ 2018-03-14  3:39 UTC (permalink / raw)
  To: davem; +Cc: daniel, netdev, kernel-team

From: Andrey Ignatov <rdna@fb.com>

"Post-hooks" are hooks that are called right before returning from
sys_bind. At this time IP and port are already allocated and no further
changes to `struct sock` can happen before returning from sys_bind but
BPF program has a chance to inspect the socket and change sys_bind
result.

Specifically it can e.g. inspect what port was allocated and if it
doesn't satisfy some policy, BPF program can force sys_bind to release
that port and return an error to user.

Another example of usage is recording the IP:port pair to some map to
use it in later calls to sys_connect. E.g. if some TCP server inside
cgroup was bound to some IP:port and then some TCP client inside same
cgroup is trying to connect to 127.0.0.1:port then BPF hook for
sys_connect can override the destination and connect application to
IP:port instead of 127.0.0.1:port. That helps forcing all applications
inside a cgroup to use desired IP and not break those applications if
they user e.g. localhost to communicate between each other.

== Implementation details ==

Post-hooks are implemented as two new prog types
`BPF_PROG_TYPE_CGROUP_INET4_POST_BIND` and
`BPF_PROG_TYPE_CGROUP_INET6_POST_BIND` and corresponding attach types
`BPF_CGROUP_INET4_POST_BIND` and `BPF_CGROUP_INET6_POST_BIND`.

Separate prog types for IPv4 and IPv6 are introduced to avoid access to
IPv6 field in `struct sock` from `inet_bind()` and to IPv4 field from
`inet6_bind()` since those fields might not make sense in such cases.

`BPF_PROG_TYPE_CGROUP_SOCK` prog type is not reused because it provides
write access to some `struct sock` fields, but socket must not be
changed in post-hooks for sys_bind.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf-cgroup.h |  16 ++++-
 include/linux/bpf_types.h  |   2 +
 include/uapi/linux/bpf.h   |  13 ++++
 kernel/bpf/syscall.c       |  14 ++++
 kernel/bpf/verifier.c      |   2 +
 net/core/filter.c          | 170 ++++++++++++++++++++++++++++++++++++++++-----
 net/ipv4/af_inet.c         |   3 +-
 net/ipv6/af_inet6.c        |   3 +-
 8 files changed, 202 insertions(+), 21 deletions(-)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 6b5c25ef1482..693c542632e3 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -98,16 +98,24 @@ int __cgroup_bpf_check_dev_permission(short dev_type, u32 major, u32 minor,
 	__ret;								       \
 })
 
-#define BPF_CGROUP_RUN_PROG_INET_SOCK(sk)				       \
+#define BPF_CGROUP_RUN_SK_PROG(sk, type)				       \
 ({									       \
 	int __ret = 0;							       \
 	if (cgroup_bpf_enabled) {					       \
-		__ret = __cgroup_bpf_run_filter_sk(sk,			       \
-						 BPF_CGROUP_INET_SOCK_CREATE); \
+		__ret = __cgroup_bpf_run_filter_sk(sk, type);		       \
 	}								       \
 	__ret;								       \
 })
 
+#define BPF_CGROUP_RUN_PROG_INET_SOCK(sk)				       \
+	BPF_CGROUP_RUN_SK_PROG(sk, BPF_CGROUP_INET_SOCK_CREATE)
+
+#define BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk)				       \
+	BPF_CGROUP_RUN_SK_PROG(sk, BPF_CGROUP_INET4_POST_BIND)
+
+#define BPF_CGROUP_RUN_PROG_INET6_POST_BIND(sk)				       \
+	BPF_CGROUP_RUN_SK_PROG(sk, BPF_CGROUP_INET6_POST_BIND)
+
 #define BPF_CGROUP_RUN_SA_PROG(sk, uaddr, type) 			       \
 ({									       \
 	int __ret = 0;							       \
@@ -183,6 +191,8 @@ static inline int cgroup_bpf_inherit(struct cgroup *cgrp) { return 0; }
 #define BPF_CGROUP_RUN_PROG_INET_SOCK(sk) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_INET4_BIND(sk, uaddr) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_INET6_BIND(sk, uaddr) ({ 0; })
+#define BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk) ({ 0; })
+#define BPF_CGROUP_RUN_PROG_INET6_POST_BIND(sk) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_INET4_CONNECT(sk, uaddr) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_INET4_CONNECT_LOCK(sk, uaddr) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_INET6_CONNECT(sk, uaddr) ({ 0; })
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 52a571827b9f..23a97978b544 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -10,6 +10,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SKB, cg_skb)
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCK, cg_sock)
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_INET4_BIND, cg_inet4_bind)
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_INET6_BIND, cg_inet6_bind)
+BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_INET4_POST_BIND, cg_inet4_post_bind)
+BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_INET6_POST_BIND, cg_inet6_post_bind)
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_INET4_CONNECT, cg_inet4_connect)
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_INET6_CONNECT, cg_inet6_connect)
 BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_IN, lwt_inout)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 441a674f385a..7dcc75a65a97 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -137,6 +137,8 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_CGROUP_INET6_BIND,
 	BPF_PROG_TYPE_CGROUP_INET4_CONNECT,
 	BPF_PROG_TYPE_CGROUP_INET6_CONNECT,
+	BPF_PROG_TYPE_CGROUP_INET4_POST_BIND,
+	BPF_PROG_TYPE_CGROUP_INET6_POST_BIND,
 };
 
 enum bpf_attach_type {
@@ -151,6 +153,8 @@ enum bpf_attach_type {
 	BPF_CGROUP_INET6_BIND,
 	BPF_CGROUP_INET4_CONNECT,
 	BPF_CGROUP_INET6_CONNECT,
+	BPF_CGROUP_INET4_POST_BIND,
+	BPF_CGROUP_INET6_POST_BIND,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -903,6 +907,15 @@ struct bpf_sock {
 	__u32 protocol;
 	__u32 mark;
 	__u32 priority;
+	__u32 src_ip4;		/* Allows 1,2,4-byte read.
+				 * Stored in network byte order.
+				 */
+	__u32 src_ip6[4];	/* Allows 1,2,4-byte read.
+				 * Stored in network byte order.
+				 */
+	__u32 src_port;		/* Allows 4-byte read.
+				 * Stored in network byte order
+				 */
 };
 
 #define XDP_PACKET_HEADROOM 256
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 145de3332e32..2eb941dacbc5 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1382,6 +1382,12 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 	case BPF_CGROUP_INET6_BIND:
 		ptype = BPF_PROG_TYPE_CGROUP_INET6_BIND;
 		break;
+	case BPF_CGROUP_INET4_POST_BIND:
+		ptype = BPF_PROG_TYPE_CGROUP_INET4_POST_BIND;
+		break;
+	case BPF_CGROUP_INET6_POST_BIND:
+		ptype = BPF_PROG_TYPE_CGROUP_INET6_POST_BIND;
+		break;
 	case BPF_CGROUP_INET4_CONNECT:
 		ptype = BPF_PROG_TYPE_CGROUP_INET4_CONNECT;
 		break;
@@ -1449,6 +1455,12 @@ static int bpf_prog_detach(const union bpf_attr *attr)
 	case BPF_CGROUP_INET6_BIND:
 		ptype = BPF_PROG_TYPE_CGROUP_INET6_BIND;
 		break;
+	case BPF_CGROUP_INET4_POST_BIND:
+		ptype = BPF_PROG_TYPE_CGROUP_INET4_POST_BIND;
+		break;
+	case BPF_CGROUP_INET6_POST_BIND:
+		ptype = BPF_PROG_TYPE_CGROUP_INET6_POST_BIND;
+		break;
 	case BPF_CGROUP_INET4_CONNECT:
 		ptype = BPF_PROG_TYPE_CGROUP_INET4_CONNECT;
 		break;
@@ -1504,6 +1516,8 @@ static int bpf_prog_query(const union bpf_attr *attr,
 	case BPF_CGROUP_INET_SOCK_CREATE:
 	case BPF_CGROUP_INET4_BIND:
 	case BPF_CGROUP_INET6_BIND:
+	case BPF_CGROUP_INET4_POST_BIND:
+	case BPF_CGROUP_INET6_POST_BIND:
 	case BPF_CGROUP_INET4_CONNECT:
 	case BPF_CGROUP_INET6_CONNECT:
 	case BPF_CGROUP_SOCK_OPS:
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index cda7830a2c1b..84faec85fe3e 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3874,6 +3874,8 @@ static int check_return_code(struct bpf_verifier_env *env)
 	case BPF_PROG_TYPE_CGROUP_SOCK:
 	case BPF_PROG_TYPE_CGROUP_INET4_BIND:
 	case BPF_PROG_TYPE_CGROUP_INET6_BIND:
+	case BPF_PROG_TYPE_CGROUP_INET4_POST_BIND:
+	case BPF_PROG_TYPE_CGROUP_INET6_POST_BIND:
 	case BPF_PROG_TYPE_CGROUP_INET4_CONNECT:
 	case BPF_PROG_TYPE_CGROUP_INET6_CONNECT:
 	case BPF_PROG_TYPE_SOCK_OPS:
diff --git a/net/core/filter.c b/net/core/filter.c
index 916195b86a23..e27196248c10 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3840,6 +3840,62 @@ static bool sock_filter_is_valid_access(int off, int size,
 	return true;
 }
 
+static bool __sock_is_valid_access(unsigned short ctx_family, int off, int size,
+				   enum bpf_access_type type,
+				   struct bpf_insn_access_aux *info)
+{
+	const int size_default = sizeof(__u32);
+	unsigned short requested_family = 0;
+
+	if (off < 0 || off >= sizeof(struct bpf_sock))
+		return false;
+	if (off % size != 0)
+		return false;
+	if (type != BPF_READ)
+		return false;
+
+	switch (off) {
+	case bpf_ctx_range(struct bpf_sock, src_ip4):
+		requested_family = AF_INET;
+		/* FALLTHROUGH */
+	case bpf_ctx_range_till(struct bpf_sock, src_ip6[0], src_ip6[3]):
+		if (!requested_family)
+			requested_family = AF_INET6;
+		/* Disallow access to IPv6 fields from IPv4 contex and vise
+		 * versa.
+		 */
+		if (requested_family != ctx_family)
+			return false;
+		bpf_ctx_record_field_size(info, size_default);
+		if (!bpf_ctx_narrow_access_ok(off, size, size_default))
+			return false;
+		break;
+	case bpf_ctx_range(struct bpf_sock, family):
+	case bpf_ctx_range(struct bpf_sock, type):
+	case bpf_ctx_range(struct bpf_sock, protocol):
+	case bpf_ctx_range(struct bpf_sock, src_port):
+		if (size != size_default)
+			return false;
+		break;
+	default:
+		return false;
+	}
+
+	return true;
+}
+
+static bool sock4_is_valid_access(int off, int size, enum bpf_access_type type,
+				  struct bpf_insn_access_aux *info)
+{
+	return __sock_is_valid_access(AF_INET, off, size, type, info);
+}
+
+static bool sock6_is_valid_access(int off, int size, enum bpf_access_type type,
+				  struct bpf_insn_access_aux *info)
+{
+	return __sock_is_valid_access(AF_INET6, off, size, type, info);
+}
+
 static int bpf_unclone_prologue(struct bpf_insn *insn_buf, bool direct_write,
 				const struct bpf_prog *prog, int drop_verdict)
 {
@@ -4406,6 +4462,40 @@ static u32 bpf_convert_ctx_access(enum bpf_access_type type,
 	return insn - insn_buf;
 }
 
+static u32 __sock_convert_ctx_access(enum bpf_access_type type,
+				     const struct bpf_insn *si,
+				     struct bpf_insn *insn_buf,
+				     struct bpf_prog *prog, u32 *target_size)
+{
+	struct bpf_insn *insn = insn_buf;
+
+	switch (si->off) {
+	case offsetof(struct bpf_sock, family):
+		BUILD_BUG_ON(FIELD_SIZEOF(struct sock, sk_family) != 2);
+
+		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
+				      offsetof(struct sock, sk_family));
+		break;
+
+	case offsetof(struct bpf_sock, type):
+		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
+				      offsetof(struct sock, __sk_flags_offset));
+		*insn++ = BPF_ALU32_IMM(BPF_AND, si->dst_reg, SK_FL_TYPE_MASK);
+		*insn++ = BPF_ALU32_IMM(BPF_RSH, si->dst_reg, SK_FL_TYPE_SHIFT);
+		break;
+
+	case offsetof(struct bpf_sock, protocol):
+		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
+				      offsetof(struct sock, __sk_flags_offset));
+		*insn++ = BPF_ALU32_IMM(BPF_AND, si->dst_reg, SK_FL_PROTO_MASK);
+		*insn++ = BPF_ALU32_IMM(BPF_RSH, si->dst_reg,
+					SK_FL_PROTO_SHIFT);
+		break;
+	}
+
+	return insn - insn_buf;
+}
+
 static u32 sock_filter_convert_ctx_access(enum bpf_access_type type,
 					  const struct bpf_insn *si,
 					  struct bpf_insn *insn_buf,
@@ -4447,26 +4537,56 @@ static u32 sock_filter_convert_ctx_access(enum bpf_access_type type,
 				      offsetof(struct sock, sk_priority));
 		break;
 
-	case offsetof(struct bpf_sock, family):
-		BUILD_BUG_ON(FIELD_SIZEOF(struct sock, sk_family) != 2);
+	default:
+		return __sock_convert_ctx_access(type, si, insn_buf, prog,
+						 target_size);
+	}
 
-		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
-				      offsetof(struct sock, sk_family));
-		break;
+	return insn - insn_buf;
+}
 
-	case offsetof(struct bpf_sock, type):
-		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
-				      offsetof(struct sock, __sk_flags_offset));
-		*insn++ = BPF_ALU32_IMM(BPF_AND, si->dst_reg, SK_FL_TYPE_MASK);
-		*insn++ = BPF_ALU32_IMM(BPF_RSH, si->dst_reg, SK_FL_TYPE_SHIFT);
-		break;
+static u32 sock_convert_ctx_access(enum bpf_access_type type,
+				   const struct bpf_insn *si,
+				   struct bpf_insn *insn_buf,
+				   struct bpf_prog *prog,
+				   u32 *target_size)
+{
+	struct bpf_insn *insn = insn_buf;
+	int off;
 
-	case offsetof(struct bpf_sock, protocol):
-		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
-				      offsetof(struct sock, __sk_flags_offset));
-		*insn++ = BPF_ALU32_IMM(BPF_AND, si->dst_reg, SK_FL_PROTO_MASK);
-		*insn++ = BPF_ALU32_IMM(BPF_RSH, si->dst_reg, SK_FL_PROTO_SHIFT);
+	switch (si->off) {
+	case offsetof(struct bpf_sock, src_ip4):
+		*insn++ = BPF_LDX_MEM(
+			BPF_SIZE(si->code), si->dst_reg, si->src_reg,
+			bpf_target_off(struct sock_common, skc_rcv_saddr,
+				       FIELD_SIZEOF(struct sock_common,
+						    skc_rcv_saddr),
+				       target_size));
 		break;
+	case bpf_ctx_range_till(struct bpf_sock, src_ip6[0], src_ip6[3]):
+		off = si->off;
+		off -= offsetof(struct bpf_sock, src_ip6[0]);
+		*insn++ = BPF_LDX_MEM(
+			BPF_SIZE(si->code), si->dst_reg, si->src_reg,
+			bpf_target_off(
+				struct sock_common,
+				skc_v6_rcv_saddr.s6_addr32[0],
+				FIELD_SIZEOF(struct sock_common,
+					     skc_v6_rcv_saddr.s6_addr32[0]),
+				target_size) + off);
+		break;
+	case offsetof(struct bpf_sock, src_port):
+		*insn++ = BPF_LDX_MEM(
+			BPF_FIELD_SIZEOF(struct sock_common, skc_num),
+			si->dst_reg, si->src_reg,
+			bpf_target_off(struct sock_common, skc_num,
+				       FIELD_SIZEOF(struct sock_common,
+						    skc_num),
+				       target_size));
+		break;
+	default:
+		return __sock_convert_ctx_access(type, si, insn_buf, prog,
+						 target_size);
 	}
 
 	return insn - insn_buf;
@@ -5122,6 +5242,24 @@ const struct bpf_verifier_ops cg_sock_verifier_ops = {
 const struct bpf_prog_ops cg_sock_prog_ops = {
 };
 
+const struct bpf_verifier_ops cg_inet4_post_bind_verifier_ops = {
+	.get_func_proto		= sock_filter_func_proto,
+	.is_valid_access	= sock4_is_valid_access,
+	.convert_ctx_access	= sock_convert_ctx_access,
+};
+
+const struct bpf_prog_ops cg_inet4_post_bind_prog_ops = {
+};
+
+const struct bpf_verifier_ops cg_inet6_post_bind_verifier_ops = {
+	.get_func_proto		= sock_filter_func_proto,
+	.is_valid_access	= sock6_is_valid_access,
+	.convert_ctx_access	= sock_convert_ctx_access,
+};
+
+const struct bpf_prog_ops cg_inet6_post_bind_prog_ops = {
+};
+
 const struct bpf_verifier_ops cg_inet4_bind_verifier_ops = {
 	.get_func_proto		= inet_bind_func_proto,
 	.is_valid_access	= sock_addr4_is_valid_access,
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 488fe26ac8e5..28e2e7fdd5b1 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -521,7 +521,8 @@ int __inet_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len,
 	/* Make sure we are allowed to bind here. */
 	if ((snum || !(inet->bind_address_no_port ||
 		       force_bind_address_no_port)) &&
-	    sk->sk_prot->get_port(sk, snum)) {
+	    (sk->sk_prot->get_port(sk, snum) ||
+	     BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk))) {
 		inet->inet_saddr = inet->inet_rcv_saddr = 0;
 		err = -EADDRINUSE;
 		goto out_release_sock;
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 13110bee5c14..473cc55a3a7d 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -414,7 +414,8 @@ int __inet6_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len,
 	/* Make sure we are allowed to bind here. */
 	if ((snum || !(inet->bind_address_no_port ||
 		       force_bind_address_no_port)) &&
-	    sk->sk_prot->get_port(sk, snum)) {
+	    (sk->sk_prot->get_port(sk, snum) ||
+	     BPF_CGROUP_RUN_PROG_INET6_POST_BIND(sk))) {
 		sk->sk_ipv6only = saved_ipv6only;
 		inet_reset_saddr(sk);
 		err = -EADDRINUSE;
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC bpf-next 0/6] bpf: introduce cgroup-bpf bind, connect, post-bind hooks
  2018-03-14  3:39 [PATCH RFC bpf-next 0/6] bpf: introduce cgroup-bpf bind, connect, post-bind hooks Alexei Starovoitov
                   ` (5 preceding siblings ...)
  2018-03-14  3:39 ` [PATCH RFC bpf-next 6/6] bpf: Post-hooks for sys_bind Alexei Starovoitov
@ 2018-03-14 17:13 ` David Ahern
  2018-03-14 18:00   ` Alexei Starovoitov
  2018-03-14 17:22 ` Mahesh Bandewar (महेश बंडेवार)
  7 siblings, 1 reply; 23+ messages in thread
From: David Ahern @ 2018-03-14 17:13 UTC (permalink / raw)
  To: Alexei Starovoitov, davem; +Cc: daniel, netdev, kernel-team

On 3/13/18 8:39 PM, Alexei Starovoitov wrote:
> For our container management we've been using complicated and fragile setup
> consisting of LD_PRELOAD wrapper intercepting bind and connect calls from
> all containerized applications.
> The setup involves per-container IPs, policy, etc, so traditional
> network-only solutions that involve VRFs, netns, acls are not applicable.

Why does VRF and the cgroup option to bind sockets to the VRF not solve
this problem for you? The VRF limits the source address choices.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC bpf-next 0/6] bpf: introduce cgroup-bpf bind, connect, post-bind hooks
  2018-03-14 17:13 ` [PATCH RFC bpf-next 0/6] bpf: introduce cgroup-bpf bind, connect, post-bind hooks David Ahern
@ 2018-03-14 18:00   ` Alexei Starovoitov
  0 siblings, 0 replies; 23+ messages in thread
From: Alexei Starovoitov @ 2018-03-14 18:00 UTC (permalink / raw)
  To: David Ahern; +Cc: Alexei Starovoitov, davem, daniel, netdev, kernel-team

On Wed, Mar 14, 2018 at 10:13:22AM -0700, David Ahern wrote:
> On 3/13/18 8:39 PM, Alexei Starovoitov wrote:
> > For our container management we've been using complicated and fragile setup
> > consisting of LD_PRELOAD wrapper intercepting bind and connect calls from
> > all containerized applications.
> > The setup involves per-container IPs, policy, etc, so traditional
> > network-only solutions that involve VRFs, netns, acls are not applicable.
> 
> Why does VRF and the cgroup option to bind sockets to the VRF not solve
> this problem for you? The VRF limits the source address choices.

answered in reply to Eric. Pls follow up there if it's still not clear.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC bpf-next 0/6] bpf: introduce cgroup-bpf bind, connect, post-bind hooks
  2018-03-14  3:39 [PATCH RFC bpf-next 0/6] bpf: introduce cgroup-bpf bind, connect, post-bind hooks Alexei Starovoitov
                   ` (6 preceding siblings ...)
  2018-03-14 17:13 ` [PATCH RFC bpf-next 0/6] bpf: introduce cgroup-bpf bind, connect, post-bind hooks David Ahern
@ 2018-03-14 17:22 ` Mahesh Bandewar (महेश बंडेवार)
  2018-03-14 18:01   ` Alexei Starovoitov
  7 siblings, 1 reply; 23+ messages in thread
From: Mahesh Bandewar (महेश बंडेवार) @ 2018-03-14 17:22 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: David Miller, daniel, linux-netdev, kernel-team

On Tue, Mar 13, 2018 at 8:39 PM, Alexei Starovoitov <ast@kernel.org> wrote:
> For our container management we've been using complicated and fragile setup
> consisting of LD_PRELOAD wrapper intercepting bind and connect calls from
> all containerized applications.
> The setup involves per-container IPs, policy, etc, so traditional
> network-only solutions that involve VRFs, netns, acls are not applicable.
You can keep the policies per cgroup but move the ip from cgroup to
net-ns and then none of these ebpf hacks are required since cgroup and
namespaces are orthogonal you can use cgroups in conjunction with
namespaces.

> Changing apps is not possible and LD_PRELOAD doesn't work
> for apps that don't use glibc like java and golang.
> BPF+cgroup looks to be the best solution for this problem.
> Hence we introduce 3 hooks:
> - at entry into sys_bind and sys_connect
>   to let bpf prog look and modify 'struct sockaddr' provided
>   by user space and fail bind/connect when appropriate
> - post sys_bind after port is allocated
>
> The approach works great and has zero overhead for anyone who doesn't
> use it and very low overhead when deployed.
>
> The main question for Daniel and Dave is what approach to take
> with prog types...
>
> In this patch set we introduce 6 new program types to make user
> experience easier:
>   BPF_PROG_TYPE_CGROUP_INET4_BIND,
>   BPF_PROG_TYPE_CGROUP_INET6_BIND,
>   BPF_PROG_TYPE_CGROUP_INET4_CONNECT,
>   BPF_PROG_TYPE_CGROUP_INET6_CONNECT,
>   BPF_PROG_TYPE_CGROUP_INET4_POST_BIND,
>   BPF_PROG_TYPE_CGROUP_INET6_POST_BIND,
>
> since v4 programs should not be using 'struct bpf_sock_addr'->user_ip6 fields
> and different prog type for v4 and v6 helps verifier reject such access
> at load time.
> Similarly bind vs connect are two different prog types too,
> since only sys_connect programs can call new bpf_bind() helper.
>
> This approach is very different from tcp-bpf where single
> 'struct bpf_sock_ops' and single prog type is used for different hooks.
> The field checks are done at run-time instead of load time.
>
> I think the approach taken by this patch set is justified,
> but we may do better if we extend BPF_PROG_ATTACH cmd
> with log_buf + log_size, then we should be able to combine
> bind+connect+v4+v6 into single program type.
> The idea that at load time the verifier will remember a bitmask
> of fields in bpf_sock_addr used by the program and helpers
> that program used, then at attach time we can check that
> hook is compatible with features used by the program and
> report human readable error message back via log_buf.
> We cannot do this right now with just EINVAL, since combinations
> of errors like 'using user_ip6 field but attaching to v4 hook'
> are too high to express as errno.
> This would be bigger change. If you folks think it's worth it
> we can go with this approach or if you think 6 new prog types
> is not too bad, we can leave the patch as-is.
> Comments?
> Other comments on patches are welcome.
>
> Andrey Ignatov (6):
>   bpf: Hooks for sys_bind
>   selftests/bpf: Selftest for sys_bind hooks
>   net: Introduce __inet_bind() and __inet6_bind
>   bpf: Hooks for sys_connect
>   selftests/bpf: Selftest for sys_connect hooks
>   bpf: Post-hooks for sys_bind
>
>  include/linux/bpf-cgroup.h                    |  68 +++-
>  include/linux/bpf_types.h                     |   6 +
>  include/linux/filter.h                        |  10 +
>  include/net/inet_common.h                     |   2 +
>  include/net/ipv6.h                            |   2 +
>  include/net/sock.h                            |   3 +
>  include/net/udp.h                             |   1 +
>  include/uapi/linux/bpf.h                      |  52 ++-
>  kernel/bpf/cgroup.c                           |  36 ++
>  kernel/bpf/syscall.c                          |  42 ++
>  kernel/bpf/verifier.c                         |   6 +
>  net/core/filter.c                             | 479 ++++++++++++++++++++++-
>  net/ipv4/af_inet.c                            |  60 ++-
>  net/ipv4/tcp_ipv4.c                           |  16 +
>  net/ipv4/udp.c                                |  14 +
>  net/ipv6/af_inet6.c                           |  47 ++-
>  net/ipv6/tcp_ipv6.c                           |  16 +
>  net/ipv6/udp.c                                |  20 +
>  tools/include/uapi/linux/bpf.h                |  39 +-
>  tools/testing/selftests/bpf/Makefile          |   8 +-
>  tools/testing/selftests/bpf/bpf_helpers.h     |   2 +
>  tools/testing/selftests/bpf/connect4_prog.c   |  45 +++
>  tools/testing/selftests/bpf/connect6_prog.c   |  61 +++
>  tools/testing/selftests/bpf/test_sock_addr.c  | 541 ++++++++++++++++++++++++++
>  tools/testing/selftests/bpf/test_sock_addr.sh |  57 +++
>  25 files changed, 1580 insertions(+), 53 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/connect4_prog.c
>  create mode 100644 tools/testing/selftests/bpf/connect6_prog.c
>  create mode 100644 tools/testing/selftests/bpf/test_sock_addr.c
>  create mode 100755 tools/testing/selftests/bpf/test_sock_addr.sh
>
> --
> 2.9.5
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC bpf-next 0/6] bpf: introduce cgroup-bpf bind, connect, post-bind hooks
  2018-03-14 17:22 ` Mahesh Bandewar (महेश बंडेवार)
@ 2018-03-14 18:01   ` Alexei Starovoitov
  0 siblings, 0 replies; 23+ messages in thread
From: Alexei Starovoitov @ 2018-03-14 18:01 UTC (permalink / raw)
  To: Mahesh Bandewar (महेश बंडेवार)
  Cc: Alexei Starovoitov, David Miller, daniel, linux-netdev,
	kernel-team

On Wed, Mar 14, 2018 at 10:22:03AM -0700, Mahesh Bandewar (महेश बंडेवार) wrote:
> On Tue, Mar 13, 2018 at 8:39 PM, Alexei Starovoitov <ast@kernel.org> wrote:
> > For our container management we've been using complicated and fragile setup
> > consisting of LD_PRELOAD wrapper intercepting bind and connect calls from
> > all containerized applications.
> > The setup involves per-container IPs, policy, etc, so traditional
> > network-only solutions that involve VRFs, netns, acls are not applicable.
> You can keep the policies per cgroup but move the ip from cgroup to
> net-ns and then none of these ebpf hacks are required since cgroup and
> namespaces are orthogonal you can use cgroups in conjunction with
> namespaces.

answered in reply to Eric. Pls follow up there if it's still not clear.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind
@ 2018-03-14 18:41 Alexei Starovoitov
  2018-03-15  0:17 ` Eric Dumazet
  0 siblings, 1 reply; 23+ messages in thread
From: Alexei Starovoitov @ 2018-03-14 18:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexei Starovoitov, David S. Miller, Daniel Borkmann,
	Network Development, Kernel Team

On Wed, Mar 14, 2018 at 11:00 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
>> It seems this is exactly the case where a netns would be the correct answer.
>
> Unfortuantely that's not the case. That's what I tried to explain
> in the cover letter:
> "The setup involves per-container IPs, policy, etc, so traditional
> network-only solutions that involve VRFs, netns, acls are not applicable."
> To elaborate more on that:
> netns is l2 isolation.
> vrf is l3 isolation.
> whereas to containerize an application we need to punch connectivity holes
> in these layered techniques.
> We also considered resurrecting Hannes's afnetns work
> and even went as far as designing a new namespace for L4 isolation.
> Unfortunately all hierarchical namespace abstraction don't work.
> To run an application inside cgroup container that was not written
> with containers in mind we have to make an illusion of running
> in non-containerized environment.
> In some cases we remember the port and container id in the post-bind hook
> in a bpf map and when some other task in a different container is trying
> to connect to a service we need to know where this service is running.
> It can be remote and can be local. Both client and service may or may not
> be written with containers in mind and this sockaddr rewrite is providing
> connectivity and load balancing feature that you simply cannot do
> with hierarchical networking primitives.

have to explain this a bit further...
We also considered hacking these 'connectivity holes' in
netns and/or vrf, but that would be real layering violation,
since clean l2, l3 abstraction would suddenly support
something that breaks through the layers.
Just like many consider ipvlan a bad hack that punches
through the layers and connects l2 abstraction of netns
at l3 layer, this is not something kernel should ever do.
We really didn't want another ipvlan-like hack in the kernel.
Instead bpf programs at bind/connect time _help_
applications discover and connect to each other.
All containers are running in init_nens and there are no vrfs.
After bind/connect the normal fib/neighbor core networking
logic works as it should always do. The whole system is
clean from network point of view.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind
  2018-03-14 18:41 [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind Alexei Starovoitov
@ 2018-03-15  0:17 ` Eric Dumazet
  2018-03-15  3:37   ` Alexei Starovoitov
  0 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2018-03-15  0:17 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, David S. Miller, Daniel Borkmann,
	Network Development, Kernel Team



On 03/14/2018 11:41 AM, Alexei Starovoitov wrote:
> On Wed, Mar 14, 2018 at 11:00 AM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>>
>>> It seems this is exactly the case where a netns would be the correct answer.
>>
>> Unfortuantely that's not the case. That's what I tried to explain
>> in the cover letter:
>> "The setup involves per-container IPs, policy, etc, so traditional
>> network-only solutions that involve VRFs, netns, acls are not applicable."
>> To elaborate more on that:
>> netns is l2 isolation.
>> vrf is l3 isolation.
>> whereas to containerize an application we need to punch connectivity holes
>> in these layered techniques.
>> We also considered resurrecting Hannes's afnetns work
>> and even went as far as designing a new namespace for L4 isolation.
>> Unfortunately all hierarchical namespace abstraction don't work.
>> To run an application inside cgroup container that was not written
>> with containers in mind we have to make an illusion of running
>> in non-containerized environment.
>> In some cases we remember the port and container id in the post-bind hook
>> in a bpf map and when some other task in a different container is trying
>> to connect to a service we need to know where this service is running.
>> It can be remote and can be local. Both client and service may or may not
>> be written with containers in mind and this sockaddr rewrite is providing
>> connectivity and load balancing feature that you simply cannot do
>> with hierarchical networking primitives.
> 
> have to explain this a bit further...
> We also considered hacking these 'connectivity holes' in
> netns and/or vrf, but that would be real layering violation,
> since clean l2, l3 abstraction would suddenly support
> something that breaks through the layers.
> Just like many consider ipvlan a bad hack that punches
> through the layers and connects l2 abstraction of netns
> at l3 layer, this is not something kernel should ever do.
> We really didn't want another ipvlan-like hack in the kernel.
> Instead bpf programs at bind/connect time _help_
> applications discover and connect to each other.
> All containers are running in init_nens and there are no vrfs.
> After bind/connect the normal fib/neighbor core networking
> logic works as it should always do. The whole system is
> clean from network point of view.


We apparently missed something when deploying ipvlan and one netns per
container/job

Full access to 64K ports, no more ports being reserved/abused.
If one job needs more, no problem, just use more than one IP per netns.

It also works with UDP just fine. Are you considering adding a hook
later for sendmsg() (unconnected socket or not), or do you want to use
the existing one in ip_finish_output(), adding per-packet overhead ?

This notion of 'clean l2, l3 abstraction' is very subjective.
I find netns isolation very clean, powerful, and it is there already.

eBPF is certainly nice, but pretending netns/ipvlan are hacks is not
credible.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind
  2018-03-15  0:17 ` Eric Dumazet
@ 2018-03-15  3:37   ` Alexei Starovoitov
  2018-03-15 14:14     ` Jiri Benc
  2018-03-15 16:22     ` Mahesh Bandewar (महेश बंडेवार)
  0 siblings, 2 replies; 23+ messages in thread
From: Alexei Starovoitov @ 2018-03-15  3:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexei Starovoitov, David S. Miller, Daniel Borkmann,
	Network Development, Kernel Team

On Wed, Mar 14, 2018 at 05:17:54PM -0700, Eric Dumazet wrote:
> 
> 
> On 03/14/2018 11:41 AM, Alexei Starovoitov wrote:
> > On Wed, Mar 14, 2018 at 11:00 AM, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> >>
> >>> It seems this is exactly the case where a netns would be the correct answer.
> >>
> >> Unfortuantely that's not the case. That's what I tried to explain
> >> in the cover letter:
> >> "The setup involves per-container IPs, policy, etc, so traditional
> >> network-only solutions that involve VRFs, netns, acls are not applicable."
> >> To elaborate more on that:
> >> netns is l2 isolation.
> >> vrf is l3 isolation.
> >> whereas to containerize an application we need to punch connectivity holes
> >> in these layered techniques.
> >> We also considered resurrecting Hannes's afnetns work
> >> and even went as far as designing a new namespace for L4 isolation.
> >> Unfortunately all hierarchical namespace abstraction don't work.
> >> To run an application inside cgroup container that was not written
> >> with containers in mind we have to make an illusion of running
> >> in non-containerized environment.
> >> In some cases we remember the port and container id in the post-bind hook
> >> in a bpf map and when some other task in a different container is trying
> >> to connect to a service we need to know where this service is running.
> >> It can be remote and can be local. Both client and service may or may not
> >> be written with containers in mind and this sockaddr rewrite is providing
> >> connectivity and load balancing feature that you simply cannot do
> >> with hierarchical networking primitives.
> > 
> > have to explain this a bit further...
> > We also considered hacking these 'connectivity holes' in
> > netns and/or vrf, but that would be real layering violation,
> > since clean l2, l3 abstraction would suddenly support
> > something that breaks through the layers.
> > Just like many consider ipvlan a bad hack that punches
> > through the layers and connects l2 abstraction of netns
> > at l3 layer, this is not something kernel should ever do.
> > We really didn't want another ipvlan-like hack in the kernel.
> > Instead bpf programs at bind/connect time _help_
> > applications discover and connect to each other.
> > All containers are running in init_nens and there are no vrfs.
> > After bind/connect the normal fib/neighbor core networking
> > logic works as it should always do. The whole system is
> > clean from network point of view.
> 
> 
> We apparently missed something when deploying ipvlan and one netns per
> container/job

Hanness expressed the reasons why RHEL doesn't support ipvlan long ago.
I couldn't find the complete link. This one mentions some of the issues:
https://www.mail-archive.com/netdev@vger.kernel.org/msg157614.html
Since ipvlan works for you, great, but it's clearly a layering violation.
ipvlan connects L2 namespaces via L3 by doing its own fib lookups.
To me it's a definition 'punch connectivity hole' in L2 abstraction.
In normal L2 setup of netns+veth the traffic from one netns should
have went into another netns via full L2. ipvlan cheats by giving
L3 connectivity. It's not clean to me. There are still neighbour
tables in netnses that are duplicated.
Because netns is L2 there is full requeuing for traffic across netnses.
I guess google doesn't prioritize container to container traffic
while outside into netns via ipvlan works ok similar to bond, but
imo it's cheating too.
imo afnetns would have been much better alternative for your
use case without ipvlan pitfalls, but as you said ipvlan already
in the tree and afnetns is not.
With afnetns early demux would have worked not only for traffic from
the network, but for traffic across afnetns-es.

> I find netns isolation very clean, powerful, and it is there already.

netns+veth is a clean abstraction, but netns+ipvlan is imo not.
imo VRF is another clean L3 abstraction. Yet some folks tried
to do VRF-like things with netns.
David Ahern wrote nice blog about issues with that.
I suspect VRF also could have worked for google use case
and would have been easier to use than netns+ipvlan.
But since ipvlan works for you in the current shape, great,
I'm not going to argue further.
Let's agree to disagree on cleanliness of the solution.

> It also works with UDP just fine. Are you considering adding a hook
> later for sendmsg() (unconnected socket or not), or do you want to use
> the existing one in ip_finish_output(), adding per-packet overhead ?

Currently that's indeed the case. Existing cgroup-bpf hooks
at ip_finish_output work for many use cases, but per-packet overhead
is bad. With bind/connect hooks we avoid that overhead for
good traffic (which is tcp and connected udp). We still need
to solve it for unconnected udp. Rough idea is to do similar
sockaddr rewrite/drop in unconnected part of udp_sendmsg.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind
  2018-03-15  3:37   ` Alexei Starovoitov
@ 2018-03-15 14:14     ` Jiri Benc
  2018-03-15 16:22     ` Mahesh Bandewar (महेश बंडेवार)
  1 sibling, 0 replies; 23+ messages in thread
From: Jiri Benc @ 2018-03-15 14:14 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, Alexei Starovoitov, David S. Miller,
	Daniel Borkmann, Network Development, Kernel Team

On Wed, 14 Mar 2018 20:37:00 -0700, Alexei Starovoitov wrote:
> Hanness expressed the reasons why RHEL doesn't support ipvlan long ago.
> I couldn't find the complete link. This one mentions some of the issues:
> https://www.mail-archive.com/netdev@vger.kernel.org/msg157614.html

ipvlan improved a lot since that time :-) And Paolo has recently fixed
the remaining issues we were aware of.

 Jiri

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind
  2018-03-15  3:37   ` Alexei Starovoitov
  2018-03-15 14:14     ` Jiri Benc
@ 2018-03-15 16:22     ` Mahesh Bandewar (महेश बंडेवार)
  1 sibling, 0 replies; 23+ messages in thread
From: Mahesh Bandewar (महेश बंडेवार) @ 2018-03-15 16:22 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, Alexei Starovoitov, David S. Miller,
	Daniel Borkmann, Network Development, Kernel Team

On Wed, Mar 14, 2018 at 8:37 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Wed, Mar 14, 2018 at 05:17:54PM -0700, Eric Dumazet wrote:
>>
>>
>> On 03/14/2018 11:41 AM, Alexei Starovoitov wrote:
>> > On Wed, Mar 14, 2018 at 11:00 AM, Alexei Starovoitov
>> > <alexei.starovoitov@gmail.com> wrote:
>> >>
>> >>> It seems this is exactly the case where a netns would be the correct answer.
>> >>
>> >> Unfortuantely that's not the case. That's what I tried to explain
>> >> in the cover letter:
>> >> "The setup involves per-container IPs, policy, etc, so traditional
>> >> network-only solutions that involve VRFs, netns, acls are not applicable."
>> >> To elaborate more on that:
>> >> netns is l2 isolation.
>> >> vrf is l3 isolation.
>> >> whereas to containerize an application we need to punch connectivity holes
>> >> in these layered techniques.
>> >> We also considered resurrecting Hannes's afnetns work
>> >> and even went as far as designing a new namespace for L4 isolation.
>> >> Unfortunately all hierarchical namespace abstraction don't work.
>> >> To run an application inside cgroup container that was not written
>> >> with containers in mind we have to make an illusion of running
>> >> in non-containerized environment.
>> >> In some cases we remember the port and container id in the post-bind hook
>> >> in a bpf map and when some other task in a different container is trying
>> >> to connect to a service we need to know where this service is running.
>> >> It can be remote and can be local. Both client and service may or may not
>> >> be written with containers in mind and this sockaddr rewrite is providing
>> >> connectivity and load balancing feature that you simply cannot do
>> >> with hierarchical networking primitives.
>> >
>> > have to explain this a bit further...
>> > We also considered hacking these 'connectivity holes' in
>> > netns and/or vrf, but that would be real layering violation,
>> > since clean l2, l3 abstraction would suddenly support
>> > something that breaks through the layers.
>> > Just like many consider ipvlan a bad hack that punches
>> > through the layers and connects l2 abstraction of netns
>> > at l3 layer, this is not something kernel should ever do.
>> > We really didn't want another ipvlan-like hack in the kernel.
>> > Instead bpf programs at bind/connect time _help_
>> > applications discover and connect to each other.
>> > All containers are running in init_nens and there are no vrfs.
>> > After bind/connect the normal fib/neighbor core networking
>> > logic works as it should always do. The whole system is
>> > clean from network point of view.
>>
>>
>> We apparently missed something when deploying ipvlan and one netns per
>> container/job
>
> Hanness expressed the reasons why RHEL doesn't support ipvlan long ago.

I had a long discussion with Hanness and there are two pending  issues
(discounting minor bug fixes / improvement). (a) the
multicast-group-membership and (b) early demux.
multicast group membership is just a matter of putting some code there
to fix it. While early-demux is little harder without violating
isolation boundaries. To me isolation is critical / important and if
we find a right solution that doesn't violate isolation, we'd fix it.

> I couldn't find the complete link. This one mentions some of the issues:
> https://www.mail-archive.com/netdev@vger.kernel.org/msg157614.html
> Since ipvlan works for you, great, but it's clearly a layering violation.
> ipvlan connects L2 namespaces via L3 by doing its own fib lookups.
> To me it's a definition 'punch connectivity hole' in L2 abstraction.
> In normal L2 setup of netns+veth the traffic from one netns should
> have went into another netns via full L2. ipvlan cheats by giving
> L3 connectivity. It's not clean to me.

IPvlan supports three different modes and you have mixed all of them
while explaining your understanding of IPvlan. Probably one needs to
digest all these modes and evaluate them in the context of their use
case. Well, I'm not even going to attempt to explain the differences,
if you were serious you could have figured it out.

There are lots of use cases and people use it in interesting ways.
Each  case can be better handled by using either VRF, or macvlan, or
IPvlan or whatever is out there. It would be childish to say one
use-case is better than others as these are *different* use cases. All
these solutions come with their own caveats and you choose what you
can live with. Well, you can always improve and I can see Redhat folks
are doing it and I appreciate their efforts.

Like I said there are several different ways to make this work with
namespaces in much cleaner way and IPvlan does not need to be
involved. However adding another eBPF hook just because we can in a
hackish way is *not* the right way. Especially when a problem has
already been solved (with namespace) these 2000 lines dont deserve to
be in kernel. eBPF is a good tool and there is a thin line between
using it appropriately and misusing it. I don't want to argue and we
can agree to disagree!


> There are still neighbour
> tables in netnses that are duplicated.
> Because netns is L2 there is full requeuing for traffic across netnses.
> I guess google doesn't prioritize container to container traffic
> while outside into netns via ipvlan works ok similar to bond, but
> imo it's cheating too.
> imo afnetns would have been much better alternative for your
> use case without ipvlan pitfalls, but as you said ipvlan already
> in the tree and afnetns is not.
> With afnetns early demux would have worked not only for traffic from
> the network, but for traffic across afnetns-es.
>
Isolation is a critical piece of our puzzle and none of the
suggestions you have given solve it. cgroups clearly don't! However,
those could be good solutions in some other use-cases.

>> I find netns isolation very clean, powerful, and it is there already.
>
> netns+veth is a clean abstraction, but netns+ipvlan is imo not.
> imo VRF is another clean L3 abstraction. Yet some folks tried
> to do VRF-like things with netns.
> David Ahern wrote nice blog about issues with that.
> I suspect VRF also could have worked for google use case
> and would have been easier to use than netns+ipvlan.
> But since ipvlan works for you in the current shape, great,
> I'm not going to argue further.
> Let's agree to disagree on cleanliness of the solution.
>
>> It also works with UDP just fine. Are you considering adding a hook
>> later for sendmsg() (unconnected socket or not), or do you want to use
>> the existing one in ip_finish_output(), adding per-packet overhead ?
>
> Currently that's indeed the case. Existing cgroup-bpf hooks
> at ip_finish_output work for many use cases, but per-packet overhead
> is bad. With bind/connect hooks we avoid that overhead for
> good traffic (which is tcp and connected udp). We still need
> to solve it for unconnected udp. Rough idea is to do similar
> sockaddr rewrite/drop in unconnected part of udp_sendmsg.
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2018-03-15 16:22 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-03-14  3:39 [PATCH RFC bpf-next 0/6] bpf: introduce cgroup-bpf bind, connect, post-bind hooks Alexei Starovoitov
2018-03-14  3:39 ` [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind Alexei Starovoitov
2018-03-14  6:21   ` Eric Dumazet
2018-03-14 18:00     ` Alexei Starovoitov
2018-03-14 14:37   ` Daniel Borkmann
2018-03-14 14:55     ` Daniel Borkmann
2018-03-14 18:11     ` Alexei Starovoitov
2018-03-14 23:27       ` Daniel Borkmann
2018-03-15  0:29         ` Alexei Starovoitov
2018-03-14  3:39 ` [PATCH RFC bpf-next 2/6] selftests/bpf: Selftest for sys_bind hooks Alexei Starovoitov
2018-03-14  3:39 ` [PATCH RFC bpf-next 3/6] net: Introduce __inet_bind() and __inet6_bind Alexei Starovoitov
2018-03-14  3:39 ` [PATCH RFC bpf-next 4/6] bpf: Hooks for sys_connect Alexei Starovoitov
2018-03-14  3:39 ` [PATCH RFC bpf-next 5/6] selftests/bpf: Selftest for sys_connect hooks Alexei Starovoitov
2018-03-14  3:39 ` [PATCH RFC bpf-next 6/6] bpf: Post-hooks for sys_bind Alexei Starovoitov
2018-03-14 17:13 ` [PATCH RFC bpf-next 0/6] bpf: introduce cgroup-bpf bind, connect, post-bind hooks David Ahern
2018-03-14 18:00   ` Alexei Starovoitov
2018-03-14 17:22 ` Mahesh Bandewar (महेश बंडेवार)
2018-03-14 18:01   ` Alexei Starovoitov
  -- strict thread matches above, loose matches on Subject: below --
2018-03-14 18:41 [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind Alexei Starovoitov
2018-03-15  0:17 ` Eric Dumazet
2018-03-15  3:37   ` Alexei Starovoitov
2018-03-15 14:14     ` Jiri Benc
2018-03-15 16:22     ` Mahesh Bandewar (महेश बंडेवार)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).