[RFC PATCH 0/5] Add eBPF hooks for cgroups

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/5] Add eBPF hooks for cgroups
@ 2016-08-17 14:00 Daniel Mack
  2016-08-17 14:00 ` [RFC PATCH 1/5] bpf: add new prog type for cgroup socket filtering Daniel Mack
                   ` (5 more replies)
  0 siblings, 6 replies; 39+ messages in thread
From: Daniel Mack @ 2016-08-17 14:00 UTC (permalink / raw)
  To: htejun, daniel, ast; +Cc: davem, kafai, fw, pablo, harald, netdev, Daniel Mack

This patch set allows eBPF programs for network filtering and
accounting to be attached to cgroups, so that they apply to all sockets
of all tasks placed in that cgroup. The logic also allows to be
extendeded for other cgroup-based eBPF logic.

In short, the patch set adds the following:

* A new eBPF program type BPF_PROG_TYPE_CGROUP_SOCKET_FILTER,
  which is identical to BPF_PROG_TYPE_SOCKET_FILTER for now

* CONFIG_CGROUP_BPF, which guards every new detail affected by this
  patch set

* Two pointers to struct cgroup to store eBPF programs for socket
  filtering

* Two new bpf(2) syscall commands, BPF_PROG_ATTACH and BPF_PROG_DETACH,
  which are kept generic at the kernel API level in a sense that can
  be reused to attach programs to other entities than cgroups. All that
  is passed in to kernel is a file descriptor.

* A hook in the networking ingress path to run these programs

* A userspace program that demonstrates how to use that new feature

I'd appreciate some feedback on this. Pablo has some remaining concerns
about this approach, and I'd like to continue the discussion we had
off-list in the light of this patchset.

Also, I currently lack an idea for a good place to hook up the egress
filter. If anyone has a good idea, I'd be happy to hear about it.

Thanks,
Daniel

Daniel Mack (5):
  bpf: add new prog type for cgroup socket filtering
  cgroup: add bpf_{e,in}gress pointers
  bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  net: filter: run cgroup eBPF programs
  samples: bpf: add userspace example for attaching eBPF programs to
    cgroups

 include/linux/cgroup-defs.h     |   6 ++
 include/uapi/linux/bpf.h        |  15 +++++
 init/Kconfig                    |   8 +++
 kernel/bpf/syscall.c            | 132 ++++++++++++++++++++++++++++++++++++++
 kernel/bpf/verifier.c           |   1 +
 kernel/cgroup.c                 |   9 +++
 net/core/filter.c               |  50 +++++++++++++++
 samples/bpf/Makefile            |   2 +
 samples/bpf/libbpf.c            |  21 +++++++
 samples/bpf/libbpf.h            |   3 +
 samples/bpf/test_cgrp2_attach.c | 136 ++++++++++++++++++++++++++++++++++++++++
 11 files changed, 383 insertions(+)
 create mode 100644 samples/bpf/test_cgrp2_attach.c

-- 
2.5.5

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RFC PATCH 1/5] bpf: add new prog type for cgroup socket filtering
  2016-08-17 14:00 [RFC PATCH 0/5] Add eBPF hooks for cgroups Daniel Mack
@ 2016-08-17 14:00 ` Daniel Mack
  2016-08-17 14:00 ` [RFC PATCH 2/5] cgroup: add bpf_{e,in}gress pointers Daniel Mack
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 39+ messages in thread
From: Daniel Mack @ 2016-08-17 14:00 UTC (permalink / raw)
  To: htejun, daniel, ast; +Cc: davem, kafai, fw, pablo, harald, netdev, Daniel Mack

For now, this program type is equivalent to BPF_PROG_TYPE_SOCKET_FILTER in
terms of checks during the verification process. It may access the skb as
well.

Programs of this type will be attached to cgroups for network filtering
and accounting.

Signed-off-by: Daniel Mack <daniel@zonque.org>
---
 include/uapi/linux/bpf.h | 1 +
 kernel/bpf/verifier.c    | 1 +
 net/core/filter.c        | 6 ++++++
 3 files changed, 8 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index da218fe..913b147 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -95,6 +95,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_SCHED_ACT,
 	BPF_PROG_TYPE_TRACEPOINT,
 	BPF_PROG_TYPE_XDP,
+	BPF_PROG_TYPE_CGROUP_SOCKET_FILTER,
 };
 
 #define BPF_PSEUDO_MAP_FD	1
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f72f23b..3f3164e 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1777,6 +1777,7 @@ static bool may_access_skb(enum bpf_prog_type type)
 	case BPF_PROG_TYPE_SOCKET_FILTER:
 	case BPF_PROG_TYPE_SCHED_CLS:
 	case BPF_PROG_TYPE_SCHED_ACT:
+	case BPF_PROG_TYPE_CGROUP_SOCKET_FILTER:
 		return true;
 	default:
 		return false;
diff --git a/net/core/filter.c b/net/core/filter.c
index 5708999..c5d8332 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2733,12 +2733,18 @@ static struct bpf_prog_type_list xdp_type __read_mostly = {
 	.type	= BPF_PROG_TYPE_XDP,
 };
 
+static struct bpf_prog_type_list cg_sk_filter_type __read_mostly = {
+	.ops	= &sk_filter_ops,
+	.type	= BPF_PROG_TYPE_CGROUP_SOCKET_FILTER,
+};
+
 static int __init register_sk_filter_ops(void)
 {
 	bpf_register_prog_type(&sk_filter_type);
 	bpf_register_prog_type(&sched_cls_type);
 	bpf_register_prog_type(&sched_act_type);
 	bpf_register_prog_type(&xdp_type);
+	bpf_register_prog_type(&cg_sk_filter_type);
 
 	return 0;
 }
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC PATCH 2/5] cgroup: add bpf_{e,in}gress pointers
  2016-08-17 14:00 [RFC PATCH 0/5] Add eBPF hooks for cgroups Daniel Mack
  2016-08-17 14:00 ` [RFC PATCH 1/5] bpf: add new prog type for cgroup socket filtering Daniel Mack
@ 2016-08-17 14:00 ` Daniel Mack
  2016-08-17 14:10   ` Tejun Heo
  2016-08-17 17:50   ` Alexei Starovoitov
  2016-08-17 14:00 ` [RFC PATCH 3/5] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands Daniel Mack
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 39+ messages in thread
From: Daniel Mack @ 2016-08-17 14:00 UTC (permalink / raw)
  To: htejun, daniel, ast; +Cc: davem, kafai, fw, pablo, harald, netdev, Daniel Mack

Add two pointers for eBPF programs to struct cgroup. These will be used
to store programs for ingress and egress for accounting and filtering.

This new feature is guarded by CONFIG_CGROUP_BPF.

Signed-off-by: Daniel Mack <daniel@zonque.org>
---
 include/linux/cgroup-defs.h | 6 ++++++
 init/Kconfig                | 8 ++++++++
 kernel/cgroup.c             | 9 +++++++++
 3 files changed, 23 insertions(+)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 5b17de6..cff3560 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -300,6 +300,12 @@ struct cgroup {
 	/* used to schedule release agent */
 	struct work_struct release_agent_work;
 
+#ifdef CONFIG_CGROUP_BPF
+	/* used by the networking layer */
+	struct bpf_prog *bpf_ingress;
+	struct bpf_prog *bpf_egress;
+#endif
+
 	/* ids of the ancestors at each level including self */
 	int ancestor_ids[];
 };
diff --git a/init/Kconfig b/init/Kconfig
index cac3f09..8bf2f3e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1144,6 +1144,14 @@ config CGROUP_PERF
 
 	  Say N if unsure.
 
+config CGROUP_BPF
+	bool "Enable eBPF programs in cgroups"
+	depends on BPF_SYSCALL
+	help
+	  This options allows cgroups to accommodate eBPF programs that
+	  can be used for network traffic filtering and accounting. See
+	  Documentation/networking/filter.txt for more information.
+
 config CGROUP_DEBUG
 	bool "Example controller"
 	default n
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index d1c51b7..65e5701 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -62,6 +62,7 @@
 #include <linux/proc_ns.h>
 #include <linux/nsproxy.h>
 #include <linux/file.h>
+#include <linux/bpf.h>
 #include <net/sock.h>
 
 /*
@@ -5461,6 +5462,14 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
 	for_each_css(css, ssid, cgrp)
 		kill_css(css);
 
+#ifdef CONFIG_CGROUP_BPF
+	if (cgrp->bpf_ingress)
+		bpf_prog_put(cgrp->bpf_ingress);
+
+	if (cgrp->bpf_egress)
+		bpf_prog_put(cgrp->bpf_egress);
+#endif
+
 	/*
 	 * Remove @cgrp directory along with the base files.  @cgrp has an
 	 * extra ref on its kn.
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC PATCH 3/5] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  2016-08-17 14:00 [RFC PATCH 0/5] Add eBPF hooks for cgroups Daniel Mack
  2016-08-17 14:00 ` [RFC PATCH 1/5] bpf: add new prog type for cgroup socket filtering Daniel Mack
  2016-08-17 14:00 ` [RFC PATCH 2/5] cgroup: add bpf_{e,in}gress pointers Daniel Mack
@ 2016-08-17 14:00 ` Daniel Mack
  2016-08-17 14:20   ` Tejun Heo
  2016-08-17 16:16   ` Eric Dumazet
  2016-08-17 14:00 ` [RFC PATCH 4/5] net: filter: run cgroup eBPF programs Daniel Mack
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 39+ messages in thread
From: Daniel Mack @ 2016-08-17 14:00 UTC (permalink / raw)
  To: htejun, daniel, ast; +Cc: davem, kafai, fw, pablo, harald, netdev, Daniel Mack

Extend the bpf(2) syscall by two new commands, BPF_PROG_ATTACH and
BPF_PROG_DETACH which allow attaching eBPF programs to a target.

On the API level, the target could be anything that has an fd in
userspace, hence the name of the field in union bpf_attr is called
'target_fd'.

When called with BPF_ATTACH_TYPE_CGROUP_{E,IN}GRESS, the target is
expected to be a valid file descriptor of a cgroup v2 directory. These
are the only use-cases implemented by this patch at this point, but
more can be added.

If a program of the given type already exists in the given cgroup,
the program is swapped atomically, so userspace does not have to drop
an existing program first before installing a new one, leaving a gap
in which no program is installed at all.

The current implementation walks the tree from the passed cgroup up
to the root. If there is any program of the given type installed in
any of the ancestors, the installation is rejected. This is because
programs subject to restrictions should have no way of escaping if
a higher-level cgroup has installed a program already. This restriction
can be revisited at some later point in time.

The API is guarded by CAP_NET_ADMIN right now, which is also something
that can be relaxed in the future.

The new bpf commands will return -EINVAL for !CONFIG_CGROUP_BPF.

Signed-off-by: Daniel Mack <daniel@zonque.org>
---
 include/uapi/linux/bpf.h |  14 +++++
 kernel/bpf/syscall.c     | 132 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 146 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 913b147..b8b8925 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -73,6 +73,8 @@ enum bpf_cmd {
 	BPF_PROG_LOAD,
 	BPF_OBJ_PIN,
 	BPF_OBJ_GET,
+	BPF_PROG_ATTACH,
+	BPF_PROG_DETACH,
 };
 
 enum bpf_map_type {
@@ -98,6 +100,11 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_CGROUP_SOCKET_FILTER,
 };
 
+enum bpf_attach_type {
+	BPF_ATTACH_TYPE_CGROUP_INGRESS,
+	BPF_ATTACH_TYPE_CGROUP_EGRESS,
+};
+
 #define BPF_PSEUDO_MAP_FD	1
 
 /* flags for BPF_MAP_UPDATE_ELEM command */
@@ -141,6 +148,13 @@ union bpf_attr {
 		__aligned_u64	pathname;
 		__u32		bpf_fd;
 	};
+
+	struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
+		__u32		target_fd;	/* container object to attach to */
+		__u32		attach_bpf_fd;	/* eBPF program to attach */
+		__u32		attach_type;	/* BPF_ATTACH_TYPE_* */
+		__u64		attach_flags;
+	};
 } __attribute__((aligned(8)));
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 228f962..036465d 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -822,6 +822,132 @@ static int bpf_obj_get(const union bpf_attr *attr)
 	return bpf_obj_get_user(u64_to_ptr(attr->pathname));
 }
 
+static int bpf_prog_attach(const union bpf_attr *attr)
+{
+	bool is_ingress = false;
+	int err = 0;
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	/* Flags are unused for now */
+	if (attr->attach_flags != 0)
+		return -EINVAL;
+
+	switch (attr->attach_type) {
+
+#ifdef CONFIG_CGROUP_BPF
+	case BPF_ATTACH_TYPE_CGROUP_INGRESS:
+		is_ingress = true;
+		/* fall through */
+
+	case BPF_ATTACH_TYPE_CGROUP_EGRESS: {
+		struct bpf_prog *prog, *old_prog, **progp;
+		struct cgroup_subsys_state *pos;
+		struct cgroup *cgrp;
+
+		prog = bpf_prog_get_type(attr->attach_bpf_fd,
+					 BPF_PROG_TYPE_CGROUP_SOCKET_FILTER);
+		if (IS_ERR(prog))
+			return PTR_ERR(prog);
+
+		cgrp = cgroup_get_from_fd(attr->target_fd);
+		if (IS_ERR(cgrp)) {
+			err = PTR_ERR(cgrp);
+			bpf_prog_put(prog);
+			return err;
+		}
+
+		/* Reject installation of a program if any ancestor has one. */
+		for (pos = cgrp->self.parent; pos; pos = pos->parent) {
+			struct cgroup *parent;
+
+			css_get(pos);
+			parent = container_of(pos, struct cgroup, self);
+
+			if ((is_ingress  && parent->bpf_ingress) ||
+			    (!is_ingress && parent->bpf_egress))
+				err = -EEXIST;
+
+			css_put(pos);
+		}
+
+		if (err < 0) {
+			bpf_prog_put(prog);
+			return err;
+		}
+
+		progp = is_ingress ? &cgrp->bpf_ingress : &cgrp->bpf_egress;
+
+		rcu_read_lock();
+		old_prog = rcu_dereference(*progp);
+		rcu_assign_pointer(*progp, prog);
+
+		if (old_prog)
+			bpf_prog_put(old_prog);
+
+		rcu_read_unlock();
+		cgroup_put(cgrp);
+
+		break;
+	}
+#endif /* CONFIG_CGROUP_BPF */
+
+	default:
+		return -EINVAL;
+	}
+
+	return err;
+}
+
+static int bpf_prog_detach(const union bpf_attr *attr)
+{
+	int err = 0;
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	switch (attr->attach_type) {
+
+#ifdef CONFIG_CGROUP_BPF
+	case BPF_ATTACH_TYPE_CGROUP_INGRESS:
+	case BPF_ATTACH_TYPE_CGROUP_EGRESS: {
+		struct bpf_prog *prog, **progp;
+		struct cgroup *cgrp;
+
+		cgrp = cgroup_get_from_fd(attr->target_fd);
+		if (IS_ERR(cgrp))
+			return PTR_ERR(cgrp);
+
+		progp = attr->attach_type == BPF_ATTACH_TYPE_CGROUP_INGRESS ?
+			&cgrp->bpf_ingress :
+			&cgrp->bpf_egress;
+
+		rcu_read_lock();
+		prog = rcu_dereference(*progp);
+
+		if (prog) {
+			rcu_assign_pointer(*progp, NULL);
+			bpf_prog_put(prog);
+		} else {
+			err = -ENOENT;
+		}
+
+		rcu_read_unlock();
+		cgroup_put(cgrp);
+
+		break;
+	}
+#endif /* CONFIG_CGROUP_BPF */
+
+	default:
+		err = -EINVAL;
+		break;
+	}
+
+	return err;
+}
+
 SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size)
 {
 	union bpf_attr attr = {};
@@ -888,6 +1014,12 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 	case BPF_OBJ_GET:
 		err = bpf_obj_get(&attr);
 		break;
+	case BPF_PROG_ATTACH:
+		err = bpf_prog_attach(&attr);
+		break;
+	case BPF_PROG_DETACH:
+		err = bpf_prog_detach(&attr);
+		break;
 	default:
 		err = -EINVAL;
 		break;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC PATCH 4/5] net: filter: run cgroup eBPF programs
  2016-08-17 14:00 [RFC PATCH 0/5] Add eBPF hooks for cgroups Daniel Mack
                   ` (2 preceding siblings ...)
  2016-08-17 14:00 ` [RFC PATCH 3/5] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands Daniel Mack
@ 2016-08-17 14:00 ` Daniel Mack
  2016-08-17 14:23   ` Tejun Heo
                     ` (2 more replies)
  2016-08-17 14:00 ` [RFC PATCH 5/5] samples: bpf: add userspace example for attaching eBPF programs to cgroups Daniel Mack
  2016-08-19  9:19 ` [RFC PATCH 0/5] Add eBPF hooks for cgroups Pablo Neira Ayuso
  5 siblings, 3 replies; 39+ messages in thread
From: Daniel Mack @ 2016-08-17 14:00 UTC (permalink / raw)
  To: htejun, daniel, ast; +Cc: davem, kafai, fw, pablo, harald, netdev, Daniel Mack

If CONFIG_CGROUP_BPF is enabled, and the cgroup associated with the
receiving socket has an eBPF programs installed, run them from
sk_filter_trim_cap().

eBPF programs used in this context are expected to either return 1 to
let the packet pass, or != 1 to drop them. The programs have access to
the full skb, including the MAC headers.

This patch only implements the call site for ingress packets.

Signed-off-by: Daniel Mack <daniel@zonque.org>
---
 net/core/filter.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index c5d8332..a1dd94b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -52,6 +52,44 @@
 #include <net/dst.h>
 #include <net/sock_reuseport.h>
 
+#ifdef CONFIG_CGROUP_BPF
+static int sk_filter_cgroup_bpf(struct sock *sk, struct sk_buff *skb,
+				enum bpf_attach_type type)
+{
+	struct sock_cgroup_data *skcd = &sk->sk_cgrp_data;
+	struct cgroup *cgrp = sock_cgroup_ptr(skcd);
+	struct bpf_prog *prog;
+	int ret = 0;
+
+	rcu_read_lock();
+
+	switch (type) {
+	case BPF_ATTACH_TYPE_CGROUP_EGRESS:
+		prog = rcu_dereference(cgrp->bpf_egress);
+		break;
+	case BPF_ATTACH_TYPE_CGROUP_INGRESS:
+		prog = rcu_dereference(cgrp->bpf_ingress);
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		ret = -EINVAL;
+		break;
+	}
+
+	if (prog) {
+		unsigned int offset = skb->data - skb_mac_header(skb);
+
+		__skb_push(skb, offset);
+		ret = bpf_prog_run_clear_cb(prog, skb) > 0 ? 0 : -EPERM;
+		__skb_pull(skb, offset);
+	}
+
+	rcu_read_unlock();
+
+	return ret;
+}
+#endif /* !CONFIG_CGROUP_BPF */
+
 /**
  *	sk_filter_trim_cap - run a packet through a socket filter
  *	@sk: sock associated with &sk_buff
@@ -78,6 +116,12 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
 	if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
 		return -ENOMEM;
 
+#ifdef CONFIG_CGROUP_BPF
+	err = sk_filter_cgroup_bpf(sk, skb, BPF_ATTACH_TYPE_CGROUP_INGRESS);
+	if (err)
+		return err;
+#endif
+
 	err = security_sock_rcv_skb(sk, skb);
 	if (err)
 		return err;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC PATCH 5/5] samples: bpf: add userspace example for attaching eBPF programs to cgroups
  2016-08-17 14:00 [RFC PATCH 0/5] Add eBPF hooks for cgroups Daniel Mack
                   ` (3 preceding siblings ...)
  2016-08-17 14:00 ` [RFC PATCH 4/5] net: filter: run cgroup eBPF programs Daniel Mack
@ 2016-08-17 14:00 ` Daniel Mack
  2016-08-19  9:19 ` [RFC PATCH 0/5] Add eBPF hooks for cgroups Pablo Neira Ayuso
  5 siblings, 0 replies; 39+ messages in thread
From: Daniel Mack @ 2016-08-17 14:00 UTC (permalink / raw)
  To: htejun, daniel, ast; +Cc: davem, kafai, fw, pablo, harald, netdev, Daniel Mack

Add a simple userpace program to demonstrate the new API to attach eBPF
programs to cgroups. This is what it does:

 * Create arraymap in kernel with 4 byte keys and 8 byte values

 * Load eBPF program

   The eBPF program accesses the map passed in to store two pieces of
   information. The number of invocations of the program, which maps
   to the number of packets received, is stored to key 0. Key 1 is
   incremented on each iteration by the number of bytes stored in
   the skb.

 * Detach any eBPF program previously attached to the cgroup

 * Attach the new program to the cgroup using BPF_PROG_ATTACH

 * Every second, read map[0] and map[1] to see how many bytes and
   packets were seen on any socket of tasks in the given cgroup.

libbpf gained two new wrappers for the new syscall commands.

Signed-off-by: Daniel Mack <daniel@zonque.org>
---
 samples/bpf/Makefile            |   2 +
 samples/bpf/libbpf.c            |  21 +++++++
 samples/bpf/libbpf.h            |   3 +
 samples/bpf/test_cgrp2_attach.c | 136 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 162 insertions(+)
 create mode 100644 samples/bpf/test_cgrp2_attach.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 90ebf7d..2f1820c 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -22,6 +22,7 @@ hostprogs-y += spintest
 hostprogs-y += map_perf_test
 hostprogs-y += test_overhead
 hostprogs-y += test_cgrp2_array_pin
+hostprogs-y += test_cgrp2_attach
 hostprogs-y += xdp1
 hostprogs-y += xdp2
 
@@ -46,6 +47,7 @@ spintest-objs := bpf_load.o libbpf.o spintest_user.o
 map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
 test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
 test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
+test_cgrp2_attach-objs := libbpf.o test_cgrp2_attach.o
 xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
 # reuse xdp1 source intentionally
 xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
diff --git a/samples/bpf/libbpf.c b/samples/bpf/libbpf.c
index 9969e35..9ce707b 100644
--- a/samples/bpf/libbpf.c
+++ b/samples/bpf/libbpf.c
@@ -104,6 +104,27 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
 	return syscall(__NR_bpf, BPF_PROG_LOAD, &attr, sizeof(attr));
 }
 
+int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type)
+{
+	union bpf_attr attr = {
+		.target_fd = target_fd,
+		.attach_bpf_fd = prog_fd,
+		.attach_type = type,
+	};
+
+	return syscall(__NR_bpf, BPF_PROG_ATTACH, &attr, sizeof(attr));
+}
+
+int bpf_prog_detach(int target_fd, enum bpf_attach_type type)
+{
+	union bpf_attr attr = {
+		.target_fd = target_fd,
+		.attach_type = type,
+	};
+
+	return syscall(__NR_bpf, BPF_PROG_DETACH, &attr, sizeof(attr));
+}
+
 int bpf_obj_pin(int fd, const char *pathname)
 {
 	union bpf_attr attr = {
diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h
index 364582b..f973241 100644
--- a/samples/bpf/libbpf.h
+++ b/samples/bpf/libbpf.h
@@ -15,6 +15,9 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
 		  const struct bpf_insn *insns, int insn_len,
 		  const char *license, int kern_version);
 
+int bpf_prog_attach(int prog_fd, int attachable_fd, enum bpf_attach_type type);
+int bpf_prog_detach(int attachable_fd, enum bpf_attach_type type);
+
 int bpf_obj_pin(int fd, const char *pathname);
 int bpf_obj_get(const char *pathname);
 
diff --git a/samples/bpf/test_cgrp2_attach.c b/samples/bpf/test_cgrp2_attach.c
new file mode 100644
index 0000000..9ec18cc
--- /dev/null
+++ b/samples/bpf/test_cgrp2_attach.c
@@ -0,0 +1,136 @@
+/* eBPF example program:
+ *
+ * - Creates arraymap in kernel with 4 bytes keys and 8 byte values
+ *
+ * - Loads eBPF program
+ *
+ *   The eBPF program accesses the map passed in to store two pieces of
+ *   information. The number of invocations of the program, which maps
+ *   to the number of packets received, is stored to key 0. Key 1 is
+ *   incremented on each iteration by the number of bytes stored in
+ *   the skb.
+ *
+ * - Detaches any eBPF program previously attached to the cgroup
+ *
+ * - Attaches the new program to a cgroup using BPF_PROG_ATTACH
+ *
+ * - Every second, reads map[0] and map[1] to see how many bytes and
+ *   packets were seen on any socket of tasks in the given cgroup.
+ */
+
+#define _GNU_SOURCE
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stddef.h>
+#include <string.h>
+#include <unistd.h>
+#include <assert.h>
+#include <errno.h>
+#include <fcntl.h>
+
+#include <sys/socket.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+
+#include <linux/bpf.h>
+
+#include "libbpf.h"
+
+enum {
+	MAP_KEY_PACKETS,
+	MAP_KEY_BYTES,
+};
+
+static int prog_load(int map_fd)
+{
+	struct bpf_insn prog[] = {
+		BPF_MOV64_REG(BPF_REG_6, BPF_REG_1), /* save r6 so it's not clobbered by BPF_CALL */
+
+		BPF_MOV64_IMM(BPF_REG_0, MAP_KEY_PACKETS), /* r0 = 0 */
+		BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */
+		BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+		BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = fp - 4 */
+		BPF_LD_MAP_FD(BPF_REG_1, map_fd), /* load map fd to r1 */
+		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+		BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
+		BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */
+		BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
+
+
+		BPF_MOV64_IMM(BPF_REG_0, MAP_KEY_BYTES), /* r0 = 1 */
+		BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */
+		BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+		BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = fp - 4 */
+		BPF_LD_MAP_FD(BPF_REG_1, map_fd),
+		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+		BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
+		BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_6, offsetof(struct __sk_buff, len)), /* r1 = skb->len */
+		BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
+
+		BPF_MOV64_IMM(BPF_REG_0, 1), /* r0 = 1 */
+		BPF_EXIT_INSN(),
+	};
+
+	return bpf_prog_load(BPF_PROG_TYPE_CGROUP_SOCKET_FILTER,
+			     prog, sizeof(prog), "GPL", 0);
+}
+
+int main(int argc, char **argv)
+{
+	int cg_fd, map_fd, prog_fd, key, ret;
+	long long pkt_cnt, byte_cnt;
+
+	if (argc < 2) {
+		printf("Usage: %s <cg-path>\n", argv[0]);
+		return EXIT_FAILURE;
+	}
+
+	cg_fd = open(argv[1], O_DIRECTORY | O_RDONLY);
+	if (cg_fd < 0) {
+		printf("Failed to open cgroup path: '%s'\n",
+		       strerror(errno));
+		return EXIT_FAILURE;
+	}
+
+	map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY,
+				sizeof(key), sizeof(byte_cnt),
+				256, 0);
+	if (map_fd < 0) {
+		printf("Failed to create map: '%s'\n", strerror(errno));
+		return EXIT_FAILURE;
+	}
+
+	prog_fd = prog_load(map_fd);
+	printf("Output from kernel verifier:\n%s\n-------\n", bpf_log_buf);
+
+	if (prog_fd < 0) {
+		printf("Failed to load prog: '%s'\n", strerror(errno));
+		return EXIT_FAILURE;
+	}
+
+	ret = bpf_prog_detach(cg_fd, BPF_ATTACH_TYPE_CGROUP_INGRESS);
+	printf("bpf_prog_detach() returned '%s' (%d)\n",
+	       strerror(errno), errno);
+
+	ret = bpf_prog_attach(prog_fd, cg_fd, BPF_ATTACH_TYPE_CGROUP_INGRESS);
+	if (ret < 0) {
+		printf("Failed to attach prog to cgroup: '%s'\n",
+		       strerror(errno));
+		return EXIT_FAILURE;
+	}
+
+	while (1) {
+		key = MAP_KEY_PACKETS;
+		assert(bpf_lookup_elem(map_fd, &key, &pkt_cnt) == 0);
+
+		key = MAP_KEY_BYTES;
+		assert(bpf_lookup_elem(map_fd, &key, &byte_cnt) == 0);
+
+		printf("cgroup received %lld packets, %lld bytes\n",
+		       pkt_cnt, byte_cnt);
+		sleep(1);
+	}
+
+	return EXIT_SUCCESS;
+}
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/5] cgroup: add bpf_{e,in}gress pointers
  2016-08-17 14:00 ` [RFC PATCH 2/5] cgroup: add bpf_{e,in}gress pointers Daniel Mack
@ 2016-08-17 14:10   ` Tejun Heo
  2016-08-17 17:50   ` Alexei Starovoitov
  1 sibling, 0 replies; 39+ messages in thread
From: Tejun Heo @ 2016-08-17 14:10 UTC (permalink / raw)
  To: Daniel Mack; +Cc: daniel, ast, davem, kafai, fw, pablo, harald, netdev

Hello,

On Wed, Aug 17, 2016 at 04:00:45PM +0200, Daniel Mack wrote:
> @@ -5461,6 +5462,14 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
>  	for_each_css(css, ssid, cgrp)
>  		kill_css(css);
>  
> +#ifdef CONFIG_CGROUP_BPF
> +	if (cgrp->bpf_ingress)
> +		bpf_prog_put(cgrp->bpf_ingress);
> +
> +	if (cgrp->bpf_egress)
> +		bpf_prog_put(cgrp->bpf_egress);
> +#endif

This most likely isn't the right place as there still can be sockets
associated with the cgroup and packets flowing.  The cgroup release
path in css_release_work_fn() probably is the right place.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 3/5] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  2016-08-17 14:00 ` [RFC PATCH 3/5] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands Daniel Mack
@ 2016-08-17 14:20   ` Tejun Heo
  2016-08-17 14:35     ` Daniel Mack
  2016-08-17 16:16   ` Eric Dumazet
  1 sibling, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2016-08-17 14:20 UTC (permalink / raw)
  To: Daniel Mack; +Cc: daniel, ast, davem, kafai, fw, pablo, harald, netdev

Hello, Daniel.

On Wed, Aug 17, 2016 at 04:00:46PM +0200, Daniel Mack wrote:
> The current implementation walks the tree from the passed cgroup up
> to the root. If there is any program of the given type installed in
> any of the ancestors, the installation is rejected. This is because
> programs subject to restrictions should have no way of escaping if
> a higher-level cgroup has installed a program already. This restriction
> can be revisited at some later point in time.

This seems a bit too restrictive to me.  Given that an entity which
can install a bpf program can remove any programs anyway, wouldn't it
make more sense to just execute the program on the closest ancestor to
the cgroup?  That'd allow selectively overriding programs for specific
subhierarchies while applying generic policies to the rest.

> +static int bpf_prog_attach(const union bpf_attr *attr)
> +{
...
> +		/* Reject installation of a program if any ancestor has one. */
> +		for (pos = cgrp->self.parent; pos; pos = pos->parent) {
> +			struct cgroup *parent;
> +
> +			css_get(pos);

This refcnting pattern doesn't make sense.  If @pos is already
accessible (which it is in this case), getting an extra ref before
accessing it doesn't do anything.  This would necessary iff the
pointer is to be used outside the scope which is protected by the ref
on @cgrp.

> +			parent = container_of(pos, struct cgroup, self);
> +
> +			if ((is_ingress  && parent->bpf_ingress) ||
> +			    (!is_ingress && parent->bpf_egress))
> +				err = -EEXIST;
> +
> +			css_put(pos);
> +		}
> +
> +		if (err < 0) {
> +			bpf_prog_put(prog);
> +			return err;
> +		}
> +
> +		progp = is_ingress ? &cgrp->bpf_ingress : &cgrp->bpf_egress;
> +
> +		rcu_read_lock();
> +		old_prog = rcu_dereference(*progp);
> +		rcu_assign_pointer(*progp, prog);

Wouldn't it make sense to update the descendants to point to the
programs directly so that there's no need to traverse the tree on each
packet?  It's more state to maintain but I think it would make total
sense given that it would reduce per-packet cost.

Otherwise, looks good to me.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 4/5] net: filter: run cgroup eBPF programs
  2016-08-17 14:00 ` [RFC PATCH 4/5] net: filter: run cgroup eBPF programs Daniel Mack
@ 2016-08-17 14:23   ` Tejun Heo
  2016-08-17 14:36     ` Daniel Mack
  2016-08-17 18:20   ` Alexei Starovoitov
  2016-08-21 20:14   ` Sargun Dhillon
  2 siblings, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2016-08-17 14:23 UTC (permalink / raw)
  To: Daniel Mack; +Cc: daniel, ast, davem, kafai, fw, pablo, harald, netdev

On Wed, Aug 17, 2016 at 04:00:47PM +0200, Daniel Mack wrote:
> @@ -78,6 +116,12 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
>  	if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
>  		return -ENOMEM;
>  
> +#ifdef CONFIG_CGROUP_BPF
> +	err = sk_filter_cgroup_bpf(sk, skb, BPF_ATTACH_TYPE_CGROUP_INGRESS);
> +	if (err)
> +		return err;
> +#endif

Hmm... where does egress hook into?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 3/5] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  2016-08-17 14:20   ` Tejun Heo
@ 2016-08-17 14:35     ` Daniel Mack
  2016-08-17 15:06       ` Tejun Heo
  2016-08-17 15:08       ` Tejun Heo
  0 siblings, 2 replies; 39+ messages in thread
From: Daniel Mack @ 2016-08-17 14:35 UTC (permalink / raw)
  To: Tejun Heo; +Cc: daniel, ast, davem, kafai, fw, pablo, harald, netdev

Hi Tejun,

On 08/17/2016 04:20 PM, Tejun Heo wrote:
> On Wed, Aug 17, 2016 at 04:00:46PM +0200, Daniel Mack wrote:
>> The current implementation walks the tree from the passed cgroup up
>> to the root. If there is any program of the given type installed in
>> any of the ancestors, the installation is rejected. This is because
>> programs subject to restrictions should have no way of escaping if
>> a higher-level cgroup has installed a program already. This restriction
>> can be revisited at some later point in time.
> 
> This seems a bit too restrictive to me.  Given that an entity which
> can install a bpf program can remove any programs anyway, wouldn't it
> make more sense to just execute the program on the closest ancestor to
> the cgroup?  That'd allow selectively overriding programs for specific
> subhierarchies while applying generic policies to the rest.

The idea is not to walk the cgroup tree for each packet, to avoid costs.
Which also means we don't have to be overly restrictive, yes.

>> +static int bpf_prog_attach(const union bpf_attr *attr)
>> +{
> ...
>> +		/* Reject installation of a program if any ancestor has one. */
>> +		for (pos = cgrp->self.parent; pos; pos = pos->parent) {
>> +			struct cgroup *parent;
>> +
>> +			css_get(pos);
> 
> This refcnting pattern doesn't make sense.  If @pos is already
> accessible (which it is in this case), getting an extra ref before
> accessing it doesn't do anything.  This would necessary iff the
> pointer is to be used outside the scope which is protected by the ref
> on @cgrp.

Right, that was a left-over from an older series. Will drop.

>> +			parent = container_of(pos, struct cgroup, self);
>> +
>> +			if ((is_ingress  && parent->bpf_ingress) ||
>> +			    (!is_ingress && parent->bpf_egress))
>> +				err = -EEXIST;
>> +
>> +			css_put(pos);
>> +		}
>> +
>> +		if (err < 0) {
>> +			bpf_prog_put(prog);
>> +			return err;
>> +		}
>> +
>> +		progp = is_ingress ? &cgrp->bpf_ingress : &cgrp->bpf_egress;
>> +
>> +		rcu_read_lock();
>> +		old_prog = rcu_dereference(*progp);
>> +		rcu_assign_pointer(*progp, prog);
> 
> Wouldn't it make sense to update the descendants to point to the
> programs directly so that there's no need to traverse the tree on each
> packet?  It's more state to maintain but I think it would make total
> sense given that it would reduce per-packet cost.

The idea is to only look at the actual v2 cgroup a task is in, and not
traverse in any way, to keep the per-packet cost at a minimum. That of
course means that in a deeper level of the hierarchy, the programs are
no longer executed and need to be reinstalled after a new cgroup is
created. To bring this more in line with how cgroups usually work, I
guess we should add code to copy over the bpf programs from the ancestor
once a new cgroup is instantiated.


Thanks,
Daniel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 4/5] net: filter: run cgroup eBPF programs
  2016-08-17 14:23   ` Tejun Heo
@ 2016-08-17 14:36     ` Daniel Mack
  2016-08-17 14:58       ` Tejun Heo
  0 siblings, 1 reply; 39+ messages in thread
From: Daniel Mack @ 2016-08-17 14:36 UTC (permalink / raw)
  To: Tejun Heo; +Cc: daniel, ast, davem, kafai, fw, pablo, harald, netdev

On 08/17/2016 04:23 PM, Tejun Heo wrote:
> On Wed, Aug 17, 2016 at 04:00:47PM +0200, Daniel Mack wrote:
>> @@ -78,6 +116,12 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
>>  	if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
>>  		return -ENOMEM;
>>  
>> +#ifdef CONFIG_CGROUP_BPF
>> +	err = sk_filter_cgroup_bpf(sk, skb, BPF_ATTACH_TYPE_CGROUP_INGRESS);
>> +	if (err)
>> +		return err;
>> +#endif
> 
> Hmm... where does egress hook into?

As stated in the cover letter, that's an open question on my list.


Thanks,
Daniel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 4/5] net: filter: run cgroup eBPF programs
  2016-08-17 14:36     ` Daniel Mack
@ 2016-08-17 14:58       ` Tejun Heo
  0 siblings, 0 replies; 39+ messages in thread
From: Tejun Heo @ 2016-08-17 14:58 UTC (permalink / raw)
  To: Daniel Mack; +Cc: daniel, ast, davem, kafai, fw, pablo, harald, netdev

On Wed, Aug 17, 2016 at 04:36:02PM +0200, Daniel Mack wrote:
> On 08/17/2016 04:23 PM, Tejun Heo wrote:
> > On Wed, Aug 17, 2016 at 04:00:47PM +0200, Daniel Mack wrote:
> >> @@ -78,6 +116,12 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
> >>  	if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
> >>  		return -ENOMEM;
> >>  
> >> +#ifdef CONFIG_CGROUP_BPF
> >> +	err = sk_filter_cgroup_bpf(sk, skb, BPF_ATTACH_TYPE_CGROUP_INGRESS);
> >> +	if (err)
> >> +		return err;
> >> +#endif
> > 
> > Hmm... where does egress hook into?
> 
> As stated in the cover letter, that's an open question on my list.

lol sorry about that.  :)

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 3/5] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  2016-08-17 14:35     ` Daniel Mack
@ 2016-08-17 15:06       ` Tejun Heo
  2016-08-17 15:51         ` Daniel Mack
  2016-08-17 15:08       ` Tejun Heo
  1 sibling, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2016-08-17 15:06 UTC (permalink / raw)
  To: Daniel Mack; +Cc: daniel, ast, davem, kafai, fw, pablo, harald, netdev

Hello,

On Wed, Aug 17, 2016 at 04:35:24PM +0200, Daniel Mack wrote:
> > Wouldn't it make sense to update the descendants to point to the
> > programs directly so that there's no need to traverse the tree on each
> > packet?  It's more state to maintain but I think it would make total
> > sense given that it would reduce per-packet cost.
> 
> The idea is to only look at the actual v2 cgroup a task is in, and not
> traverse in any way, to keep the per-packet cost at a minimum. That of
> course means that in a deeper level of the hierarchy, the programs are
> no longer executed and need to be reinstalled after a new cgroup is
> created. To bring this more in line with how cgroups usually work, I
> guess we should add code to copy over the bpf programs from the ancestor
> once a new cgroup is instantiated.

So, yeah, just executing the program from the exact cgroup wouldn't
work because it'd be easy for a delegatee to escape enforced policies
by simply creating subcgroups.

What I'd suggest is keeping two sets of pointers, one set for the
programs explicitly attached to the cgroup and the other for programs
which are to be executed for the cgroup.  The two pointers in the
latter set will point to the closest non-null program in its ancestry.
Note that the pointers may point to programs belonging to different
ancestors.

	A - B - C
              \ D - E

Let's say B has both ingress and egress programs and D only has
egress.  E's execution pointers should point to D's egress program and
B's ingress program.

The pointers would need to be updated

1. When a program is installed or removed.  For simplicity's sake, we
   can just walk the whole hierarchy.  If that's too expensive
   (shouldn't be), we can find out the closest ancestor where both
   programs can be determined and then walk from there.

2. When a new cgroup is created, it should inherit its execution table
   from its parent.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 3/5] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  2016-08-17 14:35     ` Daniel Mack
  2016-08-17 15:06       ` Tejun Heo
@ 2016-08-17 15:08       ` Tejun Heo
  1 sibling, 0 replies; 39+ messages in thread
From: Tejun Heo @ 2016-08-17 15:08 UTC (permalink / raw)
  To: Daniel Mack; +Cc: daniel, ast, davem, kafai, fw, pablo, harald, netdev

Hello, again.

Just one addition.

On Wed, Aug 17, 2016 at 04:35:24PM +0200, Daniel Mack wrote:
> created. To bring this more in line with how cgroups usually work, I
> guess we should add code to copy over the bpf programs from the ancestor
> once a new cgroup is instantiated.

Please don't copy the actual configuration or program.  Every case
where a cgroup controller took this approach turned out pretty badly.
It gets really confusing.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 3/5] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  2016-08-17 15:06       ` Tejun Heo
@ 2016-08-17 15:51         ` Daniel Mack
  2016-08-17 17:48           ` Alexei Starovoitov
  0 siblings, 1 reply; 39+ messages in thread
From: Daniel Mack @ 2016-08-17 15:51 UTC (permalink / raw)
  To: Tejun Heo; +Cc: daniel, ast, davem, kafai, fw, pablo, harald, netdev

On 08/17/2016 05:06 PM, Tejun Heo wrote:
> What I'd suggest is keeping two sets of pointers, one set for the
> programs explicitly attached to the cgroup and the other for programs
> which are to be executed for the cgroup.  The two pointers in the
> latter set will point to the closest non-null program in its ancestry.
> Note that the pointers may point to programs belonging to different
> ancestors.
> 
> 	A - B - C
>               \ D - E
> 
> Let's say B has both ingress and egress programs and D only has
> egress.  E's execution pointers should point to D's egress program and
> B's ingress program.
> 
> The pointers would need to be updated
> 
> 1. When a program is installed or removed.  For simplicity's sake, we
>    can just walk the whole hierarchy.  If that's too expensive
>    (shouldn't be), we can find out the closest ancestor where both
>    programs can be determined and then walk from there.
> 
> 2. When a new cgroup is created, it should inherit its execution table
>    from its parent.

Ok, yes, that sounds sane to me. Will implement that scheme.


Thanks!
Daniel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 3/5] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  2016-08-17 14:00 ` [RFC PATCH 3/5] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands Daniel Mack
  2016-08-17 14:20   ` Tejun Heo
@ 2016-08-17 16:16   ` Eric Dumazet
  2016-08-17 18:10     ` Alexei Starovoitov
  1 sibling, 1 reply; 39+ messages in thread
From: Eric Dumazet @ 2016-08-17 16:16 UTC (permalink / raw)
  To: Daniel Mack; +Cc: htejun, daniel, ast, davem, kafai, fw, pablo, harald, netdev

On Wed, 2016-08-17 at 16:00 +0200, Daniel Mack wrote:

> +		progp = is_ingress ? &cgrp->bpf_ingress : &cgrp->bpf_egress;
> +
> +		rcu_read_lock();
> +		old_prog = rcu_dereference(*progp);
> +		rcu_assign_pointer(*progp, prog);
> +
> +		if (old_prog)
> +			bpf_prog_put(old_prog);
> +
> +		rcu_read_unlock();


This is a bogus locking strategy.

You do not want to use rcu_read_lock()/rcu_read_unlock() here, but
appropriate writer exclusion (a mutex probably, or a spinlock)

Then use rcu_dereference_protected() instead of rcu_dereference(*progp);

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 3/5] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  2016-08-17 15:51         ` Daniel Mack
@ 2016-08-17 17:48           ` Alexei Starovoitov
  0 siblings, 0 replies; 39+ messages in thread
From: Alexei Starovoitov @ 2016-08-17 17:48 UTC (permalink / raw)
  To: Daniel Mack
  Cc: Tejun Heo, daniel, ast, davem, kafai, fw, pablo, harald, netdev

On Wed, Aug 17, 2016 at 05:51:44PM +0200, Daniel Mack wrote:
> On 08/17/2016 05:06 PM, Tejun Heo wrote:
> > What I'd suggest is keeping two sets of pointers, one set for the
> > programs explicitly attached to the cgroup and the other for programs
> > which are to be executed for the cgroup.  The two pointers in the
> > latter set will point to the closest non-null program in its ancestry.
> > Note that the pointers may point to programs belonging to different
> > ancestors.
> > 
> > 	A - B - C
> >               \ D - E
> > 
> > Let's say B has both ingress and egress programs and D only has
> > egress.  E's execution pointers should point to D's egress program and
> > B's ingress program.
> > 
> > The pointers would need to be updated
> > 
> > 1. When a program is installed or removed.  For simplicity's sake, we
> >    can just walk the whole hierarchy.  If that's too expensive
> >    (shouldn't be), we can find out the closest ancestor where both
> >    programs can be determined and then walk from there.
> > 
> > 2. When a new cgroup is created, it should inherit its execution table
> >    from its parent.
> 
> Ok, yes, that sounds sane to me. Will implement that scheme.

That makes sense to me as well. Sounds quite flexible and fast
with only one check per packet per direction.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/5] cgroup: add bpf_{e,in}gress pointers
  2016-08-17 14:00 ` [RFC PATCH 2/5] cgroup: add bpf_{e,in}gress pointers Daniel Mack
  2016-08-17 14:10   ` Tejun Heo
@ 2016-08-17 17:50   ` Alexei Starovoitov
  2016-08-17 17:56     ` Tejun Heo
  1 sibling, 1 reply; 39+ messages in thread
From: Alexei Starovoitov @ 2016-08-17 17:50 UTC (permalink / raw)
  To: Daniel Mack; +Cc: htejun, daniel, ast, davem, kafai, fw, pablo, harald, netdev

On Wed, Aug 17, 2016 at 04:00:45PM +0200, Daniel Mack wrote:
> Add two pointers for eBPF programs to struct cgroup. These will be used
> to store programs for ingress and egress for accounting and filtering.
> 
> This new feature is guarded by CONFIG_CGROUP_BPF.
...
> +#ifdef CONFIG_CGROUP_BPF
> +	/* used by the networking layer */
> +	struct bpf_prog *bpf_ingress;
> +	struct bpf_prog *bpf_egress;
> +#endif
...
> +config CGROUP_BPF
> +	bool "Enable eBPF programs in cgroups"
> +	depends on BPF_SYSCALL
> +	help
> +	  This options allows cgroups to accommodate eBPF programs that
> +	  can be used for network traffic filtering and accounting. See
> +	  Documentation/networking/filter.txt for more information.
> +

I think this extra config is unnecessary. It makes the code harder to follow.
Anyone turning on bpf syscall and cgroups should be able to have this feature.
Extra config is imo overkill.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/5] cgroup: add bpf_{e,in}gress pointers
  2016-08-17 17:50   ` Alexei Starovoitov
@ 2016-08-17 17:56     ` Tejun Heo
  0 siblings, 0 replies; 39+ messages in thread
From: Tejun Heo @ 2016-08-17 17:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Mack, daniel, ast, davem, kafai, fw, pablo, harald, netdev

Hello,

On Wed, Aug 17, 2016 at 10:50:40AM -0700, Alexei Starovoitov wrote:
> > +config CGROUP_BPF
> > +	bool "Enable eBPF programs in cgroups"
> > +	depends on BPF_SYSCALL
> > +	help
> > +	  This options allows cgroups to accommodate eBPF programs that
> > +	  can be used for network traffic filtering and accounting. See
> > +	  Documentation/networking/filter.txt for more information.
> > +
> 
> I think this extra config is unnecessary. It makes the code harder to follow.
> Anyone turning on bpf syscall and cgroups should be able to have this feature.
> Extra config is imo overkill.

Agreed, the added code is pretty small, especially in comparison to
both cgroup and bpf, and the only memory overhead would be four
pointers per cgroup, which shouldn't be noticeable at all.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 3/5] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  2016-08-17 16:16   ` Eric Dumazet
@ 2016-08-17 18:10     ` Alexei Starovoitov
  2016-08-18 15:17       ` Daniel Mack
  0 siblings, 1 reply; 39+ messages in thread
From: Alexei Starovoitov @ 2016-08-17 18:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Daniel Mack, htejun, daniel, ast, davem, kafai, fw, pablo, harald,
	netdev

On Wed, Aug 17, 2016 at 09:16:02AM -0700, Eric Dumazet wrote:
> On Wed, 2016-08-17 at 16:00 +0200, Daniel Mack wrote:
> 
> > +		progp = is_ingress ? &cgrp->bpf_ingress : &cgrp->bpf_egress;
> > +
> > +		rcu_read_lock();
> > +		old_prog = rcu_dereference(*progp);
> > +		rcu_assign_pointer(*progp, prog);
> > +
> > +		if (old_prog)
> > +			bpf_prog_put(old_prog);
> > +
> > +		rcu_read_unlock();
> 
> 
> This is a bogus locking strategy.

yep. this rcu_lock/unlock won't solve the race between parallel
bpf_prog_attach calls.
please use xchg() similar to bpf_fd_array_map_update_elem()
Then prog pointer will be swapped automically and bpf_prog_put()
will free it via call_rcu.
The reader side in sk_filter_cgroup_bpf() looks correct.
 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 4/5] net: filter: run cgroup eBPF programs
  2016-08-17 14:00 ` [RFC PATCH 4/5] net: filter: run cgroup eBPF programs Daniel Mack
  2016-08-17 14:23   ` Tejun Heo
@ 2016-08-17 18:20   ` Alexei Starovoitov
  2016-08-17 18:23     ` Alexei Starovoitov
  2016-08-21 20:14   ` Sargun Dhillon
  2 siblings, 1 reply; 39+ messages in thread
From: Alexei Starovoitov @ 2016-08-17 18:20 UTC (permalink / raw)
  To: Daniel Mack; +Cc: htejun, daniel, ast, davem, kafai, fw, pablo, harald, netdev

On Wed, Aug 17, 2016 at 04:00:47PM +0200, Daniel Mack wrote:
> If CONFIG_CGROUP_BPF is enabled, and the cgroup associated with the
> receiving socket has an eBPF programs installed, run them from
> sk_filter_trim_cap().
> 
> eBPF programs used in this context are expected to either return 1 to
> let the packet pass, or != 1 to drop them. The programs have access to
> the full skb, including the MAC headers.
> 
> This patch only implements the call site for ingress packets.
> 
> Signed-off-by: Daniel Mack <daniel@zonque.org>
> ---
>  net/core/filter.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 44 insertions(+)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index c5d8332..a1dd94b 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -52,6 +52,44 @@
>  #include <net/dst.h>
>  #include <net/sock_reuseport.h>
>  
> +#ifdef CONFIG_CGROUP_BPF
> +static int sk_filter_cgroup_bpf(struct sock *sk, struct sk_buff *skb,
> +				enum bpf_attach_type type)
> +{
> +	struct sock_cgroup_data *skcd = &sk->sk_cgrp_data;
> +	struct cgroup *cgrp = sock_cgroup_ptr(skcd);
> +	struct bpf_prog *prog;
> +	int ret = 0;
> +
> +	rcu_read_lock();
> +
> +	switch (type) {
> +	case BPF_ATTACH_TYPE_CGROUP_EGRESS:
> +		prog = rcu_dereference(cgrp->bpf_egress);
> +		break;
> +	case BPF_ATTACH_TYPE_CGROUP_INGRESS:
> +		prog = rcu_dereference(cgrp->bpf_ingress);
> +		break;
> +	default:
> +		WARN_ON_ONCE(1);
> +		ret = -EINVAL;
> +		break;
> +	}
> +
> +	if (prog) {

I really like how in this version of the patches it became
a single load+cmp of per-packet cost when this feature is off.
Please move
+	struct cgroup *cgrp = sock_cgroup_ptr(skcd);
into if (prog) {..}
to make sure it's actually single load.
The compiler cannot avoid that load when it's placed at the top.

> +		unsigned int offset = skb->data - skb_mac_header(skb);
> +
> +		__skb_push(skb, offset);
> +		ret = bpf_prog_run_clear_cb(prog, skb) > 0 ? 0 : -EPERM;

that doesn't match commit log. The above '> 0' makes sense to me though.
If we want to do it for 1 only we have to define it in uapi/bpf.h
as action code, so we can extend to 2, 3 in the future if necessary.

It also have to be bpf_prog_run_save_cb() (as sk_filter_trim_cap() does)
instead of bpf_prog_run_clear_cb().
See commit ff936a04e5f2 ("bpf: fix cb access in socket filter programs")

> +		__skb_pull(skb, offset);
> +	}
> +
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +#endif /* !CONFIG_CGROUP_BPF */
> +
>  /**
>   *	sk_filter_trim_cap - run a packet through a socket filter
>   *	@sk: sock associated with &sk_buff
> @@ -78,6 +116,12 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
>  	if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
>  		return -ENOMEM;
>  
> +#ifdef CONFIG_CGROUP_BPF
> +	err = sk_filter_cgroup_bpf(sk, skb, BPF_ATTACH_TYPE_CGROUP_INGRESS);
> +	if (err)
> +		return err;
> +#endif
> +
>  	err = security_sock_rcv_skb(sk, skb);
>  	if (err)
>  		return err;
> -- 
> 2.5.5
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 4/5] net: filter: run cgroup eBPF programs
  2016-08-17 18:20   ` Alexei Starovoitov
@ 2016-08-17 18:23     ` Alexei Starovoitov
  0 siblings, 0 replies; 39+ messages in thread
From: Alexei Starovoitov @ 2016-08-17 18:23 UTC (permalink / raw)
  To: Daniel Mack; +Cc: htejun, daniel, ast, davem, kafai, fw, pablo, harald, netdev

On Wed, Aug 17, 2016 at 11:20:29AM -0700, Alexei Starovoitov wrote:
> On Wed, Aug 17, 2016 at 04:00:47PM +0200, Daniel Mack wrote:
> > If CONFIG_CGROUP_BPF is enabled, and the cgroup associated with the
> > receiving socket has an eBPF programs installed, run them from
> > sk_filter_trim_cap().
> > 
> > eBPF programs used in this context are expected to either return 1 to
> > let the packet pass, or != 1 to drop them. The programs have access to
> > the full skb, including the MAC headers.
> > 
> > This patch only implements the call site for ingress packets.
> > 
> > Signed-off-by: Daniel Mack <daniel@zonque.org>
> > ---
> >  net/core/filter.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 44 insertions(+)
> > 
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index c5d8332..a1dd94b 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -52,6 +52,44 @@
> >  #include <net/dst.h>
> >  #include <net/sock_reuseport.h>
> >  
> > +#ifdef CONFIG_CGROUP_BPF
> > +static int sk_filter_cgroup_bpf(struct sock *sk, struct sk_buff *skb,
> > +				enum bpf_attach_type type)
> > +{
> > +	struct sock_cgroup_data *skcd = &sk->sk_cgrp_data;
> > +	struct cgroup *cgrp = sock_cgroup_ptr(skcd);
> > +	struct bpf_prog *prog;
> > +	int ret = 0;
> > +
> > +	rcu_read_lock();
> > +
> > +	switch (type) {
> > +	case BPF_ATTACH_TYPE_CGROUP_EGRESS:
> > +		prog = rcu_dereference(cgrp->bpf_egress);
> > +		break;
> > +	case BPF_ATTACH_TYPE_CGROUP_INGRESS:
> > +		prog = rcu_dereference(cgrp->bpf_ingress);
> > +		break;
> > +	default:
> > +		WARN_ON_ONCE(1);
> > +		ret = -EINVAL;
> > +		break;
> > +	}
> > +
> > +	if (prog) {
> 
> I really like how in this version of the patches it became
> a single load+cmp of per-packet cost when this feature is off.
> Please move
> +	struct cgroup *cgrp = sock_cgroup_ptr(skcd);
> into if (prog) {..}
> to make sure it's actually single load.
> The compiler cannot avoid that load when it's placed at the top.

sorry. brain fart. it is two loads. scratch that.

> > +		unsigned int offset = skb->data - skb_mac_header(skb);
> > +
> > +		__skb_push(skb, offset);
> > +		ret = bpf_prog_run_clear_cb(prog, skb) > 0 ? 0 : -EPERM;
> 
> that doesn't match commit log. The above '> 0' makes sense to me though.
> If we want to do it for 1 only we have to define it in uapi/bpf.h
> as action code, so we can extend to 2, 3 in the future if necessary.
> 
> It also have to be bpf_prog_run_save_cb() (as sk_filter_trim_cap() does)
> instead of bpf_prog_run_clear_cb().
> See commit ff936a04e5f2 ("bpf: fix cb access in socket filter programs")
> 
> > +		__skb_pull(skb, offset);
> > +	}
> > +
> > +	rcu_read_unlock();
> > +
> > +	return ret;
> > +}
> > +#endif /* !CONFIG_CGROUP_BPF */
> > +
> >  /**
> >   *	sk_filter_trim_cap - run a packet through a socket filter
> >   *	@sk: sock associated with &sk_buff
> > @@ -78,6 +116,12 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
> >  	if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
> >  		return -ENOMEM;
> >  
> > +#ifdef CONFIG_CGROUP_BPF
> > +	err = sk_filter_cgroup_bpf(sk, skb, BPF_ATTACH_TYPE_CGROUP_INGRESS);
> > +	if (err)
> > +		return err;
> > +#endif
> > +
> >  	err = security_sock_rcv_skb(sk, skb);
> >  	if (err)
> >  		return err;
> > -- 
> > 2.5.5
> > 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 3/5] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  2016-08-17 18:10     ` Alexei Starovoitov
@ 2016-08-18 15:17       ` Daniel Mack
  0 siblings, 0 replies; 39+ messages in thread
From: Daniel Mack @ 2016-08-18 15:17 UTC (permalink / raw)
  To: Alexei Starovoitov, Eric Dumazet
  Cc: htejun, daniel, ast, davem, kafai, fw, pablo, harald, netdev

On 08/17/2016 08:10 PM, Alexei Starovoitov wrote:
> On Wed, Aug 17, 2016 at 09:16:02AM -0700, Eric Dumazet wrote:
>> On Wed, 2016-08-17 at 16:00 +0200, Daniel Mack wrote:
>>
>>> +		progp = is_ingress ? &cgrp->bpf_ingress : &cgrp->bpf_egress;
>>> +
>>> +		rcu_read_lock();
>>> +		old_prog = rcu_dereference(*progp);
>>> +		rcu_assign_pointer(*progp, prog);
>>> +
>>> +		if (old_prog)
>>> +			bpf_prog_put(old_prog);
>>> +
>>> +		rcu_read_unlock();
>>
>>
>> This is a bogus locking strategy.
> 
> yep. this rcu_lock/unlock won't solve the race between parallel
> bpf_prog_attach calls.
> please use xchg() similar to bpf_fd_array_map_update_elem()
> Then prog pointer will be swapped automically and bpf_prog_put()
> will free it via call_rcu.
> The reader side in sk_filter_cgroup_bpf() looks correct.

Thanks! I reworked all the bits I got comments on, and fixed some other
details as well. I'll wait some days to see what else shakes out of this
thread, and then post again.

FWIW, the current code can be found here:

  https://github.com/zonque/linux/commits/cg-bpf-syscall


Thanks,
Daniel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups
  2016-08-17 14:00 [RFC PATCH 0/5] Add eBPF hooks for cgroups Daniel Mack
                   ` (4 preceding siblings ...)
  2016-08-17 14:00 ` [RFC PATCH 5/5] samples: bpf: add userspace example for attaching eBPF programs to cgroups Daniel Mack
@ 2016-08-19  9:19 ` Pablo Neira Ayuso
  2016-08-19 10:35   ` Daniel Mack
  2016-08-19 16:01   ` Alexei Starovoitov
  5 siblings, 2 replies; 39+ messages in thread
From: Pablo Neira Ayuso @ 2016-08-19  9:19 UTC (permalink / raw)
  To: Daniel Mack; +Cc: htejun, daniel, ast, davem, kafai, fw, harald, netdev

Hi Daniel,

On Wed, Aug 17, 2016 at 04:00:43PM +0200, Daniel Mack wrote:
> I'd appreciate some feedback on this. Pablo has some remaining concerns
> about this approach, and I'd like to continue the discussion we had
> off-list in the light of this patchset.

OK, I'm going to summarize them here below:

* This new hook allows us to enforce an *administrative filtering
  policy* that must be visible to anyone with CAP_NET_ADMIN. This is
  easy to display in nf_tables as you can list the ruleset via the nft
  userspace tool. Otherwise, in your approach if a misconfigured
  filtering policy causes connectivity problems, I don't see how the
  sysadmin is going to have an easy way to troubleshoot what is going on.

* Interaction with other software. As I could read from your patch,
  what you propose will detach any previous existing filter. So I
  don't see how you can attach multiple filtering policies from
  different processes that don't cooperate each other. In nf_tables
  this is easy since they can create their own tables so they keep their
  ruleset in separate spaces. If the interaction is not OK, again the
  sysadmin can very quickly debug this since the policies would be
  visible via nf_tables ruleset listing.

* During the Netfilter Workshop, the main concern to add this new socket
  ingress hook was that it is too specific. However this new hook in
  the network stack looks way more specific more specific since *it only
  works for cgroups*.

So what I'm proposing goes in the direction of using the nf_tables
infrastructure instead:

* Add a new socket family for nf_tables with an input hook at
  sk_filter(). This just requires the new netfilter hook there and
  the boiler plate code to allow creating tables for this new family.
  And then we get access to many of the existing features in
  nf_tables for free.

* We can quickly find a verdict on the packet using using any combination
  of selectors through concatenations and maps in nf_tables. In
  nf_tables we can express the policy with a non-linear ruleset. On
  top of this, by delaying the nf_reset() calls we can reach the
  conntrack information from sk_filter(). That would be useful to skip
  evaluating packets that belong to already established flows. Thus, we
  incur the performance penalty in classifying only for the first
  packet of the flow.

* We can skip the socket egress hook (that you don't know where to place
  yet) since you can use the existing local output hook in netfilter that
  is available for IPv4 and IPv6.

* This new hook would fit into the existing netfilter set of hooks,
  the sysadmin is already familiarized with the administrative
  infrastructure to define filtering policies in our stack, so adding this
  new hook to what we have looks natural to me.

Thanks for your patience on debating this!

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups
  2016-08-19  9:19 ` [RFC PATCH 0/5] Add eBPF hooks for cgroups Pablo Neira Ayuso
@ 2016-08-19 10:35   ` Daniel Mack
  2016-08-19 11:20     ` Daniel Borkmann
  2016-08-19 16:21     ` Pablo Neira Ayuso
  2016-08-19 16:01   ` Alexei Starovoitov
  1 sibling, 2 replies; 39+ messages in thread
From: Daniel Mack @ 2016-08-19 10:35 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: htejun, daniel, ast, davem, kafai, fw, harald, netdev

Hi Pablo,

On 08/19/2016 11:19 AM, Pablo Neira Ayuso wrote:
> On Wed, Aug 17, 2016 at 04:00:43PM +0200, Daniel Mack wrote:
>> I'd appreciate some feedback on this. Pablo has some remaining concerns
>> about this approach, and I'd like to continue the discussion we had
>> off-list in the light of this patchset.
> 
> OK, I'm going to summarize them here below:
> 
> * This new hook

"This" refers to your alternative to my patch set, right?

> allows us to enforce an *administrative filtering
>   policy* that must be visible to anyone with CAP_NET_ADMIN. This is
>   easy to display in nf_tables as you can list the ruleset via the nft
>   userspace tool. Otherwise, in your approach if a misconfigured
>   filtering policy causes connectivity problems, I don't see how the
>   sysadmin is going to have an easy way to troubleshoot what is going on.

True. That's the downside of bpf.

> * Interaction with other software. As I could read from your patch,
>   what you propose will detach any previous existing filter. So I
>   don't see how you can attach multiple filtering policies from
>   different processes that don't cooperate each other.

Also true. A cgroup can currently only hold one bpf program for each
direction, and they are supposed to be set from one controlling instance
in the system. However, it is possible to create subcgroups, and install
own programs in them, which will then be effective instead of the one in
the parent. They will, however, replace each other in runtime behavior,
and not be stacked. This is a fundamentally different approach than how
nf_tables works of course.

> In nf_tables
>   this is easy since they can create their own tables so they keep their
>   ruleset in separate spaces. If the interaction is not OK, again the
>   sysadmin can very quickly debug this since the policies would be
>   visible via nf_tables ruleset listing.

True. Debugging would be much easier that way.

> So what I'm proposing goes in the direction of using the nf_tables
> infrastructure instead:
> 
> * Add a new socket family for nf_tables with an input hook at
>   sk_filter(). This just requires the new netfilter hook there and
>   the boiler plate code to allow creating tables for this new family.
>   And then we get access to many of the existing features in
>   nf_tables for free.

Yes. However, when I proposed more or less exactly that back in
September last year ("NF_INET_LOCAL_SOCKET_IN"), the concern raised by
you and Florian Westphal was that this type of decision making is out of
scope for netfilter, mostly because

a) whether a userspace process is running should not have any influence
in the netfilter behavior (which it does, because the rules are not
processed when the local socket is cannot be determined)

b) it is asymmetric, as it only exists for the input path

c) it's a change in netfilter paradigm, because rules for multicast
receivers are run multiple times (once for each receiving task)

d) it was considered a sledgehammer solution for a something that very
few people really need

I still think such a hook would be a good thing to have. As far as
implementation goes, my patch set back then patched each of the
protocols individually (ipv4, ipv6, dccp, sctp), while your idea to hook
in to sk_filter sound much more reasonable.

If the opinions on the previously raised concerns have changed, I'm
happy to revisit.

> * We can quickly find a verdict on the packet using using any combination
>   of selectors through concatenations and maps in nf_tables. In
>   nf_tables we can express the policy with a non-linear ruleset.

That's another interesting detail that was discussed on NFWS, yes. We
need a way to dispatch incoming packets without walking a linear
dispatcher list. In the eBPF approach, that's very easy because the
cgroup is directly associated with the receiving socket, so the lookup
of the effective eBPF programs is really fast.

If we can achieve similar things with nf_tables and maps, then that
should be applicable as well.

> On
>   top of this, by delaying the nf_reset() calls we can reach the
>   conntrack information from sk_filter(). That would be useful to skip
>   evaluating packets that belong to already established flows. Thus, we
>   incur the performance penalty in classifying only for the first
>   packet of the flow.

If that's possible, that's an interesting feature, but at least for
accounting, we need to run the rules for all packets, always.

> * We can skip the socket egress hook (that you don't know where to place
>   yet) since you can use the existing local output hook in netfilter that
>   is available for IPv4 and IPv6.

If asymmetry is not a no-go anymore, that sounds fine to me.

> * This new hook would fit into the existing netfilter set of hooks,
>   the sysadmin is already familiarized with the administrative
>   infrastructure to define filtering policies in our stack, so adding this
>   new hook to what we have looks natural to me.

At least for inspecting the rules, this is certainly a benefit. On the
other hand, it's always been a pain to handle competing entities in the
system that both alter netfilter configurations, as ownership of rules
is suddenly not clear anymore.

Another concern I have with cgroup matching in netfilter (at least as
enforced by cgroup v2) is that every such rule has to carry a
char[PATH_MAX] struct member, and the matching is done via that path
string. I guess we need to come up with some solution in that area
that's less expensive here, but that could be solved separately.

So - I don't know. The whole 'eBPF in cgroups' idea was born because
through the discussions over the past months we had on all this, it
became clear to me that netfilter is not the right place for filtering
on local tasks. I agree the solution I am proposing in my patch set has
its downsides, mostly when it comes to transparency to users, but I
considered that acceptable. After all, we have eBPF users all over the
place in the kernel already, and seccomp, for instance, isn't any better
in that regard.

That said, if there is a better solution for the problem, I can as well
ditch my patches. It's ultimately your call anyway I guess :) Do you
have any plans on working on this new netfilter hook or do you want me
to have look?

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups
  2016-08-19 10:35   ` Daniel Mack
@ 2016-08-19 11:20     ` Daniel Borkmann
  2016-08-19 16:31       ` Pablo Neira Ayuso
  2016-08-19 16:21     ` Pablo Neira Ayuso
  1 sibling, 1 reply; 39+ messages in thread
From: Daniel Borkmann @ 2016-08-19 11:20 UTC (permalink / raw)
  To: Daniel Mack, Pablo Neira Ayuso
  Cc: htejun, ast, davem, kafai, fw, harald, netdev

On 08/19/2016 11:19 AM, Pablo Neira Ayuso wrote:
[...]
 > * During the Netfilter Workshop, the main concern to add this new socket

Don't really know what was discussed exactly at NFWS, but ...

 >    ingress hook was that it is too specific. However this new hook in
 >    the network stack looks way more specific more specific since *it only
 >    works for cgroups*.

... why would that be something so overly specific? I don't think that "it
only works for cgroups" would be a narrow use case. While the current
sk_filter() BPF policies are set from application level, it makes sense to
me to have an option for an entity that manages the cgroups to apply an
external policy for networking side as well for participating processes.
It seems like a useful extension to the current sk_filter() infrastructure
iff we carve out the details properly and generic enough, and besides ...

[...]
On 08/19/2016 12:35 PM, Daniel Mack wrote:
[...]
> So - I don't know. The whole 'eBPF in cgroups' idea was born because
> through the discussions over the past months we had on all this, it
> became clear to me that netfilter is not the right place for filtering
> on local tasks. I agree the solution I am proposing in my patch set has
> its downsides, mostly when it comes to transparency to users, but I
> considered that acceptable. After all, we have eBPF users all over the
> place in the kernel already, and seccomp, for instance, isn't any better
> in that regard.

... since you mention seccomp here as well, it would be another good fit
as a program subtype to apply syscall policies for those participants on
a cgroup level too, f.e. to disallow certain syscalls. It would be quite
similar conceptually. So, fwiw, if this is being designed generic enough,
the use cases would go much broader than that.

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups
  2016-08-19  9:19 ` [RFC PATCH 0/5] Add eBPF hooks for cgroups Pablo Neira Ayuso
  2016-08-19 10:35   ` Daniel Mack
@ 2016-08-19 16:01   ` Alexei Starovoitov
  1 sibling, 0 replies; 39+ messages in thread
From: Alexei Starovoitov @ 2016-08-19 16:01 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Daniel Mack, htejun, daniel, ast, davem, kafai, fw, harald,
	netdev

On Fri, Aug 19, 2016 at 11:19:41AM +0200, Pablo Neira Ayuso wrote:
> Hi Daniel,
> 
> On Wed, Aug 17, 2016 at 04:00:43PM +0200, Daniel Mack wrote:
> > I'd appreciate some feedback on this. Pablo has some remaining concerns
> > about this approach, and I'd like to continue the discussion we had
> > off-list in the light of this patchset.
> 
> OK, I'm going to summarize them here below:
> 
> * This new hook allows us to enforce an *administrative filtering
>   policy* that must be visible to anyone with CAP_NET_ADMIN. This is
>   easy to display in nf_tables as you can list the ruleset via the nft
>   userspace tool. Otherwise, in your approach if a misconfigured
>   filtering policy causes connectivity problems, I don't see how the
>   sysadmin is going to have an easy way to troubleshoot what is going on.
> 
> * Interaction with other software. As I could read from your patch,
>   what you propose will detach any previous existing filter. So I
>   don't see how you can attach multiple filtering policies from
>   different processes that don't cooperate each other. In nf_tables
>   this is easy since they can create their own tables so they keep their
>   ruleset in separate spaces. If the interaction is not OK, again the
>   sysadmin can very quickly debug this since the policies would be
>   visible via nf_tables ruleset listing.
> 
> * During the Netfilter Workshop, the main concern to add this new socket
>   ingress hook was that it is too specific. However this new hook in
>   the network stack looks way more specific more specific since *it only
>   works for cgroups*.
> 
> So what I'm proposing goes in the direction of using the nf_tables
> infrastructure instead:

Pablo, if you were proposing to do cgroups+nft as well as cgroups+bpf
we could have had much more productive discussion.
You were not participating in cgroup+bpf design and now bringing up
bogus points that make no sense to me. That's not helpful.
Please start another cgroups+nft thread and there we can discuss the
ways to do it cleanly without slowdown the stack.
netfilter hooks bloat the stack enough that some people compile them out.
If I were you, I'd focus on improving iptables/nft performance instead
of arguing about their coolness.

> Thanks for your patience on debating this!

I don't think you're sincere.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups
  2016-08-19 10:35   ` Daniel Mack
  2016-08-19 11:20     ` Daniel Borkmann
@ 2016-08-19 16:21     ` Pablo Neira Ayuso
  2016-08-19 17:07       ` Thomas Graf
  1 sibling, 1 reply; 39+ messages in thread
From: Pablo Neira Ayuso @ 2016-08-19 16:21 UTC (permalink / raw)
  To: Daniel Mack; +Cc: htejun, daniel, ast, davem, kafai, fw, harald, netdev

Hi Daniel,

On Fri, Aug 19, 2016 at 12:35:14PM +0200, Daniel Mack wrote:
> Hi Pablo,
> 
> On 08/19/2016 11:19 AM, Pablo Neira Ayuso wrote:
> > On Wed, Aug 17, 2016 at 04:00:43PM +0200, Daniel Mack wrote:
> >> I'd appreciate some feedback on this. Pablo has some remaining concerns
> >> about this approach, and I'd like to continue the discussion we had
> >> off-list in the light of this patchset.
> > 
> > OK, I'm going to summarize them here below:
> > 
> > * This new hook
> 
> "This" refers to your alternative to my patch set, right?
>
> > allows us to enforce an *administrative filtering
> >   policy* that must be visible to anyone with CAP_NET_ADMIN. This is
> >   easy to display in nf_tables as you can list the ruleset via the nft
> >   userspace tool. Otherwise, in your approach if a misconfigured
> >   filtering policy causes connectivity problems, I don't see how the
> >   sysadmin is going to have an easy way to troubleshoot what is going on.
> 
> True. That's the downside of bpf.
> 
> > * Interaction with other software. As I could read from your patch,
> >   what you propose will detach any previous existing filter. So I
> >   don't see how you can attach multiple filtering policies from
> >   different processes that don't cooperate each other.
> 
> Also true. A cgroup can currently only hold one bpf program for each
> direction, and they are supposed to be set from one controlling instance
> in the system. However, it is possible to create subcgroups, and install
> own programs in them, which will then be effective instead of the one in
> the parent. They will, however, replace each other in runtime behavior,
> and not be stacked. This is a fundamentally different approach than how
> nf_tables works of course.

I see, this looks problematic indeed, thanks for confirming this.

> > In nf_tables
> >   this is easy since they can create their own tables so they keep their
> >   ruleset in separate spaces. If the interaction is not OK, again the
> >   sysadmin can very quickly debug this since the policies would be
> >   visible via nf_tables ruleset listing.
> 
> True. Debugging would be much easier that way.
> 
> > So what I'm proposing goes in the direction of using the nf_tables
> > infrastructure instead:
> > 
> > * Add a new socket family for nf_tables with an input hook at
> >   sk_filter(). This just requires the new netfilter hook there and
> >   the boiler plate code to allow creating tables for this new family.
> >   And then we get access to many of the existing features in
> >   nf_tables for free.
> 
> Yes. However, when I proposed more or less exactly that back in
> September last year ("NF_INET_LOCAL_SOCKET_IN"), the concern raised by
> you and Florian Westphal was that this type of decision making is out of
> scope for netfilter, mostly because
>
> a) whether a userspace process is running should not have any influence
> in the netfilter behavior (which it does, because the rules are not
> processed when the local socket is cannot be determined)

We always have a socket at sk_filter(). This new socket family retains
this specific semantics.

> b) it is asymmetric, as it only exists for the input path

Unless strictly necessary, I would not add another earlier socket
egress hook if ipv4 and ipv6 LOCAL_OUT hook is enough for this.

> c) it's a change in netfilter paradigm, because rules for multicast
> receivers are run multiple times (once for each receiving task)

This is part of the semantics of this new socket family, so the user
is aware of this particular expensive multicast behaviour in this new
family.

> d) it was considered a sledgehammer solution for a something that very
> few people really need

I don't see why the current RFC is less heavyweight. I guess this was
true with the "hooks spread all over the transport layer" approach was
discussed, but not with the single sk_filter() hook we're discussing.

> I still think such a hook would be a good thing to have. As far as
> implementation goes, my patch set back then patched each of the
> protocols individually (ipv4, ipv6, dccp, sctp), while your idea to hook
> in to sk_filter sound much more reasonable.

That's it. Instead of having hooks spread all over the place per
protocol, the idea is to add a new socket family with a hook like
NF_SOCKET_INGRESS at sk_filter(), this new family encloses this new
specific socket semantics, that differs from the ip, ip6 and inet
family semantics.

> If the opinions on the previously raised concerns have changed, I'm
> happy to revisit.
> 
> > * We can quickly find a verdict on the packet using using any combination
> >   of selectors through concatenations and maps in nf_tables. In
> >   nf_tables we can express the policy with a non-linear ruleset.
> 
> That's another interesting detail that was discussed on NFWS, yes. We
> need a way to dispatch incoming packets without walking a linear
> dispatcher list. In the eBPF approach, that's very easy because the
> cgroup is directly associated with the receiving socket, so the lookup
> of the effective eBPF programs is really fast.
> 
> If we can achieve similar things with nf_tables and maps, then that
> should be applicable as well.

Yes, we can indeed.

> >   On top of this, by delaying the nf_reset() calls we can reach the
> >   conntrack information from sk_filter(). That would be useful to skip
> >   evaluating packets that belong to already established flows. Thus, we
> >   incur the performance penalty in classifying only for the first
> >   packet of the flow.
> 
> If that's possible, that's an interesting feature, but at least for
> accounting, we need to run the rules for all packets, always.

We can build flow tables that looks like:

        nft add rule x y flow table whatever-name { meta skuid "apache" counter }

Then, we can list its content using @whatever-name as handle. The
table is dynamically populated. This accounts for all incoming traffic
for apache. We can actually use any concatenation of packet selectors
there.

This flow table thing uses the generic set infrastructure, and more
specifically, the rhashtable implementation, so looking up for the
entry scales up.

> > * We can skip the socket egress hook (that you don't know where to place
> >   yet) since you can use the existing local output hook in netfilter that
> >   is available for IPv4 and IPv6.
> 
> If asymmetry is not a no-go anymore, that sounds fine to me.

Unless strictly necessary, I would not add a new hook if we can make
it with the ip/ip6 local out hook.

> > * This new hook would fit into the existing netfilter set of hooks,
> >   the sysadmin is already familiarized with the administrative
> >   infrastructure to define filtering policies in our stack, so adding this
> >   new hook to what we have looks natural to me.
> 
> At least for inspecting the rules, this is certainly a benefit. On the
> other hand, it's always been a pain to handle competing entities in the
> system that both alter netfilter configurations, as ownership of rules
> is suddenly not clear anymore.
> 
> Another concern I have with cgroup matching in netfilter (at least as
> enforced by cgroup v2) is that every such rule has to carry a
> char[PATH_MAX] struct member, and the matching is done via that path
> string. I guess we need to come up with some solution in that area
> that's less expensive here, but that could be solved separately.

Good.

> So - I don't know. The whole 'eBPF in cgroups' idea was born because
> through the discussions over the past months we had on all this, it
> became clear to me that netfilter is not the right place for filtering
> on local tasks. I agree the solution I am proposing in my patch set has
> its downsides, mostly when it comes to transparency to users, but I
> considered that acceptable. After all, we have eBPF users all over the
> place in the kernel already, and seccomp, for instance, isn't any better
> in that regard.

But seccomp, like socket filters, are requested by the process itself,
so we are not attaching a global policy, instead this a per-process
policy. Since the process has attached this filter itself via
setsockopt() he is aware of what has done. As I said, this is
different when applied globally.

> That said, if there is a better solution for the problem, I can as well
> ditch my patches. It's ultimately your call anyway I guess :) Do you
> have any plans on working on this new netfilter hook or do you want me
> to have look?

I would look into this, yes. Thanks.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups
  2016-08-19 11:20     ` Daniel Borkmann
@ 2016-08-19 16:31       ` Pablo Neira Ayuso
  2016-08-19 16:37         ` Thomas Graf
  0 siblings, 1 reply; 39+ messages in thread
From: Pablo Neira Ayuso @ 2016-08-19 16:31 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Daniel Mack, htejun, ast, davem, kafai, fw, harald, netdev

On Fri, Aug 19, 2016 at 01:20:25PM +0200, Daniel Borkmann wrote:
> On 08/19/2016 11:19 AM, Pablo Neira Ayuso wrote:
> [...]
> > * During the Netfilter Workshop, the main concern to add this new socket
> 
> Don't really know what was discussed exactly at NFWS, but ...

Slides are available here:

http://workshop.netfilter.org/2016/wiki/index.php/File_Cgroup-matches-NFWS2016.html

We missed you Daniel, I hope you can make it next year ;-)

> >    ingress hook was that it is too specific. However this new hook in
> >    the network stack looks way more specific more specific since *it only
> >    works for cgroups*.
> 
> ... why would that be something so overly specific? I don't think that "it
> only works for cgroups" would be a narrow use case. While the current
> sk_filter() BPF policies are set from application level, it makes sense to
> me to have an option for an entity that manages the cgroups to apply an
> external policy for networking side as well for participating processes.
> It seems like a useful extension to the current sk_filter() infrastructure
> iff we carve out the details properly and generic enough, and besides ...

This forces anyone to filter socket traffic from cgroups, so this
makes the networking infrastructure dependent on cgroups for no
reason, instead of simply using the cgroupv1, cgroupv2, or whatever
other information as yet another selector.

> [...]
> On 08/19/2016 12:35 PM, Daniel Mack wrote:
> [...]
> >So - I don't know. The whole 'eBPF in cgroups' idea was born because
> >through the discussions over the past months we had on all this, it
> >became clear to me that netfilter is not the right place for filtering
> >on local tasks. I agree the solution I am proposing in my patch set has
> >its downsides, mostly when it comes to transparency to users, but I
> >considered that acceptable. After all, we have eBPF users all over the
> >place in the kernel already, and seccomp, for instance, isn't any better
> >in that regard.
> 
> ... since you mention seccomp here as well, it would be another good fit
> as a program subtype to apply syscall policies for those participants on
> a cgroup level too, f.e. to disallow certain syscalls. It would be quite
> similar conceptually. So, fwiw, if this is being designed generic enough,
> the use cases would go much broader than that.

Why do you need global seccomp policies? The process knows better what
he needs to place in his sandbox, so attaching this from the process
itself makes more sense to me... Anyway, this reminds me to selinux.

Back to my main point, I would not expect we have to request sysadmins
to dump BPF bytecode to understand what global policy is being
enforced.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups
  2016-08-19 16:31       ` Pablo Neira Ayuso
@ 2016-08-19 16:37         ` Thomas Graf
  0 siblings, 0 replies; 39+ messages in thread
From: Thomas Graf @ 2016-08-19 16:37 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Daniel Borkmann, Daniel Mack, htejun, ast, davem, kafai, fw,
	harald, netdev

On 08/19/16 at 06:31pm, Pablo Neira Ayuso wrote:
> Why do you need global seccomp policies? The process knows better what
> he needs to place in his sandbox, so attaching this from the process
> itself makes more sense to me... Anyway, this reminds me to selinux.

Two different objectives. The point is to sandbox applications and
restrict their capabilities. It's not the application itself but the task
orchestration system around it that manages the policies.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups
  2016-08-19 16:21     ` Pablo Neira Ayuso
@ 2016-08-19 17:07       ` Thomas Graf
  2016-08-22 16:06         ` Pablo Neira Ayuso
  0 siblings, 1 reply; 39+ messages in thread
From: Thomas Graf @ 2016-08-19 17:07 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Daniel Mack, htejun, daniel, ast, davem, kafai, fw, harald,
	netdev

On 08/19/16 at 06:21pm, Pablo Neira Ayuso wrote:
> On Fri, Aug 19, 2016 at 12:35:14PM +0200, Daniel Mack wrote:
> > Also true. A cgroup can currently only hold one bpf program for each
> > direction, and they are supposed to be set from one controlling instance
> > in the system. However, it is possible to create subcgroups, and install
> > own programs in them, which will then be effective instead of the one in
> > the parent. They will, however, replace each other in runtime behavior,
> > and not be stacked. This is a fundamentally different approach than how
> > nf_tables works of course.
> 
> I see, this looks problematic indeed, thanks for confirming this.

What exactly is problematic? I think the proposed mechanism is very
clean in allowing sub groups to provide the entire program. This
allows for delegation. Different orchestrators can manage different
cgroups. It's different as Daniel stated. I don't see how this is
problematic though.

You brought up multiple tables which reflect the cumulative approach.
This sometimes works but has its issues as well. Users must be aware
of each other and anticipate what rules other users might inject
before or after their own tables. The very existence of firewalld which
aims at democratizing this collaboration proves this point.

So in that sense I would very much like for both models to be made
available to users. nftables+cgroups for a cumulative approach as
well as BPF+cgroups for the delegation approach.  I don't see why the
cgroups based filtering capability should not be made available to both.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 4/5] net: filter: run cgroup eBPF programs
  2016-08-17 14:00 ` [RFC PATCH 4/5] net: filter: run cgroup eBPF programs Daniel Mack
  2016-08-17 14:23   ` Tejun Heo
  2016-08-17 18:20   ` Alexei Starovoitov
@ 2016-08-21 20:14   ` Sargun Dhillon
  2016-08-25 19:37     ` Tejun Heo
  2 siblings, 1 reply; 39+ messages in thread
From: Sargun Dhillon @ 2016-08-21 20:14 UTC (permalink / raw)
  To: Daniel Mack; +Cc: htejun, daniel, ast, davem, kafai, fw, pablo, harald, netdev

On Wed, Aug 17, 2016 at 04:00:47PM +0200, Daniel Mack wrote:
> If CONFIG_CGROUP_BPF is enabled, and the cgroup associated with the
> receiving socket has an eBPF programs installed, run them from
> sk_filter_trim_cap().
> 
> eBPF programs used in this context are expected to either return 1 to
> let the packet pass, or != 1 to drop them. The programs have access to
> the full skb, including the MAC headers.
> 
> This patch only implements the call site for ingress packets.
> 
> Signed-off-by: Daniel Mack <daniel@zonque.org>
> ---
>  net/core/filter.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 44 insertions(+)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index c5d8332..a1dd94b 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -52,6 +52,44 @@
>  #include <net/dst.h>
>  #include <net/sock_reuseport.h>
>  
> +#ifdef CONFIG_CGROUP_BPF
> +static int sk_filter_cgroup_bpf(struct sock *sk, struct sk_buff *skb,
> +				enum bpf_attach_type type)
> +{
> +	struct sock_cgroup_data *skcd = &sk->sk_cgrp_data;
> +	struct cgroup *cgrp = sock_cgroup_ptr(skcd);
> +	struct bpf_prog *prog;
> +	int ret = 0;
> +
> +	rcu_read_lock();
> +
> +	switch (type) {
> +	case BPF_ATTACH_TYPE_CGROUP_EGRESS:
> +		prog = rcu_dereference(cgrp->bpf_egress);
> +		break;
> +	case BPF_ATTACH_TYPE_CGROUP_INGRESS:
> +		prog = rcu_dereference(cgrp->bpf_ingress);
> +		break;
> +	default:
> +		WARN_ON_ONCE(1);
> +		ret = -EINVAL;
> +		break;
> +	}
> +
> +	if (prog) {
> +		unsigned int offset = skb->data - skb_mac_header(skb);
> +
> +		__skb_push(skb, offset);
> +		ret = bpf_prog_run_clear_cb(prog, skb) > 0 ? 0 : -EPERM;
> +		__skb_pull(skb, offset);
> +	}
> +
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +#endif /* !CONFIG_CGROUP_BPF */
> +
>  /**
>   *	sk_filter_trim_cap - run a packet through a socket filter
>   *	@sk: sock associated with &sk_buff
> @@ -78,6 +116,12 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
>  	if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
>  		return -ENOMEM;
>  
> +#ifdef CONFIG_CGROUP_BPF
> +	err = sk_filter_cgroup_bpf(sk, skb, BPF_ATTACH_TYPE_CGROUP_INGRESS);
> +	if (err)
> +		return err;
> +#endif
> +
>  	err = security_sock_rcv_skb(sk, skb);
>  	if (err)
>  		return err;
> -- 
> 2.5.5
> 

So, casually looking at this patch, it looks like you're relying on 
sock_cgroup_data, which only points to the default hierarchy. If someone uses 
net_prio or net_classid, cgroup_sk_alloc_disable is called, and this wont work 
anymore. 

Any ideas on how to work around that? Does it make sense to add another pointer 
to sock_cgroup_data, or at least a warning when allocation is disabled?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups
  2016-08-19 17:07       ` Thomas Graf
@ 2016-08-22 16:06         ` Pablo Neira Ayuso
  2016-08-22 16:22           ` Daniel Mack
  0 siblings, 1 reply; 39+ messages in thread
From: Pablo Neira Ayuso @ 2016-08-22 16:06 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Daniel Mack, htejun, daniel, ast, davem, kafai, fw, harald,
	netdev

Hi Thomas,

On Fri, Aug 19, 2016 at 07:07:39PM +0200, Thomas Graf wrote:
> On 08/19/16 at 06:21pm, Pablo Neira Ayuso wrote:
> > On Fri, Aug 19, 2016 at 12:35:14PM +0200, Daniel Mack wrote:
> > > Also true. A cgroup can currently only hold one bpf program for each
> > > direction, and they are supposed to be set from one controlling instance
> > > in the system. However, it is possible to create subcgroups, and install
> > > own programs in them, which will then be effective instead of the one in
> > > the parent. They will, however, replace each other in runtime behavior,
> > > and not be stacked. This is a fundamentally different approach than how
> > > nf_tables works of course.
> > 
> > I see, this looks problematic indeed, thanks for confirming this.
> 
> What exactly is problematic? I think the proposed mechanism is very
> clean in allowing sub groups to provide the entire program. This
> allows for delegation. Different orchestrators can manage different
> cgroups. It's different as Daniel stated. I don't see how this is
> problematic though.
> 
> You brought up multiple tables which reflect the cumulative approach.
> This sometimes works but has its issues as well. Users must be aware
> of each other and anticipate what rules other users might inject
> before or after their own tables. The very existence of firewalld which
> aims at democratizing this collaboration proves this point.

Firewalld, was really required in the iptables predefined tables
model, in nft last time we talked about this during NFWS'15, future
plans for firewalld were not clear yet.

Moreover, in nft, different users can indeed dump the ruleset and it
would be possible to validate if one policy is being shadowed by
another coming later on. The bpf bytecode dump cannot be taken to the
original representation.

> So in that sense I would very much like for both models to be made
> available to users. nftables+cgroups for a cumulative approach as
> well as BPF+cgroups for the delegation approach.  I don't see why the
> cgroups based filtering capability should not be made available to both.

This patchset also needs an extra egress hook, not yet known where to
be placed, so two hooks in the network stacks in the end, and this
only works for cgroups version 2.

Last time we talked about this, main concerns were that this was too
specific, but this approach seems even more specific to me.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups
  2016-08-22 16:06         ` Pablo Neira Ayuso
@ 2016-08-22 16:22           ` Daniel Mack
  2016-08-22 17:20             ` Sargun Dhillon
  0 siblings, 1 reply; 39+ messages in thread
From: Daniel Mack @ 2016-08-22 16:22 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Thomas Graf
  Cc: htejun, daniel, ast, davem, kafai, fw, harald, netdev

On 08/22/2016 06:06 PM, Pablo Neira Ayuso wrote:
> On Fri, Aug 19, 2016 at 07:07:39PM +0200, Thomas Graf wrote:

>> You brought up multiple tables which reflect the cumulative approach.
>> This sometimes works but has its issues as well. Users must be aware
>> of each other and anticipate what rules other users might inject
>> before or after their own tables. The very existence of firewalld which
>> aims at democratizing this collaboration proves this point.
> 
> Firewalld, was really required in the iptables predefined tables
> model, in nft last time we talked about this during NFWS'15, future
> plans for firewalld were not clear yet.
> 
> Moreover, in nft, different users can indeed dump the ruleset and it
> would be possible to validate if one policy is being shadowed by
> another coming later on. The bpf bytecode dump cannot be taken to the
> original representation.

But as Thomas said - both things address different use-cases. For
container setups, there is no administrator involved to use cli tools,
so I don't think that's really much of an argument.

>> So in that sense I would very much like for both models to be made
>> available to users. nftables+cgroups for a cumulative approach as
>> well as BPF+cgroups for the delegation approach.  I don't see why the
>> cgroups based filtering capability should not be made available to both.
> 
> This patchset also needs an extra egress hook, not yet known where to
> be placed, so two hooks in the network stacks in the end, 

That should be solvable, I'm sure. I can as well leave egress out for
the next version so it can be added later on.

> and this only works for cgroups version 2.

I don't see a problem with that, as v1 and v2 hierarchies can peacefully
coexist.

> Last time we talked about this, main concerns were that this was too
> specific, but this approach seems even more specific to me.

Hmm, I disagree - bpf programs that are associated with cgroups are
rather something that can be extended a lot in the future, for instance
for handling port binding permissions etc. Unlike the proposed network
cgroup controller with all sorts of complicated knobs to control ranges
of ports etc, a bpf program that take care of that in a much more
versatile way.

I also strongly believe we can have both, a cgroup controller that has
bpf programs for socket filtering and other things, _and_ a "post socket
lookup netfilter" table type. Both will have their individual use-cases.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups
  2016-08-22 16:22           ` Daniel Mack
@ 2016-08-22 17:20             ` Sargun Dhillon
  2016-08-23  8:27               ` Daniel Mack
  0 siblings, 1 reply; 39+ messages in thread
From: Sargun Dhillon @ 2016-08-22 17:20 UTC (permalink / raw)
  To: Daniel Mack
  Cc: Pablo Neira Ayuso, Thomas Graf, htejun, daniel, ast, davem, kafai,
	fw, harald, netdev

On Mon, Aug 22, 2016 at 06:22:20PM +0200, Daniel Mack wrote:
> On 08/22/2016 06:06 PM, Pablo Neira Ayuso wrote:
> > On Fri, Aug 19, 2016 at 07:07:39PM +0200, Thomas Graf wrote:
> 
> >> You brought up multiple tables which reflect the cumulative approach.
> >> This sometimes works but has its issues as well. Users must be aware
> >> of each other and anticipate what rules other users might inject
> >> before or after their own tables. The very existence of firewalld which
> >> aims at democratizing this collaboration proves this point.
> > 
> > Firewalld, was really required in the iptables predefined tables
> > model, in nft last time we talked about this during NFWS'15, future
> > plans for firewalld were not clear yet.
> > 
> > Moreover, in nft, different users can indeed dump the ruleset and it
> > would be possible to validate if one policy is being shadowed by
> > another coming later on. The bpf bytecode dump cannot be taken to the
> > original representation.
> 
> But as Thomas said - both things address different use-cases. For
> container setups, there is no administrator involved to use cli tools,
> so I don't think that's really much of an argument.
> 
> >> So in that sense I would very much like for both models to be made
> >> available to users. nftables+cgroups for a cumulative approach as
> >> well as BPF+cgroups for the delegation approach.  I don't see why the
> >> cgroups based filtering capability should not be made available to both.
> > 
> > This patchset also needs an extra egress hook, not yet known where to
> > be placed, so two hooks in the network stacks in the end, 
> 
> That should be solvable, I'm sure. I can as well leave egress out for
> the next version so it can be added later on.
> 
Any idea where you might put that yet? Does dev_xmit seems like a reasonable 
place?

> > and this only works for cgroups version 2.
> 
> I don't see a problem with that, as v1 and v2 hierarchies can peacefully
> coexist.
> 
If someone uses the netprio, or the net classid controllers, skcd matches
no longer work. Ideally, we should fix up these controllers to make them
more v2 friendly.

> > Last time we talked about this, main concerns were that this was too
> > specific, but this approach seems even more specific to me.
> 
> Hmm, I disagree - bpf programs that are associated with cgroups are
> rather something that can be extended a lot in the future, for instance
> for handling port binding permissions etc. Unlike the proposed network
> cgroup controller with all sorts of complicated knobs to control ranges
> of ports etc, a bpf program that take care of that in a much more
> versatile way.
> 
> I also strongly believe we can have both, a cgroup controller that has
> bpf programs for socket filtering and other things, _and_ a "post socket
> lookup netfilter" table type. Both will have their individual use-cases.
> 
> 
> Thanks,
> Daniel
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups
  2016-08-22 17:20             ` Sargun Dhillon
@ 2016-08-23  8:27               ` Daniel Mack
  2016-08-23  9:54                 ` Sargun Dhillon
  0 siblings, 1 reply; 39+ messages in thread
From: Daniel Mack @ 2016-08-23  8:27 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Pablo Neira Ayuso, Thomas Graf, htejun, daniel, ast, davem, kafai,
	fw, harald, netdev

On 08/22/2016 07:20 PM, Sargun Dhillon wrote:
> On Mon, Aug 22, 2016 at 06:22:20PM +0200, Daniel Mack wrote:
>> On 08/22/2016 06:06 PM, Pablo Neira Ayuso wrote:

>>> This patchset also needs an extra egress hook, not yet known where to
>>> be placed, so two hooks in the network stacks in the end, 
>>
>> That should be solvable, I'm sure. I can as well leave egress out for
>> the next version so it can be added later on.
>>
> Any idea where you might put that yet? Does dev_xmit seems like a reasonable 
> place?

Ah, yes. Thanks for the pointer, that seems to work fine.

> If someone uses the netprio, or the net classid controllers, skcd matches
> no longer work.

Yes, sock_cgroup_ptr() will fall back to the v2 root in this case.

> Ideally, we should fix up these controllers to make them
> more v2 friendly.

These controllers do not exist for v2, that's why sock_cgroup_ptr()
behaves that way. What's your idea to fix that up?


Thanks,
Daniel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups
  2016-08-23  8:27               ` Daniel Mack
@ 2016-08-23  9:54                 ` Sargun Dhillon
  2016-08-23 10:03                   ` Daniel Mack
  0 siblings, 1 reply; 39+ messages in thread
From: Sargun Dhillon @ 2016-08-23  9:54 UTC (permalink / raw)
  To: Daniel Mack
  Cc: Pablo Neira Ayuso, Thomas Graf, htejun, daniel, ast, davem, kafai,
	fw, harald, netdev

On Tue, Aug 23, 2016 at 10:27:28AM +0200, Daniel Mack wrote:
> On 08/22/2016 07:20 PM, Sargun Dhillon wrote:
> > On Mon, Aug 22, 2016 at 06:22:20PM +0200, Daniel Mack wrote:
> >> On 08/22/2016 06:06 PM, Pablo Neira Ayuso wrote:
> 
> >>> This patchset also needs an extra egress hook, not yet known where to
> >>> be placed, so two hooks in the network stacks in the end, 
> >>
> >> That should be solvable, I'm sure. I can as well leave egress out for
> >> the next version so it can be added later on.
> >>
> > Any idea where you might put that yet? Does dev_xmit seems like a reasonable 
> > place?
> 
> Ah, yes. Thanks for the pointer, that seems to work fine.
> 
Daniel pointed out to me that there's already a BPF program that's used there 
for tc matches. So, it should work fine. I would just verify you can call 
programs from IRQs, and rcu_bh plays well with it.

Alternatively, if you want to filter only IP traffic, ip_output, and ip6_output 
are fairly good places. I'm planning on putting some LSM hooks there soon. It's 
a bit simpler.

I also suggest you use verdicts rather than trimming for simplicity sake.

> > If someone uses the netprio, or the net classid controllers, skcd matches
> > no longer work.
> 
> Yes, sock_cgroup_ptr() will fall back to the v2 root in this case.
> 
> > Ideally, we should fix up these controllers to make them
> > more v2 friendly.
> 
> These controllers do not exist for v2, that's why sock_cgroup_ptr()
> behaves that way. What's your idea to fix that up?
I think that we should just add another pointer to the end of sock_cgroup_data 
while we're in this state of transition, and nudge people to disable 
CONFIG_CGROUP_NET_PRIO and CONFIG_CGROUP_NET_CLASSID over time.

Alternatively, we add these controllers for v2, and we have some kind of marker 
whether or not they're on v2 in the skcd. If they are, we can find the cgroup, 
and get the prioidx, and classid from the css. Although the comment in 
cgroup-defs.h suggests that v2 and classid should never be used concurrently, I 
can't help but to disagree, given there's legacy infrastructure that leverages 
classid.

> 
> 
> Thanks,
> Daniel
> 

Looking forward to seeing these patches,
-Sargun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups
  2016-08-23  9:54                 ` Sargun Dhillon
@ 2016-08-23 10:03                   ` Daniel Mack
  0 siblings, 0 replies; 39+ messages in thread
From: Daniel Mack @ 2016-08-23 10:03 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Pablo Neira Ayuso, Thomas Graf, htejun, daniel, ast, davem, kafai,
	fw, harald, netdev

On 08/23/2016 11:54 AM, Sargun Dhillon wrote:
> On Tue, Aug 23, 2016 at 10:27:28AM +0200, Daniel Mack wrote:
>> On 08/22/2016 07:20 PM, Sargun Dhillon wrote:
>>> On Mon, Aug 22, 2016 at 06:22:20PM +0200, Daniel Mack wrote:
>>>> On 08/22/2016 06:06 PM, Pablo Neira Ayuso wrote:
>>
>>>>> This patchset also needs an extra egress hook, not yet known where to
>>>>> be placed, so two hooks in the network stacks in the end, 
>>>>
>>>> That should be solvable, I'm sure. I can as well leave egress out for
>>>> the next version so it can be added later on.
>>>>
>>> Any idea where you might put that yet? Does dev_xmit seems like a reasonable 
>>> place?
>>
>> Ah, yes. Thanks for the pointer, that seems to work fine.
>>
> Daniel pointed out to me that there's already a BPF program that's used there 
> for tc matches. So, it should work fine. I would just verify you can call 
> programs from IRQs, and rcu_bh plays well with it.

IRQs should not matter AFAICS, and for testing, I placed the hook even
outside of rcu_bh. All the program runner needs is rcu_read_lock() to
access the rcu protected pointers.

> Alternatively, if you want to filter only IP traffic, ip_output, and ip6_output 
> are fairly good places. I'm planning on putting some LSM hooks there soon. It's 
> a bit simpler.

If you do that, and that's simpler, we can as well move the hook over at
some point. For now, I think dev_xmit() is sufficient.

> I also suggest you use verdicts rather than trimming for simplicity sake.

That's how it works already. eBPF programs in that context are expected
to either return 0 (reject) or 1 (pass). The may, however cause side
effects such as shared map updates etc, which is what the example
program does for accounting.

> I think that we should just add another pointer to the end of sock_cgroup_data 
> while we're in this state of transition, and nudge people to disable 
> CONFIG_CGROUP_NET_PRIO and CONFIG_CGROUP_NET_CLASSID over time.
> 
> Alternatively, we add these controllers for v2, and we have some kind of marker 
> whether or not they're on v2 in the skcd. If they are, we can find the cgroup, 
> and get the prioidx, and classid from the css. Although the comment in 
> cgroup-defs.h suggests that v2 and classid should never be used concurrently, I 
> can't help but to disagree, given there's legacy infrastructure that leverages 
> classid.

I'll leave that to Tejun to comment on :)


Thanks,
Daniel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 4/5] net: filter: run cgroup eBPF programs
  2016-08-21 20:14   ` Sargun Dhillon
@ 2016-08-25 19:37     ` Tejun Heo
  0 siblings, 0 replies; 39+ messages in thread
From: Tejun Heo @ 2016-08-25 19:37 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Daniel Mack, daniel, ast, davem, kafai, fw, pablo, harald, netdev

Hello, Sargun.

On Sun, Aug 21, 2016 at 01:14:22PM -0700, Sargun Dhillon wrote:
> So, casually looking at this patch, it looks like you're relying on 
> sock_cgroup_data, which only points to the default hierarchy. If someone uses 
> net_prio or net_classid, cgroup_sk_alloc_disable is called, and this wont work 
> anymore. 

The requirement there comes from network side.  In short, davem
(rightfuly) doesn't want further proliferation of cgroup association
fields in struct sock.  It makes sense for network control too as it's
schizophrenic to have different associations depending on the specific
controller.  Also, the v2 requirement shouldn't really get in the way
as it can be mounted as just another hierarchy along with other v1
hierarchies.

> Any ideas on how to work around that? Does it make sense to add another pointer 
> to sock_cgroup_data, or at least a warning when allocation is disabled?

cgroup already warns when the association gets disabled due to usage
of netcls and netprio.  We probably want to update the warning message
to include bpf too.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2016-08-25 20:14 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-08-17 14:00 [RFC PATCH 0/5] Add eBPF hooks for cgroups Daniel Mack
2016-08-17 14:00 ` [RFC PATCH 1/5] bpf: add new prog type for cgroup socket filtering Daniel Mack
2016-08-17 14:00 ` [RFC PATCH 2/5] cgroup: add bpf_{e,in}gress pointers Daniel Mack
2016-08-17 14:10   ` Tejun Heo
2016-08-17 17:50   ` Alexei Starovoitov
2016-08-17 17:56     ` Tejun Heo
2016-08-17 14:00 ` [RFC PATCH 3/5] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands Daniel Mack
2016-08-17 14:20   ` Tejun Heo
2016-08-17 14:35     ` Daniel Mack
2016-08-17 15:06       ` Tejun Heo
2016-08-17 15:51         ` Daniel Mack
2016-08-17 17:48           ` Alexei Starovoitov
2016-08-17 15:08       ` Tejun Heo
2016-08-17 16:16   ` Eric Dumazet
2016-08-17 18:10     ` Alexei Starovoitov
2016-08-18 15:17       ` Daniel Mack
2016-08-17 14:00 ` [RFC PATCH 4/5] net: filter: run cgroup eBPF programs Daniel Mack
2016-08-17 14:23   ` Tejun Heo
2016-08-17 14:36     ` Daniel Mack
2016-08-17 14:58       ` Tejun Heo
2016-08-17 18:20   ` Alexei Starovoitov
2016-08-17 18:23     ` Alexei Starovoitov
2016-08-21 20:14   ` Sargun Dhillon
2016-08-25 19:37     ` Tejun Heo
2016-08-17 14:00 ` [RFC PATCH 5/5] samples: bpf: add userspace example for attaching eBPF programs to cgroups Daniel Mack
2016-08-19  9:19 ` [RFC PATCH 0/5] Add eBPF hooks for cgroups Pablo Neira Ayuso
2016-08-19 10:35   ` Daniel Mack
2016-08-19 11:20     ` Daniel Borkmann
2016-08-19 16:31       ` Pablo Neira Ayuso
2016-08-19 16:37         ` Thomas Graf
2016-08-19 16:21     ` Pablo Neira Ayuso
2016-08-19 17:07       ` Thomas Graf
2016-08-22 16:06         ` Pablo Neira Ayuso
2016-08-22 16:22           ` Daniel Mack
2016-08-22 17:20             ` Sargun Dhillon
2016-08-23  8:27               ` Daniel Mack
2016-08-23  9:54                 ` Sargun Dhillon
2016-08-23 10:03                   ` Daniel Mack
2016-08-19 16:01   ` Alexei Starovoitov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).