netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC bpf-next 0/6] bpf: introduce cgroup-bpf bind, connect, post-bind hooks
@ 2018-03-14  3:39 Alexei Starovoitov
  2018-03-14  3:39 ` [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind Alexei Starovoitov
                   ` (7 more replies)
  0 siblings, 8 replies; 23+ messages in thread
From: Alexei Starovoitov @ 2018-03-14  3:39 UTC (permalink / raw)
  To: davem; +Cc: daniel, netdev, kernel-team

For our container management we've been using complicated and fragile setup
consisting of LD_PRELOAD wrapper intercepting bind and connect calls from
all containerized applications.
The setup involves per-container IPs, policy, etc, so traditional
network-only solutions that involve VRFs, netns, acls are not applicable.
Changing apps is not possible and LD_PRELOAD doesn't work
for apps that don't use glibc like java and golang.
BPF+cgroup looks to be the best solution for this problem.
Hence we introduce 3 hooks:
- at entry into sys_bind and sys_connect
  to let bpf prog look and modify 'struct sockaddr' provided
  by user space and fail bind/connect when appropriate
- post sys_bind after port is allocated

The approach works great and has zero overhead for anyone who doesn't
use it and very low overhead when deployed.

The main question for Daniel and Dave is what approach to take
with prog types...

In this patch set we introduce 6 new program types to make user
experience easier:
  BPF_PROG_TYPE_CGROUP_INET4_BIND,
  BPF_PROG_TYPE_CGROUP_INET6_BIND,
  BPF_PROG_TYPE_CGROUP_INET4_CONNECT,
  BPF_PROG_TYPE_CGROUP_INET6_CONNECT,
  BPF_PROG_TYPE_CGROUP_INET4_POST_BIND,
  BPF_PROG_TYPE_CGROUP_INET6_POST_BIND,

since v4 programs should not be using 'struct bpf_sock_addr'->user_ip6 fields
and different prog type for v4 and v6 helps verifier reject such access
at load time.
Similarly bind vs connect are two different prog types too,
since only sys_connect programs can call new bpf_bind() helper.

This approach is very different from tcp-bpf where single
'struct bpf_sock_ops' and single prog type is used for different hooks.
The field checks are done at run-time instead of load time.

I think the approach taken by this patch set is justified,
but we may do better if we extend BPF_PROG_ATTACH cmd
with log_buf + log_size, then we should be able to combine
bind+connect+v4+v6 into single program type.
The idea that at load time the verifier will remember a bitmask
of fields in bpf_sock_addr used by the program and helpers
that program used, then at attach time we can check that
hook is compatible with features used by the program and
report human readable error message back via log_buf.
We cannot do this right now with just EINVAL, since combinations
of errors like 'using user_ip6 field but attaching to v4 hook'
are too high to express as errno.
This would be bigger change. If you folks think it's worth it
we can go with this approach or if you think 6 new prog types
is not too bad, we can leave the patch as-is.
Comments?
Other comments on patches are welcome.

Andrey Ignatov (6):
  bpf: Hooks for sys_bind
  selftests/bpf: Selftest for sys_bind hooks
  net: Introduce __inet_bind() and __inet6_bind
  bpf: Hooks for sys_connect
  selftests/bpf: Selftest for sys_connect hooks
  bpf: Post-hooks for sys_bind

 include/linux/bpf-cgroup.h                    |  68 +++-
 include/linux/bpf_types.h                     |   6 +
 include/linux/filter.h                        |  10 +
 include/net/inet_common.h                     |   2 +
 include/net/ipv6.h                            |   2 +
 include/net/sock.h                            |   3 +
 include/net/udp.h                             |   1 +
 include/uapi/linux/bpf.h                      |  52 ++-
 kernel/bpf/cgroup.c                           |  36 ++
 kernel/bpf/syscall.c                          |  42 ++
 kernel/bpf/verifier.c                         |   6 +
 net/core/filter.c                             | 479 ++++++++++++++++++++++-
 net/ipv4/af_inet.c                            |  60 ++-
 net/ipv4/tcp_ipv4.c                           |  16 +
 net/ipv4/udp.c                                |  14 +
 net/ipv6/af_inet6.c                           |  47 ++-
 net/ipv6/tcp_ipv6.c                           |  16 +
 net/ipv6/udp.c                                |  20 +
 tools/include/uapi/linux/bpf.h                |  39 +-
 tools/testing/selftests/bpf/Makefile          |   8 +-
 tools/testing/selftests/bpf/bpf_helpers.h     |   2 +
 tools/testing/selftests/bpf/connect4_prog.c   |  45 +++
 tools/testing/selftests/bpf/connect6_prog.c   |  61 +++
 tools/testing/selftests/bpf/test_sock_addr.c  | 541 ++++++++++++++++++++++++++
 tools/testing/selftests/bpf/test_sock_addr.sh |  57 +++
 25 files changed, 1580 insertions(+), 53 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/connect4_prog.c
 create mode 100644 tools/testing/selftests/bpf/connect6_prog.c
 create mode 100644 tools/testing/selftests/bpf/test_sock_addr.c
 create mode 100755 tools/testing/selftests/bpf/test_sock_addr.sh

-- 
2.9.5

^ permalink raw reply	[flat|nested] 23+ messages in thread
* Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind
@ 2018-03-14 18:41 Alexei Starovoitov
  2018-03-15  0:17 ` Eric Dumazet
  0 siblings, 1 reply; 23+ messages in thread
From: Alexei Starovoitov @ 2018-03-14 18:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexei Starovoitov, David S. Miller, Daniel Borkmann,
	Network Development, Kernel Team

On Wed, Mar 14, 2018 at 11:00 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
>> It seems this is exactly the case where a netns would be the correct answer.
>
> Unfortuantely that's not the case. That's what I tried to explain
> in the cover letter:
> "The setup involves per-container IPs, policy, etc, so traditional
> network-only solutions that involve VRFs, netns, acls are not applicable."
> To elaborate more on that:
> netns is l2 isolation.
> vrf is l3 isolation.
> whereas to containerize an application we need to punch connectivity holes
> in these layered techniques.
> We also considered resurrecting Hannes's afnetns work
> and even went as far as designing a new namespace for L4 isolation.
> Unfortunately all hierarchical namespace abstraction don't work.
> To run an application inside cgroup container that was not written
> with containers in mind we have to make an illusion of running
> in non-containerized environment.
> In some cases we remember the port and container id in the post-bind hook
> in a bpf map and when some other task in a different container is trying
> to connect to a service we need to know where this service is running.
> It can be remote and can be local. Both client and service may or may not
> be written with containers in mind and this sockaddr rewrite is providing
> connectivity and load balancing feature that you simply cannot do
> with hierarchical networking primitives.

have to explain this a bit further...
We also considered hacking these 'connectivity holes' in
netns and/or vrf, but that would be real layering violation,
since clean l2, l3 abstraction would suddenly support
something that breaks through the layers.
Just like many consider ipvlan a bad hack that punches
through the layers and connects l2 abstraction of netns
at l3 layer, this is not something kernel should ever do.
We really didn't want another ipvlan-like hack in the kernel.
Instead bpf programs at bind/connect time _help_
applications discover and connect to each other.
All containers are running in init_nens and there are no vrfs.
After bind/connect the normal fib/neighbor core networking
logic works as it should always do. The whole system is
clean from network point of view.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2018-03-15 16:22 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-03-14  3:39 [PATCH RFC bpf-next 0/6] bpf: introduce cgroup-bpf bind, connect, post-bind hooks Alexei Starovoitov
2018-03-14  3:39 ` [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind Alexei Starovoitov
2018-03-14  6:21   ` Eric Dumazet
2018-03-14 18:00     ` Alexei Starovoitov
2018-03-14 14:37   ` Daniel Borkmann
2018-03-14 14:55     ` Daniel Borkmann
2018-03-14 18:11     ` Alexei Starovoitov
2018-03-14 23:27       ` Daniel Borkmann
2018-03-15  0:29         ` Alexei Starovoitov
2018-03-14  3:39 ` [PATCH RFC bpf-next 2/6] selftests/bpf: Selftest for sys_bind hooks Alexei Starovoitov
2018-03-14  3:39 ` [PATCH RFC bpf-next 3/6] net: Introduce __inet_bind() and __inet6_bind Alexei Starovoitov
2018-03-14  3:39 ` [PATCH RFC bpf-next 4/6] bpf: Hooks for sys_connect Alexei Starovoitov
2018-03-14  3:39 ` [PATCH RFC bpf-next 5/6] selftests/bpf: Selftest for sys_connect hooks Alexei Starovoitov
2018-03-14  3:39 ` [PATCH RFC bpf-next 6/6] bpf: Post-hooks for sys_bind Alexei Starovoitov
2018-03-14 17:13 ` [PATCH RFC bpf-next 0/6] bpf: introduce cgroup-bpf bind, connect, post-bind hooks David Ahern
2018-03-14 18:00   ` Alexei Starovoitov
2018-03-14 17:22 ` Mahesh Bandewar (महेश बंडेवार)
2018-03-14 18:01   ` Alexei Starovoitov
  -- strict thread matches above, loose matches on Subject: below --
2018-03-14 18:41 [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind Alexei Starovoitov
2018-03-15  0:17 ` Eric Dumazet
2018-03-15  3:37   ` Alexei Starovoitov
2018-03-15 14:14     ` Jiri Benc
2018-03-15 16:22     ` Mahesh Bandewar (महेश बंडेवार)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).