From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pl0-f50.google.com ([209.85.160.50]:34212 "EHLO mail-pl0-f50.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750950AbeCOAR4 (ORCPT ); Wed, 14 Mar 2018 20:17:56 -0400 Received: by mail-pl0-f50.google.com with SMTP id u13-v6so2691470plq.1 for ; Wed, 14 Mar 2018 17:17:56 -0700 (PDT) Subject: Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind To: Alexei Starovoitov Cc: Alexei Starovoitov , "David S. Miller" , Daniel Borkmann , Network Development , Kernel Team References: From: Eric Dumazet Message-ID: <97dc8c66-9701-7970-bb38-750d79f767c8@gmail.com> Date: Wed, 14 Mar 2018 17:17:54 -0700 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: netdev-owner@vger.kernel.org List-ID: On 03/14/2018 11:41 AM, Alexei Starovoitov wrote: > On Wed, Mar 14, 2018 at 11:00 AM, Alexei Starovoitov > wrote: >> >>> It seems this is exactly the case where a netns would be the correct answer. >> >> Unfortuantely that's not the case. That's what I tried to explain >> in the cover letter: >> "The setup involves per-container IPs, policy, etc, so traditional >> network-only solutions that involve VRFs, netns, acls are not applicable." >> To elaborate more on that: >> netns is l2 isolation. >> vrf is l3 isolation. >> whereas to containerize an application we need to punch connectivity holes >> in these layered techniques. >> We also considered resurrecting Hannes's afnetns work >> and even went as far as designing a new namespace for L4 isolation. >> Unfortunately all hierarchical namespace abstraction don't work. >> To run an application inside cgroup container that was not written >> with containers in mind we have to make an illusion of running >> in non-containerized environment. >> In some cases we remember the port and container id in the post-bind hook >> in a bpf map and when some other task in a different container is trying >> to connect to a service we need to know where this service is running. >> It can be remote and can be local. Both client and service may or may not >> be written with containers in mind and this sockaddr rewrite is providing >> connectivity and load balancing feature that you simply cannot do >> with hierarchical networking primitives. > > have to explain this a bit further... > We also considered hacking these 'connectivity holes' in > netns and/or vrf, but that would be real layering violation, > since clean l2, l3 abstraction would suddenly support > something that breaks through the layers. > Just like many consider ipvlan a bad hack that punches > through the layers and connects l2 abstraction of netns > at l3 layer, this is not something kernel should ever do. > We really didn't want another ipvlan-like hack in the kernel. > Instead bpf programs at bind/connect time _help_ > applications discover and connect to each other. > All containers are running in init_nens and there are no vrfs. > After bind/connect the normal fib/neighbor core networking > logic works as it should always do. The whole system is > clean from network point of view. We apparently missed something when deploying ipvlan and one netns per container/job Full access to 64K ports, no more ports being reserved/abused. If one job needs more, no problem, just use more than one IP per netns. It also works with UDP just fine. Are you considering adding a hook later for sendmsg() (unconnected socket or not), or do you want to use the existing one in ip_finish_output(), adding per-packet overhead ? This notion of 'clean l2, l3 abstraction' is very subjective. I find netns isolation very clean, powerful, and it is there already. eBPF is certainly nice, but pretending netns/ipvlan are hacks is not credible.