netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Eric Dumazet <eric.dumazet@gmail.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>,
	"David S. Miller" <davem@davemloft.net>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Network Development <netdev@vger.kernel.org>,
	Kernel Team <kernel-team@fb.com>
Subject: Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind
Date: Wed, 14 Mar 2018 17:17:54 -0700	[thread overview]
Message-ID: <97dc8c66-9701-7970-bb38-750d79f767c8@gmail.com> (raw)
In-Reply-To: <CAADnVQL4PDiNq07ip-uh=n0+ba3nzTy+QJwew1x71UKeGgsiEw@mail.gmail.com>



On 03/14/2018 11:41 AM, Alexei Starovoitov wrote:
> On Wed, Mar 14, 2018 at 11:00 AM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>>
>>> It seems this is exactly the case where a netns would be the correct answer.
>>
>> Unfortuantely that's not the case. That's what I tried to explain
>> in the cover letter:
>> "The setup involves per-container IPs, policy, etc, so traditional
>> network-only solutions that involve VRFs, netns, acls are not applicable."
>> To elaborate more on that:
>> netns is l2 isolation.
>> vrf is l3 isolation.
>> whereas to containerize an application we need to punch connectivity holes
>> in these layered techniques.
>> We also considered resurrecting Hannes's afnetns work
>> and even went as far as designing a new namespace for L4 isolation.
>> Unfortunately all hierarchical namespace abstraction don't work.
>> To run an application inside cgroup container that was not written
>> with containers in mind we have to make an illusion of running
>> in non-containerized environment.
>> In some cases we remember the port and container id in the post-bind hook
>> in a bpf map and when some other task in a different container is trying
>> to connect to a service we need to know where this service is running.
>> It can be remote and can be local. Both client and service may or may not
>> be written with containers in mind and this sockaddr rewrite is providing
>> connectivity and load balancing feature that you simply cannot do
>> with hierarchical networking primitives.
> 
> have to explain this a bit further...
> We also considered hacking these 'connectivity holes' in
> netns and/or vrf, but that would be real layering violation,
> since clean l2, l3 abstraction would suddenly support
> something that breaks through the layers.
> Just like many consider ipvlan a bad hack that punches
> through the layers and connects l2 abstraction of netns
> at l3 layer, this is not something kernel should ever do.
> We really didn't want another ipvlan-like hack in the kernel.
> Instead bpf programs at bind/connect time _help_
> applications discover and connect to each other.
> All containers are running in init_nens and there are no vrfs.
> After bind/connect the normal fib/neighbor core networking
> logic works as it should always do. The whole system is
> clean from network point of view.


We apparently missed something when deploying ipvlan and one netns per
container/job

Full access to 64K ports, no more ports being reserved/abused.
If one job needs more, no problem, just use more than one IP per netns.

It also works with UDP just fine. Are you considering adding a hook
later for sendmsg() (unconnected socket or not), or do you want to use
the existing one in ip_finish_output(), adding per-packet overhead ?

This notion of 'clean l2, l3 abstraction' is very subjective.
I find netns isolation very clean, powerful, and it is there already.

eBPF is certainly nice, but pretending netns/ipvlan are hacks is not
credible.

  reply	other threads:[~2018-03-15  0:17 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-14 18:41 [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind Alexei Starovoitov
2018-03-15  0:17 ` Eric Dumazet [this message]
2018-03-15  3:37   ` Alexei Starovoitov
2018-03-15 14:14     ` Jiri Benc
2018-03-15 16:22     ` Mahesh Bandewar (महेश बंडेवार)
  -- strict thread matches above, loose matches on Subject: below --
2018-03-14  3:39 [PATCH RFC bpf-next 0/6] bpf: introduce cgroup-bpf bind, connect, post-bind hooks Alexei Starovoitov
2018-03-14  3:39 ` [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind Alexei Starovoitov
2018-03-14  6:21   ` Eric Dumazet
2018-03-14 18:00     ` Alexei Starovoitov
2018-03-14 14:37   ` Daniel Borkmann
2018-03-14 14:55     ` Daniel Borkmann
2018-03-14 18:11     ` Alexei Starovoitov
2018-03-14 23:27       ` Daniel Borkmann
2018-03-15  0:29         ` Alexei Starovoitov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=97dc8c66-9701-7970-bb38-750d79f767c8@gmail.com \
    --to=eric.dumazet@gmail.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=ast@kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=kernel-team@fb.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).