From: Eric Dumazet <eric.dumazet@gmail.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>,
"David S. Miller" <davem@davemloft.net>,
Daniel Borkmann <daniel@iogearbox.net>,
Network Development <netdev@vger.kernel.org>,
Kernel Team <kernel-team@fb.com>
Subject: Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind
Date: Wed, 14 Mar 2018 17:17:54 -0700 [thread overview]
Message-ID: <97dc8c66-9701-7970-bb38-750d79f767c8@gmail.com> (raw)
In-Reply-To: <CAADnVQL4PDiNq07ip-uh=n0+ba3nzTy+QJwew1x71UKeGgsiEw@mail.gmail.com>
On 03/14/2018 11:41 AM, Alexei Starovoitov wrote:
> On Wed, Mar 14, 2018 at 11:00 AM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>>
>>> It seems this is exactly the case where a netns would be the correct answer.
>>
>> Unfortuantely that's not the case. That's what I tried to explain
>> in the cover letter:
>> "The setup involves per-container IPs, policy, etc, so traditional
>> network-only solutions that involve VRFs, netns, acls are not applicable."
>> To elaborate more on that:
>> netns is l2 isolation.
>> vrf is l3 isolation.
>> whereas to containerize an application we need to punch connectivity holes
>> in these layered techniques.
>> We also considered resurrecting Hannes's afnetns work
>> and even went as far as designing a new namespace for L4 isolation.
>> Unfortunately all hierarchical namespace abstraction don't work.
>> To run an application inside cgroup container that was not written
>> with containers in mind we have to make an illusion of running
>> in non-containerized environment.
>> In some cases we remember the port and container id in the post-bind hook
>> in a bpf map and when some other task in a different container is trying
>> to connect to a service we need to know where this service is running.
>> It can be remote and can be local. Both client and service may or may not
>> be written with containers in mind and this sockaddr rewrite is providing
>> connectivity and load balancing feature that you simply cannot do
>> with hierarchical networking primitives.
>
> have to explain this a bit further...
> We also considered hacking these 'connectivity holes' in
> netns and/or vrf, but that would be real layering violation,
> since clean l2, l3 abstraction would suddenly support
> something that breaks through the layers.
> Just like many consider ipvlan a bad hack that punches
> through the layers and connects l2 abstraction of netns
> at l3 layer, this is not something kernel should ever do.
> We really didn't want another ipvlan-like hack in the kernel.
> Instead bpf programs at bind/connect time _help_
> applications discover and connect to each other.
> All containers are running in init_nens and there are no vrfs.
> After bind/connect the normal fib/neighbor core networking
> logic works as it should always do. The whole system is
> clean from network point of view.
We apparently missed something when deploying ipvlan and one netns per
container/job
Full access to 64K ports, no more ports being reserved/abused.
If one job needs more, no problem, just use more than one IP per netns.
It also works with UDP just fine. Are you considering adding a hook
later for sendmsg() (unconnected socket or not), or do you want to use
the existing one in ip_finish_output(), adding per-packet overhead ?
This notion of 'clean l2, l3 abstraction' is very subjective.
I find netns isolation very clean, powerful, and it is there already.
eBPF is certainly nice, but pretending netns/ipvlan are hacks is not
credible.
next prev parent reply other threads:[~2018-03-15 0:17 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-03-14 18:41 [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind Alexei Starovoitov
2018-03-15 0:17 ` Eric Dumazet [this message]
2018-03-15 3:37 ` Alexei Starovoitov
2018-03-15 14:14 ` Jiri Benc
2018-03-15 16:22 ` Mahesh Bandewar (महेश बंडेवार)
-- strict thread matches above, loose matches on Subject: below --
2018-03-14 3:39 [PATCH RFC bpf-next 0/6] bpf: introduce cgroup-bpf bind, connect, post-bind hooks Alexei Starovoitov
2018-03-14 3:39 ` [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind Alexei Starovoitov
2018-03-14 6:21 ` Eric Dumazet
2018-03-14 18:00 ` Alexei Starovoitov
2018-03-14 14:37 ` Daniel Borkmann
2018-03-14 14:55 ` Daniel Borkmann
2018-03-14 18:11 ` Alexei Starovoitov
2018-03-14 23:27 ` Daniel Borkmann
2018-03-15 0:29 ` Alexei Starovoitov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=97dc8c66-9701-7970-bb38-750d79f767c8@gmail.com \
--to=eric.dumazet@gmail.com \
--cc=alexei.starovoitov@gmail.com \
--cc=ast@kernel.org \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=kernel-team@fb.com \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).