Re: Network virtualization/isolation

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: ebiederm@xmission.com (Eric W. Biederman)
To: Daniel Lezcano <dlezcano@fr.ibm.com>
Cc: hadi@cyberus.ca, Dmitry Mishin <dim@openvz.org>,
	Stephen Hemminger <shemminger@osdl.org>,
	netdev@vger.kernel.org,
	Linux Containers <containers@lists.osdl.org>,
	Herbert Poetzl <herbert@13thfloor.at>
Subject: Re: Network virtualization/isolation
Date: Tue, 28 Nov 2006 14:50:03 -0700	[thread overview]
Message-ID: <m14psji6lw.fsf@ebiederm.dsl.xmission.com> (raw)
In-Reply-To: <456C9B8C.1010701@fr.ibm.com> (Daniel Lezcano's message of "Tue, 28 Nov 2006 21:26:52 +0100")

Daniel Lezcano <dlezcano@fr.ibm.com> writes:

> Eric W. Biederman wrote:
>> I do not want to get into a big debate on the merits of various
>> techniques at this time.  We seem to be in basic agreement
>> about what we are talking about.
>>
>> There is one thing I think we can all agree upon.
>> - Everything except isolation at the network device/L2 layer, does not
>>   allow guests to have the full power of the linux networking stack.
> Agree.
>>
>> - There has been a demonstrated use for the full power of the linux
>>   networking stack in containers..
> Agree.
>>
>> - There are a set of techniques which look as though they will give
>>   us full speed when we do isolation of the network stack at the
>>   network device/L2 layer.
> Agree.

Herbert Poetzl <herbert@13thfloor.at> writes:
> correct, don't get me wrong, I'm absolutely not against
> layer 2 virtualization, but not at the expense of light-
> weight layer 3 isolation, which _is_ the traditional way
> 'containers' are built (see BSD, solaris ...)

Ok.  So on this point we agree.  Full isolation at the network device/L2 level
is desirable and no one is opposed to that.

There is however a strong feeling especially for the case of application
containers that something more focused on what a non-privileged process can
use and deal with would be nice.  The ``L3'' case.

I agree that has potential but I worry about 2 things.
- Premature optimization.
- A poor choice of semantics.
- Feature creep leading to insane semantics.

I feel there is something in the L3 arguments as well and it sounds
like it would be a good idea to flush out the semantics.

For full network isolation we have the case that every process,
every socket, and every network device belongs to a network namespace.
This is enough to derive the network namespace for all other user
visible data structures, and to a large extent to define their semantics.

We still need a definition of the non-privileged case, that is compatible
with the former definition.

.....

What unprivileged user space gets to manipulate are sockets.  So perhaps
we can break our model into a network socket namespace and network device
namespace.  

I would define it so that for each socket there is exactly one network
socket namespace.  And for each network socket namespace there is exactly
one network device namespace.

The network socket namespace would be concerned with the rules for deciding
which local addresses a socket can connect/accept/bind to.

The network device namespace would be concerned with everything else.

The problem I see are the wild card binds.  In general unmodified
server applications want to bind to *:port by default.  Running
two such applications on different ip addresses is a problem.  Even
if you can configure them not to do that it becomes easy to do that
be default.

There are some interesting flexible cases where we want one
application container to have one port on IP, and a different
application container to have a different port on the same IP.

So we need something flexible and not just based on IP addresses.
I think the right answer here is a netfilter table that defines
what we can accept/bind/connect the socket to.

The tricky part is when do we return -EADDRINUSE.

I think we can specify the rules such that if we conflict with
another socket in the same socket namespace the rules remain
as they are today, and the kernel returns it unconditionally.

I think for cases across network socket namespaces it should
be a matter for the rules, to decide if the connection should
happen and what error code to return if the connection does not
happen.

There is a potential in this to have an ambiguous case where two
applications can be listening for connections on the same socket
on the same port and both will allow the connection.  If that
is the case I believe the proper definition is the first socket
that we find that will accept the connection gets the connection.

I believe this is a sufficiently general definition that we can
make it work with network types in the kernel including DECNET, IP,
and IPv6.

The only gain I see for having the socket namespace is socket
collision detection, and a convenient tag to distinguish containers.

I think this set of netfilter rules may be an interesting alternative
to ip connection tracking in the current firewall code.

... 

Assuming the above scheme works does that sound about what people
actually want to use?

I think with the appropriate set of rules it provides what is needed
for application migration.  I.e. 127.0.0.1 can be filtered so that
you can only connect to sockets in your current container.

It does get a little odd because it does allow for the possibility
that you can have multiple connected sockets with same source ip,
source port, destination ip, destination port.  If the rules are
setup appropriately.  I don't see that peculiarity being visible on
the outside network so it shouldn't be a problem.

Eric

next prev parent reply	other threads:[~2006-11-28 21:51 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-10-25 15:51 Network virtualization/isolation Daniel Lezcano
2006-10-23 20:01 ` Stephen Hemminger
2006-10-26  9:44   ` Daniel Lezcano
2006-10-26 15:56     ` Stephen Hemminger
2006-10-26 22:16       ` Daniel Lezcano
2006-10-27  7:34       ` Dmitry Mishin
2006-10-27  9:10         ` Daniel Lezcano
2006-11-01 14:35           ` jamal
2006-11-01 16:13             ` Daniel Lezcano
2006-11-14 15:17             ` Daniel Lezcano
2006-11-14 18:12               ` James Morris
2006-11-15  9:56                 ` Daniel Lezcano
2006-11-22 12:00               ` Daniel Lezcano
2006-11-25  9:09               ` Eric W. Biederman
2006-11-28 14:15                 ` Daniel Lezcano
2006-11-28 16:51                   ` Eric W. Biederman
2006-11-28 17:37                     ` Herbert Poetzl
2006-11-28 20:26                     ` Daniel Lezcano
2006-11-28 21:50                       ` Eric W. Biederman [this message]
2006-11-29  5:54                         ` Herbert Poetzl
2006-11-29 20:21                         ` Brian Haley
2006-11-29 22:10                           ` [Devel] " Daniel Lezcano
2006-11-30 16:15                             ` Vlad Yasevich
2006-11-30 16:38                               ` Daniel Lezcano
2006-11-30 17:24                                 ` Herbert Poetzl
2006-12-03 12:26                             ` jamal
2006-12-03 14:13                               ` jamal
2006-12-03 16:00                                 ` Eric W. Biederman
2006-12-04 15:19                                   ` Dmitry Mishin
2006-12-04 15:45                                     ` Eric W. Biederman
2006-12-04 16:43                                     ` Herbert Poetzl
2006-12-04 16:58                                       ` Eric W. Biederman
2006-12-04 17:02                                       ` Dmitry Mishin
2006-12-04 17:19                                         ` Herbert Poetzl
2006-12-04 17:41                                         ` Daniel Lezcano
2006-12-04 12:15                                 ` Eric W. Biederman
2006-12-04 13:44                                   ` jamal
2006-12-04 15:35                                     ` Eric W. Biederman
2006-12-04 16:00                                       ` Dmitry Mishin
2006-12-04 16:52                                         ` Eric W. Biederman
2006-12-06 11:54                                           ` [Devel] " Kirill Korotaev
2006-12-06 18:30                                             ` Herbert Poetzl
2006-12-08 19:57                                               ` Eric W. Biederman
2006-12-09  3:50                                                 ` Herbert Poetzl
2006-12-09  6:13                                                   ` Andrew Morton
2006-12-09  6:35                                                     ` Herbert Poetzl
2006-12-09 21:18                                                       ` Dmitry Mishin
2006-12-09 22:34                                                       ` Kir Kolyshkin
2006-12-10  2:21                                                         ` Herbert Poetzl
2006-12-09  8:07                                                   ` Eric W. Biederman
2006-12-09 11:27                                                   ` Tomasz Torcz
2006-12-09 19:04                                                     ` Herbert Poetzl
2006-12-03 16:37                               ` Herbert Poetzl
2006-12-03 16:58                                 ` jamal
2006-12-04 10:18                               ` Daniel Lezcano
2006-12-04 13:22                                 ` jamal
2006-12-02 11:29                         ` Kari Hurtta
2006-12-02 11:49                           ` Kari Hurtta
2006-11-29  5:58                       ` Herbert Poetzl
2006-11-25  8:21             ` Eric W. Biederman
2006-11-26 18:34               ` Herbert Poetzl
2006-11-26 19:41                 ` Ben Greear
2006-11-26 20:52                 ` Eric W. Biederman
2006-11-25  8:27       ` Eric W. Biederman
  -- strict thread matches above, loose matches on Subject: below --
2006-11-25 16:35 Leonid Grossman
2006-11-25 19:26 ` Eric W. Biederman
2006-11-25 22:17 Leonid Grossman
2006-11-25 23:16 ` Eric W. Biederman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=m14psji6lw.fsf@ebiederm.dsl.xmission.com \
    --to=ebiederm@xmission.com \
    --cc=containers@lists.osdl.org \
    --cc=dim@openvz.org \
    --cc=dlezcano@fr.ibm.com \
    --cc=hadi@cyberus.ca \
    --cc=herbert@13thfloor.at \
    --cc=netdev@vger.kernel.org \
    --cc=shemminger@osdl.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).