Re: [RFC] network namespaces

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Herbert Poetzl <herbert@13thfloor.at>
To: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Daniel Lezcano <dlezcano@fr.ibm.com>,
	netdev@vger.kernel.org, "Serge E. Hallyn" <serue@us.ibm.com>,
	Andrey Savochkin <saw@sw.ru>,
	haveblue@us.ibm.com, clg@fr.ibm.com, sam@vilain.net,
	Andrew Morton <akpm@osdl.org>,
	dev@sw.ru, devel@openvz.org, alexey@sw.ru,
	Linux Containers <containers@lists.osdl.org>
Subject: Re: [RFC] network namespaces
Date: Tue, 5 Sep 2006 18:53:28 +0200	[thread overview]
Message-ID: <20060905165328.GA17317@MAIL.13thfloor.at> (raw)
In-Reply-To: <m1ejuq2wz0.fsf@ebiederm.dsl.xmission.com>

On Tue, Sep 05, 2006 at 08:45:39AM -0600, Eric W. Biederman wrote:
> Daniel Lezcano <dlezcano@fr.ibm.com> writes:
> 
> >>>2. People expressed concerns that complete separation of namespaces
> >>>   may introduce an undesired overhead in certain usage scenarios.
> >>>   The overhead comes from packets traversing input path, then output path,
> >>>   then input path again in the destination namespace if root namespace
> >>>   acts as a router.
> >
> > Yes, performance is probably one issue.
> >
> > My concerns was for layer 2 / layer 3 virtualization. I agree
> > a layer 2 isolation/virtualization is the best for the "system
> > container". But there is another family of container called
> > "application container", it is not a system which is run inside a
> > container but only the application. If you want to run a oracle
> > database inside a container, you can run it inside an application
> > container without launching <init> and all the services.
> >
> > This family of containers are used too for HPC (high performance
> > computing) and for distributed checkpoint/restart. The cluster
> > runs hundred of jobs, spawning them on different hosts inside an
> > application container. Usually the jobs communicates with broadcast
> > and multicast. Application containers does not care of having
> > different MAC address and rely on a layer 3 approach.
> >
> > Are application containers comfortable with a layer 2 virtualization
> > ? I don't think so, because several jobs running inside the same
> > host communicate via broadcast/multicast between them and between
> > other jobs running on different hosts. The IP consumption is a
> > problem too: 1 container == 2 IP (one for the root namespace/
> > one for the container), multiplicated with the number of jobs.
> > Furthermore, lot of jobs == lot of virtual devices.
> >
> > However, after a discussion with Kirill at the OLS, it appears we
> > can merge the layer 2 and 3 approaches if the level of network
> > virtualization is tunable and we can choose layer 2 or layer 3 when
> > doing the "unshare". The determination of the namespace for the
> > incoming traffic can be done with an specific iptable module as
> > a first step. While looking at the network namespace patches, it
> > appears that the TCP/UDP part is **very** similar at what is needed
> > for a layer
> > 3 approach.
> >
> > Any thoughts ?
> 
> For HPC if you are interested in migration you need a separate IP
> per container. If you can take you IP address with you migration of
> networking state is simple. If you can't take your IP address with you
> a network container is nearly pointless from a migration perspective.
>
> Beyond that from everything I have seen layer 2 is just much cleaner
> than any layer 3 approach short of Serge's bind filtering.

well, the 'ip subset' approach Linux-VServer and
other Jail solutions use is very clean, it just does
not match your expectations of a virtual interface
(as there is none) and it does not cope well with
all kinds of per context 'requirements', which IMHO
do not really exist on the application layer (only
on the whole system layer)

> Beyond that I have yet to see a clean semantics for anything
> resembling your layer 2 layer 3 hybrid approach. If we can't have
> clear semantics it is by definition impossible to implement correctly
> because no one understands what it is supposed to do.

IMHO that would be quite simple, have a 'namespace'
for limiting port binds to a subset of the available
ips and another one which does complete network 
virtualization with all the whistles and bells, IMHO
most of them are orthogonal and can easily be combined

 - full network virtualization
 - lightweight ip subset 
 - both

> Note. A true layer 3 approach has no impact on TCP/UDP filtering
> because it filters at bind time not at packet reception time. Once you
> start inspecting packets I don't see what the gain is from not going
> all of the way to layer 2.

IMHO this requirement only arises from the full system
virtualization approach, just look at the other jail
solutions (solaris, bsd, ...) some of them do not even 
allow for more than a single ip but they work quite
well when used properly ...

best,
Herbert

> Eric

next prev parent reply	other threads:[~2006-09-05 16:53 UTC|newest]

Thread overview: 72+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-08-15 14:20 [RFC] network namespaces Andrey Savochkin
2006-08-15 14:48 ` [PATCH 1/9] network namespaces: core and device list Andrey Savochkin
2006-08-16 14:46   ` Dave Hansen
2006-08-16 16:45     ` Stephen Hemminger
2006-08-15 14:48 ` [PATCH 2/9] network namespaces: IPv4 routing Andrey Savochkin
2006-08-15 14:48 ` [PATCH 3/9] network namespaces: playing and debugging Andrey Savochkin
2006-08-16 16:46   ` Stephen Hemminger
2006-08-16 17:22     ` Eric W. Biederman
2006-08-17  6:28       ` Andrey Savochkin
2006-08-17  8:30     ` Kirill Korotaev
2006-08-15 14:48 ` [PATCH 4/9] network namespaces: socket hashes Andrey Savochkin
2006-09-18 15:12   ` Daniel Lezcano
2006-09-20 16:32     ` Andrey Savochkin
2006-09-21 12:34       ` Daniel Lezcano
2006-08-15 14:48 ` [PATCH 5/9] network namespaces: async socket operations Andrey Savochkin
2006-09-22 15:33   ` Daniel Lezcano
2006-09-23 13:16     ` Andrey Savochkin
2006-08-15 14:48 ` [PATCH 6/9] allow proc_dir_entries to have destructor Andrey Savochkin
2006-08-15 14:48 ` [PATCH 7/9] net_device seq_file Andrey Savochkin
2006-08-15 14:48 ` [PATCH 8/9] network namespaces: device to pass packets between namespaces Andrey Savochkin
2006-08-15 14:48 ` [PATCH 9/9] network namespaces: playing with pass-through device Andrey Savochkin
2006-08-16 11:53 ` [RFC] network namespaces Serge E. Hallyn
2006-08-16 15:12   ` Alexey Kuznetsov
2006-08-16 17:35     ` Eric W. Biederman
2006-08-17  8:29       ` Kirill Korotaev
2006-09-05 13:34   ` Daniel Lezcano
2006-09-05 14:45     ` Eric W. Biederman
2006-09-05 15:32       ` Daniel Lezcano
2006-09-05 16:53       ` Herbert Poetzl [this message]
2006-09-05 18:27         ` Eric W. Biederman
2006-09-06 14:52           ` Kirill Korotaev
2006-09-06 15:09             ` [Devel] " Kir Kolyshkin
2006-09-06  9:10         ` Daniel Lezcano
2006-09-06 16:56           ` Herbert Poetzl
2006-09-06 17:37             ` [Devel] " Kir Kolyshkin
2006-09-06 18:34               ` Eric W. Biederman
2006-09-06 18:58                 ` Kir Kolyshkin
2006-09-06 20:53                   ` Cedric Le Goater
2006-09-06 23:06                 ` Caitlin Bestler
2006-09-06 23:25                   ` Eric W. Biederman
2006-09-07  0:53                     ` Stephen Hemminger
2006-09-07  5:11                       ` Eric W. Biederman
2006-09-07  8:25                   ` Daniel Lezcano
2006-09-07 18:29                     ` Eric W. Biederman
2006-09-08  6:02                       ` Herbert Poetzl
2006-09-07 16:23                 ` [Devel] " Kirill Korotaev
2006-09-07 17:27                   ` Herbert Poetzl
2006-09-07 19:50                     ` Eric W. Biederman
2006-09-08 13:10                     ` Dmitry Mishin
2006-09-08 18:11                       ` Herbert Poetzl
2006-09-09  7:57                         ` Dmitry Mishin
2006-09-10  2:47                           ` Herbert Poetzl
2006-09-10  3:41                             ` Eric W. Biederman
2006-09-10  8:11                               ` Dmitry Mishin
2006-09-10 11:48                                 ` Eric W. Biederman
2006-09-10 19:19                               ` [Devel] " Herbert Poetzl
2006-09-10  7:45                             ` Dmitry Mishin
2006-09-10 19:22                               ` Herbert Poetzl
2006-09-12  3:26                               ` Eric W. Biederman
2006-09-11 14:40                           ` [Devel] " Daniel Lezcano
2006-09-11 14:57                             ` Herbert Poetzl
2006-09-11 15:04                               ` Daniel Lezcano
2006-09-11 15:10                               ` Dmitry Mishin
2006-09-12  3:28                                 ` Eric W. Biederman
2006-09-12  7:38                                   ` Dmitry Mishin
2006-09-06 21:44               ` [Devel] " Daniel Lezcano
2006-09-06 17:58             ` Eric W. Biederman
2006-09-05 15:47     ` Kirill Korotaev
2006-09-05 17:09     ` Eric W. Biederman
2006-09-06 20:25       ` Cedric Le Goater
2006-09-06 20:40         ` Eric W. Biederman
2006-10-04  9:40 ` Daniel Lezcano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20060905165328.GA17317@MAIL.13thfloor.at \
    --to=herbert@13thfloor.at \
    --cc=akpm@osdl.org \
    --cc=alexey@sw.ru \
    --cc=clg@fr.ibm.com \
    --cc=containers@lists.osdl.org \
    --cc=dev@sw.ru \
    --cc=devel@openvz.org \
    --cc=dlezcano@fr.ibm.com \
    --cc=ebiederm@xmission.com \
    --cc=haveblue@us.ibm.com \
    --cc=netdev@vger.kernel.org \
    --cc=sam@vilain.net \
    --cc=saw@sw.ru \
    --cc=serue@us.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).