James Morris wrote:
> On Tue, 14 Nov 2006, Daniel Lezcano wrote:
> 
>> the attached document describes the network isolation at the layer 2 and at
>> the layer 3, it presents the pros and cons of the different approaches, their
>> common points and the impacted network code.
>> I hope it will be helpful :)
> 
> What about other network subsystems: xfrm, netfilter, iptables, netlink, 
> etc. ?

They are not addressed for the moment, Dmitry Mishin is looking at 
netfilters isolation.

Netlink has 2 aspects:

  * the communication between processes and because the message format 
use a pid destination, the netlink will be addressed by the pid 
virtualization

  * the ip management. At the layer 2, there is nothing to do because 
the data access are relative to the namespace. At the layer 3, the cases 
should be handled to check the IPs. Some work is already done for the 
ifaddr isolation.

  * Jong Choi made some work on the iptables isolation (see below)

Cheers.

	-- Daniel

-------------------------------------------------------------------------------


Hi Rusty,

 > > I'm currently looking at a container-based lightweight virtualization
 > > technology in Linux which recently being actively discussed in LKML by
 > > IBMers (Hubertus Franke, Dave Hansen, Serge Hallyn, and Cedric Le
 > > Goater) and by developers like Herbert Poetzl, Eric Bierderman, and
 > > Kirill Korotaev. One of the main issues is on the netfilter
 > > virtualization (please refer to
 > > http://marc.theaimsgroup.com/?l=linux-kernel&m=114322107510852&w=2).
 >
 > Yes.  Virtualization of iptables is fairly silly: you can crash the
 > machine with careful insertion of bad rules.  We do simple sanity
 > checks, but they're not complete.
 >
 > Currently, that's OK, you have to be root to do these operations anyway.
 > But it illustrates one problem with iptables virtualization which the
 > OpenVZ people don't seem to have a grasp on 8(

As a first step, we started looking at a mechanism which can provide
a scalable way of implementing per container tables and rules.
It seems that the safety issue you pointed would be a next step
to discuss should there be needs for more than the root based scheme.

 > Do you have a pointer to the alternative implementation?

We are currently working on a patch against a 2.6.16 kernel and will be 
able to point
to the patch sooner or later within this week. Instead of pointing to an 
incomplete
patch now, let me describe the proposed mechanism in further detail for now.

The OpenVZ patch can be considered intrusive for the following reasons:

1) It requires API changes of iptables modules such as filter, mangle, nat
and also including the existing off-the-tree or future iptable modules.
OpenVZ basically replaces the existing table data instance such as
packet_filter of iptable_filter so that each VPS will have its own table 
data
instance in its "ve_struct" data structure. This might not be acceptable
because it causes changes in API / ABI.

2) it implements per VPS ipt_tables, ipt_target, and ipt_match linked 
lists in its
"ve_struct" data structure. Rules and tables are purely local in the sense
that each VPS has its own set of chains, tables, matches and targets and
even the HW node cannot observe other VPS' iptable. It seems not
desirable not being able to provide an entire system-wide visibility and
management functionality to the HW node, because of manageability issue
and more importantly the safety issue you pointed out.

On the other hand, the Vserver approach of not providing special isolation
would cause scalability issue when it is needed to set up iptables rules for
hundreds of vservers.

The following diagram illustrates the data structure of the proposed
virtualization scheme:

  **** Look at the attached file ****

This scheme can be considered as adding additional indirection layer at the
private field of struct ipt_table instead of having per container 
nf_hooks array,
and has the following potential advantages:

1) The "private" pointer seems a very natural place to implement this 
indirection and
the changes needed for this indirection is very small to the OpenVZ 
approach.

2) The indirection is implemented in the "ip_tables" module, but not in the
individual modules for tables, match, and target themselves. Because the 
changes
are rather confined in the ip_tables module, it won't change API / ABI 
and hence
will keep the existing iptables module base.

3) The indirection at the private field of "ipt_tables" struct will 
provide isolation for both
paths starting from nf_hooks[][] and xt at the same time. The lengths of 
both of
these paths are O(1) and the legnth of each chain in a table is also 
O(1) wrt the
number of containers.

Assuming the root based operation, wondering whether it is a workable 
approach
to add a command line option to let root specify which container to work on.

 > Thanks!
 > Rusty.

Thanks for your comments and will send you the pointer to the patch soonest.
- JH