James Morris wrote: > On Tue, 14 Nov 2006, Daniel Lezcano wrote: > >> the attached document describes the network isolation at the layer 2 and at >> the layer 3, it presents the pros and cons of the different approaches, their >> common points and the impacted network code. >> I hope it will be helpful :) > > What about other network subsystems: xfrm, netfilter, iptables, netlink, > etc. ? They are not addressed for the moment, Dmitry Mishin is looking at netfilters isolation. Netlink has 2 aspects: * the communication between processes and because the message format use a pid destination, the netlink will be addressed by the pid virtualization * the ip management. At the layer 2, there is nothing to do because the data access are relative to the namespace. At the layer 3, the cases should be handled to check the IPs. Some work is already done for the ifaddr isolation. * Jong Choi made some work on the iptables isolation (see below) Cheers. -- Daniel ------------------------------------------------------------------------------- Hi Rusty, > > I'm currently looking at a container-based lightweight virtualization > > technology in Linux which recently being actively discussed in LKML by > > IBMers (Hubertus Franke, Dave Hansen, Serge Hallyn, and Cedric Le > > Goater) and by developers like Herbert Poetzl, Eric Bierderman, and > > Kirill Korotaev. One of the main issues is on the netfilter > > virtualization (please refer to > > http://marc.theaimsgroup.com/?l=linux-kernel&m=114322107510852&w=2). > > Yes. Virtualization of iptables is fairly silly: you can crash the > machine with careful insertion of bad rules. We do simple sanity > checks, but they're not complete. > > Currently, that's OK, you have to be root to do these operations anyway. > But it illustrates one problem with iptables virtualization which the > OpenVZ people don't seem to have a grasp on 8( As a first step, we started looking at a mechanism which can provide a scalable way of implementing per container tables and rules. It seems that the safety issue you pointed would be a next step to discuss should there be needs for more than the root based scheme. > Do you have a pointer to the alternative implementation? We are currently working on a patch against a 2.6.16 kernel and will be able to point to the patch sooner or later within this week. Instead of pointing to an incomplete patch now, let me describe the proposed mechanism in further detail for now. The OpenVZ patch can be considered intrusive for the following reasons: 1) It requires API changes of iptables modules such as filter, mangle, nat and also including the existing off-the-tree or future iptable modules. OpenVZ basically replaces the existing table data instance such as packet_filter of iptable_filter so that each VPS will have its own table data instance in its "ve_struct" data structure. This might not be acceptable because it causes changes in API / ABI. 2) it implements per VPS ipt_tables, ipt_target, and ipt_match linked lists in its "ve_struct" data structure. Rules and tables are purely local in the sense that each VPS has its own set of chains, tables, matches and targets and even the HW node cannot observe other VPS' iptable. It seems not desirable not being able to provide an entire system-wide visibility and management functionality to the HW node, because of manageability issue and more importantly the safety issue you pointed out. On the other hand, the Vserver approach of not providing special isolation would cause scalability issue when it is needed to set up iptables rules for hundreds of vservers. The following diagram illustrates the data structure of the proposed virtualization scheme: **** Look at the attached file **** This scheme can be considered as adding additional indirection layer at the private field of struct ipt_table instead of having per container nf_hooks array, and has the following potential advantages: 1) The "private" pointer seems a very natural place to implement this indirection and the changes needed for this indirection is very small to the OpenVZ approach. 2) The indirection is implemented in the "ip_tables" module, but not in the individual modules for tables, match, and target themselves. Because the changes are rather confined in the ip_tables module, it won't change API / ABI and hence will keep the existing iptables module base. 3) The indirection at the private field of "ipt_tables" struct will provide isolation for both paths starting from nf_hooks[][] and xt at the same time. The lengths of both of these paths are O(1) and the legnth of each chain in a table is also O(1) wrt the number of containers. Assuming the root based operation, wondering whether it is a workable approach to add a command line option to let root specify which container to work on. > Thanks! > Rusty. Thanks for your comments and will send you the pointer to the patch soonest. - JH