From mboxrd@z Thu Jan 1 00:00:00 1970 From: Daniel Lezcano Subject: Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control. Date: Sat, 06 Mar 2010 22:26:30 +0100 Message-ID: <4B92C886.9020507@free.fr> References: <4B88E431.6040609@parallels.com> <4B894564.7080104@parallels.com> <4B89727C.9040602@parallels.com> <4B8AE8C1.1030305@free.fr> <4B8D28CF.8060304@parallels.com> <20100302211942.GA17816@us.ibm.com> <20100303000743.GA13744@us.ibm.com> <4B8E9370.3050300@parallels.com> <4B9158F5.5040205@parallels.com> <4B926B1B.5070207@free.fr> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: containers-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org Errors-To: containers-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org To: "Eric W. Biederman" Cc: Pavel Emelyanov , Linux Netdev List , containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org, Netfilter Development Mailinglist , Ben Greear , Sukadev Bhattiprolu List-Id: containers.vger.kernel.org Eric W. Biederman wrote: > Daniel Lezcano writes: > > >> Eric W. Biederman wrote: >> > > >> If the normal rules of parentage apply, that means pid 0 has to wait it's child. >> If we are in the scenario of pid 0, it's child pid 1234 and we kill the pid 1 of >> the pid namespace, I suppose pid 1234 will be killed too. >> The pid 0 will stay in the pid namespace and will able to fork again a new pid >> 1. >> >> I think Serge already reported that... >> >> That sounds good :) >> > > I expect zap_pid_ns_processes should also arrange so we cannot allocate any > more processes. We certainly need to do something explicit or pid 1 won't > be allocated. It might make sense to resurrect a pid namespace after it's > death but it is definitely weird. > Mmh, yes. But that was just an idea, maybe a bit out of the scope you are aiming. >>> In a lot of ways I like this idea of sys_hijack/sys_cloneat, and I >>> don't think anything I am doing fundamentally undermines it. The use >>> case of doing things in fork is that there is automatic inheritance of >>> everything. All of the namespaces and all of the control groups, and >>> possibly also the parent process. >>> >> And also the rootfs for executing the command inside the container >> (eg. shutdown), the uid/gid (if there is a user namespace), the mount points, >> ... >> But I suppose we can do the same with setns for all the namespaces and chrooting >> within the container rootfs. >> >> What I see is a problem with the tty. For example, we cloneat the init process >> of the container which is usually /sbin/init but this one has its tty mapped to >> /dev/console, so the output of the exec'ed command will go to the console. >> > > My original thinking was that the fd's would come from the caller of sys_cloneat.... Oh, ok :s >>> Overall it sounds like the semantics I have proposed with >>> unshare(CLONE_NEWPID) are workable, and simple to implement. The >>> extra fork is a bit surprising but it certainly does not >>> look like a show stopper for implementing a pid namespace join. >>> >>> >> I agree, it's some kind of "ghost" process. >> IMO, with a bit of userspace code it would be possible to enter or exec a >> command inside a container with nsfd, setns. >> >> +1 to test your patchset Eric :) >> > > I will see about reposting sometime soon. > Great ! thanks. >> Just a mindless suggestion, the "nsopen" / "nsattach" syscall names should be >> more clear no ? >> > > Not bad suggestions. > > I am going to explore a bit more. Given that nsfd is using the same > permission checks as a proc file, I think I can just make it a proc > file. Something like "/proc//ns/net". With a little luck that > won't suck too badly. > Ah ! yes. Good idea.