From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751883Ab1JSSWi (ORCPT ); Wed, 19 Oct 2011 14:22:38 -0400 Received: from mail-iy0-f174.google.com ([209.85.210.174]:59089 "EHLO mail-iy0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751028Ab1JSSWg (ORCPT ); Wed, 19 Oct 2011 14:22:36 -0400 Date: Wed, 19 Oct 2011 11:22:28 -0700 From: Tejun Heo To: Pavel Emelyanov Cc: Cyrill Gorcunov , "linux-kernel@vger.kernel.org" , Andrey Vagin , James Bottomley , Glauber Costa , "H. Peter Anvin" , Ingo Molnar , Dave Hansen , "Eric W. Biederman" , Daniel Lezcano , Alexey Dobriyan , Linus Torvalds , Oleg Nesterov Subject: Re: [patch 5/5] elf: Add support for loading ET_CKPT files Message-ID: <20111019182228.GJ25124@google.com> References: <20111014110416.552685686@openvz.org> <20111014110511.670174429@openvz.org> <20111014171033.GC4294@google.com> <20111014173304.GD4294@google.com> <4E9E9255.7090601@parallels.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4E9E9255.7090601@parallels.com> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, Pavel. On Wed, Oct 19, 2011 at 01:03:17PM +0400, Pavel Emelyanov wrote: > > I don't think this is a good idea. We already have most of interface > > necessary for restoring task state and there's no need to put it into > > the kernel as one piece. If there's something which can't be done > > from userland using existing interfaces, let's please discuss what > > they are and whether they can be resolved differently first. > > The rest API we will require will look too special-purpose (like Cyrill > mentioned the ability to set the mm's fields such as start_code, etc). I find that quite difficult to agree with. We're talking about some minor additions to prctl against updating exec path to do something it was never designed to do + new binary format. > Besides, putting a parasite code to restore the memory state will drive > us into certain limitations. E.g. - what address range in the target task > address space should the parasite occupy? What if we fail in finding one, > how should we act next? How is that different from snapshotting time? Unless the address is completely packed, I can't see how that can be a problem. In the worst case, you can use the parasite to map what's not overlapping and then let the ptracer restore the overlapping areas after parasite is done. > > The exec path is very intricate as it is and it would be very easy to > > introduce security or other bugs by altering its basic behavior. > > Can you elaborate a bit more on this? "Other" bugs can be introduced by > any piece of code injected into the kernel, but what about security? exec is a pretty important security boundary. A lot of resources (fd, VMAs, credentails) have security-sensitive policies forced across exec boundary and there are enough places which depends on the all resetting property of exec (e.g. no other user of thread-group wide resources). > > exec presumes destruction of (most of) the current process and all its > > other threads and replacing them w/ fresh states from an executable. > > This is *exactly* what is expected from the restoration process. Umm.... so why does the patch skip zapping? And if you are gonna be zapping, how are you gonna do multiple threads? > > I see that you removed zapping to allow restoring multi-threaded > > process, which seems quite scary to me. It might be okay, I don't > > know, but IMHO it just isn't a very good idea to introduce such > > variation to basic exec behavior for this rather narrow use case. > > Why? Can you describe your thought on this more, maybe I'm really missing > smth important. Because it breaks one of the very basic assumptions - there are no other tasks in the thread group and thus resources usually shared among threadgroup can be assumed to be exclusive. > > In addition, leaving it alone doesn't really solve multi-threaded > > restoring, does it? > > So your main concern is about multy-threaded tasks, right? I think we can > prepare the exec handler capable to show how this can look like. I'm strongly against that direction regardless of implementation details. > > The current code doesn't seem to be doing anything but if you're gonna add > > it later, how are you gonna synchronize other threads? > > Synchronize what? If we're providing to a userspace a way to put things right > it's up to userspace to use it correctly. Like we saw this wit pid namespaces - > there's are plenty of ways how to kill the app with CLONE_NEWPID clone flag, but > the intention of one is not the be 100% fool-proof, but to provide an ability > to do what userspace need. To shared kernel data structures. > One of the abilities to restore multy-threaded task can be - clone threads from > the userspace, then wait for them to prepare themselves the way they need it, > then the threads go to sleep and the leader one calls execve. Thus the userspace > state consistency is OK. If you're gonna let userspace do it, why bother with in-kernel thing at all? > > * It's still calling exec_mmap(), so the exec'ing thread will be on > > the new mm while other threads would still be on the old one. > > Why do you think it's a problem? Ummm.... they aren't on the same address space? How can that be okay? It's not only wrong from CR POV. The kernel disallows CLONE_THREAD/SIGHAND w/o CLONE_VM and depend on that assumption, you're breaking that. > > * There are operations which assume that the calling task is > > de-threaded and thus has exclusive access to data structures which > > can be shared among the thread group (e.g. signal struct). How are > > accesses to them synchronized? > > If we're talking about keeping the userspace state solid, then stopping the > other threads solves this. Again, there are kernel exclusivity assumptions. > That's a pity. As I stated above, we'll try to demonstrate the exec handler > capable to restore the multy-threaded APP and send the 2nd RFC. Can you answer > my questions above so we keep your concerns in mind while doing this, please. IMHO, this is a fundamentally broken approach which isn't even necessary. So, FWIW, NACK. Thank you. -- tejun