From mboxrd@z Thu Jan 1 00:00:00 1970 From: Daniel Lezcano Subject: Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors Date: Mon, 22 Mar 2010 09:40:32 +0100 Message-ID: <4BA72D00.7040406@free.fr> References: <1268960401-16680-1-git-send-email-orenl@cs.columbia.edu> <1268960401-16680-4-git-send-email-orenl@cs.columbia.edu> <20100320044310.GC2887@count0.beaverton.ibm.com> <20100321172703.GC4174@shareable.org> <20100321194019.GA11714@hallyn.com> <4BA68884.3080003@free.fr> <4BA6914D.8040007@cs.columbia.edu> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: "Serge E. Hallyn" , linux-fsdevel@vger.kernel.org, containers@lists.linux-foundation.org, Jamie Lokier , Andreas Dilger To: Oren Laadan Return-path: Received: from mtagate3.uk.ibm.com ([194.196.100.163]:41429 "EHLO mtagate3.uk.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753582Ab0CVIkq (ORCPT ); Mon, 22 Mar 2010 04:40:46 -0400 Received: from d06nrmr1407.portsmouth.uk.ibm.com (d06nrmr1407.portsmouth.uk.ibm.com [9.149.38.185]) by mtagate3.uk.ibm.com (8.13.1/8.13.1) with ESMTP id o2M8eboK013532 for ; Mon, 22 Mar 2010 08:40:37 GMT Received: from d06av03.portsmouth.uk.ibm.com (d06av03.portsmouth.uk.ibm.com [9.149.37.213]) by d06nrmr1407.portsmouth.uk.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id o2M8ebBV970954 for ; Mon, 22 Mar 2010 08:40:37 GMT Received: from d06av03.portsmouth.uk.ibm.com (loopback [127.0.0.1]) by d06av03.portsmouth.uk.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id o2M8eaOQ020358 for ; Mon, 22 Mar 2010 08:40:37 GMT In-Reply-To: <4BA6914D.8040007@cs.columbia.edu> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: Oren Laadan wrote: > > > Daniel Lezcano wrote: >> Serge E. Hallyn wrote: >>> Quoting Jamie Lokier (jamie@shareable.org): >>> >>>> Matt Helsley wrote: >>>> >>>>>> That said, if the intent is to allow the restore to be done on >>>>>> another node with a "similar" filesystem (e.g. created by rsync/node >>>>>> image), instead of having a coherent distributed filesystem on all >>>>>> of the nodes then the filename makes sense. >>>>>> >>>>> Yes, this is the intent. >>>>> >>>> I would worry about programs which are using files which have been >>>> deleted, renamed, or (very common) renamed-over by another process >>>> after being opened, as there's a good chance they will successfully >>>> open the wrong file after c/r, and corrupt state from then on. >>>> >>> Userspace is expected to back up and restore the filesystem, for >>> instance using a btrfs snapshot or a simple rsync or tar. >>> >>> >> That does not solve the problem Jamie is talking about. >> A rsync or a tar will not see a deleted file and using a btrfs to >> have the CR to work with the deleted files is a bit overkill, no ? > > Let's separate the issues of file system snapshot and deleted files. > > 1) File system snapshot: > ------------------------ > The requirement is to preserve the file system state between the time > of the checkpoint and the time of the restart, because userspace will > expect it to remain the same. > > The alternatives are: > > a) Use capable file system, like brfs, or (modified) nilfs. > > b) Userspace saves the state e.g. w/ tar or rsync (maybe incremental) > > c) Assume/expect that the file system isn't modified between checkpoint > and restart (e.g. if we use c/r to suspend a user's session) > > d) Expect userspace to adapt to changes if they occur, e.g. by having > the application be aware of the possibility, or by providing a wrapper > that will do some magic prior to restart (by looking at the checkpoint > image). > > Options a,b,c are all transparent to the application, while option > d required that applications become aware of c/r. That's ok, but our > primary goal is to be generic enough to unmodified applications. > > 2) Deleted files: > ----------------- > The requirement is that at restart we'll be able to restore the file > point in the kernel to a deleted file with same properties and contents > as it was at the time of the checkpoint. > > The alternatives we considered are: > > e) For each deleted file, save the contents of that file as part of > the checkpoint image; > At restart - create a new file, populate with the contents, open it > (to get an active file pointer), and finally unlink it, so it is - > again - deleted. > > f) At checkpoint time, create a file (from scratch) in a dedicated > area of the file system (userspace configurable?), and copy the > contents of the deleted file to this file. Only save the file system > state after this is done. > At restart, open the alternative file instead, and then immediately > delete it. > > g) At checkpoint time, re-link the file to a dedicated area of the > file system. This requires support from the underlying file system, > of course. For instance, it's trivial for ext2,3 but IIRC will need > help for ext4. Re-linking is essentially attaching a new filename > to an existing inode that is still referenced but is otherwise not > reachable - and make it reachable again. > At restart, open the re-linked file and then immediately delete it. > >> I have another question about the deleted files. How is handled the >> case when a process has a deleted mapped file but without an >> associated file descriptor ? >> > > It works the same as with non-deleted files (assuming that we know > how to handle delete files in general, e.g. options e,d,f above): > > To checkpoint a task's mm we loop through the vma's and checkpoint > them. For a vma that corresponds to a mapped file, we first save > the vma->vm_file. In turn, for a file pointer we save the filename, > properties, credentials. A file pointer is saved as an independent > object - and is assigned a unique id - objref. The state of the vma > will indicate indicate this objref. > > At restart, we will first see the file pointer object, and will > open the file to create a corresponding file pointer. Later when > we restore the vma, we'll locate the (new) file pointer using the > objref and use it in mmap. > > Oren. > Thanks Oren for the detailed answer.