From mboxrd@z Thu Jan  1 00:00:00 1970
From: Daniel Lezcano <daniel.lezcano@free.fr>
Subject: Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
Date: Mon, 22 Mar 2010 09:40:32 +0100
Message-ID: <4BA72D00.7040406@free.fr>
References: <1268960401-16680-1-git-send-email-orenl@cs.columbia.edu>	<1268960401-16680-4-git-send-email-orenl@cs.columbia.edu>	<F18D161D-850B-4C82-83D5-1F19D573E84F@sun.com>	<20100320044310.GC2887@count0.beaverton.ibm.com>	<20100321172703.GC4174@shareable.org>	<20100321194019.GA11714@hallyn.com> <4BA68884.3080003@free.fr> <4BA6914D.8040007@cs.columbia.edu>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: "Serge E. Hallyn" <serge@hallyn.com>,
	linux-fsdevel@vger.kernel.org,
	containers@lists.linux-foundation.org,
	Jamie Lokier <jamie@shareable.org>,
	Andreas Dilger <adilger@sun.com>
To: Oren Laadan <orenl@cs.columbia.edu>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mtagate3.uk.ibm.com ([194.196.100.163]:41429 "EHLO
	mtagate3.uk.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1753582Ab0CVIkq (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Mon, 22 Mar 2010 04:40:46 -0400
Received: from d06nrmr1407.portsmouth.uk.ibm.com (d06nrmr1407.portsmouth.uk.ibm.com [9.149.38.185])
	by mtagate3.uk.ibm.com (8.13.1/8.13.1) with ESMTP id o2M8eboK013532
	for <linux-fsdevel@vger.kernel.org>; Mon, 22 Mar 2010 08:40:37 GMT
Received: from d06av03.portsmouth.uk.ibm.com (d06av03.portsmouth.uk.ibm.com [9.149.37.213])
	by d06nrmr1407.portsmouth.uk.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id o2M8ebBV970954
	for <linux-fsdevel@vger.kernel.org>; Mon, 22 Mar 2010 08:40:37 GMT
Received: from d06av03.portsmouth.uk.ibm.com (loopback [127.0.0.1])
	by d06av03.portsmouth.uk.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id o2M8eaOQ020358
	for <linux-fsdevel@vger.kernel.org>; Mon, 22 Mar 2010 08:40:37 GMT
In-Reply-To: <4BA6914D.8040007@cs.columbia.edu>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

Oren Laadan wrote:
>
>
> Daniel Lezcano wrote:
>> Serge E. Hallyn wrote:
>>> Quoting Jamie Lokier (jamie@shareable.org):
>>>  
>>>> Matt Helsley wrote:
>>>>    
>>>>>> That said, if the intent is to allow the restore to be done on
>>>>>> another node with a "similar" filesystem (e.g. created by rsync/node
>>>>>> image), instead of having a coherent distributed filesystem on all
>>>>>> of the nodes then the filename makes sense.
>>>>>>         
>>>>> Yes, this is the intent.
>>>>>       
>>>> I would worry about programs which are using files which have been
>>>> deleted, renamed, or (very common) renamed-over by another process
>>>> after being opened, as there's a good chance they will successfully
>>>> open the wrong file after c/r, and corrupt state from then on.
>>>>     
>>> Userspace is expected to back up and restore the filesystem, for
>>> instance using a btrfs snapshot or a simple rsync or tar.
>>>
>>>   
>> That does not solve the problem Jamie is talking about.
>> A rsync or a tar will not see a deleted file and using a btrfs to 
>> have the CR to work with the deleted files is a bit overkill, no ?
>
> Let's separate the issues of file system snapshot and deleted files.
>
> 1) File system snapshot:
> ------------------------
> The requirement is to preserve the file system state between the time
> of the checkpoint and the time of the restart, because userspace will
> expect it to remain the same.
>
> The alternatives are:
>
> a) Use capable file system, like brfs, or (modified) nilfs.
>
> b) Userspace saves the state e.g. w/ tar or rsync (maybe incremental)
>
> c) Assume/expect that the file system isn't modified between checkpoint
> and restart (e.g. if we use c/r to suspend a user's session)
>
> d) Expect userspace to adapt to changes if they occur, e.g. by having
> the application be aware of the possibility, or by providing a wrapper
> that will do some magic prior to restart (by looking at the checkpoint
> image).
>
> Options a,b,c are all transparent to the application, while option
> d required that applications become aware of c/r. That's ok, but our
> primary goal is to be generic enough to unmodified applications.
>
> 2) Deleted files:
> -----------------
> The requirement is that at restart we'll be able to restore the file
> point in the kernel to a deleted file with same properties and contents
> as it was at the time of the checkpoint.
>
> The alternatives we considered are:
>
> e) For each deleted file, save the contents of that file as part of
> the checkpoint image;
> At restart - create a new file, populate with the contents, open it
> (to get an active file pointer), and finally unlink it, so it is -
> again - deleted.
>
> f) At checkpoint time, create a file (from scratch) in a dedicated
> area of the file system (userspace configurable?), and copy the
> contents of the deleted file to this file. Only save the file system
> state after this is done.
> At restart, open the alternative file instead, and then immediately
> delete it.
>
> g) At checkpoint time, re-link the file to a dedicated area of the
> file system. This requires support from the underlying file system,
> of course. For instance, it's trivial for ext2,3 but IIRC will need
> help for ext4. Re-linking is essentially attaching a new filename
> to an existing inode that is still referenced but is otherwise not
> reachable - and make it reachable again.
> At restart, open the re-linked file and then immediately delete it.
>
>> I have another question about the deleted files. How is handled the 
>> case when a process has a deleted mapped file but without an 
>> associated file descriptor ?
>>
>
> It works the same as with non-deleted files (assuming that we know
> how to handle delete files in general, e.g. options e,d,f above):
>
> To checkpoint a task's mm we loop through the vma's and checkpoint
> them. For a vma that corresponds to a mapped file, we first save
> the vma->vm_file. In turn, for a file pointer we save the filename,
> properties, credentials. A file pointer is saved as an independent
> object - and is assigned a unique id - objref. The state of the vma
> will indicate indicate this objref.
>
> At restart, we will first see the file pointer object, and will
> open the file to create a corresponding file pointer. Later when
> we restore the vma, we'll locate the (new) file pointer using the
> objref and use it in mmap.
>
> Oren.
>

Thanks Oren for the detailed answer.