From mboxrd@z Thu Jan  1 00:00:00 1970
From: "Aneesh Kumar K. V" <aneesh.kumar@linux.vnet.ibm.com>
Subject: Re: [PATCH -V3] Generic name to handle and open by handle syscalls
Date: Sun, 25 Apr 2010 23:51:56 +0530
Message-ID: <87k4rv1jgb.fsf@linux.vnet.ibm.com>
References: <1271960133-16414-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com> <DABA9B7F-548A-489D-ABED-689383E7E2DF@mit.edu> <F4F339E7-3C44-4DAB-9C89-5E665D3CDE24@sun.com> <20100424110812.40989988@notabene.brown>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: Theodore Tso <tytso@mit.edu>, hch@infradead.org,
	viro@zeniv.linux.org.uk, corbet@lwn.net,
	linux-fsdevel@vger.kernel.org, sfrench@us.ibm.com
To: Neil Brown <neilb@suse.de>, Andreas Dilger <adilger@sun.com>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from e28smtp06.in.ibm.com ([122.248.162.6]:42158 "EHLO
	e28smtp06.in.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1753727Ab0DYSWD (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Sun, 25 Apr 2010 14:22:03 -0400
Received: from d28relay05.in.ibm.com (d28relay05.in.ibm.com [9.184.220.62])
	by e28smtp06.in.ibm.com (8.14.3/8.13.1) with ESMTP id o3PIM1SS012288
	for <linux-fsdevel@vger.kernel.org>; Sun, 25 Apr 2010 23:52:01 +0530
Received: from d28av05.in.ibm.com (d28av05.in.ibm.com [9.184.220.67])
	by d28relay05.in.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id o3PIM16W3498080
	for <linux-fsdevel@vger.kernel.org>; Sun, 25 Apr 2010 23:52:01 +0530
Received: from d28av05.in.ibm.com (loopback [127.0.0.1])
	by d28av05.in.ibm.com (8.14.3/8.13.1/NCO v10.0 AVout) with ESMTP id o3PIM0rO004119
	for <linux-fsdevel@vger.kernel.org>; Mon, 26 Apr 2010 04:22:01 +1000
In-Reply-To: <20100424110812.40989988@notabene.brown>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Sat, 24 Apr 2010 11:08:12 +1000, Neil Brown <neilb@suse.de> wrote:
> On Fri, 23 Apr 2010 18:19:59 -0600
> Andreas Dilger <adilger@sun.com> wrote:
> 
> > On 2010-04-23, at 07:23, Theodore Tso wrote:
> > > 
> > > Something to consider is whether anything bad happens if there are multiple filesystems mounted with the same UUID.  I can think of two ways this could happen.   One is when we make a read-only LVM snapshot of a filesystem, and then mount it to back up a stable snapshot.  This might happen if the sysadmin is trying to backup a SQL database, for example; the database gets frozen, we take a snapshot, and then we unfreeze the database and mount the snapshot.   Now suppose we try to open-by-handle the mysql database --- should the system return the a file from the r/o frozen snapshot, or from the r/w file system?
> > 
> > I'd say from the r/w LV in virtually all cases.  We are safe from totally egregious errors, because the inode+generation will prevent totally incorrect files from being returned, but newer/older versions of the same file/director may be found.
> > 
> > > Something we might do is to add a check and refuse mounting file systems with duplicate UUID's, and changing the LVM snapshot code to do run some kind of hook after a snapshot which runs a "tune2fs -U random" on the snapshot.   For r/o LVM snapshots, we could also put in a hack that if there are two file systems mounted, one r/o and one r/w, we return the r/w file system.
> > 
> > I think this may break things if we change the UUID when a snapshot is created, because we don't know what userspace might be using the UUID for.  That said, I totally agree that returning the r/w LV makes sense.  The LVM code itself understands which LV is the primary and which is the snapshot, so it likely means that the "lookup the UUID" code might need to be smarter.
> > 
> > Probably the simplest thing is if a new filesystem is mounted, but a second filesystem with the same UUID is mounted that it is skipped.  If we keep the UUID list in FIFO order, that should be sufficient to ensure that the "primary" version is returned first.
> > 
> 
> I really think this sounds too much like 'policy'.  It is not a trivially
> obvious algorithm for selecting the 'right' filesystem.  It depends on the
> order things have happened, which might be right for the case that you are
> thinking of, but might be wrong for some other case.
> 
> I haven't been following the conversation closely so I might have missed
> something, but why don't we leave the mapping from handle->filesystem up to
> userspace and just do the "filesystem+handle -> file" part in the kernel?
> (i.e. just what nfsd does).
> 
> From the kernel's perspective, the only unique identifier for a file system
> is a (sometimes fictitious or arbitrary) device number.  Using anything else
> (except maybe a mount point) in a kernel interface just seems wrong.
> 
> Maybe map the filesystem part of the handle from UUID (or whatever) to devno
> in userspace, then pass the devno+file-part-of-handle to the kernel to
> perform, the final mapping.
> 

My earlier patchset[1] more or less did that. It actually took a mountdir
fd and a file system unique handle to identify the inode in the file
system. The idea was to let the userspace NFS server track the mount
points in the exported path and use the right mountdir fd when it does
a open-by-handle syscall. NFSD can add extra information to the handle
returned from the syscall to indicate which mountdir fd should be used
with the open-by-handle request.


[1] http://article.gmane.org/gmane.linux.file-systems/38909
    Message-id:1268932144-14105-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com

-aneesh