From mboxrd@z Thu Jan  1 00:00:00 1970
From: Matthew Wilcox <matthew@wil.cx>
Subject: Re: NFSv4/pNFS possible POSIX I/O API standards
Date: Wed, 6 Dec 2006 08:44:26 -0700
Message-ID: <20061206154426.GU3013@parisc-linux.org>
References: <1164984094.5761.86.camel@lade.trondhjem.org> <20061203015203.GA5656@schatzie.adilger.int> <Pine.LNX.4.62.0612021830030.31986@wtf.di.newdream.net> <20061204073200.GB5637@schatzie.adilger.int> <1165245336.711.176.camel@lade.trondhjem.org> <4574C48A.8030007@mcs.anl.gov> <1165298200.5776.26.camel@lade.trondhjem.org> <20061205100748.GC5871@infradead.org> <20061205142002.GN3013@parisc-linux.org> <4576DBE0.9090305@mcs.anl.gov>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: Christoph Hellwig <hch@infradead.org>,
	Trond Myklebust <trond.myklebust@fys.uio.no>,
	Andreas Dilger <adilger@clusterfs.com>,
	Sage Weil <sage@newdream.net>, Brad Boyer <flar@allandria.com>,
	Anton Altaparmakov <aia21@cam.ac.uk>,
	Gary Grider <ggrider@lanl.gov>, linux-fsdevel@vger.kernel.org
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from palinux.external.hp.com ([192.25.206.14]:37429 "EHLO
	mail.parisc-linux.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S935790AbWLFPo2 (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Wed, 6 Dec 2006 10:44:28 -0500
To: Rob Ross <rross@mcs.anl.gov>
Content-Disposition: inline
In-Reply-To: <4576DBE0.9090305@mcs.anl.gov>
Sender: linux-fsdevel-owner@vger.kernel.org
List-Id: linux-fsdevel.vger.kernel.org

On Wed, Dec 06, 2006 at 09:04:00AM -0600, Rob Ross wrote:
> The openg() solution has the following advantages to what you propose. 
> First, it places the burden of the communication of the file handle on 
> the application process, not the file system. That means less work for 
> the file system. Second, it does not require that clients respond to 
> unexpected network traffic. Third, the network traffic is deterministic 
> -- one client interacts with the file system and then explicitly 
> performs the broadcast. Fourth, it does not require that the file system 
> store additional state on clients.

You didn't address the disadvantages I pointed out on December 1st in a
mail to the posix mailing list:

: I now understand this not so much as a replacement for dup() but in
: terms of being able to open by NFS filehandle, or inode number.  The
: fh_t is presumably generated by the underlying cluster filesystem, and
: is a handle that has meaning on all nodes that are members of the
: cluster.
:
: I think we need to consider security issues (that have also come up
: when open-by-inode-number was proposed).  For example, how long is the
: fh_t intended to be valid for?  Forever?  Until the cluster is rebooted?
: Could the fh_t be used by any user, or only those with credentials to
: access the file?  What happens if we revoke() the original fd?
:
: I'm a little concerned about the generation of a suitable fh_t.
: In the implementation of sutoc(), how does the kernel know which
: filesystem to ask to translate it?  It's not impossible (though it is
: implausible) that an fh_t could be meaningful to more than one
: filesystem.
:
: One possibility of fixing this could be to use a magic number at the
: beginning of the fh_t to distinguish which filesystem this belongs
: to (a list of currently-used magic numbers in Linux can be found at
: http://git.parisc-linux.org/?p=linux-2.6.git;a=blob;f=include/linux/magic.h)

Christoph has also touched on some of these points, and added some I
missed.

> In the O_CLUSTER_WIDE approach, a naive implementation (everyone passing 
> the flag) would likely cause a storm of network traffic if clients were 
> closely synchronized (which they are likely to be).

I think you're referring to a naive application, rather than a naive
cluster filesystem, right?  There's several ways to fix that problem,
including throttling broadcasts of information, having nodes ask their
immediate neighbours if they have a cache of the information, and having
the server not respond (wait for a retransmit) if it's recently sent out
a broadcast.

> However, the application change issue is actually moot; we will make 
> whatever changes inside our MPI-IO implementation, and many users will 
> get the benefits for free.

That's good.

> The readdirplus(), readx()/writex(), and openg()/openfh() were all 
> designed to allow our applications to explain exactly what they wanted 
> and to allow for explicit communication. I understand that there is a 
> tendency toward solutions where the FS guesses what the app is going to 
> do or is passed a hint (e.g. fadvise) about what is going to happen, 
> because these things don't require interface changes. But these 
> solutions just aren't as effective as actually spelling out what the 
> application wants.

Sure, but I think you're emphasising "these interfaces let us get our
job done" over the legitimate concerns that we have.  I haven't really
looked at the readdirplus() or readx()/writex() interfaces, but the
security problems with openg() makes me think you haven't really looked
at it from the "what could go wrong" perspective.  I'd be interested in
reviewing the readx()/writex() interfaces, but still don't see a document
for them anywhere.