From mboxrd@z Thu Jan  1 00:00:00 1970
From: Rob Ross <rross@mcs.anl.gov>
Subject: Re: NFSv4/pNFS possible POSIX I/O API standards
Date: Wed, 06 Dec 2006 09:04:00 -0600
Message-ID: <4576DBE0.9090305@mcs.anl.gov>
References: <1164950795.5761.25.camel@lade.trondhjem.org> <Pine.LNX.4.62.0611302157580.10257@wtf.di.newdream.net> <1164984094.5761.86.camel@lade.trondhjem.org> <20061203015203.GA5656@schatzie.adilger.int> <Pine.LNX.4.62.0612021830030.31986@wtf.di.newdream.net> <20061204073200.GB5637@schatzie.adilger.int> <1165245336.711.176.camel@lade.trondhjem.org> <4574C48A.8030007@mcs.anl.gov> <1165298200.5776.26.camel@lade.trondhjem.org> <20061205100748.GC5871@infradead.org> <20061205142002.GN3013@parisc-linux.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Christoph Hellwig <hch@infradead.org>,
	Trond Myklebust <trond.myklebust@fys.uio.no>,
	Andreas Dilger <adilger@clusterfs.com>,
	Sage Weil <sage@newdream.net>, Brad Boyer <flar@allandria.com>,
	Anton Altaparmakov <aia21@cam.ac.uk>,
	Gary Grider <ggrider@lanl.gov>, linux-fsdevel@vger.kernel.org
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mailgw.mcs.anl.gov ([140.221.9.4]:34113 "EHLO
	mailgw.mcs.anl.gov" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S933073AbWLFPED (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Wed, 6 Dec 2006 10:04:03 -0500
To: Matthew Wilcox <matthew@wil.cx>
In-Reply-To: <20061205142002.GN3013@parisc-linux.org>
Sender: linux-fsdevel-owner@vger.kernel.org
List-Id: linux-fsdevel.vger.kernel.org

Matthew Wilcox wrote:
> On Tue, Dec 05, 2006 at 10:07:48AM +0000, Christoph Hellwig wrote:
>> The filehandle idiocy on the other hand is way of into crackpipe land.
> 
> Right, and it needs to be discarded.  Of course, there was a real
> problem that it addressed, so we need to come up with an acceptable
> alternative.
 >
> The scenario is a cluster-wide application doing simultaneous opens of
> the same file.  So thousands of nodes all hitting the same DLM locks
> (for read) all at once.  The openg() non-solution implies that all
> nodes in the cluster share the same filehandle space, so I think a
> reasonable solution can be implemented entirely within the clusterfs,
> with an extra flag to open(), say O_CLUSTER_WIDE.  When the clusterfs
> sees this flag set (in ->lookup), it can treat it as a hint that this
> pathname component is likely to be opened again on other nodes and
> broadcast that fact to the other nodes within the cluster.  Other nodes
> on seeing that hint (which could be structured as "The child "bin"
> of filehandle e62438630ca37539c8cc1553710bbfaa3cf960a7 has filehandle
> ff51a98799931256b555446b2f5675db08de6229") can keep a record of that fact.
> When they see their own open, they can populate the path to that file
> without asking the server for extra metadata.
> 
> There's obviously security issues there (why I say 'hint' rather than
> 'command'), but there's also security problems with open-by-filehandle.
> Note that this solution requires no syscall changes, no application
> changes, and also helps a scenario where each node opens a different
> file in the same directory.
> 
> I've never worked on a clusterfs, so there may be some gotchas (eg, how
> do you invalidate the caches of nodes when you do a rename).  But this
> has to be preferable to open-by-fh.

The openg() solution has the following advantages to what you propose. 
First, it places the burden of the communication of the file handle on 
the application process, not the file system. That means less work for 
the file system. Second, it does not require that clients respond to 
unexpected network traffic. Third, the network traffic is deterministic 
-- one client interacts with the file system and then explicitly 
performs the broadcast. Fourth, it does not require that the file system 
store additional state on clients.

In the O_CLUSTER_WIDE approach, a naive implementation (everyone passing 
the flag) would likely cause a storm of network traffic if clients were 
closely synchronized (which they are likely to be). We could work around 
this by having one application open early, then barrier, then have 
everyone else open, but then we might as well have just sent the handle 
as the barrier operation, and we've made the use of the O_CLUSTER_WIDE 
open() significantly more complicated for the application.

However, the application change issue is actually moot; we will make 
whatever changes inside our MPI-IO implementation, and many users will 
get the benefits for free.

The readdirplus(), readx()/writex(), and openg()/openfh() were all 
designed to allow our applications to explain exactly what they wanted 
and to allow for explicit communication. I understand that there is a 
tendency toward solutions where the FS guesses what the app is going to 
do or is passed a hint (e.g. fadvise) about what is going to happen, 
because these things don't require interface changes. But these 
solutions just aren't as effective as actually spelling out what the 
application wants.

Regards,

Rob