From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: Re: NFSv4/pNFS possible POSIX I/O API standards Date: Wed, 6 Dec 2006 08:44:26 -0700 Message-ID: <20061206154426.GU3013@parisc-linux.org> References: <1164984094.5761.86.camel@lade.trondhjem.org> <20061203015203.GA5656@schatzie.adilger.int> <20061204073200.GB5637@schatzie.adilger.int> <1165245336.711.176.camel@lade.trondhjem.org> <4574C48A.8030007@mcs.anl.gov> <1165298200.5776.26.camel@lade.trondhjem.org> <20061205100748.GC5871@infradead.org> <20061205142002.GN3013@parisc-linux.org> <4576DBE0.9090305@mcs.anl.gov> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Christoph Hellwig , Trond Myklebust , Andreas Dilger , Sage Weil , Brad Boyer , Anton Altaparmakov , Gary Grider , linux-fsdevel@vger.kernel.org Return-path: Received: from palinux.external.hp.com ([192.25.206.14]:37429 "EHLO mail.parisc-linux.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S935790AbWLFPo2 (ORCPT ); Wed, 6 Dec 2006 10:44:28 -0500 To: Rob Ross Content-Disposition: inline In-Reply-To: <4576DBE0.9090305@mcs.anl.gov> Sender: linux-fsdevel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org On Wed, Dec 06, 2006 at 09:04:00AM -0600, Rob Ross wrote: > The openg() solution has the following advantages to what you propose. > First, it places the burden of the communication of the file handle on > the application process, not the file system. That means less work for > the file system. Second, it does not require that clients respond to > unexpected network traffic. Third, the network traffic is deterministic > -- one client interacts with the file system and then explicitly > performs the broadcast. Fourth, it does not require that the file system > store additional state on clients. You didn't address the disadvantages I pointed out on December 1st in a mail to the posix mailing list: : I now understand this not so much as a replacement for dup() but in : terms of being able to open by NFS filehandle, or inode number. The : fh_t is presumably generated by the underlying cluster filesystem, and : is a handle that has meaning on all nodes that are members of the : cluster. : : I think we need to consider security issues (that have also come up : when open-by-inode-number was proposed). For example, how long is the : fh_t intended to be valid for? Forever? Until the cluster is rebooted? : Could the fh_t be used by any user, or only those with credentials to : access the file? What happens if we revoke() the original fd? : : I'm a little concerned about the generation of a suitable fh_t. : In the implementation of sutoc(), how does the kernel know which : filesystem to ask to translate it? It's not impossible (though it is : implausible) that an fh_t could be meaningful to more than one : filesystem. : : One possibility of fixing this could be to use a magic number at the : beginning of the fh_t to distinguish which filesystem this belongs : to (a list of currently-used magic numbers in Linux can be found at : http://git.parisc-linux.org/?p=linux-2.6.git;a=blob;f=include/linux/magic.h) Christoph has also touched on some of these points, and added some I missed. > In the O_CLUSTER_WIDE approach, a naive implementation (everyone passing > the flag) would likely cause a storm of network traffic if clients were > closely synchronized (which they are likely to be). I think you're referring to a naive application, rather than a naive cluster filesystem, right? There's several ways to fix that problem, including throttling broadcasts of information, having nodes ask their immediate neighbours if they have a cache of the information, and having the server not respond (wait for a retransmit) if it's recently sent out a broadcast. > However, the application change issue is actually moot; we will make > whatever changes inside our MPI-IO implementation, and many users will > get the benefits for free. That's good. > The readdirplus(), readx()/writex(), and openg()/openfh() were all > designed to allow our applications to explain exactly what they wanted > and to allow for explicit communication. I understand that there is a > tendency toward solutions where the FS guesses what the app is going to > do or is passed a hint (e.g. fadvise) about what is going to happen, > because these things don't require interface changes. But these > solutions just aren't as effective as actually spelling out what the > application wants. Sure, but I think you're emphasising "these interfaces let us get our job done" over the legitimate concerns that we have. I haven't really looked at the readdirplus() or readx()/writex() interfaces, but the security problems with openg() makes me think you haven't really looked at it from the "what could go wrong" perspective. I'd be interested in reviewing the readx()/writex() interfaces, but still don't see a document for them anywhere.