From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sage Weil Subject: Re: NFSv4/pNFS possible POSIX I/O API standards Date: Thu, 30 Nov 2006 09:49:08 -0800 (PST) Message-ID: References: <6.2.3.4.2.20061127213243.04f786c0@cic-mail.lanl.gov> <20061128055428.GA29891@infradead.org> <20061129090450.GA16296@infradead.org> <20061129094815.GE6429@schatzie.adilger.int> <1164795522.7557.45.camel@imp.csi.cam.ac.uk> <20061129082622.GA20285@cynthia.pants.nu> <20061130092548.GA1534@infradead.org> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Brad Boyer , Anton Altaparmakov , Andreas Dilger , Gary Grider , linux-fsdevel@vger.kernel.org Return-path: Received: from pochacco.sd.dreamhost.com ([66.33.201.150]:34477 "EHLO pochacco.sd.dreamhost.com") by vger.kernel.org with ESMTP id S1030902AbWK3RtU (ORCPT ); Thu, 30 Nov 2006 12:49:20 -0500 To: Christoph Hellwig In-Reply-To: <20061130092548.GA1534@infradead.org> Sender: linux-fsdevel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org On Thu, 30 Nov 2006, Christoph Hellwig wrote: > On Wed, Nov 29, 2006 at 12:26:22AM -0800, Brad Boyer wrote: >> For a more extreme case, hfs and hfsplus don't even have a separation >> between directory entries and inode information. The code creates this >> separation synthetically to match the expectations of the kernel. During >> a readdir(), the full catalog record is loaded from disk, but all that >> is used is the information passed back to the filldir callback. The only >> thing that would be needed to return extra information would be code to >> copy information from the internal structure to whatever the system call >> used to return data to the program. > > In this case you can infact already instanciate inodes froms readdir. > Take a look at the NFS code. Sure. And having readdirplus over the wire is a great performance win for NFS, but it works only because NFS metadata consistency is already weak. Giving applications an atomic readdirplus makes things considerably simpler for distributed filesystems that want to provide strong consistency (and a reasonable interpretation of what POSIX semantics mean for a distributed filesystem). In particular, it allows the application (e.g. ls --color or -al) to communicate to the kernel and filesystem that it doesn't care about the relative ordering of each subsequent stat() with respect to other writers (possibly on different hosts, with whom synchronization can incur a heavy performance penalty), but rather only wants a snapshot of dentry+inode state. As Andreas already mentioned, detecting this (exceedingly common) case may be possible with heuristics (e.g. watching the ordering of stat() calls vs the filldir resuls), but that's hardly ideal when a cleaner interface can explicitly capture the application's requirements. sage