From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sage Weil Subject: Re: NFSv4/pNFS possible POSIX I/O API standards Date: Fri, 1 Dec 2006 10:42:27 -0800 (PST) Message-ID: References: <6.2.3.4.2.20061127213243.04f786c0@cic-mail.lanl.gov> <20061128055428.GA29891@infradead.org> <20061129090450.GA16296@infradead.org> <20061129094815.GE6429@schatzie.adilger.int> <1164795522.7557.45.camel@imp.csi.cam.ac.uk> <20061129082622.GA20285@cynthia.pants.nu> <20061130092548.GA1534@infradead.org> <1164950795.5761.25.camel@lade.trondhjem.org> <1164984094.5761.86.camel@lade.trondhjem.org> <1164996475.5761.150.camel@lade.trondhjem.org> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Christoph Hellwig , Brad Boyer , Anton Altaparmakov , Andreas Dilger , Gary Grider , linux-fsdevel@vger.kernel.org Return-path: Received: from pochacco.sd.dreamhost.com ([66.33.201.150]:20867 "EHLO pochacco.sd.dreamhost.com") by vger.kernel.org with ESMTP id S1031727AbWLASn3 (ORCPT ); Fri, 1 Dec 2006 13:43:29 -0500 To: Trond Myklebust In-Reply-To: <1164996475.5761.150.camel@lade.trondhjem.org> Sender: linux-fsdevel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org On Fri, 1 Dec 2006, Trond Myklebust wrote: > I'm quite happy with a proposal for a statlite(). I'm objecting to > readdirplus() because I can't see that it offers you anything useful. > You haven't provided an example of an application which would clearly > benefit from a readdirplus() interface instead of readdir()+statlite() > and possibly some tools for managing cache consistency. Okay, now I think I understand where you're coming from. The difference between readdirplus() and readdir()+statlite() is that (depending on the mask you specify) statlite() either provides the "right" answer (ala stat()), or anything that is vaguely "recent." readdirplus() would provide size/mtime from sometime _after_ the initial opendir() call, establishing a useful ordering. So without readdirplus(), you either get readdir()+stat() and the performance problems I mentioned before, or readdir()+statlite() where "recent" may not be good enough. Instead of my previous example of proccess #1 waiting for process #2 to finish and then checking the results with stat(), imagine instead that #1 is waiting for 100,000 other processes to finish, and then wants to check the results (size/mtime) of all of them. readdir()+statlite() won't work, and readdir()+stat() may be pathologically slow. Also, it's a tiring and trivial example, but even the 'ls -al' scenario isn't ideally addressed by readdir()+statlite(), since statlite() might return size/mtime from before 'ls -al' was executed by the user. One can easily imagine modifying a file on one host, then doing 'ls -al' on another host and not seeing the effects. If 'ls -al' can use readdirplus(), it's overall application semantics can be preserved without hammering large directories in a distributed filesystem. > I agree that an interface which allows a userland process offer hints to > the kernel as to what kind of cache consistency it requires for file > metadata would be useful. We already have stuff like posix_fadvise() etc > for file data, and perhaps it might be worth looking into how you could > devise something similar for metadata. > If what you really want is for applications to be able to manage network > filesystem cache consistency, then why not provide those tools instead? True, something to manage the attribute cache consistency for statlite() results would also address the issue by letting an application declare how weak it's results are allowed to be. That seems a bit more awkward, though, and would only affect statlite()--the only call that allows weak consistency in the first place. In contrast, readdirplus maps nicely onto what filesystems like NFS are already doing over the wire. sage