From mboxrd@z Thu Jan 1 00:00:00 1970 From: Rob Ross Subject: Re: NFSv4/pNFS possible POSIX I/O API standards Date: Mon, 04 Dec 2006 18:59:54 -0600 Message-ID: <4574C48A.8030007@mcs.anl.gov> References: <20061129094815.GE6429@schatzie.adilger.int> <1164795522.7557.45.camel@imp.csi.cam.ac.uk> <20061129082622.GA20285@cynthia.pants.nu> <20061130092548.GA1534@infradead.org> <1164950795.5761.25.camel@lade.trondhjem.org> <1164984094.5761.86.camel@lade.trondhjem.org> <20061203015203.GA5656@schatzie.adilger.int> <20061204073200.GB5637@schatzie.adilger.int> <1165245336.711.176.camel@lade.trondhjem.org> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Andreas Dilger , Sage Weil , Christoph Hellwig , Brad Boyer , Anton Altaparmakov , Gary Grider , linux-fsdevel@vger.kernel.org Return-path: Received: from mailgw.mcs.anl.gov ([140.221.9.4]:55019 "EHLO mailgw.mcs.anl.gov" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758394AbWLEA76 (ORCPT ); Mon, 4 Dec 2006 19:59:58 -0500 To: Trond Myklebust In-Reply-To: <1165245336.711.176.camel@lade.trondhjem.org> Sender: linux-fsdevel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org Hi all, I don't think that the group intended that there be an opendirplus(); rather readdirplus() would simply be called instead of the usual readdir(). We should clarify that. Regarding Peter Staubach's comments about no one ever using the readdirplus() call; well, if people weren't performing this workload in the first place, we wouldn't *need* this sort of call! This call is specifically targeted at improving "ls -l" performance on large directories, and Sage has pointed out quite nicely how that might work. In our case (PVFS), we would essentially perform three phases of communication with the file system for a readdirplus that was obtaining full statistics: first grabbing the directory entries, then obtaining metadata from servers on all objects in bulk, then gathering file sizes in bulk. The reduction in control message traffic is enormous, and the concurrency is much greater than in a readdir()+stat()s workload. We'd never perform this sort of optimization optimistically, as the cost of guessing wrong is just too high. We would want to see the call as a proper VFS operation that we could act upon. The entire readdirplus() operation wasn't intended to be atomic, and in fact the returned structure has space for an error associated with the stat() on a particular entry, to allow for implementations that stat() subsequently and get an error because the object was removed between when the entry was read out of the directory and when the stat was performed. I think this fits well with what Andreas and others are thinking. We should clarify the description appropriately. I don't think that we have a readdirpluslite() variation documented yet? Gary? It would make a lot of sense. Except that it should probably have a better name... Regarding Andreas's note that he would prefer the statlite() flags to mean "valid", that makes good sense to me (and would obviously apply to the so-far even more hypothetical readdirpluslite()). I don't think there's a lot of value in returning possibly-inaccurate values? Thanks everyone, Rob Trond Myklebust wrote: > On Mon, 2006-12-04 at 00:32 -0700, Andreas Dilger wrote: >>> I'm wondering if a corresponding opendirplus() (or similar) would also be >>> appropriate to inform the kernel/filesystem that readdirplus() will >>> follow, and stat information should be gathered/buffered. Or do most >>> implementations wait for the first readdir() before doing any actual work >>> anyway? >> I'm not sure what some filesystems might do here. I suppose NFS has weak >> enough cache semantics that it _might_ return stale cached data from the >> client in order to fill the readdirplus() data, but it is just as likely >> that it ships the whole thing to the server and returns everything in >> one shot. That would imply everything would be at least as up-to-date >> as the opendir(). > > Whether or not the posix committee decides on readdirplus, I propose > that we implement this sort of thing in the kernel via a readdir > equivalent to posix_fadvise(). That can give exactly the barrier > semantics that they are asking for, and only costs 1 extra syscall as > opposed to 2 (opendirplus() and readdirplus()). > > Cheers > Trond