From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sage Weil Subject: Re: readdirplus() as possible POSIX I/O API Date: Tue, 5 Dec 2006 15:20:27 -0800 (PST) Message-ID: References: <6.2.3.4.2.20061127213243.04f786c0@cic-mail.lanl.gov> <20061128055428.GA29891@infradead.org> <20061129090450.GA16296@infradead.org> <20061129094815.GE6429@schatzie.adilger.int> <1164795522.7557.45.camel@imp.csi.cam.ac.uk> <20061129082622.GA20285@cynthia.pants.nu> <20061130092548.GA1534@infradead.org> <1164950795.5761.25.camel@lade.trondhjem.org> <1164984094.5761.86.camel@lade.trondhjem.org> <1164996475.5761.150.camel@lade.trondhjem.org> <457462AF.5080601@redhat.com> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Trond Myklebust , Christoph Hellwig , Brad Boyer , Anton Altaparmakov , Andreas Dilger , Gary Grider , linux-fsdevel@vger.kernel.org Return-path: Received: from pochacco.sd.dreamhost.com ([66.33.201.150]:34095 "EHLO pochacco.sd.dreamhost.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S967467AbWLEXUj (ORCPT ); Tue, 5 Dec 2006 18:20:39 -0500 To: Peter Staubach In-Reply-To: <457462AF.5080601@redhat.com> Sender: linux-fsdevel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org On Mon, 4 Dec 2006, Peter Staubach wrote: > I think that there are several points which are missing here. > > First, readdirplus(), without any sort of caching, is going to be _very_ > expensive, performance-wise, for _any_ size directory. You can see this > by instrumenting any NFS server which already supports the NFSv3 READDIRPLUS > semantics. Are you referring to the work the server must do to gather stat information for each inode? > Second, the NFS client side readdirplus() implementation is going to be > _very_ expensive as well. The NFS client does write-behind and all this > data _must_ be flushed to the server _before_ the over the wire READDIRPLUS > can be issued. This means that the client will have to step through every > inode which is associated with the directory inode being readdirplus()'d > and ensure that all modified data has been successfully written out. This > part of the operation, for a sufficiently large directory and a sufficiently > large page cache, could take signficant time in itself. Why can't the client send the over the wire READDIRPLUS without flushing inode data, and then simply ignore the stat portion of the server's response in instances where it's locally cached (and dirty) inode data is newer than the server's? > These overheads may make this new operation expensive enough that no > applications will end up using it. If the application calls readdirplus() only when it would otherwise do readdir()+stat(), the flushing you mention would happen anyway (from the stat()). Wouldn't this at least allow that to happen in parallel for the whole directory? sage