From mboxrd@z Thu Jan 1 00:00:00 1970 From: Peter Staubach Subject: Re: readdirplus() as possible POSIX I/O API Date: Wed, 06 Dec 2006 10:48:12 -0500 Message-ID: <4576E63C.2040409@redhat.com> References: <6.2.3.4.2.20061127213243.04f786c0@cic-mail.lanl.gov> <20061128055428.GA29891@infradead.org> <20061129090450.GA16296@infradead.org> <20061129094815.GE6429@schatzie.adilger.int> <1164795522.7557.45.camel@imp.csi.cam.ac.uk> <20061129082622.GA20285@cynthia.pants.nu> <20061130092548.GA1534@infradead.org> <1164950795.5761.25.camel@lade.trondhjem.org> <1164984094.5761.86.camel@lade.trondhjem.org> <1164996475.5761.150.camel@lade.trondhjem.org> <457462AF.5080601@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Trond Myklebust , Christoph Hellwig , Brad Boyer , Anton Altaparmakov , Andreas Dilger , Gary Grider , linux-fsdevel@vger.kernel.org Return-path: Received: from mx1.redhat.com ([66.187.233.31]:37452 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S936043AbWLFPsf (ORCPT ); Wed, 6 Dec 2006 10:48:35 -0500 To: Sage Weil In-Reply-To: Sender: linux-fsdevel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org Sage Weil wrote: > On Mon, 4 Dec 2006, Peter Staubach wrote: >> I think that there are several points which are missing here. >> >> First, readdirplus(), without any sort of caching, is going to be _very_ >> expensive, performance-wise, for _any_ size directory. You can see this >> by instrumenting any NFS server which already supports the NFSv3 >> READDIRPLUS >> semantics. > > Are you referring to the work the server must do to gather stat > information for each inode? > Yes and the fact that the client will be forced to go over the wire for each readdirplus() call, whereas it can use cached information today. An application actually waiting on the response to a READDIRPLUS will not be pleased at the resulting performance. >> Second, the NFS client side readdirplus() implementation is going to be >> _very_ expensive as well. The NFS client does write-behind and all this >> data _must_ be flushed to the server _before_ the over the wire >> READDIRPLUS >> can be issued. This means that the client will have to step through >> every >> inode which is associated with the directory inode being readdirplus()'d >> and ensure that all modified data has been successfully written out. >> This >> part of the operation, for a sufficiently large directory and a >> sufficiently >> large page cache, could take signficant time in itself. > > Why can't the client send the over the wire READDIRPLUS without > flushing inode data, and then simply ignore the stat portion of the > server's response in instances where it's locally cached (and dirty) > inode data is newer than the server's? > This would seem to minimize the value as far as I understand the requirements here. >> These overheads may make this new operation expensive enough that no >> applications will end up using it. > > If the application calls readdirplus() only when it would otherwise do > readdir()+stat(), the flushing you mention would happen anyway (from > the stat()). Wouldn't this at least allow that to happen in parallel > for the whole directory? I don't see where the parallelism comes from. Before issuing the READDIRPLUS over the wire, the client would have to ensure that each and every one of those flushes was completed. I suppose that a sufficiently clever and complex implementation could figure out how to schedule all those flushes asynchronously and then wait for all of them to complete, but there will be a performance cost. Walking the caches for all of those inodes, perhaps using several or all of the cpus in the system, smacking the server with all of those WRITE operations simultaneously with all of the associated network bandwidth usage, all adds up to other applications on the client and potentially the network not doing much at the same time. All of this cost to the system and to the network for the benefit of a single application? That seems like a tough sell to me. This is an easy problem to look at from the application viewpoint. The solution seems obvious. Give it the fastest possible way to read the directory and retrieve stat information about every entry in the directory. However, when viewed from a systemic level, this becomes a very different problem with many more aspects. Perhaps flow controlling this one application in favor of many other applications, running network wide, may be the better thing to continue to do. I dunno. ps