From mboxrd@z Thu Jan  1 00:00:00 1970
From: Peter Staubach <staubach@redhat.com>
Subject: Re: readdirplus() as possible POSIX I/O API
Date: Wed, 06 Dec 2006 10:48:12 -0500
Message-ID: <4576E63C.2040409@redhat.com>
References: <6.2.3.4.2.20061127213243.04f786c0@cic-mail.lanl.gov>  <20061128055428.GA29891@infradead.org>  <20061129090450.GA16296@infradead.org>  <20061129094815.GE6429@schatzie.adilger.int>  <1164795522.7557.45.camel@imp.csi.cam.ac.uk>  <20061129082622.GA20285@cynthia.pants.nu>  <20061130092548.GA1534@infradead.org>  <Pine.LNX.4.62.0611300916260.8918@wtf.di.newdream.net>  <1164950795.5761.25.camel@lade.trondhjem.org>  <Pine.LNX.4.62.0611302157580.10257@wtf.di.newdream.net>  <1164984094.5761.86.camel@lade.trondhjem.org>  <Pine.LNX.4.62.0612010846400.10257@wtf.di.newdream.net> <1164996475.5761.150.camel@lade.trondhjem.org> <Pine.LNX.4.62.0612011018100.15475@wtf.di.newdream.net> <457462AF.5080601@redhat.com> <Pine.LNX.4.62.0612051439350.15475@wtf.di.newdream.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>,
	Christoph Hellwig <hch@infradead.org>,
	Brad Boyer <flar@allandria.com>,
	Anton Altaparmakov <aia21@cam.ac.uk>,
	Andreas Dilger <adilger@clusterfs.com>,
	Gary Grider <ggrider@lanl.gov>, linux-fsdevel@vger.kernel.org
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mx1.redhat.com ([66.187.233.31]:37452 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S936043AbWLFPsf (ORCPT <rfc822;linux-fsdevel@vger.kernel.org>);
	Wed, 6 Dec 2006 10:48:35 -0500
To: Sage Weil <sage@newdream.net>
In-Reply-To: <Pine.LNX.4.62.0612051439350.15475@wtf.di.newdream.net>
Sender: linux-fsdevel-owner@vger.kernel.org
List-Id: linux-fsdevel.vger.kernel.org

Sage Weil wrote:
> On Mon, 4 Dec 2006, Peter Staubach wrote:
>> I think that there are several points which are missing here.
>>
>> First, readdirplus(), without any sort of caching, is going to be _very_
>> expensive, performance-wise, for _any_ size directory.  You can see this
>> by instrumenting any NFS server which already supports the NFSv3 
>> READDIRPLUS
>> semantics.
>
> Are you referring to the work the server must do to gather stat 
> information for each inode?
>

Yes and the fact that the client will be forced to go over the wire for
each readdirplus() call, whereas it can use cached information today.
An application actually waiting on the response to a READDIRPLUS will
not be pleased at the resulting performance.

>> Second, the NFS client side readdirplus() implementation is going to be
>> _very_ expensive as well.  The NFS client does write-behind and all this
>> data _must_ be flushed to the server _before_ the over the wire 
>> READDIRPLUS
>> can be issued.  This means that the client will have to step through 
>> every
>> inode which is associated with the directory inode being readdirplus()'d
>> and ensure that all modified data has been successfully written out.  
>> This
>> part of the operation, for a sufficiently large directory and a 
>> sufficiently
>> large page cache, could take signficant time in itself.
>
> Why can't the client send the over the wire READDIRPLUS without 
> flushing inode data, and then simply ignore the stat portion of the 
> server's response in instances where it's locally cached (and dirty) 
> inode data is newer than the server's?
>

This would seem to minimize the value as far as I understand the
requirements here.

>> These overheads may make this new operation expensive enough that no
>> applications will end up using it.
>
> If the application calls readdirplus() only when it would otherwise do 
> readdir()+stat(), the flushing you mention would happen anyway (from 
> the stat()).  Wouldn't this at least allow that to happen in parallel 
> for the whole directory?

I don't see where the parallelism comes from.  Before issuing the
READDIRPLUS over the wire, the client would have to ensure that each
and every one of those flushes was completed.  I suppose that a
sufficiently clever and complex implementation could figure out how
to schedule all those flushes asynchronously and then wait for all
of them to complete, but there will be a performance cost.  Walking
the caches for all of those inodes, perhaps using several or all of
the cpus in the system, smacking the server with all of those WRITE
operations simultaneously with all of the associated network
bandwidth usage, all adds up to other applications on the client
and potentially the network not doing much at the same time.

All of this cost to the system and to the network for the benefit of
a single application?  That seems like a tough sell to me.

This is an easy problem to look at from the application viewpoint.
The solution seems obvious.  Give it the fastest possible way to
read the directory and retrieve stat information about every entry
in the directory.  However, when viewed from a systemic level, this
becomes a very different problem with many more aspects.  Perhaps
flow controlling this one application in favor of many other applications,
running network wide, may be the better thing to continue to do.
I dunno.

       ps