From mboxrd@z Thu Jan  1 00:00:00 1970
From: Theodore Ts'o <tytso@mit.edu>
Subject: Re: Expected getdents behaviour
Date: Thu, 15 Sep 2005 14:19:47 -0400
Message-ID: <20050915181946.GH22503@thunk.org>
References: <e483447805091506573daebc21@mail.gmail.com> <1126793268.1676.9.camel@imp.csi.cam.ac.uk> <1126793558.1676.15.camel@imp.csi.cam.ac.uk> <E1EFuzO-00037w-00@dorka.pomaz.szeredi.hu> <1126797460.1676.23.camel@imp.csi.cam.ac.uk> <20050915164110.GA25573@hasse.suse.de> <20050915174658.GA9974@wohnheim.fh-wedel.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: Jan Blunck <jblunck@suse.de>, Anton Altaparmakov <aia21@cam.ac.uk>,
	miklos@szeredi.hu, aaranya@cs.sunysb.edu,
	linux-fsdevel@vger.kernel.org
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from thunk.org ([69.25.196.29]:27025 "EHLO thunker.thunk.org")
	by vger.kernel.org with ESMTP id S1030561AbVIOSUF (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Thu, 15 Sep 2005 14:20:05 -0400
To: J?rn Engel <joern@wohnheim.fh-wedel.de>
Content-Disposition: inline
In-Reply-To: <20050915174658.GA9974@wohnheim.fh-wedel.de>
Sender: linux-fsdevel-owner@vger.kernel.org
List-Id: linux-fsdevel.vger.kernel.org

On Thu, Sep 15, 2005 at 07:46:58PM +0200, J?rn Engel wrote:
> Correct.  But you're missing half of the problem.
> 
> The other half is getting O(1)  or O(ln(n)) performance for lookup on
> a large directory.  Solution to this half alone would be to use a
> balanced tree for directory entries.  Now, using a self-balancing tree
> would not solve your half of the problem, so it is not a full
> solution.
> 
> Ext2, as far as I understand it, solved both halves by having both a
> hashtree and the sequential list.  Getdents is walking the list and
> ignoring entries with the "deleted" flag.  Lookup is using the
> hashtree, which points to the list for actual data.

Actually, no.  What we do is return the directory entries in hash sort
order, by walking the btree.  And we use the hash as the offset
returned in d_off and via telldir().  The reason for this?  So that if
we add files which causes a node to split, that we still only return
files in the directory once and only once.  If we traversed the tree
in sequential order only, then some files could be returned multiple
times, and some files returned not at all after a node split
operation.

This is part of the "hair" that is needed in order to have a
POSIX-compliant (and non-buggy) readdir/telldir implementation.  The
downside of returning files in hash sort order is that an application
which tries to stat or open every single file in a directory in
readdir() order will no longer do so in rough creation order, and this
can cause the disk heads to seek all over the inode table to read the
inode.  This performance loss can fixed by sorting the files by inode
number before opening them.  This can be done in the application, or
via a LD_PRELOAD library, which reads the entire directory at
opendir() time, and then sorts the directory and returns it in sorted
inode order.  (Why can't we do this in the kernel? because for a very
large directory, we would have to allocate a huge amount of
non-swappable memory to store all of the filenames in the directory.)

As I said, implementing POSIX-compliant
opendir/readdir/telldir/seekdir is tricky for some filesystems, and
filesystem engineers often curse POSIX telldir/seekdir.  But they are
a fact of life, and we have to deal with them.

						- Ted