From mboxrd@z Thu Jan 1 00:00:00 1970 From: Theodore Ts'o Subject: Re: Expected getdents behaviour Date: Thu, 15 Sep 2005 14:19:47 -0400 Message-ID: <20050915181946.GH22503@thunk.org> References: <1126793268.1676.9.camel@imp.csi.cam.ac.uk> <1126793558.1676.15.camel@imp.csi.cam.ac.uk> <1126797460.1676.23.camel@imp.csi.cam.ac.uk> <20050915164110.GA25573@hasse.suse.de> <20050915174658.GA9974@wohnheim.fh-wedel.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Jan Blunck , Anton Altaparmakov , miklos@szeredi.hu, aaranya@cs.sunysb.edu, linux-fsdevel@vger.kernel.org Return-path: Received: from thunk.org ([69.25.196.29]:27025 "EHLO thunker.thunk.org") by vger.kernel.org with ESMTP id S1030561AbVIOSUF (ORCPT ); Thu, 15 Sep 2005 14:20:05 -0400 To: J?rn Engel Content-Disposition: inline In-Reply-To: <20050915174658.GA9974@wohnheim.fh-wedel.de> Sender: linux-fsdevel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org On Thu, Sep 15, 2005 at 07:46:58PM +0200, J?rn Engel wrote: > Correct. But you're missing half of the problem. > > The other half is getting O(1) or O(ln(n)) performance for lookup on > a large directory. Solution to this half alone would be to use a > balanced tree for directory entries. Now, using a self-balancing tree > would not solve your half of the problem, so it is not a full > solution. > > Ext2, as far as I understand it, solved both halves by having both a > hashtree and the sequential list. Getdents is walking the list and > ignoring entries with the "deleted" flag. Lookup is using the > hashtree, which points to the list for actual data. Actually, no. What we do is return the directory entries in hash sort order, by walking the btree. And we use the hash as the offset returned in d_off and via telldir(). The reason for this? So that if we add files which causes a node to split, that we still only return files in the directory once and only once. If we traversed the tree in sequential order only, then some files could be returned multiple times, and some files returned not at all after a node split operation. This is part of the "hair" that is needed in order to have a POSIX-compliant (and non-buggy) readdir/telldir implementation. The downside of returning files in hash sort order is that an application which tries to stat or open every single file in a directory in readdir() order will no longer do so in rough creation order, and this can cause the disk heads to seek all over the inode table to read the inode. This performance loss can fixed by sorting the files by inode number before opening them. This can be done in the application, or via a LD_PRELOAD library, which reads the entire directory at opendir() time, and then sorts the directory and returns it in sorted inode order. (Why can't we do this in the kernel? because for a very large directory, we would have to allocate a huge amount of non-swappable memory to store all of the filenames in the directory.) As I said, implementing POSIX-compliant opendir/readdir/telldir/seekdir is tricky for some filesystems, and filesystem engineers often curse POSIX telldir/seekdir. But they are a fact of life, and we have to deal with them. - Ted