From mboxrd@z Thu Jan 1 00:00:00 1970 From: Anton Altaparmakov Subject: Re: Expected getdents behaviour Date: Thu, 15 Sep 2005 22:04:13 +0100 (BST) Message-ID: References: <1126793268.1676.9.camel@imp.csi.cam.ac.uk> <1126793558.1676.15.camel@imp.csi.cam.ac.uk> <1126797460.1676.23.camel@imp.csi.cam.ac.uk> <20050915164110.GA25573@hasse.suse.de> <20050915174658.GA9974@wohnheim.fh-wedel.de> <20050915181946.GH22503@thunk.org> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: J?rn Engel , Jan Blunck , miklos@szeredi.hu, aaranya@cs.sunysb.edu, linux-fsdevel@vger.kernel.org Return-path: Received: from ppsw-9.csi.cam.ac.uk ([131.111.8.139]:37554 "EHLO ppsw-9.csi.cam.ac.uk") by vger.kernel.org with ESMTP id S1161028AbVIOVEW (ORCPT ); Thu, 15 Sep 2005 17:04:22 -0400 To: Theodore Ts'o In-Reply-To: <20050915181946.GH22503@thunk.org> Sender: linux-fsdevel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org On Thu, 15 Sep 2005, Theodore Ts'o wrote: > On Thu, Sep 15, 2005 at 07:46:58PM +0200, J?rn Engel wrote: > > Correct. But you're missing half of the problem. > > > > The other half is getting O(1) or O(ln(n)) performance for lookup on > > a large directory. Solution to this half alone would be to use a > > balanced tree for directory entries. Now, using a self-balancing tree > > would not solve your half of the problem, so it is not a full > > solution. > > > > Ext2, as far as I understand it, solved both halves by having both a > > hashtree and the sequential list. Getdents is walking the list and > > ignoring entries with the "deleted" flag. Lookup is using the > > hashtree, which points to the list for actual data. > > Actually, no. What we do is return the directory entries in hash sort > order, by walking the btree. And we use the hash as the offset > returned in d_off and via telldir(). The reason for this? So that if > we add files which causes a node to split, that we still only return > files in the directory once and only once. If we traversed the tree > in sequential order only, then some files could be returned multiple > times, and some files returned not at all after a node split > operation. I don't know how your hash works but on ntfs this does not work like that. There is no parameter which would be able to tell you where you are in the b tree, unless you take a reference on an entry so it cannot be removed (rename==remove btw) but how does the reference get released later? > This is part of the "hair" that is needed in order to have a > POSIX-compliant (and non-buggy) readdir/telldir implementation. The > downside of returning files in hash sort order is that an application > which tries to stat or open every single file in a directory in > readdir() order will no longer do so in rough creation order, and this > can cause the disk heads to seek all over the inode table to read the > inode. This performance loss can fixed by sorting the files by inode > number before opening them. This can be done in the application, or > via a LD_PRELOAD library, which reads the entire directory at > opendir() time, and then sorts the directory and returns it in sorted > inode order. (Why can't we do this in the kernel? because for a very > large directory, we would have to allocate a huge amount of > non-swappable memory to store all of the filenames in the directory.) > > As I said, implementing POSIX-compliant > opendir/readdir/telldir/seekdir is tricky for some filesystems, and > filesystem engineers often curse POSIX telldir/seekdir. But they are > a fact of life, and we have to deal with them. Sure but given a large enough directory you will OOM the system rather than return the correct result which is why I keep saying the interface is broken... Best regards, Anton -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/