linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Theodore Ts'o <tytso@mit.edu>
To: J?rn Engel <joern@wohnheim.fh-wedel.de>
Cc: Jan Blunck <jblunck@suse.de>,
	Anton Altaparmakov <aia21@cam.ac.uk>,
	miklos@szeredi.hu, aaranya@cs.sunysb.edu,
	linux-fsdevel@vger.kernel.org
Subject: Re: Expected getdents behaviour
Date: Thu, 15 Sep 2005 14:19:47 -0400	[thread overview]
Message-ID: <20050915181946.GH22503@thunk.org> (raw)
In-Reply-To: <20050915174658.GA9974@wohnheim.fh-wedel.de>

On Thu, Sep 15, 2005 at 07:46:58PM +0200, J?rn Engel wrote:
> Correct.  But you're missing half of the problem.
> 
> The other half is getting O(1)  or O(ln(n)) performance for lookup on
> a large directory.  Solution to this half alone would be to use a
> balanced tree for directory entries.  Now, using a self-balancing tree
> would not solve your half of the problem, so it is not a full
> solution.
> 
> Ext2, as far as I understand it, solved both halves by having both a
> hashtree and the sequential list.  Getdents is walking the list and
> ignoring entries with the "deleted" flag.  Lookup is using the
> hashtree, which points to the list for actual data.

Actually, no.  What we do is return the directory entries in hash sort
order, by walking the btree.  And we use the hash as the offset
returned in d_off and via telldir().  The reason for this?  So that if
we add files which causes a node to split, that we still only return
files in the directory once and only once.  If we traversed the tree
in sequential order only, then some files could be returned multiple
times, and some files returned not at all after a node split
operation.

This is part of the "hair" that is needed in order to have a
POSIX-compliant (and non-buggy) readdir/telldir implementation.  The
downside of returning files in hash sort order is that an application
which tries to stat or open every single file in a directory in
readdir() order will no longer do so in rough creation order, and this
can cause the disk heads to seek all over the inode table to read the
inode.  This performance loss can fixed by sorting the files by inode
number before opening them.  This can be done in the application, or
via a LD_PRELOAD library, which reads the entire directory at
opendir() time, and then sorts the directory and returns it in sorted
inode order.  (Why can't we do this in the kernel? because for a very
large directory, we would have to allocate a huge amount of
non-swappable memory to store all of the filenames in the directory.)

As I said, implementing POSIX-compliant
opendir/readdir/telldir/seekdir is tricky for some filesystems, and
filesystem engineers often curse POSIX telldir/seekdir.  But they are
a fact of life, and we have to deal with them.

						- Ted

  reply	other threads:[~2005-09-15 18:20 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-09-15 13:57 Expected getdents behaviour Akshat Aranya
2005-09-15 14:03 ` Peter Staubach
2005-09-15 14:07 ` Anton Altaparmakov
2005-09-15 14:12   ` Anton Altaparmakov
2005-09-15 14:45     ` Miklos Szeredi
2005-09-15 15:17       ` Anton Altaparmakov
2005-09-15 16:41         ` Jan Blunck
2005-09-15 17:46           ` Jörn Engel
2005-09-15 18:19             ` Theodore Ts'o [this message]
2005-09-15 21:04               ` Anton Altaparmakov
2005-09-16  7:50                 ` Nikita Danilov
2005-09-15 21:47               ` Jörn Engel
2005-09-16  7:29               ` Nikita Danilov
2005-09-16 11:58                 ` Theodore Ts'o
2005-09-15 21:00             ` Anton Altaparmakov
2005-09-15 21:15               ` Charles P. Wright
2005-09-15 21:19                 ` Anton Altaparmakov
2005-09-15 20:28           ` Anton Altaparmakov
2005-09-15 16:51         ` Miklos Szeredi
2005-09-15 21:17           ` Anton Altaparmakov
2005-09-15 15:51     ` Theodore Ts'o
2005-09-15 16:52       ` Bryan Henderson
2005-09-15 16:57         ` Jeremy Allison
2005-09-15 20:51           ` Anton Altaparmakov
2005-09-15 20:50         ` Anton Altaparmakov
2005-09-15 23:41           ` Bryan Henderson
2005-09-15 20:25       ` Anton Altaparmakov
2005-09-16  3:39         ` Theodore Ts'o
2005-09-16 11:57           ` Dave Kleikamp
2005-09-15 18:08     ` Nikita Danilov
2005-09-16 11:23       ` Miklos Szeredi
2005-09-16  1:28   ` tridge

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20050915181946.GH22503@thunk.org \
    --to=tytso@mit.edu \
    --cc=aaranya@cs.sunysb.edu \
    --cc=aia21@cam.ac.uk \
    --cc=jblunck@suse.de \
    --cc=joern@wohnheim.fh-wedel.de \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).