From: Theodore Tso <tytso@mit.edu>
To: Neil Brown <neilb@suse.de>
Cc: "Jörn Engel" <joern@lazybastard.org>,
"H. Peter Anvin" <hpa@zytor.com>,
"Christoph Hellwig" <hch@infradead.org>,
"Ulrich Drepper" <drepper@gmail.com>,
"Linux Kernel Mailing List" <linux-kernel@vger.kernel.org>
Subject: Re: If not readdir() then what?
Date: Thu, 12 Apr 2007 08:21:16 -0400 [thread overview]
Message-ID: <20070412122116.GD28148@thunk.org> (raw)
In-Reply-To: <17949.51797.386833.917451@notabene.brown>
On Thu, Apr 12, 2007 at 03:57:41PM +1000, Neil Brown wrote:
> But my perspective is that a solution in nfsd at-best a work-around.
> Caching the whole 'struct file' when there is just a small bit that we
> might want seems like a heavy hammer. The filesystem is in the best
> place to know what needs to be cached, and it should be the one doing
> the caching.
Sure, but ext3 needs to cache the information in the file handle,
because it's dynamic, per-directory stream information that needs to
be cached. The fundamental problem is the broken nature of the 64-bit
cookie; it simply isn't big enough. So what's being disputed is who
gets to pay the cost of that particular design mistake, in POSIX and
in NFS?
In the POSIX case, right now only applications that use
telldir/seekdir pay the cost, which is they might see some repeated
directory entries in the case of hash collisions.
Unfortunately, in the NFS case if there are hash collisions, under the
wrong circumstances the NFS client could loop forever trying to
readdir() a directory stream.
> This is a simple consequence of the design decision to use hashes as
> the search key. They aren't dense and they will collide. So the
> solution will be a bit fuzzy around the edges. And maybe that is an
> acceptable tradeoff. But the filesystem should take full
> responsibility for it, whether in performance or correctness :-)
Well, we could also say that it is NFS's fault that they used a
limited size cookie as a fundamental part of their protocol....
> But there are alternatives. e.g. internal chaining.
> Insist on a unique 64bit hash for every file. If the hash is in use,
> increment and try again. On lookup, if the hash leads you to a file
> with the wrong name, increment and try again until you find a hole
> (hash value that is not stored). When you delete an entry, leave a
> place holder if the next hash is in use. Conversely if the next hash
> is not in use, delete the entry and delete the previous one if it is a
> place holder.
This solution requires an incompatible file format change. In
addition, it means trying to garbage collect directory entries when at
the beginning or end of a directory block without dragging in the
previous or next directory block. This is a huge amount of hair, will
screw over performance, and is not compatible with how we do things
today.
(It also means that you have to store the hash in the directory entry,
which we don't do today, since we can always calculate the hash from
the file name. But if in some cases the hash is going to be some
small integer plus the calculated hash, you have to bloat the
directory entry by an extra 8 bytes per dirent.)
Again, compared to a directory fd cache, what you're proposing a huge
hit to the filesystem, and at the moment, given that telldir/seekdir
is rarely used by everyone else, it's mainly NFS which is the main bad
actor here by insisting on the use of a small 31/63-bit cookie as a
condition of protocol correctness.
- Ted
next prev parent reply other threads:[~2007-04-12 12:22 UTC|newest]
Thread overview: 65+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-04-07 16:57 If not readdir() then what? Ulrich Drepper
2007-04-07 20:36 ` Theodore Tso
2007-04-07 23:30 ` Christoph Hellwig
2007-04-08 18:11 ` H. Peter Anvin
2007-04-08 18:41 ` Jörn Engel
2007-04-08 19:19 ` Theodore Tso
2007-04-08 19:26 ` Ulrich Drepper
2007-04-08 19:28 ` H. Peter Anvin
2007-04-08 19:40 ` Ulrich Drepper
2007-04-09 1:44 ` Theodore Tso
2007-04-09 11:09 ` Jörn Engel
2007-04-09 12:29 ` Trond Myklebust
2007-04-09 12:31 ` Trond Myklebust
2007-04-09 13:19 ` Theodore Tso
2007-04-09 14:03 ` Trond Myklebust
2007-04-09 16:34 ` Jan Engelhardt
2007-04-09 17:00 ` Trond Myklebust
2007-04-10 13:56 ` Theodore Tso
2007-04-10 14:10 ` Ulrich Drepper
2007-04-10 15:48 ` H. Peter Anvin
2007-04-10 16:42 ` Ulrich Drepper
2007-04-10 14:37 ` Trond Myklebust
2007-04-10 15:54 ` Jan Engelhardt
2007-04-10 16:18 ` H. Peter Anvin
2007-04-10 16:25 ` Valdis.Kletnieks
2007-04-10 21:12 ` Neil Brown
2007-04-10 21:16 ` H. Peter Anvin
2007-04-10 21:43 ` Neil Brown
2007-04-10 21:18 ` Trond Myklebust
2007-04-10 21:37 ` Neil Brown
2007-04-10 21:57 ` Bob Copeland
2007-04-10 21:59 ` Trond Myklebust
2007-04-10 22:33 ` Neil Brown
2007-04-11 0:22 ` Trond Myklebust
2007-04-11 1:45 ` Bernd Eckenfels
2007-04-10 21:46 ` Alan Cox
2007-04-10 21:26 ` Neil Brown
2007-04-09 12:46 ` Andreas Schwab
2007-04-10 21:15 ` Neil Brown
2007-04-11 13:57 ` Jan Engelhardt
2007-04-11 14:42 ` Theodore Tso
2007-04-11 22:32 ` Neil Brown
2007-04-11 22:06 ` David Lang
2007-04-11 23:23 ` H. Peter Anvin
2007-04-11 23:33 ` Jörn Engel
2007-04-12 0:00 ` Neil Brown
2007-04-11 23:22 ` Theodore Tso
2007-04-12 1:46 ` Neil Brown
2007-04-12 2:37 ` Jörn Engel
2007-04-12 5:57 ` Neil Brown
2007-04-12 9:33 ` Jörn Engel
2007-04-12 12:21 ` Theodore Tso [this message]
2007-04-12 17:18 ` J. Bruce Fields
2007-04-12 17:35 ` H. Peter Anvin
2007-04-16 3:05 ` Theodore Tso
2007-04-16 5:47 ` Neil Brown
2007-04-16 10:39 ` Theodore Tso
2007-04-16 6:18 ` Neil Brown
2007-04-16 11:07 ` Theodore Tso
2007-04-16 23:24 ` Neil Brown
2007-04-08 18:47 ` Theodore Tso
2007-04-08 19:13 ` H. Peter Anvin
2007-04-08 18:50 ` Ulrich Drepper
2007-04-07 23:44 ` Jan Engelhardt
2007-04-08 20:36 ` J. Bruce Fields
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20070412122116.GD28148@thunk.org \
--to=tytso@mit.edu \
--cc=drepper@gmail.com \
--cc=hch@infradead.org \
--cc=hpa@zytor.com \
--cc=joern@lazybastard.org \
--cc=linux-kernel@vger.kernel.org \
--cc=neilb@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox