public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Theodore Tso <tytso@mit.edu>
To: Neil Brown <neilb@suse.de>
Cc: "Jörn Engel" <joern@lazybastard.org>,
	"H. Peter Anvin" <hpa@zytor.com>,
	"Christoph Hellwig" <hch@infradead.org>,
	"Ulrich Drepper" <drepper@gmail.com>,
	"Linux Kernel Mailing List" <linux-kernel@vger.kernel.org>
Subject: Re: If not readdir() then what?
Date: Wed, 11 Apr 2007 19:22:24 -0400	[thread overview]
Message-ID: <20070411232224.GF17778@thunk.org> (raw)
In-Reply-To: <17949.25061.739035.688232@notabene.brown>

On Thu, Apr 12, 2007 at 08:32:05AM +1000, Neil Brown wrote:
> For the first:
>   You are storing an internal tree representation of part of the
>   directory attached to the 'struct file'.
>   Would it be possible to store this attached to the page via the
>   ->private pointer?  What would avoid the allocate/create/free
>   overhead on every request.

The reason why we are storing it associated with the file pointer
instead of the page/block is because the a filename insertion might
cause a node split, in which case half of the 4k block gets copied to
another block.  We need a stable pointer to where we are in the tree
that can cope with hash collisions, and that's the reason for creating
red/black tree in the first place, since it *doesn't* get split and
reorganized when the directory's hash tree gets reorg'ed.  So
attaching the tree to the page breaks the reason why we have the
separate data structure in the first place.

>   You suggest caching the open files in nfsd.  While that is probably
>   possible (I've thought of it a number of times) is would also be
>   quite complex, e.g. requiring some sort of call-back to close all
>   those files when the filesystem is unexported.  And it is very easy
>   to get caching heuristics wrong.  Leveraging the page-cache which is
>   a very mature cache seems to make a lot of sense.

Is it really that complex?  The simplest way of handling it is simply
keeping a open directory fd cache in a per-filesystem rbtree index
which is indexed by file handle and contains the file pointer.  When
you unexport the filesystem, you simply walk the rbtree and close all
of the file descriptors; no callback is required.

The caching hueristics are an issue; but fixed-size cache with a
simple LFU replacement strategy isn't all that complex to implement.
If 95% of the time, the readdir's come in quick succession, even a
small cache will probably provide huge performance gains, and
increaing the cache size past some critical point will probably only
provide marginal improvements.  

> For the second.
>   You say that you " would need at least 96 bits in order to make that
>   guarantee; 64 bits of hash, plus a 32-bit count value in the hash
>   collision chain".  I think 96 is a bit greedy.  Surely 48 bits of 
>   hash and 16 bits of collision-chain-position would plenty.  You would
>   need 65537 entries before a collision was even possible, and
>   billions before it was at all likely. (How big does a set of 48bit
>   numbers have to get before the probability that "No subset of 65536
>   numbers are all the same" drops below 0.95?)
> 
>   This would really require that the collision-chain-index was stable
>   across create/delete.  Doing that while you have the tree in the
>   page cache is probably easy enough.  Doing it across reboots is
>   probably not possible without on-disk changes.

Actually, no, we can't keep the collision chain count stable across a
create/delete even while the tree is cached.  At least, not without
storing a huge amount of state associated with each page.  (It would
be a lot more work than simply having nfsd keep a fd cache for
directory streams ;-).

If we need create/delete stability, probably our only sane
implementation choice is to just stick with a 63-bit hash, and cross
our fingers and hope for the best.  

If nfsd caches the last N used directory caches, where N is roughly
proportional to the number of active clients, and the clients all only
use the last cookie returned in the readdir entry (since it would be
stupid to use one of the earlier ones and request the server to
re-send something which the client already has), at least in the
absense of telldir/seekdir calls, then that might be quite sufficient,
even if we return multiple direntory entries which contain hash
collisions to the client.  As long as the directory fd is cached, and
the client just uses the last cookie to fetch the next batch of
dirents, we'll be fine.

						- Ted

  parent reply	other threads:[~2007-04-11 23:23 UTC|newest]

Thread overview: 65+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-04-07 16:57 If not readdir() then what? Ulrich Drepper
2007-04-07 20:36 ` Theodore Tso
2007-04-07 23:30   ` Christoph Hellwig
2007-04-08 18:11     ` H. Peter Anvin
2007-04-08 18:41       ` Jörn Engel
2007-04-08 19:19         ` Theodore Tso
2007-04-08 19:26           ` Ulrich Drepper
2007-04-08 19:28           ` H. Peter Anvin
2007-04-08 19:40             ` Ulrich Drepper
2007-04-09  1:44             ` Theodore Tso
2007-04-09 11:09               ` Jörn Engel
2007-04-09 12:29                 ` Trond Myklebust
2007-04-09 12:31                 ` Trond Myklebust
2007-04-09 13:19                   ` Theodore Tso
2007-04-09 14:03                     ` Trond Myklebust
2007-04-09 16:34                       ` Jan Engelhardt
2007-04-09 17:00                         ` Trond Myklebust
2007-04-10 13:56                       ` Theodore Tso
2007-04-10 14:10                         ` Ulrich Drepper
2007-04-10 15:48                           ` H. Peter Anvin
2007-04-10 16:42                             ` Ulrich Drepper
2007-04-10 14:37                         ` Trond Myklebust
2007-04-10 15:54                           ` Jan Engelhardt
2007-04-10 16:18                             ` H. Peter Anvin
2007-04-10 16:25                             ` Valdis.Kletnieks
2007-04-10 21:12                           ` Neil Brown
2007-04-10 21:16                             ` H. Peter Anvin
2007-04-10 21:43                               ` Neil Brown
2007-04-10 21:18                             ` Trond Myklebust
2007-04-10 21:37                               ` Neil Brown
2007-04-10 21:57                                 ` Bob Copeland
2007-04-10 21:59                                 ` Trond Myklebust
2007-04-10 22:33                                   ` Neil Brown
2007-04-11  0:22                                     ` Trond Myklebust
2007-04-11  1:45                                       ` Bernd Eckenfels
2007-04-10 21:46                             ` Alan Cox
2007-04-10 21:26                     ` Neil Brown
2007-04-09 12:46                 ` Andreas Schwab
2007-04-10 21:15         ` Neil Brown
2007-04-11 13:57           ` Jan Engelhardt
2007-04-11 14:42           ` Theodore Tso
2007-04-11 22:32             ` Neil Brown
2007-04-11 22:06               ` David Lang
2007-04-11 23:23                 ` H. Peter Anvin
2007-04-11 23:33                   ` Jörn Engel
2007-04-12  0:00                 ` Neil Brown
2007-04-11 23:22               ` Theodore Tso [this message]
2007-04-12  1:46                 ` Neil Brown
2007-04-12  2:37                   ` Jörn Engel
2007-04-12  5:57                     ` Neil Brown
2007-04-12  9:33                       ` Jörn Engel
2007-04-12 12:21                       ` Theodore Tso
2007-04-12 17:18                         ` J. Bruce Fields
2007-04-12 17:35                           ` H. Peter Anvin
2007-04-16  3:05                             ` Theodore Tso
2007-04-16  5:47                               ` Neil Brown
2007-04-16 10:39                                 ` Theodore Tso
2007-04-16  6:18                         ` Neil Brown
2007-04-16 11:07                           ` Theodore Tso
2007-04-16 23:24                             ` Neil Brown
2007-04-08 18:47       ` Theodore Tso
2007-04-08 19:13         ` H. Peter Anvin
2007-04-08 18:50     ` Ulrich Drepper
2007-04-07 23:44   ` Jan Engelhardt
2007-04-08 20:36   ` J. Bruce Fields

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070411232224.GF17778@thunk.org \
    --to=tytso@mit.edu \
    --cc=drepper@gmail.com \
    --cc=hch@infradead.org \
    --cc=hpa@zytor.com \
    --cc=joern@lazybastard.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=neilb@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox