public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Theodore Tso <tytso@mit.edu>
To: Neil Brown <neilb@suse.de>
Cc: "Jörn Engel" <joern@lazybastard.org>,
	"H. Peter Anvin" <hpa@zytor.com>,
	"Christoph Hellwig" <hch@infradead.org>,
	"Ulrich Drepper" <drepper@gmail.com>,
	"Linux Kernel Mailing List" <linux-kernel@vger.kernel.org>
Subject: Re: If not readdir() then what?
Date: Mon, 16 Apr 2007 07:07:02 -0400	[thread overview]
Message-ID: <20070416110702.GF27533@thunk.org> (raw)
In-Reply-To: <17955.5453.164637.970405@notabene.brown>

On Mon, Apr 16, 2007 at 04:18:53PM +1000, Neil Brown wrote:
> On Thursday April 12, tytso@mit.edu wrote:
> > 
> > Unfortunately, in the NFS case if there are hash collisions, under the
> > wrong circumstances the NFS client could loop forever trying to
> > readdir() a directory stream.
> 
> I think that can be easily avoided by the filesystem by simply
> ensuring that the cookies returned in a request are all *after* the
> 'llseek' cookie.  There still maybe some risk of missing or extra
> entries in the directory if the filesystem cannot perform a precise
> cookie->position mapping, but an infinite loop should not be a problem.

It is true that you can replace the infinite loop problem by
guaranteeing that certain filenames are simply always missing.
Neither seemed desirable to me, but it is true that if we have to err
in one direction or another, simply skipping files in the case of a
hash collision is probably a better result.

> You still haven't convinced me that the limited size is a fundamental
> problem.
> In the case of ext3,  I have suggested (various things including)
> using 56bits of the hash and an 8 bit sequence number through any
> hash chain that may eventuate.
> The seq number could not be stable across reboots with the current
> on-disk data, but could be made stable for almost all real usage with
> a little bit of caching (I imagine using upper 4 bits to give a
> sequence number while reading from disk, and lower 4 bits to give a
> sequence number when adding new entries).

The challenge is making it be stable across inserts/deletes, never
mind reboots.  And it's not a "little bit of cacheing"; in order to be
correct we would have to cache *forever*, since at least in theory an
NFS client could hold on to a cookie for an arbitrarily long period of
time (weeks, months, years, decades), yes?

> Supposing I were to try my hand at making the modifications I
> suggested to ext3 so that per page/hashvalue rbtrees were cached in
> the pagecache and were used to provide a stable-as-possible telldir
> cookie.

Well, the first problem you'll run into is that ext3 doesn't store
directories in the page cache.  That's because the journaling layer
fs/jbd operates on a buffer cache basis (i.e., on a per filesystem
block level).  

Secondly, storing the rbtree in the page cache doesn't help, because
the whole point of the rbtree is to have a stable pointer that doesn't
change across a b-tree node split.  Whether you use a page cache or
the buffer cache, it doesn't change the fact that when you do a b-tree
node split, half the directory entries *go* *someplace* *else*.  Put
another way, Linus's standard reasons for wanting to store the
directory in the page cache so we can have better readahead don't
apply when you're using htree, because with htree we're never reading
entries in page cache order.

So you wouldn't be able to store the rbtree in the page cache, since
the doing so would defeat the whole purpose of the having the rbtree
in the first place.  The rbtree needs to be per directory stream,
*not* per directory block.   

You could store a mapping between each directory entry and a sequence
number.  That could work, modulo the need to store these entries
*forever*.  But now you take a performance hit since every time you
create or delete a directory entry, you'll need to keep the cache up
to date.  And since we don't store the hash in the on-disk format
(since you can always calculate it from the filename), you'll need to
store the hash as well in the cache information.  So that means
storing a 64-bit hash plus the sequence counter information --- and if
we assume 32-bit alignment requirements for data structures, it means
the simplest way to do this would be to hang a data structure off of
each directory block which consumes 12 bytes of data for every
directory entry, plus either (a) extra space for pointers so you can
easily insert and delete entries as directory entries are created or
deleted, or (b) resigning yourself to use memmove() to insert and
delete entries in the linear array.  

You're welcome to try, but I suspect it won't take long before you'll
see why I'm asserting that a directory fd cache in nfsd is *way* less
work.  :-)

						- Ted

  reply	other threads:[~2007-04-16 11:08 UTC|newest]

Thread overview: 65+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-04-07 16:57 If not readdir() then what? Ulrich Drepper
2007-04-07 20:36 ` Theodore Tso
2007-04-07 23:30   ` Christoph Hellwig
2007-04-08 18:11     ` H. Peter Anvin
2007-04-08 18:41       ` Jörn Engel
2007-04-08 19:19         ` Theodore Tso
2007-04-08 19:26           ` Ulrich Drepper
2007-04-08 19:28           ` H. Peter Anvin
2007-04-08 19:40             ` Ulrich Drepper
2007-04-09  1:44             ` Theodore Tso
2007-04-09 11:09               ` Jörn Engel
2007-04-09 12:29                 ` Trond Myklebust
2007-04-09 12:31                 ` Trond Myklebust
2007-04-09 13:19                   ` Theodore Tso
2007-04-09 14:03                     ` Trond Myklebust
2007-04-09 16:34                       ` Jan Engelhardt
2007-04-09 17:00                         ` Trond Myklebust
2007-04-10 13:56                       ` Theodore Tso
2007-04-10 14:10                         ` Ulrich Drepper
2007-04-10 15:48                           ` H. Peter Anvin
2007-04-10 16:42                             ` Ulrich Drepper
2007-04-10 14:37                         ` Trond Myklebust
2007-04-10 15:54                           ` Jan Engelhardt
2007-04-10 16:18                             ` H. Peter Anvin
2007-04-10 16:25                             ` Valdis.Kletnieks
2007-04-10 21:12                           ` Neil Brown
2007-04-10 21:16                             ` H. Peter Anvin
2007-04-10 21:43                               ` Neil Brown
2007-04-10 21:18                             ` Trond Myklebust
2007-04-10 21:37                               ` Neil Brown
2007-04-10 21:57                                 ` Bob Copeland
2007-04-10 21:59                                 ` Trond Myklebust
2007-04-10 22:33                                   ` Neil Brown
2007-04-11  0:22                                     ` Trond Myklebust
2007-04-11  1:45                                       ` Bernd Eckenfels
2007-04-10 21:46                             ` Alan Cox
2007-04-10 21:26                     ` Neil Brown
2007-04-09 12:46                 ` Andreas Schwab
2007-04-10 21:15         ` Neil Brown
2007-04-11 13:57           ` Jan Engelhardt
2007-04-11 14:42           ` Theodore Tso
2007-04-11 22:32             ` Neil Brown
2007-04-11 22:06               ` David Lang
2007-04-11 23:23                 ` H. Peter Anvin
2007-04-11 23:33                   ` Jörn Engel
2007-04-12  0:00                 ` Neil Brown
2007-04-11 23:22               ` Theodore Tso
2007-04-12  1:46                 ` Neil Brown
2007-04-12  2:37                   ` Jörn Engel
2007-04-12  5:57                     ` Neil Brown
2007-04-12  9:33                       ` Jörn Engel
2007-04-12 12:21                       ` Theodore Tso
2007-04-12 17:18                         ` J. Bruce Fields
2007-04-12 17:35                           ` H. Peter Anvin
2007-04-16  3:05                             ` Theodore Tso
2007-04-16  5:47                               ` Neil Brown
2007-04-16 10:39                                 ` Theodore Tso
2007-04-16  6:18                         ` Neil Brown
2007-04-16 11:07                           ` Theodore Tso [this message]
2007-04-16 23:24                             ` Neil Brown
2007-04-08 18:47       ` Theodore Tso
2007-04-08 19:13         ` H. Peter Anvin
2007-04-08 18:50     ` Ulrich Drepper
2007-04-07 23:44   ` Jan Engelhardt
2007-04-08 20:36   ` J. Bruce Fields

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070416110702.GF27533@thunk.org \
    --to=tytso@mit.edu \
    --cc=drepper@gmail.com \
    --cc=hch@infradead.org \
    --cc=hpa@zytor.com \
    --cc=joern@lazybastard.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=neilb@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox