linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andreas Dilger <adilger@sun.com>
To: Zhang Huan <zhhuan@gmail.com>
Cc: linux-ext4@vger.kernel.org
Subject: Re: Question on readdir implementation
Date: Tue, 15 Sep 2009 08:41:41 -0600	[thread overview]
Message-ID: <20090915144141.GS2537@webber.adilger.int> (raw)
In-Reply-To: <20090915095724.GA8440@zhanghuan.nrchpc.ac.cn>

On Sep 15, 2009  17:57 +0800, Zhang Huan wrote:
> I'm reading EXT4 codes and has some questions about readdir
> implementation.
> 
> Why traverse the directory in hash order? This brings lots of code to
> build and traverse a red-black tree. Why not just plainly traverse the
> directory's blocks?

Because htree does not maintain constant ordering of directory entries.
Consider a readdir running at the same time as files are being added:

readdir proc		create proc
read block 0
read block 1
			create new file
			    hash of filename puts entry into block 0
			    block 0 is full, split it
				add new block 3
				copy 1/2 of block 0 entries to block 3
				add new filename to block 0
read block 2
read block 3

When the readdir returns block 3, 1/2 of the entries in that block
are the same as were returned with the original block 0 read.  This
is a violation of POSIX readdir() semantics that require each existing
entry only be returned a single time (it doesn't matter if the new
filename is returned or not).
			
> Since the red-black tree is built every time a NFS readdir request comes
> in, in case of hash collision, the nfs client may receive duplicate dir
> entries if the buffer is not large enough to return all entries with the
> same hash value in once.

This is because NFSv2 only has a 32-bit cookie value, and if there is a
whole buffer full of values with the same 32-bit hash there isn't anything
that can be done about it except drop those extra duplicates (a very rare
case).  It would have a much more serious problem if there was a directory
larger than 2^32 bytes in size, which would result in all entries beyond
2GB being lost.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


  reply	other threads:[~2009-09-15 14:41 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-09-15  9:57 Question on readdir implementation Zhang Huan
2009-09-15 14:41 ` Andreas Dilger [this message]
2009-09-15 14:53 ` Theodore Tso
2009-09-15 17:56   ` Florian Weimer
2009-09-15 18:38     ` Theodore Tso
2009-09-16  5:47   ` Zhang Huan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090915144141.GS2537@webber.adilger.int \
    --to=adilger@sun.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=zhhuan@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).