From: Andreas Dilger <adilger@sun.com>
To: Zhang Huan <zhhuan@gmail.com>
Cc: linux-ext4@vger.kernel.org
Subject: Re: Question on readdir implementation
Date: Tue, 15 Sep 2009 08:41:41 -0600 [thread overview]
Message-ID: <20090915144141.GS2537@webber.adilger.int> (raw)
In-Reply-To: <20090915095724.GA8440@zhanghuan.nrchpc.ac.cn>
On Sep 15, 2009 17:57 +0800, Zhang Huan wrote:
> I'm reading EXT4 codes and has some questions about readdir
> implementation.
>
> Why traverse the directory in hash order? This brings lots of code to
> build and traverse a red-black tree. Why not just plainly traverse the
> directory's blocks?
Because htree does not maintain constant ordering of directory entries.
Consider a readdir running at the same time as files are being added:
readdir proc create proc
read block 0
read block 1
create new file
hash of filename puts entry into block 0
block 0 is full, split it
add new block 3
copy 1/2 of block 0 entries to block 3
add new filename to block 0
read block 2
read block 3
When the readdir returns block 3, 1/2 of the entries in that block
are the same as were returned with the original block 0 read. This
is a violation of POSIX readdir() semantics that require each existing
entry only be returned a single time (it doesn't matter if the new
filename is returned or not).
> Since the red-black tree is built every time a NFS readdir request comes
> in, in case of hash collision, the nfs client may receive duplicate dir
> entries if the buffer is not large enough to return all entries with the
> same hash value in once.
This is because NFSv2 only has a 32-bit cookie value, and if there is a
whole buffer full of values with the same 32-bit hash there isn't anything
that can be done about it except drop those extra duplicates (a very rare
case). It would have a much more serious problem if there was a directory
larger than 2^32 bytes in size, which would result in all entries beyond
2GB being lost.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
next prev parent reply other threads:[~2009-09-15 14:41 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-09-15 9:57 Question on readdir implementation Zhang Huan
2009-09-15 14:41 ` Andreas Dilger [this message]
2009-09-15 14:53 ` Theodore Tso
2009-09-15 17:56 ` Florian Weimer
2009-09-15 18:38 ` Theodore Tso
2009-09-16 5:47 ` Zhang Huan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090915144141.GS2537@webber.adilger.int \
--to=adilger@sun.com \
--cc=linux-ext4@vger.kernel.org \
--cc=zhhuan@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).