From: "Theodore Ts'o" <tytso@mit.edu>
To: Zeno Endemann <zeno.endemann@mailbox.org>
Cc: linux-ext4@vger.kernel.org
Subject: Re: ext4: Question about directory entry minor hash usage (documentation error?)
Date: Wed, 24 Sep 2025 15:37:12 -0400 [thread overview]
Message-ID: <20250924193712.GD19231@mit.edu> (raw)
In-Reply-To: <51a89ced-228d-4fd0-9613-3b4d027d9162@mailbox.org>
On Wed, Sep 24, 2025 at 06:04:08PM +0200, Zeno Endemann wrote:
>
> The documentation of hash tree directories claims that interior nodes are
> "indexed by a minor hash".
Yes, that's incorrect.
> However from my current understanding of the code, it seems to me the node
> splitting works somewhat like regular B-trees, and there is no re-sorting
> with a minor hash going on. The minor hash doesn't influence at all the
> on-disk data structure, and is only used for sorting in a kernel internal
> rb-tree. Is this correct? If so, I could offer to write up a patch for the
> documentation.
The hash is used as the directory cookie for the puroses of telldir(2)
and seekdir(2) on 32-bit systems. On 64-bit systems, we use the hash
plus the minor_hash to form a 64-bit cookie, or "directory offset".
This is also used for nfsv2 (which suports a 32-bit directory cookie),
ad nfsv3 (which supports a 64-bit directory cookie).
> As a side question, I was wondering a bit why the kernel differentiates
> between htree-indexed dirs and others when simply iterating over it (as in
> e.g. ext4_readdir), and what the point of that rb-tree there is, i.e. why one
> would want to iterate over the entries in hash tree order.
The reason for this is POSIX requirements for how readdir(), telldir()
and seekdir works. When your use readdir() it must return each
directory entry once and only once. And if while you are calling
readdir() a directory entry gets deleted or added, the added or
removed directory entry must be returned zero or one times, and all
other directory entries exactly once. If you use telldir() to save a
position in the readdirt() stream, and then go back to it using
seekdir(), the rules of "exactly once unless entry is added/removed in
which case it is returned zero or one times" apply.
If you are using a linear directory structure, like the old V7 Unix,
this works fine. But if you are usig a b-tree, where an interior node
might get split, with half of the directory entries getting moved to a
newly allocated block, the POSIX readdir() semantics are extremely
difficult to implement *unless* you interate over the directory in
hash tree order.
(There are other solutions, but they are much more complicated.)
- Ted
next prev parent reply other threads:[~2025-09-24 19:37 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-24 16:04 ext4: Question about directory entry minor hash usage (documentation error?) Zeno Endemann
2025-09-24 19:37 ` Theodore Ts'o [this message]
2025-09-25 15:58 ` Zeno Endemann
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250924193712.GD19231@mit.edu \
--to=tytso@mit.edu \
--cc=linux-ext4@vger.kernel.org \
--cc=zeno.endemann@mailbox.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.