* ext4: Question about directory entry minor hash usage (documentation error?) @ 2025-09-24 16:04 Zeno Endemann 2025-09-24 19:37 ` Theodore Ts'o 0 siblings, 1 reply; 3+ messages in thread From: Zeno Endemann @ 2025-09-24 16:04 UTC (permalink / raw) To: linux-ext4 Hello, The documentation of hash tree directories claims that interior nodes are "indexed by a minor hash". However from my current understanding of the code, it seems to me the node splitting works somewhat like regular B-trees, and there is no re-sorting with a minor hash going on. The minor hash doesn't influence at all the on-disk data structure, and is only used for sorting in a kernel internal rb-tree. Is this correct? If so, I could offer to write up a patch for the documentation. As a side question, I was wondering a bit why the kernel differentiates between htree-indexed dirs and others when simply iterating over it (as in e.g. ext4_readdir), and what the point of that rb-tree there is, i.e. why one would want to iterate over the entries in hash tree order. Thanks and cheers, Zeno Endemann ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: ext4: Question about directory entry minor hash usage (documentation error?) 2025-09-24 16:04 ext4: Question about directory entry minor hash usage (documentation error?) Zeno Endemann @ 2025-09-24 19:37 ` Theodore Ts'o 2025-09-25 15:58 ` Zeno Endemann 0 siblings, 1 reply; 3+ messages in thread From: Theodore Ts'o @ 2025-09-24 19:37 UTC (permalink / raw) To: Zeno Endemann; +Cc: linux-ext4 On Wed, Sep 24, 2025 at 06:04:08PM +0200, Zeno Endemann wrote: > > The documentation of hash tree directories claims that interior nodes are > "indexed by a minor hash". Yes, that's incorrect. > However from my current understanding of the code, it seems to me the node > splitting works somewhat like regular B-trees, and there is no re-sorting > with a minor hash going on. The minor hash doesn't influence at all the > on-disk data structure, and is only used for sorting in a kernel internal > rb-tree. Is this correct? If so, I could offer to write up a patch for the > documentation. The hash is used as the directory cookie for the puroses of telldir(2) and seekdir(2) on 32-bit systems. On 64-bit systems, we use the hash plus the minor_hash to form a 64-bit cookie, or "directory offset". This is also used for nfsv2 (which suports a 32-bit directory cookie), ad nfsv3 (which supports a 64-bit directory cookie). > As a side question, I was wondering a bit why the kernel differentiates > between htree-indexed dirs and others when simply iterating over it (as in > e.g. ext4_readdir), and what the point of that rb-tree there is, i.e. why one > would want to iterate over the entries in hash tree order. The reason for this is POSIX requirements for how readdir(), telldir() and seekdir works. When your use readdir() it must return each directory entry once and only once. And if while you are calling readdir() a directory entry gets deleted or added, the added or removed directory entry must be returned zero or one times, and all other directory entries exactly once. If you use telldir() to save a position in the readdirt() stream, and then go back to it using seekdir(), the rules of "exactly once unless entry is added/removed in which case it is returned zero or one times" apply. If you are using a linear directory structure, like the old V7 Unix, this works fine. But if you are usig a b-tree, where an interior node might get split, with half of the directory entries getting moved to a newly allocated block, the POSIX readdir() semantics are extremely difficult to implement *unless* you interate over the directory in hash tree order. (There are other solutions, but they are much more complicated.) - Ted ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: ext4: Question about directory entry minor hash usage (documentation error?) 2025-09-24 19:37 ` Theodore Ts'o @ 2025-09-25 15:58 ` Zeno Endemann 0 siblings, 0 replies; 3+ messages in thread From: Zeno Endemann @ 2025-09-25 15:58 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-ext4 Theodore Ts'o wrote on 24.09.25 21:37: > On Wed, Sep 24, 2025 at 06:04:08PM +0200, Zeno Endemann wrote: >> >> The documentation of hash tree directories claims that interior nodes are >> "indexed by a minor hash". > > Yes, that's incorrect. > >> However from my current understanding of the code, it seems to me the node >> splitting works somewhat like regular B-trees, and there is no re-sorting >> with a minor hash going on. The minor hash doesn't influence at all the >> on-disk data structure, and is only used for sorting in a kernel internal >> rb-tree. Is this correct? If so, I could offer to write up a patch for the >> documentation. > > The hash is used as the directory cookie for the puroses of telldir(2) > and seekdir(2) on 32-bit systems. On 64-bit systems, we use the hash > plus the minor_hash to form a 64-bit cookie, or "directory offset". > This is also used for nfsv2 (which suports a 32-bit directory cookie), > ad nfsv3 (which supports a 64-bit directory cookie). > >> As a side question, I was wondering a bit why the kernel differentiates >> between htree-indexed dirs and others when simply iterating over it (as in >> e.g. ext4_readdir), and what the point of that rb-tree there is, i.e. why one >> would want to iterate over the entries in hash tree order. > > The reason for this is POSIX requirements for how readdir(), telldir() > and seekdir works. When your use readdir() it must return each > directory entry once and only once. And if while you are calling > readdir() a directory entry gets deleted or added, the added or > removed directory entry must be returned zero or one times, and all > other directory entries exactly once. If you use telldir() to save a > position in the readdirt() stream, and then go back to it using > seekdir(), the rules of "exactly once unless entry is added/removed in > which case it is returned zero or one times" apply. > > If you are using a linear directory structure, like the old V7 Unix, > this works fine. But if you are usig a b-tree, where an interior node > might get split, with half of the directory entries getting moved to a > newly allocated block, the POSIX readdir() semantics are extremely > difficult to implement *unless* you interate over the directory in > hash tree order. > > (There are other solutions, but they are much more complicated.) I see, that makes sense. Thanks for the explanation. ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2025-09-25 15:59 UTC | newest] Thread overview: 3+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-09-24 16:04 ext4: Question about directory entry minor hash usage (documentation error?) Zeno Endemann 2025-09-24 19:37 ` Theodore Ts'o 2025-09-25 15:58 ` Zeno Endemann
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).