linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* ext4: Question about directory entry minor hash usage (documentation error?)
@ 2025-09-24 16:04 Zeno Endemann
  2025-09-24 19:37 ` Theodore Ts'o
  0 siblings, 1 reply; 3+ messages in thread
From: Zeno Endemann @ 2025-09-24 16:04 UTC (permalink / raw)
  To: linux-ext4

Hello,

The documentation of hash tree directories claims that interior nodes are
"indexed by a minor hash".

However from my current understanding of the code, it seems to me the node
splitting works somewhat like regular B-trees, and there is no re-sorting
with a minor hash going on. The minor hash doesn't influence at all the
on-disk data structure, and is only used for sorting in a kernel internal
rb-tree. Is this correct? If so, I could offer to write up a patch for the
documentation.

As a side question, I was wondering a bit why the kernel differentiates
between htree-indexed dirs and others when simply iterating over it (as in
e.g. ext4_readdir), and what the point of that rb-tree there is, i.e. why one
would want to iterate over the entries in hash tree order.


Thanks and cheers,
Zeno Endemann

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: ext4: Question about directory entry minor hash usage (documentation error?)
  2025-09-24 16:04 ext4: Question about directory entry minor hash usage (documentation error?) Zeno Endemann
@ 2025-09-24 19:37 ` Theodore Ts'o
  2025-09-25 15:58   ` Zeno Endemann
  0 siblings, 1 reply; 3+ messages in thread
From: Theodore Ts'o @ 2025-09-24 19:37 UTC (permalink / raw)
  To: Zeno Endemann; +Cc: linux-ext4

On Wed, Sep 24, 2025 at 06:04:08PM +0200, Zeno Endemann wrote:
> 
> The documentation of hash tree directories claims that interior nodes are
> "indexed by a minor hash".

Yes, that's incorrect.

> However from my current understanding of the code, it seems to me the node
> splitting works somewhat like regular B-trees, and there is no re-sorting
> with a minor hash going on. The minor hash doesn't influence at all the
> on-disk data structure, and is only used for sorting in a kernel internal
> rb-tree. Is this correct? If so, I could offer to write up a patch for the
> documentation.

The hash is used as the directory cookie for the puroses of telldir(2)
and seekdir(2) on 32-bit systems.  On 64-bit systems, we use the hash
plus the minor_hash to form a 64-bit cookie, or "directory offset".
This is also used for nfsv2 (which suports a 32-bit directory cookie),
ad nfsv3 (which supports a 64-bit directory cookie).

> As a side question, I was wondering a bit why the kernel differentiates
> between htree-indexed dirs and others when simply iterating over it (as in
> e.g. ext4_readdir), and what the point of that rb-tree there is, i.e. why one
> would want to iterate over the entries in hash tree order.

The reason for this is POSIX requirements for how readdir(), telldir()
and seekdir works.  When your use readdir() it must return each
directory entry once and only once.  And if while you are calling
readdir() a directory entry gets deleted or added, the added or
removed directory entry must be returned zero or one times, and all
other directory entries exactly once.  If you use telldir() to save a
position in the readdirt() stream, and then go back to it using
seekdir(), the rules of "exactly once unless entry is added/removed in
which case it is returned zero or one times" apply.

If you are using a linear directory structure, like the old V7 Unix,
this works fine.  But if you are usig a b-tree, where an interior node
might get split, with half of the directory entries getting moved to a
newly allocated block, the POSIX readdir() semantics are extremely
difficult to implement *unless* you interate over the directory in
hash tree order.

(There are other solutions, but they are much more complicated.)

      	  		       	    	- Ted

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: ext4: Question about directory entry minor hash usage (documentation error?)
  2025-09-24 19:37 ` Theodore Ts'o
@ 2025-09-25 15:58   ` Zeno Endemann
  0 siblings, 0 replies; 3+ messages in thread
From: Zeno Endemann @ 2025-09-25 15:58 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4

Theodore Ts'o wrote on 24.09.25 21:37:
> On Wed, Sep 24, 2025 at 06:04:08PM +0200, Zeno Endemann wrote:
>>
>> The documentation of hash tree directories claims that interior nodes are
>> "indexed by a minor hash".
> 
> Yes, that's incorrect.
> 
>> However from my current understanding of the code, it seems to me the node
>> splitting works somewhat like regular B-trees, and there is no re-sorting
>> with a minor hash going on. The minor hash doesn't influence at all the
>> on-disk data structure, and is only used for sorting in a kernel internal
>> rb-tree. Is this correct? If so, I could offer to write up a patch for the
>> documentation.
> 
> The hash is used as the directory cookie for the puroses of telldir(2)
> and seekdir(2) on 32-bit systems.  On 64-bit systems, we use the hash
> plus the minor_hash to form a 64-bit cookie, or "directory offset".
> This is also used for nfsv2 (which suports a 32-bit directory cookie),
> ad nfsv3 (which supports a 64-bit directory cookie).
> 
>> As a side question, I was wondering a bit why the kernel differentiates
>> between htree-indexed dirs and others when simply iterating over it (as in
>> e.g. ext4_readdir), and what the point of that rb-tree there is, i.e. why one
>> would want to iterate over the entries in hash tree order.
> 
> The reason for this is POSIX requirements for how readdir(), telldir()
> and seekdir works.  When your use readdir() it must return each
> directory entry once and only once.  And if while you are calling
> readdir() a directory entry gets deleted or added, the added or
> removed directory entry must be returned zero or one times, and all
> other directory entries exactly once.  If you use telldir() to save a
> position in the readdirt() stream, and then go back to it using
> seekdir(), the rules of "exactly once unless entry is added/removed in
> which case it is returned zero or one times" apply.
> 
> If you are using a linear directory structure, like the old V7 Unix,
> this works fine.  But if you are usig a b-tree, where an interior node
> might get split, with half of the directory entries getting moved to a
> newly allocated block, the POSIX readdir() semantics are extremely
> difficult to implement *unless* you interate over the directory in
> hash tree order.
> 
> (There are other solutions, but they are much more complicated.)

I see, that makes sense. Thanks for the explanation.



^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-09-25 15:59 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-24 16:04 ext4: Question about directory entry minor hash usage (documentation error?) Zeno Endemann
2025-09-24 19:37 ` Theodore Ts'o
2025-09-25 15:58   ` Zeno Endemann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).