Re: tree sorting..

Git development
 help / color / mirror / Atom feed

* Re: tree sorting..
       [not found] <Pine.LNX.4.64.0703160941290.3816@woody.linux-foundation.org>
@ 2007-03-16 23:13 ` Shawn O. Pearce
  2007-03-16 23:35   ` Linus Torvalds
  0 siblings, 1 reply; 2+ messages in thread
From: Shawn O. Pearce @ 2007-03-16 23:13 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

[adding in the list]
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On irc:
> 
>   spearce: personally the utf-8 commit encoding mess is less of a mess 
> 	then the "lets sort trees with / after their name, but not store 
> 	them that way!".
> 
> umm, we *do* store them that way, so I'm a bit confused?

Sorry, hasty IRC messages don't always read very well.  What I was
trying to say was that the decision to sort entries in the tree
the way that they are sorted is an annoying problem.

Entries in a tree must be unique by their name; that is I cannot have
"foo" as both a file and as a subtree.  An excellent constraint
to have.  But the position of the entry "foo" is different for a
file than for a subtree, as the subtree "foo" is sorted as though
it were "foo/" while the file "foo" is sorted by just "foo".

So there's additional complexity imposed.  I cannot just binary
search a tree and locate "foo", as "foo" could appear at one of two
different positions in that tree.  Sure I might know that the user
is asking for "foo/" (as the input was "foo/bar"); but what if I
search for "foo/" and in this particular tree there is a "foo",
but its a file?  I cannot conclude that foo doesn't exist here
without searching *again* for "foo" on its own.  Likewise if the
user asks for "foo" (no trailing slash).

The entire reason this happened is hysterical raisins.  Your first
draft of Git had trees as one flat structure, sorting complete
paths by their memcmp result.  Which is dead simple and worked.

But folks pointed out it would be more scalable to store each
directory on its own, as not every directory changes on each commit.
So you split the trees up in the object database, but you left the
index as-is.

This meant that as you were walking the index you found the subtree
"foo" where "foo/" appears, not where "foo" appears.  In my opinion
it is a *BUG* that you did not correct the sorting to remove the
trailing slash when you split the index down into individual tree
objects.

Now we are stuck with it.

And that flat index?  It worked well to get us started, sure, but
the whole cache_tree thing that Junio wedged in was specifically to
address the scability of the index on larger repositories.  We don't
need to recreate (or examine) trees for parts that didn't change.

Wasn't that the original scability argument as to *why* you broke
out the tree objects in the ODB apart?  How was the index exempt
from that?

Basically you cut a corner way back by not also splitting the
index up by subtree when you split the ODB "tree" up by subtree,
and we've been paying for it ever since.

I'm personally annoyed about it because I've fought this problem
several times.  In jgit (my Java Git implementation) I can't do a
simple binary search on a tree's contents, because of the issues
stated above.  In git-fast-import I can't do a simple binary search,
because of the issues stated above.  There I also have to resort a
tree *twice* to generate the delta, because a path name might have
changed type and thus changed positions within the tree.

Last night we found a bug where you can't do a two-head merge with
`read-tree -m` when "A" is a subdirectory in this branch but is a
file in the branch you are switching to.  This is difficult to fix
because within the index we cannot simply do a type comparsion of
"A" and "A" to see if they are the same...  Junio is trying to
patch around it, but it ain't pretty.

With the new pack v4 concept of a 6 byte fixed-width tree record,
we could actually do binary search within a tree, rather than
linear search.  But again, its uh, more difficult than it probably
should be.

We all do things that in hindsight could have been done differently.
I *definately* would rather have the Git we have today, than to
not have one at all.  I'm extremely grateful that you and several
others took time away from Linux kernel hacking to get this project
started, and kept a practical focus throughout, ensuring we would
have a useful tool.

I just get frustrated with it sometimes.  ;-)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: tree sorting..
  2007-03-16 23:13 ` tree sorting Shawn O. Pearce
@ 2007-03-16 23:35   ` Linus Torvalds
  0 siblings, 0 replies; 2+ messages in thread
From: Linus Torvalds @ 2007-03-16 23:35 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: git

On Fri, 16 Mar 2007, Shawn O. Pearce wrote:
> 
> The entire reason this happened is hysterical raisins.  Your first
> draft of Git had trees as one flat structure, sorting complete
> paths by their memcmp result.  Which is dead simple and worked.

No. 

It's *not* historical reasons. That's what I tried to explain.

It's absolutely *required*. Not because of the old flat layout, but 
because we want read-tree to translate 1:1 into the index.

And the index is a flat file. And it *needs* to be a flat file, because we 
need to open it and read it efficiently. If it wasn't a flat file, we'd 
have all the index operations taking as long as the tree operations take.

> Now we are stuck with it.

No. How you need to accept it. It's that easy. It's not "stuck". It's 
"correct".

That's why I reacted to your irc thing. You seemed to think (and still 
apparently do) that the sorting is some messy mistake. It's anything but.

So yes, "foo" is *different* from "foo/". You just need to realize that 
they are two totally different entries. They have absolutely no 
relationship what-so-ever with each other, except that you obviously must 
not add one when the other exists.

		Linus

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2007-03-16 23:35 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <Pine.LNX.4.64.0703160941290.3816@woody.linux-foundation.org>
2007-03-16 23:13 ` tree sorting Shawn O. Pearce
2007-03-16 23:35   ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox