Why is the name of a blob SHA1("$type $size\0$data") and not SHA1("$data")?

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Why is the name of a blob SHA1("$type $size\0$data") and not SHA1("$data")?
@ 2009-04-30 19:55 David Srbecky
  2009-04-30 20:02 ` Shawn O. Pearce
  2009-04-30 22:57 ` Björn Steinbrink
  0 siblings, 2 replies; 3+ messages in thread
From: David Srbecky @ 2009-04-30 19:55 UTC (permalink / raw)
  To: git

Hi,

First of all, congratulations on makeing such a great version control 
system.  I love the storage model - in comparison with other systems, it 
is just birantly simple and ingenious.

I started digging into the details and there is one thing that is really 
bugging me - why is the name of a blob SHA1("$type $size\0$data") and 
not SHA1("$data")?  I mean, wouldn't it be beautiful if the name of the 
blob would really just be the SHA1 of the uncompressed file content? :-)

Furthermore, is the header really necessary?  Wouldn't it be 
eqvivalently effective to put the blobs into own subdirectory? For 
example:  .git\objects\blob\22\22a3d28c5b2fca0eae83be1a2ed619e357f6a1e6
So the blob would contatin just be the compressed content and nothing 
else - beautiful :-)

I would really appriciate some comments on the design decisions so that 
I can sleep well at night :-)

David

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Why is the name of a blob SHA1("$type $size\0$data") and not SHA1("$data")?
  2009-04-30 19:55 Why is the name of a blob SHA1("$type $size\0$data") and not SHA1("$data")? David Srbecky
@ 2009-04-30 20:02 ` Shawn O. Pearce
  2009-04-30 22:57 ` Björn Steinbrink
  1 sibling, 0 replies; 3+ messages in thread
From: Shawn O. Pearce @ 2009-04-30 20:02 UTC (permalink / raw)
  To: David Srbecky; +Cc: git

David Srbecky <dsrbecky@gmail.com> wrote:
>
> I started digging into the details and there is one thing that is really  
> bugging me - why is the name of a blob SHA1("$type $size\0$data") and  
> not SHA1("$data")?  I mean, wouldn't it be beautiful if the name of the  
> blob would really just be the SHA1 of the uncompressed file content? :-)

Well, a commit is stored in the same namespace as a blob (file
content).  So the type being included in the SHA1 computation helps
to break them apart and say "this is really a commit" vs. "this
is a file that just happens to have the same content as a commit".
It does help consistency checkers like `git fsck` to know that the
object is used in the right context.

I can't guess what Linus had in mind when he wrote Git, but I would
wager it was something along the lines that storing everything in
a single directory structure was simpler/more elegant than having
a different directory structure per object type.  Today I would
probably have made the same design decision, but I'm biased by
Git already so who knows if I'm just mimicing Linus' brilliance or
would have arrived at the same result myself.

Including the length is overkill, yes, but its in the header of the
data so that git can immediately allocate a properly sized memory
buffer before it inflates the rest of the object content.  Its a
performance improvement.  Its probably a historical accident that
it got included in the SHA1 computation, as notice its position
between the type and the data... it likely was just easier to
include it in the SHA1 than to exclude it.

> I would really appriciate some comments on the design decisions so that  
> I can sleep well at night :-)

Then I won't mention pack files... which aren't as simple to read
as just inflating a file on disk.  :-)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Why is the name of a blob SHA1("$type $size\0$data") and not SHA1("$data")?
  2009-04-30 19:55 Why is the name of a blob SHA1("$type $size\0$data") and not SHA1("$data")? David Srbecky
  2009-04-30 20:02 ` Shawn O. Pearce
@ 2009-04-30 22:57 ` Björn Steinbrink
  1 sibling, 0 replies; 3+ messages in thread
From: Björn Steinbrink @ 2009-04-30 22:57 UTC (permalink / raw)
  To: David Srbecky; +Cc: git

On 2009.04.30 20:55:00 +0100, David Srbecky wrote:
> Hi,
>
>
> First of all, congratulations on makeing such a great version control  
> system.  I love the storage model - in comparison with other systems, it  
> is just birantly simple and ingenious.
>
>
> I started digging into the details and there is one thing that is really  
> bugging me - why is the name of a blob SHA1("$type $size\0$data") and  
> not SHA1("$data")?  I mean, wouldn't it be beautiful if the name of the  
> blob would really just be the SHA1 of the uncompressed file content? :-)
>
>
> Furthermore, is the header really necessary?  Wouldn't it be  
> eqvivalently effective to put the blobs into own subdirectory? For  
> example:  .git\objects\blob\22\22a3d28c5b2fca0eae83be1a2ed619e357f6a1e6
> So the blob would contatin just be the compressed content and nothing  
> else - beautiful :-)

Yes, at least the type is pretty important. Consider just "git show
$some_object_name". If the object name was just the hash of the
contents, you could have a blob and a commit with the same name. Which
is which? And which do you mean in that command? The command line
interface would need to accept a type in addition to the object name in
a lot of place.

And in packs, you want the objects ordered so that you get could access
patterns, and don't read from all over the pack file. That means that
you would need the type header there, regardless of whether it is in the
loose object file.

Björn

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2009-04-30 22:57 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-30 19:55 Why is the name of a blob SHA1("$type $size\0$data") and not SHA1("$data")? David Srbecky
2009-04-30 20:02 ` Shawn O. Pearce
2009-04-30 22:57 ` Björn Steinbrink

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).