From: "Scott Chacon" <schacon@gmail.com>
To: "Shawn O. Pearce" <spearce@spearce.org>
Cc: "Junio C Hamano" <gitster@pobox.com>, "git list" <git@vger.kernel.org>
Subject: Re: Git Community Book
Date: Sat, 6 Sep 2008 11:14:22 -0700 [thread overview]
Message-ID: <d411cc4a0809061114k6b9f01b9sdf479360d4cb4c41@mail.gmail.com> (raw)
In-Reply-To: <20080906063325.GD28035@spearce.org>
Thanks a ton for this, I'll incorporate all of this.
On Fri, Sep 5, 2008 at 11:33 PM, Shawn O. Pearce <spearce@spearce.org> wrote:
> Scott Chacon <schacon@gmail.com> wrote:
>> On Fri, Sep 5, 2008 at 12:41 PM, Junio C Hamano <gitster@pobox.com> wrote:
>> > "Scott Chacon" <schacon@gmail.com> writes:
>> >
>> >> Also, the last section of the book is on some of the plumbing - mostly
>> >> stuff I've found difficult to pick up with the existing documentation
>> >> while re-implementing stuff in Ruby. I would really appreciate it if
>> >> someone could proofread some of these chapters for errors:
>> >>
>> >> http://book.git-scm.com/7_the_packfile.html
>
> OK, time for me to throw in comments. ;-)
>
> I do like this book, its organized and concise. Thanks for doing it.
>
>
> http://book.git-scm.com/7_how_git_stores_objects.html:
>
> The loose object formatting of
>
> header = "#{type} #{size}#body"
> store = header + content
>
> I can't read Ruby so I'm not sure what the header value computes
> out to here. #body should be a \0. I'm also not sure that the
> prior line setting size = content.length.to_s is very clear for
> the non-Ruby people to understand how a size is formatted.
>
Sorry, the markdown thingy is translating all the '\0's to '#body' for
some freaking reason unless I write it as '\\0'. I'll fix this - it's
difficult for me to find these sometimes. As for the rest of the ruby
stuff, I think I'll add some comments.
> If the code shown here is the Ruby implementation I'm a little
> concerned about it writing directly into the loose object. If the
> write is partial then you have a partial object which is at the
> right name, but is unusable. That can give you corruption that
> is difficult to track down and fix. C Git and JGit both write
> to temporary files then atomically move the temporary file into
> position under its proper name only after it has been fully written.
That is a good idea - I don't do it that way and I certainly will
change the implementation to do so and modify these docs to reflect
that advice.
> "When objects are written to disk, it is often in the loose format,
> since that format is less expensive to access."
>
> I'm not sure that statement is true. Access from packs tends
> to scream compared to access from loose objects. The overheads
> of opening and closing the file descriptors, even on Linux, is
> what kills performance for data access. However Git writes to
> loose objects first and packs later for _safety_ not efficiency.
> Although it is a lot more efficient to write a 2 KB loose object
> and avoid rewriting a 50 MB pack, but its also less likely to fail
> and make you lose your work.
Thanks for the clarification. I write to loose objects first largely
because it's so much easier to do. But also because I don't mmap
objects, so packfile access is not faster for implementations that
can't do that very well. Also, I had originally meant "less expensive
to write", but I can see that is not clear.
> http://book.git-scm.com/7_the_git_index.html:
>
> I wouldn't say that the index stores permissions. More like it
> stores the "class" or "type" of the thing located at that path.
> There are 4 major classes:
>
> - regular file
> - executable file
> - symbolic link
> - git submodule
>
> The 5th class is the subtree, but only appears in trees and not
> in the index since the index file is actually flat.
Interesting. This documentation is actually from the User Manual -
I'll update this chapter first and if it looks better, I'll submit a
patch to the UM, too.
> http://book.git-scm.com/7_the_packfile.html:
>
> You should probably point out that the .idx file uses network byte
> order for the numeric fields like the version number and the file
> offsets.
Will do.
>
> I'd also point out that the offsets in index v1 are unsigned and
> from the start of the pack file. The offsets in index v2 are
> also unsigned, but the 1<<31 is tested in the 32 bit offset to
> see if a 64 bit offset is used. The algorithm there is:
>
> if offset32 & 1<<31:
> offset = ofs64_table[offset32 & ~(1<<31)]
> else
> offset = offset32
>
> Its also rather unclear how the fan out table can be used to limit
> the binary search. What you are missing is describing that fanout[X]
> holds the number of objects whose first byte of their SHA-1 is <= X.
> Hence fanout[0] has the number of objects whose SHA-1 starts with
> "00" and fanout[0x15] has the number of objects whose SHA-1 starts
> with "15", "14", "13", ..., "00". Thus fanout[0xff] has the total
> number of objects in the pack.
>
> In the pack file section I'd also point out the version and entry
> count are unsigned network byte order. This is not clear from the
> Ruby code, although one can guess at it if one knows the git.git
> code very very well (like I do).
>
> "After that, you get a series of packed objects, in order of thier SHAs"
>
> Aside from s/thier/their/ this is not a correct statement _AT ALL_.
>
> The ordering of objects in the packfile is very carefully planned
> by the packer to maximize data locality from most recent -> least
> recent information, making the most recent revisions of a project
> the fastest to access. This has _NOTHING_ to do with their SHA-1
> names.
>
> Technically a pack may store objects in any random order. Heck,
> you can wire up an RNG to the packer to always produce a different
> ordering each time you pack. Practically an implementation shouldn't
> be that stupid and should instead try to order objects by recency,
> like git.git and JGit both do.
>
> "At the end of the packfile is a 20-byte SHA1 sum of all the shas
> (in sorted order) in that packfile."
>
> Also incorrect. The 20-byte checksum at the end of the pack file
> is a checksum of all bytes preceeding the checksum itself. We use
> it as an end-to-end data integrity check, especially on the network
> transport to verify that every bit sent by the one side is received
> correctly on the other side.
>
I'm an idiot. I say this because I actually implemented a bunch of
this stuff (in Ruby) and ran into most of these issues when trying to
implement it. So I knew these things not 3 weeks ago, but I still
wrote it this way. Dur. Thanks for the corrections, I'll update
everything accordingly.
> BTW, can I just say, I love the graphics in this book. They are
> quite well done. Very worthwhile.
Thanks.
>
>
> http://book.git-scm.com/7_transfer_protocols.html:
>
> You might as well explain that the stream returned by upload-pack
> uses the same 4 byte line length framing to form "packets", with
> the 5th byte (really first byte of the payload) indicating the
> "stream":
>
> - stream 1 ('\001') is the PACK data
> - stream 2 ('\002') is progress data/information
> - stream 3 ('\003') is the OH S**T we are aborting, died, dead
>
> You may also want to explain that the way you know the end of the
> pack is to read the header, get the entry count, and then read that
> many objects from the stream, and then verify the pack checksum.
>
> --
> Shawn.
>
Thanks again for all the time it must have taken to review all of this
- I'll make sure it gets into the book, and where appropriate, back
into the UM or other internal git docs.
Scott
next prev parent reply other threads:[~2008-09-06 18:15 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-09-05 19:08 Git Community Book Scott Chacon
2008-09-05 19:15 ` Thomas Adam
2008-09-05 20:45 ` Scott Chacon
2008-09-05 19:41 ` Junio C Hamano
2008-09-05 21:34 ` Scott Chacon
2008-09-05 22:09 ` Felipe Contreras
2008-09-06 6:33 ` Shawn O. Pearce
2008-09-06 18:14 ` Scott Chacon [this message]
2008-09-05 20:27 ` Linus Torvalds
2008-09-06 0:48 ` Stephan Beyer
2008-09-06 18:26 ` Christos Τrochalakis
-- strict thread matches above, loose matches on Subject: below --
2008-07-29 16:20 Scott Chacon
2008-07-29 16:28 ` Miklos Vajna
2008-07-29 17:09 ` Petr Baudis
2008-07-29 18:30 ` Scott Chacon
2008-07-29 18:42 ` Junio C Hamano
2008-07-29 19:00 ` Julian Phillips
2008-07-29 19:09 ` Junio C Hamano
2008-07-29 19:34 ` Scott Chacon
2008-07-29 19:57 ` Junio C Hamano
2008-07-30 21:39 ` J. Bruce Fields
2008-07-29 17:43 ` Junio C Hamano
2008-07-29 18:25 ` Junio C Hamano
2008-07-29 19:29 ` Scott Chacon
2008-07-29 19:24 ` Scott Chacon
2008-07-29 22:34 ` Daniel Barkalow
2008-07-29 22:47 ` Junio C Hamano
2008-07-30 13:20 ` Bart Trojanowski
2008-07-30 18:27 ` Junio C Hamano
2008-07-30 13:31 ` Bart Trojanowski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d411cc4a0809061114k6b9f01b9sdf479360d4cb4c41@mail.gmail.com \
--to=schacon@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=spearce@spearce.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).