Re: Git Community Book - Scott Chacon

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Scott Chacon" <schacon@gmail.com>
To: "Shawn O. Pearce" <spearce@spearce.org>
Cc: "Junio C Hamano" <gitster@pobox.com>, "git list" <git@vger.kernel.org>
Subject: Re: Git Community Book
Date: Sat, 6 Sep 2008 11:14:22 -0700	[thread overview]
Message-ID: <d411cc4a0809061114k6b9f01b9sdf479360d4cb4c41@mail.gmail.com> (raw)
In-Reply-To: <20080906063325.GD28035@spearce.org>

Thanks a ton for this, I'll incorporate all of this.

On Fri, Sep 5, 2008 at 11:33 PM, Shawn O. Pearce <spearce@spearce.org> wrote:
> Scott Chacon <schacon@gmail.com> wrote:
>> On Fri, Sep 5, 2008 at 12:41 PM, Junio C Hamano <gitster@pobox.com> wrote:
>> > "Scott Chacon" <schacon@gmail.com> writes:
>> >
>> >> Also, the last section of the book is on some of the plumbing - mostly
>> >> stuff I've found difficult to pick up with the existing documentation
>> >> while re-implementing stuff in Ruby.  I would really appreciate it if
>> >> someone could proofread some of these chapters for errors:
>> >>
>> >> http://book.git-scm.com/7_the_packfile.html
>
> OK, time for me to throw in comments.  ;-)
>
> I do like this book, its organized and concise.  Thanks for doing it.
>
>
> http://book.git-scm.com/7_how_git_stores_objects.html:
>
> The loose object formatting of
>
>  header = "#{type} #{size}#body"
>  store = header + content
>
> I can't read Ruby so I'm not sure what the header value computes
> out to here.  #body should be a \0.  I'm also not sure that the
> prior line setting size = content.length.to_s is very clear for
> the non-Ruby people to understand how a size is formatted.
>

Sorry, the markdown thingy is translating all the '\0's to '#body' for
some freaking reason unless I write it as '\\0'.  I'll fix this - it's
difficult for me to find these sometimes.  As for the rest of the ruby
stuff, I think I'll add some comments.

> If the code shown here is the Ruby implementation I'm a little
> concerned about it writing directly into the loose object.  If the
> write is partial then you have a partial object which is at the
> right name, but is unusable.  That can give you corruption that
> is difficult to track down and fix.  C Git and JGit both write
> to temporary files then atomically move the temporary file into
> position under its proper name only after it has been fully written.

That is a good idea - I don't do it that way and I certainly will
change the implementation to do so and modify these docs to reflect
that advice.

> "When objects are written to disk, it is often in the loose format,
> since that format is less expensive to access."
>
> I'm not sure that statement is true.  Access from packs tends
> to scream compared to access from loose objects.  The overheads
> of opening and closing the file descriptors, even on Linux, is
> what kills performance for data access.  However Git writes to
> loose objects first and packs later for _safety_ not efficiency.
> Although it is a lot more efficient to write a 2 KB loose object
> and avoid rewriting a 50 MB pack, but its also less likely to fail
> and make you lose your work.

Thanks for the clarification.  I write to loose objects first largely
because it's so much easier to do.  But also because I don't mmap
objects, so packfile access is not faster for implementations that
can't do that very well.  Also, I had originally meant "less expensive
to write", but I can see that is not clear.


> http://book.git-scm.com/7_the_git_index.html:
>
> I wouldn't say that the index stores permissions.  More like it
> stores the "class" or "type" of the thing located at that path.
> There are 4 major classes:
>
>        - regular file
>        - executable file
>        - symbolic link
>        - git submodule
>
> The 5th class is the subtree, but only appears in trees and not
> in the index since the index file is actually flat.

Interesting.  This documentation is actually from the User Manual -
I'll update this chapter first and if it looks better, I'll submit a
patch to the UM, too.

> http://book.git-scm.com/7_the_packfile.html:
>
> You should probably point out that the .idx file uses network byte
> order for the numeric fields like the version number and the file
> offsets.

Will do.

>
> I'd also point out that the offsets in index v1 are unsigned and
> from the start of the pack file.  The offsets in index v2 are
> also unsigned, but the 1<<31 is tested in the 32 bit offset to
> see if a 64 bit offset is used.  The algorithm there is:
>
>        if offset32 & 1<<31:
>                offset = ofs64_table[offset32 & ~(1<<31)]
>        else
>                offset = offset32
>
> Its also rather unclear how the fan out table can be used to limit
> the binary search.  What you are missing is describing that fanout[X]
> holds the number of objects whose first byte of their SHA-1 is <= X.
> Hence fanout[0] has the number of objects whose SHA-1 starts with
> "00" and fanout[0x15] has the number of objects whose SHA-1 starts
> with "15", "14", "13", ..., "00".  Thus fanout[0xff] has the total
> number of objects in the pack.
>
> In the pack file section I'd also point out the version and entry
> count are unsigned network byte order.  This is not clear from the
> Ruby code, although one can guess at it if one knows the git.git
> code very very well (like I do).
>
> "After that, you get a series of packed objects, in order of thier SHAs"
>
> Aside from s/thier/their/ this is not a correct statement _AT ALL_.
>
> The ordering of objects in the packfile is very carefully planned
> by the packer to maximize data locality from most recent -> least
> recent information, making the most recent revisions of a project
> the fastest to access.  This has _NOTHING_ to do with their SHA-1
> names.
>
> Technically a pack may store objects in any random order.  Heck,
> you can wire up an RNG to the packer to always produce a different
> ordering each time you pack.  Practically an implementation shouldn't
> be that stupid and should instead try to order objects by recency,
> like git.git and JGit both do.
>
> "At the end of the packfile is a 20-byte SHA1 sum of all the shas
> (in sorted order) in that packfile."
>
> Also incorrect.  The 20-byte checksum at the end of the pack file
> is a checksum of all bytes preceeding the checksum itself.  We use
> it as an end-to-end data integrity check, especially on the network
> transport to verify that every bit sent by the one side is received
> correctly on the other side.
>

I'm an idiot.  I say this because I actually implemented a bunch of
this stuff (in Ruby) and ran into most of these issues when trying to
implement it.  So I knew these things not 3 weeks ago, but I still
wrote it this way.  Dur.  Thanks for the corrections, I'll update
everything accordingly.

> BTW, can I just say, I love the graphics in this book.  They are
> quite well done.  Very worthwhile.

Thanks.

>
>
> http://book.git-scm.com/7_transfer_protocols.html:
>
> You might as well explain that the stream returned by upload-pack
> uses the same 4 byte line length framing to form "packets", with
> the 5th byte (really first byte of the payload) indicating the
> "stream":
>
>        - stream 1 ('\001') is the PACK data
>        - stream 2 ('\002') is progress data/information
>        - stream 3 ('\003') is the OH S**T we are aborting, died, dead
>
> You may also want to explain that the way you know the end of the
> pack is to read the header, get the entry count, and then read that
> many objects from the stream, and then verify the pack checksum.
>
> --
> Shawn.
>

Thanks again for all the time it must have taken to review all of this
- I'll make sure it gets into the book, and where appropriate, back
into the UM or other internal git docs.

Scott

next prev parent reply	other threads:[~2008-09-06 18:15 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-09-05 19:08 Git Community Book Scott Chacon
2008-09-05 19:15 ` Thomas Adam
2008-09-05 20:45   ` Scott Chacon
2008-09-05 19:41 ` Junio C Hamano
2008-09-05 21:34   ` Scott Chacon
2008-09-05 22:09     ` Felipe Contreras
2008-09-06  6:33     ` Shawn O. Pearce
2008-09-06 18:14       ` Scott Chacon [this message]
2008-09-05 20:27 ` Linus Torvalds
2008-09-06  0:48 ` Stephan Beyer
2008-09-06 18:26 ` Christos Τrochalakis
  -- strict thread matches above, loose matches on Subject: below --
2008-07-29 16:20 Scott Chacon
2008-07-29 16:28 ` Miklos Vajna
2008-07-29 17:09 ` Petr Baudis
2008-07-29 18:30   ` Scott Chacon
2008-07-29 18:42     ` Junio C Hamano
2008-07-29 19:00       ` Julian Phillips
2008-07-29 19:09         ` Junio C Hamano
2008-07-29 19:34       ` Scott Chacon
2008-07-29 19:57         ` Junio C Hamano
2008-07-30 21:39     ` J. Bruce Fields
2008-07-29 17:43 ` Junio C Hamano
2008-07-29 18:25   ` Junio C Hamano
2008-07-29 19:29     ` Scott Chacon
2008-07-29 19:24   ` Scott Chacon
2008-07-29 22:34 ` Daniel Barkalow
2008-07-29 22:47   ` Junio C Hamano
2008-07-30 13:20     ` Bart Trojanowski
2008-07-30 18:27       ` Junio C Hamano
2008-07-30 13:31     ` Bart Trojanowski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d411cc4a0809061114k6b9f01b9sdf479360d4cb4c41@mail.gmail.com \
    --to=schacon@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=spearce@spearce.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).