From: Scott Chacon <schacon@gmail.com>
To: Jakub Narebski <jnareb@gmail.com>
Cc: "Shawn O. Pearce" <spearce@spearce.org>,
git@vger.kernel.org, Junio C Hamano <gitster@pobox.com>,
Andreas Ericsson <ae@op5.se>, Tony Finch <dot@dotat.at>,
Johannes Sixt <j6t@kdbg.org>,
Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: Comments pack protocol description in "Git Community Book" (second round)
Date: Sat, 6 Jun 2009 14:58:00 -0700 [thread overview]
Message-ID: <d411cc4a0906061458g494d80dbwe3a5358edfd1d49e@mail.gmail.com> (raw)
In-Reply-To: <200906062338.02451.jnareb@gmail.com>
Hey,
On Sat, Jun 6, 2009 at 2:38 PM, Jakub Narebski<jnareb@gmail.com> wrote:
> There are beginnings of description of git pack protocol in section
> "Transfer Protocols"[1][2] of chapter "7. Internals and Plumbing"
> of "Git Community Book" (http://book.git-scm.com).
>
> [1] http://book.git-scm.com/7_transfer_protocols.html
> [2] http://github.com/schacon/gitbook/blob/master/text/54_Transfer_Protocols/0_Transfer_Protocols.markdown
>
> This is second round of my comments about this item. I'd like to have
> some more comments about git pack protocol before trying to come up
> with formulation which is good enough to send as patch against source
> of mentioned section.
>
I can certainly fix up this chapter with these comments - I understand
the protocol a bit better now than I did when I originally wrote this.
In addition to that, I started taking a shot at putting together an
RFC formatted documentation of this protocol as was requested. I may
have _way_ missed the mark on what you were looking for originally,
it's hard to say, not having read a lot of RFC documents - I probably
ended up writing in a more bookish format rather than a technical
spec, but whatever - maybe you'll find it helpful or can fix it up to
more what you were expecting. I'm not done with it - some of it is
still basically unformatted comments from this previous thread, but at
least it's laid out roughly how I thought it might be useful and I
have fleshed out a lot of it. You can find the RFC text output
document here:
http://git-scm.com/gitserver.txt
And the xml doc I generated it from here:
http://github.com/schacon/gitserver-rfc
Perhaps if we're going to spend time getting this all correct, we
should get a standalone technical doc all agreed upon, then I can
relatively easily extract what's needed into that chapter of the
Community book.
Thoughts?
Scott
> The relevant parts of above source are quoted as if they were email
> I am replying too.
>
> I have CC-ed everybody who participated in this subthread (originally
> named "Re: Request for detailed documentation of git pack protocol").
>
> ....
>> ### Fetching Data with Upload Pack ###
>>
>> For the smarter protocols, fetching objects is much more efficient. A
>> socket is opened, either over ssh or over port 9418 (in the case of
>> the git:// protocol), and the git-fetch-pack(1) command on the client
>> begins communicating with a forked git-upload-pack(1) process on the
>> server.
>>
>> Then the server will tell the client which SHAs it has for each ref,
>> and the client figures out what it needs and responds with a list of
>> SHAs it wants and already has.
>
> It would be probably more clear here to state explicitely that there
> are two lists, i.e. "a list of SHAs it wants and a list of SHAs it
> already has".
>
>>
>> At this point, the server will generate a packfile with all the
>> objects that the client needs and begin streaming it down to the
>> client.
>
> This is a bit of oversimplification. In most simple case like client
> using git-clone to get all objects it is true that server can generate
> packfile and stream it to client after client tells a list of wanted
> SHAs. In more complicated case however there can be series of
> exchanges between client and server, with client sending sets of
> commits it have, and server responding whether it is enough (or
> perhaps this line of commits is uninteresting)... and only then
> arriving at list of objects to send in a packfile.
>
>>
>> Let's look at an example.
>
> I think that before example we should have short description (sketch)
> of the whole exchange; for example the one taken from
> 'Documentation/technical/pack-protocol.txt':
>
> upload-pack (S) | fetch/clone-pack (C) protocol:
>
> # Tell the puller what commits we have and what their names are
> S: SHA1 name
> S: ...
> S: SHA1 name
> S: # flush -- it's your turn
> # Tell the pusher what commits we want, and what we have
> C: want name
> C: ..
> C: want name
> C: have SHA1
> C: have SHA1
> C: ...
> C: # flush -- occasionally ask "had enough?"
> S: NAK
> C: have SHA1
> C: ...
> C: have SHA1
> S: ACK
> C: done
> S: XXXXXXX -- packfile contents.
>
>
>>
>> The client connects and sends the request header. The clone command
>>
>> $ git clone git://myserver.com/project.git
>>
>> produces the following request:
>>
>> 0032git-upload-pack /project.git\\000host=myserver.com\\000
>
> Although fetching via SSH protocol is, I guess, much more rare than
> fetching via anonymous unauthenticated git:// protocol, it _might_ be
> good idea to tell there that fetching via SSH differs from above
> sequence that instead of opening TCP connection to port 9418 and
> sending above packet, and later reading from and writing to socket,
> "git clone ssh://myserver.com/srv/git/project.git" calls
>
> ssh myserver.com git-upload-pack /srv/git/project.git
>
> and later reads from standard output of the above command, and writes
> to standard input of above command.
>
> The rest of exchange is _identical_ for git:// and for ssh:// (and
> I guess also for file:// pseudoprotocol).
>
>>
>> The first four bytes contain the hex length of the line (including 4
>> byte line length and trailing newline if present). Following are the
>> command and arguments. This is followed by a null byte and then the
>> host information. The request is terminated by a null byte.
>
> I think it would be better to describe packet (chunk) format, called
> pkt-line in git, separately from describing the contents of above
> packet; either first pkt-line then command, or first command then
> pkt-line. Otherwise we would be left with describing pkt-line format
> many times, as it is done in current version of this chapter.
>
>
> In git clients communicates with server using a packetized stream,
> where each line (packet, chunk) is preceded by its length (including
> the header) as a 4-byte hex number. A length of 'zero', i.e. packet
> "0000" has a special meaning: it means end of stream / flush
> connection. The "# flush ..." in description of client--server
> exchange above is done using exactly "0000" packet.
>
> Footnote: this format somewhat reminds / resembles 'chunked' transfer
> encoding used in HTTP[1], although there are differences.
> http://en.wikipedia.org/wiki/Chunked_transfer_encoding
>
>>
>> The request is processed and turned into a call to git-upload-pack:
>>
>> $ git-upload-pack /path/to/repos/project.git
>
> This is alternate place where we could tell about fetching via ssh://
>
> We probably should tell where /path/to/repos that /project.git is
> prefixed with comes from; it is from --base-path=/path/to/repos
> argument to git-daemon (a sort of "GIT root").
>
> BTW. (this is just a very minor nit) shouldn't we use FHS compliant
> path, i.e. "/srv/git" instead of "/path/to/repos" (and follow RFC in
> using "example.com" in place of "myserver.com")?
>
>>
>> This immediately returns information of the repo:
>>
>> 007c74730d410fcb6603ace96f1dc55ea6196122532d HEAD\\000multi_ack thin-pack side-band side-band-64k ofs-delta shallow no-progress include-tag\\n
>> 003e7d1665144a3a975c05f1f43902ddaf084e784dbe refs/heads/debug\\n
>> 003d5a3f6be755bbb7deae50065988cbfa1ffa9ab68a refs/heads/dist\\n
>> 003e7e47fe2bd8d01d481f44d7af0531bd93d3b21c01 refs/heads/local\\n
>> 003f74730d410fcb6603ace96f1dc55ea6196122532d refs/heads/master\\n
>> 0000
>
> I have added explicit LF terminators in the form of "\\n" (which would
> render as "\n"), mainly because "0000" flush packed _doesn't_ have it.
> Also I have added "include-tag", as modern git installations provide
> this capability.
>
> Here is a dilemma: currently example output is provided almost exactly
> as-is, only indented and with some quoting/escaping (\\000 or \\0 for
> NUL character, \\n for LF, later \\001 and \\002 for 0x01 and 0x02
> bytes). To know if given example output is what client sends or what
> server outputs, you have to read the narrative. Alternate solution
> would be to use "C: " and "S: " prefixing (perhaps with some extra
> format to make it more clear that it is not part of data), used in
> pack-protocol.txt technical documentation, and proposed for describing
> network protocols by some RFC (I don't remember which, unfortunately).
> Which one to choose?
>
>
> We would want, at some point, describe that first line of first
> response from server contains 'stuffed' behind "\0" (NUL) space
> separated list of capabilities our server supports. Those
> capabilities would have to be described somewhere: as a sidebar,
> or in a separate subsection, or in an appendix.
>
> Below there is (for completeness) list of git-upload-pack
> capabilities, with short description of each:
>
> * multi_ack (for historical reasons not multi-ack)
>
> It allows the server to return "ACK $SHA1 continue" as soon as it
> finds a commit that it can use as a common base, between the
> client's wants and the client's have set.
>
> By sending this early, the server can potentially head off the
> client from walking any further down that particular branch of the
> client's repository history.
>
> See the thread for more details (posts by Shawn O. Pearce and by
> Junio C Hamano).
>
> * thin-pack
>
> Server can send thin packs, i.e. packs which do not contain base
> elements for some delta chains, if those base elements are
> available on client side. Client has thin-pack capability when it
> understand how to "thicken" them adding required delta bases,
> making those packfiles independent.
>
> Of course it doesn't make sense for client to use (request) this
> capability for git-clone... But if the client does request it (and
> I think modern clients actually do request it, even on initial
> clone case) the server won't produce a thin pack. Why? There is no
> common base, so there is no uninteresting set to omit from the
> pack. :-)
>
> * side-band
> * side-band-64k
>
> This means that server can send, and client understand multiplexed
> (muxed) progress reports and error info interleaved with the
> packfile itself.
>
> These two options are mutually exclusive. A client should ask for
> only one of them, and a modern client always favors side-band-64k.
> If client ask for both, server uses side-band-64k.
>
> Older side-band allows only up to 1000 bytes per packet.
>
> * ofs-delta
>
> Server can send, and client understand PACKv2 with delta refering
> to its base by position in pack rather than by SHA-1. Both can
> send/read OBJ_OFS_DELTA, aka type 6 in a pack file.
>
> * shallow
>
> Server can send shallow clone (git clone --depth ...).
>
> * no-progress
>
> Client should use it if it was started with "git clone -q" or
> something, and doesn't want that side brand 2. We still want
> sideband 1 with actual data (packfile), and sideband 3 with error
> messages.
>
> * include-tag
>
> If we pack an object to the client, and a tag points exactly at
> that object, we pack the tag too. In general this allows a client
> to get all new tags when it fetches a branch, in a single network
> connection, instead of two (separate connection for tags).
>
> This capability is not to be used when client was called with
> '--no-tags'.
>
>>
>> Each line starts with a four byte line length declaration in hex. The
>> section is terminated by a line length declaration of 0000.
>
> This repetition would not be necessary if pkt-line format had its own
> description somewhere before. We would probably still want to remind
> the reader that "0000" line length declaration means 'flush'.
>
>>
>> This is sent back to the client verbatim.
>
> Hmmm... "sent back ... verbatim"? I wonder what did you want to say
> here...
>
>> The client responds with another request:
>>
>> 0054want 74730d410fcb6603ace96f1dc55ea6196122532d multi_ack side-band-64k ofs-delta\\n
>> 0032want 7d1665144a3a975c05f1f43902ddaf084e784dbe\\n
>> 0032want 5a3f6be755bbb7deae50065988cbfa1ffa9ab68a\\n
>> 0032want 7e47fe2bd8d01d481f44d7af0531bd93d3b21c01\\n
>> 0032want 74730d410fcb6603ace96f1dc55ea6196122532d\\n
>> 0000
>> 0009done\\n
>
> Here again I added explicit LF terminator, and split off "0000" flush
> packet in separate line, to make this request (well, two requests)
> more clear.
>
> The first line of this request contains capabilities client wants to
> use. It should be some subset of capabilities server supports.
>
>>
>> The is sent to the open git-upload-pack process which then streams out
>> the final response:
>
> "_The_ is send"?
>
> I would remove quotes around lines of server response below, but would
> leave explicit \n for LF, and \\001 and \\002 for bytes 0x01 and 0x02
> denoting channel.
>
>>
>> "0008NAK\n"
>
> This NAK means that server did not found [closed] set of common
> ancestors. It is response to "0000" flush line ("had enough?" line)
> from client. As the example is about git-clone, and client doesn't
> _have_ any commits to show server as candidates for common ancestors
> (calculation), it replies with "done" to get pack.
>
>> "0023\\002Counting objects: 2797, done.\n"
>
> This is a bit untypical example, as for larger repositories like Linux
> kernel or even git repository, usually you would have much more
> objects, and actually object enumeration would take more time. You
> would see many
>
> "0020\\002Counting objects: 10662 \r"
> "0020\\002Counting objects: 22318 \r"
> "0020\\002Counting objects: 29506 \r"
>
> packets before
>
> "0023\\002Counting objects: 65058, done.\n"
>
>> "002b\\002Compressing objects: 0% (1/1177) \r"
>> "002c\\002Compressing objects: 1% (12/1177) \r"
>> "002c\\002Compressing objects: 2% (24/1177) \r"
>> "002c\\002Compressing objects: 3% (36/1177) \r"
>> "002c\\002Compressing objects: 4% (48/1177) \r"
>> "002c\\002Compressing objects: 5% (59/1177) \r"
>> "002c\\002Compressing objects: 6% (71/1177) \r"
>> "0053\\002Compressing objects: 7% (83/1177) \rCompressing objects: 8% (95/1177) \r"
>> ...
>> "005b\\002Compressing objects: 100% (1177/1177) \rCompressing objects: 100% (1177/1177), done.\n"
>
> Sidenote: the reason why there is sometimes more than one line send in
> a single packet / single pkt-line is buffering between git-pack-objects
> which produces those messages to pipe, and git-upload-pack which reads
> them and sends them to client. If pack-objects can write two messages
> into the pipe buffer before upload-pack is woken to read them out,
> upload-pack might find two (or more) messages ready to read without
> blocking. These get bundled into a single packet, because, why not,
> its easier to code it that way.
>
> Here or a little later we probably should explain (even though it is
> fairly obvious), that final response from server is (here) in pkt-line
> with sideband format, where first byte of data denotes channel
> (stream) number: 1 for data, 2 for progress info, 3 for fatal errors.
>
>> "2004\\001PACK\\000\\000\\000\\002\\000\\000\n\\355\\225\\017x\\234\\235\\216K\n\\302"...
>> "2005\\001\\360\\204{\\225\\376\\330\\345]z\226\273"...
>
> Here I think it would be enough to show only the fragment which is
> packfile signature...
>
>> ...
>> "0037\\002Total 2797 (delta 1799), reused 2360 (delta 1529)\n"
>> ...
>> "<\\276\\255L\\273s\\005\\001w0006\\001[0000"
>
> This line is I think is broken in wrong place. It is the tail
> end of some packet (each packed begins with 4 characters wide 0-padded
> length of chunk as hex number; "<\\276\\255L" does not match 4HEXDIG),
> followed by "0000" 'flush' packet (here it signals end of stream).
>
>>
>> See the Packfile chapter previously for the actual format of the
>> packfile data in the response.
>>
>>
> ....
> --
> Jakub Narebski
> Poland
>
next prev parent reply other threads:[~2009-06-06 21:58 UTC|newest]
Thread overview: 66+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-05-12 21:29 Request for detailed documentation of git pack protocol Jakub Narebski
2009-05-12 23:34 ` Shawn O. Pearce
2009-05-14 8:24 ` Jakub Narebski
2009-05-14 14:57 ` Shawn O. Pearce
2009-05-14 15:02 ` Andreas Ericsson
2009-05-15 20:29 ` Linus Torvalds
2009-05-15 16:51 ` Clemens Buchacher
2009-05-14 18:13 ` Nicolas Pitre
2009-05-14 20:27 ` Jakub Narebski
2009-05-14 13:55 ` Scott Chacon
2009-05-14 14:44 ` Shawn O. Pearce
2009-05-14 15:01 ` Jakub Narebski
2009-05-15 0:58 ` A Large Angry SCM
2009-05-15 19:05 ` Ealdwulf Wuffinga
2009-06-02 21:39 ` Jakub Narebski
2009-06-02 23:27 ` Shawn O. Pearce
2009-06-03 0:50 ` Jakub Narebski
2009-06-03 1:29 ` Shawn O. Pearce
2009-06-03 2:11 ` Junio C Hamano
2009-06-03 2:15 ` Shawn O. Pearce
2009-06-03 9:21 ` Jakub Narebski
2009-06-03 14:48 ` Shawn O. Pearce
2009-06-03 15:07 ` Shawn O. Pearce
2009-06-03 15:39 ` Jakub Narebski
2009-06-03 15:50 ` Shawn O. Pearce
2009-06-03 16:51 ` Jakub Narebski
2009-06-03 16:56 ` Shawn O. Pearce
2009-06-03 20:19 ` Jakub Narebski
2009-06-03 20:24 ` Shawn O. Pearce
2009-06-03 22:04 ` Jakub Narebski
2009-06-03 22:04 ` Shawn O. Pearce
2009-06-03 22:16 ` Junio C Hamano
2009-06-03 22:46 ` Jakub Narebski
2009-06-04 7:17 ` Andreas Ericsson
2009-06-04 7:26 ` Junio C Hamano
2009-06-06 16:33 ` Scott Chacon
2009-06-06 17:24 ` Junio C Hamano
2009-06-06 17:41 ` Jakub Narebski
2009-06-03 21:38 ` Tony Finch
2009-06-03 17:11 ` Junio C Hamano
2009-06-03 19:05 ` Johannes Sixt
2009-06-03 2:18 ` Robin H. Johnson
2009-06-03 10:47 ` Jakub Narebski
2009-06-03 14:17 ` Shawn O. Pearce
2009-06-03 20:56 ` Tony Finch
2009-06-03 21:20 ` Jakub Narebski
2009-06-03 21:53 ` Tony Finch
2009-06-04 8:45 ` Jakub Narebski
2009-06-04 11:41 ` Tony Finch
2009-06-04 18:41 ` Shawn O. Pearce
2009-06-03 12:29 ` Jakub Narebski
2009-06-03 14:19 ` Shawn O. Pearce
2009-06-04 20:55 ` Jakub Narebski
2009-06-04 21:57 ` Shawn O. Pearce
2009-06-05 0:45 ` Shawn O. Pearce
2009-06-05 7:24 ` Jakub Narebski
2009-06-05 8:45 ` Jakub Narebski
2009-06-06 21:38 ` Comments pack protocol description in "Git Community Book" (second round) Jakub Narebski
2009-06-06 21:58 ` Scott Chacon [this message]
2009-06-07 8:21 ` Jakub Narebski
2009-06-07 20:13 ` Shawn O. Pearce
2009-06-07 20:43 ` Shawn O. Pearce
2009-06-13 9:30 ` Comments pack protocol description in "RFC for the Git Packfile Protocol" (long) Jakub Narebski
2009-06-07 20:06 ` Comments pack protocol description in "Git Community Book" (second round) Shawn O. Pearce
2009-06-09 9:39 ` Jakub Narebski
2009-06-09 14:28 ` Shawn O. Pearce
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d411cc4a0906061458g494d80dbwe3a5358edfd1d49e@mail.gmail.com \
--to=schacon@gmail.com \
--cc=ae@op5.se \
--cc=dot@dotat.at \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=j6t@kdbg.org \
--cc=jnareb@gmail.com \
--cc=spearce@spearce.org \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).