From: Nicolas Pitre <nico@cam.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Davide Libenzi <davidel@xmailserver.org>,
Jeff Garzik <jeff@garzik.org>,
Git Mailing List <git@vger.kernel.org>,
mpm@selenic.com, bcrl@kvack.org
Subject: Re: cleaner/better zlib sources?
Date: Fri, 16 Mar 2007 13:06:44 -0400 (EDT) [thread overview]
Message-ID: <alpine.LFD.0.83.0703161236180.5518@xanadu.home> (raw)
In-Reply-To: <Pine.LNX.4.64.0703160913361.3816@woody.linux-foundation.org>
On Fri, 16 Mar 2007, Linus Torvalds wrote:
> The most performance-critical objects for uncompression are commits and
> trees. At least for the kernel, the average size of a tree object is 678
> bytes. And that's ignoring the fact that most of them are then deltified,
> so about 80% of them are likely just a ~60-byte delta.
This is why in pack v4 there will be an alternate tree object
representation which is not deflated at all.
In short we intend to have 3 tables where common things are factored
out:
1) the path component string table (deflated)
2) author/committer string table (deflated)
3) sorted SHA1 table (obviously not deflated)
The sorted SHA1 table will be part of the pack instead of being in the
pack index. The idea is that most SHA1's are already duplicated in the
pack already anyway within commit and tree objects. With a single table
then commit and tree objects can index into that SHA1 table rather than
providing the SHA1 value inline for the objects they refer to.
This means that a tree object record would be only 6 bytes according to
the current design: 2 bytes to index into the path component string
table (which also include the mode information), and 4 bytes to index
into the sorted SHA1 table. And similarly for commit objects.
This means that the pack index will only have a table of offsets
corresponding to the table of sorted SHA1's.
So... walking revisions will become only a matter of picking the first
commit object, using the tree index value (which is not deflated), but
instead of using it in the SHA1 table it could be used in the offset
table to find the location of the corresponding tree object directly.
Same goes for tree entries, or for locating the parent's commit object.
No deflating, no binary searching, no SHA1 comparisons. Plain straight
pointer dereference.
Then, if you want to filter tree walking on path spec, you only need to
locate the path component in the path table once and use the
corresponding index to filter tree entries instead of repeated strcmp().
Same thing if you want to filter commits based on author/committer.
One side effect of this is that you can tell straight away that a path
doesn't exist in the whole pack if one of its components cannot be found
in the table (that works only if no legacy tree representations are
present of course). That should make history walking blazingly fast.
The only thing that gets deflated is the commit message which needs to
be inflated only when displaying it.
And so far that makes for quite smaller packs too!
Nicolas
next prev parent reply other threads:[~2007-03-16 17:07 UTC|newest]
Thread overview: 85+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-03-16 1:04 cleaner/better zlib sources? Linus Torvalds
2007-03-16 1:10 ` Shawn O. Pearce
2007-03-16 1:11 ` Jeff Garzik
2007-03-16 1:14 ` Matt Mackall
2007-03-16 1:46 ` Linus Torvalds
2007-03-16 1:54 ` Linus Torvalds
2007-03-16 2:43 ` Davide Libenzi
2007-03-16 2:56 ` Linus Torvalds
2007-03-16 3:16 ` Davide Libenzi
2007-03-16 16:21 ` Linus Torvalds
2007-03-16 16:24 ` Davide Libenzi
2007-03-16 16:35 ` Linus Torvalds
2007-03-16 19:21 ` Davide Libenzi
2007-03-17 0:01 ` Linus Torvalds
2007-03-17 1:11 ` Linus Torvalds
2007-03-17 3:28 ` Nicolas Pitre
2007-03-17 5:19 ` Shawn O. Pearce
2007-03-17 17:55 ` Linus Torvalds
2007-03-17 19:40 ` Linus Torvalds
2007-03-17 19:42 ` [PATCH 1/2] Make trivial wrapper functions around delta base generation and freeing Linus Torvalds
2007-03-17 19:44 ` [PATCH 2/2] Implement a simple delta_base cache Linus Torvalds
2007-03-17 21:45 ` Linus Torvalds
2007-03-17 22:37 ` Junio C Hamano
2007-03-17 23:09 ` Linus Torvalds
2007-03-17 23:54 ` Linus Torvalds
2007-03-18 1:13 ` Nicolas Pitre
2007-03-18 7:47 ` Junio C Hamano
2007-03-17 23:12 ` Junio C Hamano
2007-03-17 23:24 ` Linus Torvalds
2007-03-17 23:52 ` Jon Smirl
2007-03-18 1:14 ` Morten Welinder
2007-03-18 1:29 ` Linus Torvalds
2007-03-18 1:38 ` Nicolas Pitre
2007-03-18 1:55 ` Linus Torvalds
2007-03-18 2:03 ` Nicolas Pitre
2007-03-18 2:20 ` Linus Torvalds
2007-03-18 3:00 ` Nicolas Pitre
2007-03-18 3:31 ` Linus Torvalds
2007-03-18 5:30 ` Julian Phillips
2007-03-18 17:23 ` Linus Torvalds
2007-03-18 10:53 ` Robin Rosenberg
2007-03-18 17:34 ` Linus Torvalds
2007-03-18 18:29 ` Robin Rosenberg
2007-03-18 21:25 ` Shawn O. Pearce
2007-03-19 13:16 ` David Brodsky
2007-03-20 6:35 ` Robin Rosenberg
2007-03-20 9:13 ` David Brodsky
2007-03-21 2:37 ` Linus Torvalds
2007-03-21 2:54 ` Nicolas Pitre
2007-03-18 3:06 ` [PATCH 3/2] Avoid unnecessary strlen() calls Linus Torvalds
2007-03-18 9:45 ` Junio C Hamano
2007-03-18 15:54 ` Linus Torvalds
2007-03-18 15:57 ` Linus Torvalds
2007-03-18 21:38 ` Shawn O. Pearce
2007-03-18 21:48 ` Linus Torvalds
2007-03-20 3:05 ` Johannes Schindelin
2007-03-20 3:29 ` Shawn O. Pearce
2007-03-20 3:40 ` Shawn O. Pearce
2007-03-20 4:11 ` Linus Torvalds
2007-03-20 4:18 ` Shawn O. Pearce
2007-03-20 4:45 ` Linus Torvalds
2007-03-20 5:44 ` Junio C Hamano
2007-03-20 3:16 ` Junio C Hamano
2007-03-20 4:31 ` Linus Torvalds
2007-03-20 4:39 ` Shawn O. Pearce
2007-03-20 4:57 ` Linus Torvalds
2007-03-18 1:44 ` [PATCH 2/2] Implement a simple delta_base cache Linus Torvalds
2007-03-18 6:28 ` Avi Kivity
2007-03-17 22:44 ` Linus Torvalds
2007-03-16 16:35 ` cleaner/better zlib sources? Jeff Garzik
2007-03-16 16:42 ` Matt Mackall
2007-03-16 16:51 ` Linus Torvalds
2007-03-16 17:12 ` Nicolas Pitre
2007-03-16 23:22 ` Shawn O. Pearce
2007-03-16 17:06 ` Nicolas Pitre [this message]
2007-03-16 17:51 ` Linus Torvalds
2007-03-16 18:09 ` Nicolas Pitre
2007-03-16 1:33 ` Davide Libenzi
2007-03-16 2:06 ` Davide Libenzi
-- strict thread matches above, loose matches on Subject: below --
2007-03-16 6:08 linux
2007-03-16 11:34 ` Florian Weimer
2007-03-16 15:51 ` Josef Weidendorfer
2007-03-16 16:11 ` Linus Torvalds
2007-03-16 17:39 ` linux
2007-03-16 22:45 ` Josef Weidendorfer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=alpine.LFD.0.83.0703161236180.5518@xanadu.home \
--to=nico@cam.org \
--cc=bcrl@kvack.org \
--cc=davidel@xmailserver.org \
--cc=git@vger.kernel.org \
--cc=jeff@garzik.org \
--cc=mpm@selenic.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).