From: "Shawn O. Pearce" <spearce@spearce.org>
To: git@vger.kernel.org
Cc: Nicolas Pitre <nico@cam.org>
Subject: pack v4 status
Date: Tue, 27 Feb 2007 10:50:42 -0500 [thread overview]
Message-ID: <20070227155042.GB3230@spearce.org> (raw)
Nico's and my packv4 topic is available from my fastimport.git fork
on repo.or.cz:
gitweb: http://repo.or.cz/w/git/fastimport.git
git: git://repo.or.cz/git/fastimport.git
branch: sp/pack4
We have thus far reformatted OBJ_TREEs with a new dictionary based
compression scheme. In this scheme we pool the filenames and modes
that appear within trees into a single table within the packfile.
All trees are then converted to use a 22 byte record format:
- 2 byte network byte order index into the string pool
- 20 byte SHA-1
These trees are then stored *uncompressed* within the packfile,
but are also still stored using our standard delta system (only the
deltas for these trees are also stored uncompressed). The resulting
savings is pretty good; on linux-2.6.git we are saving ~3.8 MiB as
a result of this encoding alone:
141649022 pack2-linuxA.git
137625761 pack4-linuxB.git
read_sha1_file() has been modified to unpack this new tree format
back into the canonical format; something that I think is very
unncessary for runtime given how easy it is to iterate the encoded
tree, but is still critically important for tools like git-cat-file,
git-index-pack and git-verify-pack. Future plans are to iterate
the encoded tree directly, but performance is already faster despite
needing to reconvert the tree:
lh=825020c3866e7312947e17a0caa9dd1a5622bafc
git --git-dir=pack2-linux.git rev-list $lh -- include/asm-m68k
3.97 real 3.60 user 0.15 sys
3.98 real 3.60 user 0.15 sys
3.98 real 3.60 user 0.15 sys
3.98 real 3.60 user 0.15 sys
3.98 real 3.60 user 0.15 sys
git --git-dir=pack4-linux.git rev-list $lh -- include/asm-m68k
3.52 real 3.17 user 0.13 sys
3.46 real 3.17 user 0.13 sys
3.51 real 3.17 user 0.13 sys
3.52 real 3.18 user 0.13 sys
3.53 real 3.16 user 0.13 sys
I'll take 500 milliseconds savings anyday, thanks! :-)
Nico and I have only started working on commits, so the above results
still utilize the packv2 format for OBJ_COMMIT and do not take into
account any of our proposed concepts there.
The impetus for packv4 is to format the packfile in such a way that
we can work with the data faster at runtime for common operations,
like rev-list and its builtin path limiter. We also want to make
reachability analysis (critical for packing and fsck) faster.
Any reduction in storage size is considered a bonus here, though
obviously there is some correlation between size of input data and
the time required to process it. ;-)
The patch series for this is getting large. Right now we are up to
32 patches in the series. Given where we are and where we want to
go I'm predicting this series will come out at close to 100 patches.
Of course that's partly because I'm working in fairly small units,
slowly iterating the code into the final version we want.
I am constantly rebasing the sp/pack4 topic noted above, so the
patch count is not really because I'm going back and fixing things
in later patches. Its because I'm trying to slowly iterate the
runtime side of things in digestable changes, then the packing side,
so that the system still works at every single commit in the series.
Yes, its a *BIG* set of code changed.
Obviously this series has a heavy hand on sha1_file.c,
builtin-pack-objects.c, builtin-unpack-objects.c, index-pack.c.
But it will also start to hit less obvious places like commit.c
and tree-walk.c as we start to support walking the encoded objects
directly.
Given the huge size of the series, and the amount of effort we are
tossing into it, and the fact that I'm trying to make it pu-ready by
early next week, we would appreciate it if folks could keep changes
to the above mentioned files limited to critical bug fixes only. :)
--
Shawn.
next reply other threads:[~2007-02-27 20:44 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-02-27 15:50 Shawn O. Pearce [this message]
2007-02-27 21:51 ` pack v4 status Linus Torvalds
2007-02-27 22:15 ` Johannes Schindelin
2007-02-27 22:33 ` Nicolas Pitre
2007-02-27 22:32 ` Nicolas Pitre
2007-02-27 22:36 ` Junio C Hamano
2007-02-28 3:45 ` Shawn O. Pearce
2007-02-28 1:19 ` Nicolas Pitre
2007-02-28 4:13 ` Shawn O. Pearce
-- strict thread matches above, loose matches on Subject: below --
2007-02-28 10:04 linux
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20070227155042.GB3230@spearce.org \
--to=spearce@spearce.org \
--cc=git@vger.kernel.org \
--cc=nico@cam.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).