From: Linus Torvalds <torvalds@osdl.org>
To: Keith Packard <keithp@keithp.com>
Cc: Git Mailing List <git@vger.kernel.org>
Subject: Re: Repacking many disconnected blobs
Date: Wed, 14 Jun 2006 08:53:22 -0700 (PDT) [thread overview]
Message-ID: <Pine.LNX.4.64.0606140826200.5498@g5.osdl.org> (raw)
In-Reply-To: <1150269478.20536.150.camel@neko.keithp.com>
On Wed, 14 Jun 2006, Keith Packard wrote:
>
> parsecvs scans every ,v file and creates a blob for every revision of
> every file right up front. Once these are created, it discards the
> actual file contents and deals solely with the hash values.
>
> The problem is that while this is going on, the repository consists
> solely of disconnected objects, and I can't make git-repack put those
> into pack objects.
Ok. That's actually _easily_ rectifiable, because it turns out that your
behaviour is something that re-packing is actually really good at
handling.
The thing is, "git repack" (the wrapper function) is all about finding all
the heads of a repository, and then tellign the _real_ packing logic which
objects to pack.
In other words, it literally boils down to basically
git-rev-list --all --objects $rev_list |
git-pack-objects --non-empty $pack_objects .tmp-pack
where "$rev_list" and "$pack_objects" are just extra flags to the two
phases that you don't really care about.
But the important point to recognize is that the pack generation itself
doesn't care about reachability or anything else AT ALL. The pack is just
a jumble of objects, nothing more. Which is exactly what you want.
> I'm assuming that if I could get these disconnected blobs all neatly
> tucked into a pack object, things might go a bit faster.
Absolutely. And it's even easy.
What you should do is to just generate a list of objects every once in a
while, and pass that list off to "git-pack-objects", which will create a
pack-file for you. Then you just move the generated pack-file (and index
file) into the .git/objects/pack directory, and then you can run the
normal "git-prune-packed", and you're done.
There's just two small subtle points to look out for:
- You can list the objects with "most important first" order first, if
you can. That will improve locality later (the packing will try to
generate the pack so that the order you gave the objects in will be a
rough order of the resul - the first objects will be together at the
beginning, the last objects will be at the end)
This is not a huge deal. If you don't have a good order, give them in
any order, and then after you're done (and you do have branches and
tag-heads), the final repack (with a regular "git repack") will fix it
all up.
You'll still get all of the size/access advantage of packfiles without
this, it just won't have the additional "nice IO patterns within the
packfile" behaviour (which mainly matters for the cold-cache case, so
you may well not care).
- append the filename the object is associated with to the object name on
the list, if at all possible. This is what git-pack-objects will use as
part of the heuristic for finding the deltas, so this is actually a big
deal. If you forget (or mess up) the filename, packing will still
_work_ - it's just a heuristic, after all, and there are a few others
too - but the pack-file will have inferior delta chains.
(The name doesn't have to be the "real name", it really only needs to
be something unique per *,v file, but real name is probably best)
The corollary to this is that it's better to generate the pack-file
from a list of every version of a few files than it is to generate it
from a few versions of every file. Ie, if you process things one file
at a time, and create every object for that file, that is actually good
for packing, since there will be the optimal delta opportunity.
In other words, you should just feed git-pack-file a list of objects in
the form "<sha1><space><filename>\n", and git-pack-file will do the rest.
Just as a stupid example, if you were to want to pack just the _tree_ that
is the current version of a git archive, you'd do
git-rev-list --objects HEAD^{tree} |
git-pack-objects --non-empty .tmp-pack
which you can try on the current git tree just to see (the first line will
generate a list of all objects reachable from the current _tree_: no
history at all, the second line will create two files under the name of
".tmp-pack-<sha1-of-object-list>.{pack|idx}".
The reason I suggest doing this for the current tree of the git archive is
simply that you can look at the git-rev-list output with "less", and see
for yourself what it actually does (and there are just a few hundred
objects there: a few tree objects, and the blob objects for every file in
the current HEAD).
So the git pack-format is actually _optimal_ for your particular case,
exactly because the pack-files don't actually care about any high-level
semantics: all they contain is a list of objects.
So in phase 1, when you generate all the objects, the simplest thing to do
is to literally just remember the last five thousand objects or so as you
generate them, and when that array of objects fills up, you just start the
"git-pack-objects" thing, and feed it the list of objects, move the
pack-file into .git/objects/pack/pack-... and do a "git prune-packed".
Then you just continue.
So this should all fit the parsecvs approach very well indeed.
Linus
next prev parent reply other threads:[~2006-06-14 15:53 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-06-14 7:17 Repacking many disconnected blobs Keith Packard
2006-06-14 7:29 ` Shawn Pearce
2006-06-14 9:07 ` Johannes Schindelin
2006-06-14 12:33 ` Junio C Hamano
2006-06-14 9:37 ` Sergey Vlasov
2006-06-14 15:53 ` Linus Torvalds [this message]
2006-06-14 17:55 ` Keith Packard
2006-06-14 18:18 ` Linus Torvalds
2006-06-14 18:52 ` Linus Torvalds
2006-06-14 18:59 ` Keith Packard
2006-06-14 19:18 ` Linus Torvalds
2006-06-14 19:25 ` Nicolas Pitre
2006-06-14 21:05 ` Keith Packard
2006-06-14 21:17 ` Linus Torvalds
2006-06-14 21:20 ` Nicolas Pitre
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.LNX.4.64.0606140826200.5498@g5.osdl.org \
--to=torvalds@osdl.org \
--cc=git@vger.kernel.org \
--cc=keithp@keithp.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).