* git pack-objects input list @ 2007-12-01 10:45 Mike Hommey 2007-12-01 17:49 ` Linus Torvalds 0 siblings, 1 reply; 4+ messages in thread From: Mike Hommey @ 2007-12-01 10:45 UTC (permalink / raw) To: git Hi, While playing around with git-pack-objects, it seemed to me that the input it can take is not a simple list of object SHA1s. Unfortunately, the man page is not very verbose about that. While I'd happily send a patch for that, I'd prefer to actually know what kind of input it can take, and what it uses it for. AFAICT, it can take the output of git-rev-list --all --objects (so, SHA1s followed by file names for blobs), which seems to be the same as what git-pack-objects --revs does internally, but it seems to have a string impact on how deltas are calculated (not giving file names makes it create a smaller pack in some cases, a bigger one in other cases). Could someone knowing the delta calculation internals enlighten me ? Thanks Mike ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: git pack-objects input list 2007-12-01 10:45 git pack-objects input list Mike Hommey @ 2007-12-01 17:49 ` Linus Torvalds 2007-12-01 22:38 ` Mike Hommey 0 siblings, 1 reply; 4+ messages in thread From: Linus Torvalds @ 2007-12-01 17:49 UTC (permalink / raw) To: Mike Hommey; +Cc: git On Sat, 1 Dec 2007, Mike Hommey wrote: > > While playing around with git-pack-objects, it seemed to me that the > input it can take is not a simple list of object SHA1s. Well, it *can* take a simple list of object SHA1's. But yes, the preferred format is a list of "SHA1 <basename>", where the basename is used as part of the heuristics on what other objects to try doing a delta against. But if you give no basename, that heuristic just won't have the name hint, and things will still *work*, it's just more likely (but not certain) that the resulting packfile will be larger. > Could someone knowing the delta calculation internals enlighten me ? The delta calculations simply create a small hash based on the basename, and use that to clump blobs/trees with the same basename together. That's *usually* a huge win in terms of finding good deltas, since the most likely delta is for a previous version of the same file (or tree!) and since we don't try to find deltas against *all* other blobs, but just use a sliding window, having good delta candidates close to each other is going to help a lot. Without the basename information, the delta list will just be sorted by type and size, which works fine, but generally finds fewer deltas. But it's all a heuristic, and if can go both ways. If you have lots of renames (which aren't just cross-directory ones, but actually change the basename), then the basename information may actually hurt. (Btw: the hash we generate is on purpose not a very good one. It actually thinks that the last characters are "more important", so it tends to hash files that end in the same few characters together. So *.c files clump together etc. At least that's the intent). See builtin-pack-objects.c: - type_size_sort(): this is the rule for sortign objects for deltaing. Type is most important (ie we always sort commits, trees, blobs separately and clump them together and effectively delta them only against objects of the same type) Then comes the basename hash (so that we sort objects with the same name together, and *.c files closer to each other than to *.h files, for example). Then comes the preferred_base (so that we sort things that already have specific delta bases together), and then the size (so that we sort files that are similar in size). And finally, if everything else is equal (the size will generally be identical for tree objects of the same directory with no new files but just SHA1 changes, for example) we sort by the order they were found in the history ("recency") by just comparing the pointer itself, since the original thing will be just one big array filled in by order of objects. - find_deltas() - this is the actual thing that does the "look through the object window and try to find good deltas", which operates on the array that was created by the type_size_sort. Hope that clarified something. Linus ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: git pack-objects input list 2007-12-01 17:49 ` Linus Torvalds @ 2007-12-01 22:38 ` Mike Hommey 2007-12-02 2:23 ` Nicolas Pitre 0 siblings, 1 reply; 4+ messages in thread From: Mike Hommey @ 2007-12-01 22:38 UTC (permalink / raw) To: Linus Torvalds; +Cc: git On Sat, Dec 01, 2007 at 09:49:00AM -0800, Linus Torvalds wrote: > Hope that clarified something. Thanks, that helped me understand my observations when trying to pack with and without file names in pack-objects input on different kind of datasets, where some would be best packed with and others would be without. I'll try to add some words about the pack-objects input format in the documentation. I don't know if it's worth adding information about the packing process itself in the manual page. Or maybe that should be added to a more technical document about git (a bit like "git for computer scientists") Mike ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: git pack-objects input list 2007-12-01 22:38 ` Mike Hommey @ 2007-12-02 2:23 ` Nicolas Pitre 0 siblings, 0 replies; 4+ messages in thread From: Nicolas Pitre @ 2007-12-02 2:23 UTC (permalink / raw) To: Mike Hommey; +Cc: Linus Torvalds, git On Sat, 1 Dec 2007, Mike Hommey wrote: > On Sat, Dec 01, 2007 at 09:49:00AM -0800, Linus Torvalds wrote: > > Hope that clarified something. > > Thanks, that helped me understand my observations when trying to pack > with and without file names in pack-objects input on different kind of > datasets, where some would be best packed with and others would be without. > > I'll try to add some words about the pack-objects input format in the > documentation. I don't know if it's worth adding information about the > packing process itself in the manual page. Or maybe that should be added > to a more technical document about git (a bit like "git for computer > scientists") Look at Documentation/technical/ for existing technically oriented documents. The pack format and packing heuristics have documents of their own already. If you feel like adding more documentation there please just go ahead. Nicolas ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2007-12-02 2:23 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-12-01 10:45 git pack-objects input list Mike Hommey 2007-12-01 17:49 ` Linus Torvalds 2007-12-01 22:38 ` Mike Hommey 2007-12-02 2:23 ` Nicolas Pitre
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).