git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] Packing large repositories
@ 2007-03-28  7:05 Dana How
  2007-03-28 16:53 ` Linus Torvalds
  2007-04-02  1:39 ` Sam Vilain
  0 siblings, 2 replies; 15+ messages in thread
From: Dana How @ 2007-03-28  7:05 UTC (permalink / raw)
  To: git; +Cc: danahow

Hi,

I just started experimenting with using git on
a large engineering project which has used p4 so far.
Part of a checkout is about 55GB;
after an initial commit and packing I have a 20GB+ packfile.
Of course this is unusable, since object_entry's in an .idx
file have only 32 bits in their offset fields.  I conclude that
for such large projects,  git-repack/git-pack-objects would need
new options to control maximum packfile size.

[ I don't think this affects git-{fetch,receive,send}-pack
since apparently only the pack is transferred and it only uses
the variable-length size and delta base offset encodings
(of course the accumulation of the 7 bit chunks in a 32b
 variable would need to be corrected, but at least the data
format doesn't change).]

So I am toying with adding a --limit <size> flag to git-repack/git-pack-objects.
This cannot be used with --stdout.  If specified, e.g.
  git-repack --limit 2g
then each packfile created could be at most 2^31-1 bytes in size.
It's possible that multiple packfiles would be created in one shot.
Thus git-pack-objects could write multiple names to stdout
and git-repack would need to be updated accordingly.

Finally, I wonder if having tree/commit/tag objects mixed into
such large packfiles would be a performance hit.
(Or maybe this will only appear once I have real history,
 not just a large initial commit.  But I can say that I now have 48K
 data blobs and 9K others.)
To find out, I may experiment with adding a --type=<types> option
to git-repack/git-pack-objects.  Thus typing
  git-repack --limit 2g --type=tree+commit+tag,blob
would cause git-pack-objects to make 2 passes over its internal
object list. On the first, it would pack tree, commit, and tag objects.
On the second, it would pack blobs. Each pass would write at
least one independent packfile (or more with --limit).  This would also
allow different incremental repacking strategies/schedules for different types.

Comments?

Thanks!
-- 
Dana L. How  danahow@gmail.com  +1 650 804 5991 cell

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2007-04-03  5:40 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-03-28  7:05 [RFC] Packing large repositories Dana How
2007-03-28 16:53 ` Linus Torvalds
2007-03-30  6:23   ` Shawn O. Pearce
2007-03-30 13:01     ` Nicolas Pitre
2007-03-31 11:04       ` Geert Bosch
2007-03-31 18:36         ` Linus Torvalds
2007-03-31 19:02           ` Nicolas Pitre
2007-03-31 20:54           ` Junio C Hamano
2007-03-31 21:20           ` Linus Torvalds
2007-03-31 21:56             ` Linus Torvalds
2007-04-02  6:22           ` Geert Bosch
2007-04-03  5:39             ` Shawn O. Pearce
2007-03-31 18:51         ` Nicolas Pitre
2007-04-02 21:19   ` Dana How
2007-04-02  1:39 ` Sam Vilain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).