git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* git clone dies (large git repository)
@ 2006-08-18 22:42 Troy Telford
  2006-08-19 10:58 ` Jakub Narebski
  2006-08-19 20:46 ` Junio C Hamano
  0 siblings, 2 replies; 6+ messages in thread
From: Troy Telford @ 2006-08-18 22:42 UTC (permalink / raw)
  To: git

I've got a git repository I use to manage a set of RPMs.  It's got history  
stretching back for years, and imported nicely into git.  Since it's used  
to create RPMs, the repository has a structure similar to this:
.
|--README
|-- foo
|    |--SOURCES
|    |  |--foo.tar.bz2
|    |  `--foo-build.patch
|    `--SPECS
|       `--foo.spec
`-- bar
      |--SOURCES
      |  |--bar.tar.bz2
      |  `--bar-build.patch
      `--SPECS
         `--bar.spec

The source tarballs are updated when there's a new version of the  
software; I don't need to worry about changes that are /inside/ the  
tarball-- just that the tarball itself has changed.  As you can imagine, a  
fair amount of the 'stuff' in the repository are these binary tarballs.

The total repository size (ie. the '.git' folder):  4GB

I have only one complaint (and I can work around it anyway):  I can't 'git  
clone' the repository.

if I run:
git clone git://my.server.net/git/rpms
I get the following output:

remote: Generating pack...
remote: Done counting 20971 objects.
remote: Deltifying 20971 objects.
remote:  100% (20971/20971) done
3707.885MB  (21657 kB/s)

remote: Total 20971, written 20971 (delta 9604), reused 20971 (delta 9604)
error: git-fetch-pack: unable to read from git-index-pack
error: git-index-pack died of signal 11
fetch-pack from 'git://my.server.net/git/rpms' failed.

It's interesting to note that during the pack file transfer, it stops  
incrementing at ~3700 MB; the pack file is 4.0 GB.  So either 300MB isn't  
being transferred, or it's just not updating the display for the last few  
hundred megs.

My workaround is to just use 'rsync' to copy the data (although scp works  
too), then checkout the working copy.  After that, fetch/pull and push  
work fine.

The behavior is consistent with git v1.4.1 and v1.4.2, on SLES 9, SLES 10,  
RHEL 4, and Gentoo.

It is also consistent if I clone via the git daemon, or the ssh protocol  
('git clone server:/path/to/repo')

I originally had everything as loose objects.  I then ran 'git-repack -d'  
on occasion, so I had a combination of a large pack file, smaller pack  
files, and loose objects.  Finally, I tried 'git repack -a -d' and  
consolidated it all into a single 4GB pack file.  It didn't seem to make  
much difference in the output.

Am I bumping some sort of limitation within git, or have I uncovered a bug?
-- 
Troy Telford

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git clone dies (large git repository)
  2006-08-18 22:42 git clone dies (large git repository) Troy Telford
@ 2006-08-19 10:58 ` Jakub Narebski
  2006-08-19 20:46 ` Junio C Hamano
  1 sibling, 0 replies; 6+ messages in thread
From: Jakub Narebski @ 2006-08-19 10:58 UTC (permalink / raw)
  To: git

Troy Telford wrote:

> I originally had everything as loose objects.  I then ran 'git-repack -d'  
> on occasion, so I had a combination of a large pack file, smaller pack  
> files, and loose objects.  Finally, I tried 'git repack -a -d' and  
> consolidated it all into a single 4GB pack file.  It didn't seem to make  
> much difference in the output.
> 
> Am I bumping some sort of limitation within git, or have I uncovered
> a bug? 

You _might_ have bumped into filesystem limit on file size, or system limit
on mmap size.

IIRC it was to be addressed (splitting packs into manageable hunks).
-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git clone dies (large git repository)
  2006-08-18 22:42 git clone dies (large git repository) Troy Telford
  2006-08-19 10:58 ` Jakub Narebski
@ 2006-08-19 20:46 ` Junio C Hamano
  2006-08-21 23:30   ` Troy Telford
  1 sibling, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2006-08-19 20:46 UTC (permalink / raw)
  To: Troy Telford; +Cc: git

"Troy Telford" <ttelford@linuxnetworx.com> writes:

> I originally had everything as loose objects.  I then ran 'git-repack
> -d' on occasion, so I had a combination of a large pack file, smaller
> pack  files, and loose objects.  Finally, I tried 'git repack -a -d'
> and  consolidated it all into a single 4GB pack file.  It didn't seem
> to make  much difference in the output.
>
> Am I bumping some sort of limitation within git, or have I uncovered a bug?

The former.  Unfortunately this comes from an old design
decision.

Fortunately this design decision is not something irreversible
(see Chapter 1 of Documentation/ManagementStyle in the kernel
repository ;-).

The packfile is a dual-use format.  When used for network
transfer, we only send the .pack file and have the recipient
reconstruct the corresponding .idx file.  When used locally, we
need both .pack and .idx file; .pack contains the meat of the
data, and .idx allows us random access to the objects stored in
the corresponding .pack file.

What is interesting is that .pack format does not have (as far
as I know) inherent size limitation.  However, .idx file has
hardcoded 32-bit offsets into .pack -- hence, in practice, you
cannot use a .pack that is over 4GB locally.

One crude workaround that would work _today_ for your situation
without changing file formats would be to use git-fetch into an
empty repository (and do ref cloning by hand) instead of using
git-clone.  git-fetch gets .pack data over the wire and explode
the objects contained in the stream into individual objects (as
opposed to git-clone gets .pack data, stores it as a .pack and
tries to create corresponding .idx which in your case would bust
the 32-bit limit and fail).

This is from a private note I sent to Linus on Jun 26 2005 when
pack & idx pairs were initially introduced.

 - Design decision.  As before, you have assumption that nothing
   is longer than 2^32 bytes.  I am not unhappy with that
   restriction with individual objects (even their uncompressed
   size limited below 4GB or even 2GB is fine --- after all we
   are talking about a source control system).  I am however
   wondering if we would regret it later to have a packed file
   also limited to 4GB by having object_entry.offset "unsigned
   long" (and fwrite htonl'ed 4 bytes).  I personally do not
   have problem with this, but I can easily see HPA frowning on
   us.  He didn't like it when I said "in GIT world, file sizes
   and offsets are of type 'unsigned long'" some time ago.

I do not have a copy of a response from Linus to this point, but
if I recall things correctly, since then, the plan always has
been (1) to limit the size of individual packfiles to fit within
the idx limit and/or (2) extend the idx format to be able to
express offset over 2^32.  The latter is possible because idx
file is a local matter, used only for local accesses and does
not get set over the wire.

However, even if we revise the .idx file format, we have another
practical problem to solve.  Currently we assume that we can mmap
one packfile as a whole and do a random access into it.  This
needs to be changed so that we (perhaps optionally, only when
dealing with a huge packfile) mmap part of a .pack at a time.

I recall more recently (as opposed to the heated discussion
immediately after packfile was introduced June last year) we had
another discussion about people not being able to mmap huge
packfiles, and partial mmapping was one of the things that were
discussed there.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git clone dies (large git repository)
  2006-08-19 20:46 ` Junio C Hamano
@ 2006-08-21 23:30   ` Troy Telford
  2006-08-22  0:23     ` Junio C Hamano
  0 siblings, 1 reply; 6+ messages in thread
From: Troy Telford @ 2006-08-21 23:30 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Sat, 19 Aug 2006 14:46:30 -0600, Junio C Hamano <junkio@cox.net> wrote:

> What is interesting is that .pack format does not have (as far
> as I know) inherent size limitation.  However, .idx file has
> hardcoded 32-bit offsets into .pack -- hence, in practice, you
> cannot use a .pack that is over 4GB locally.

Confessing my (complete, total, frightening) ignorance about git:  Is it  
even possible to take a large pack file and split it into smaller packs?

I'm thinking of it as an option for git-repack-- that the user can set the  
maximum size of any individual pack, and after that limit is reached, a  
new pack file is started.  (ie. --max-size 2GB) and will end up with two  
packs, each 2GB in size.

That being said -- I've been able to work around it (although I haven't  
tried your suggestion yet); it's not a 'critical' problem.  I'm now just  
curious if my fantasy (above) makes sense.
-- 
Troy Telford

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git clone dies (large git repository)
  2006-08-21 23:30   ` Troy Telford
@ 2006-08-22  0:23     ` Junio C Hamano
  2006-08-22  0:42       ` Jakub Narebski
  0 siblings, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2006-08-22  0:23 UTC (permalink / raw)
  To: Troy Telford; +Cc: git

"Troy Telford" <ttelford@linuxnetworx.com> writes:

> I'm thinking of it as an option for git-repack-- that the user can set
> the maximum size of any individual pack, and after that limit is
> reached, a  new pack file is started.  (ie. --max-size 2GB) and will
> end up with two  packs, each 2GB in size.

The way I would suggest you do it is not by size but by distance
from the latest.  If you want to split the kernel history for
example, you repack up to 2.6.14 for example, and then repack
the remainder.  That way, you can optimize for size for older
(presumably less frequently used) data while optimizing for
speed for more reent stuff.

There is no wrapper support for the above splitting in
git-repack.  The low-level plumbing tools can be used this way
for example:

	name=`
                git rev-list --objects $list_old_tags_here |
                git pack-objects --window=50 --depth=50 --non-empty .tmp-pack
        ` &&
        mv -f .tmp-pack-$name.{pack,idx} .git/objects/pack/

	name=`
		git revlist --objects --all --not $list_old_tags_here |
		git pack-objects --non-empty .tmp-pack
	` &&
        mv -f .tmp-pack-$name.{pack,idx} .git/objects/pack/

If you are splitting into more than two, you would instead have
more than one $list_old_tags_here list, and iterate them
through, something like:

	pack_between () {
        	already_done="$1"
                do_this_time="$2"
                w=${3-10}
		name=`
			git rev-list --objects \
                        	$do_this_time \
                                --not $already_done |
			git pack-objects --window=$w --depth=$w \
                        	--non-empty .tmp-pack
                ` &&
                mv -f .tmp-pack-$name.{pack,idx} .git/objects/pack/
	}

	pack_between "" "$prehistoric_tag_list" 100
	pack_between "$prehistoric_tag_list" "$more_recent_tag_list" 50
	pack_between "$more_recent_tag_list" --all

All untested, of course, so do not play with it in your precious
repository you do not have any other copy, but hopefully you get
an idea ;-).

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git clone dies (large git repository)
  2006-08-22  0:23     ` Junio C Hamano
@ 2006-08-22  0:42       ` Jakub Narebski
  0 siblings, 0 replies; 6+ messages in thread
From: Jakub Narebski @ 2006-08-22  0:42 UTC (permalink / raw)
  To: git

Junio C Hamano wrote:

> "Troy Telford" <ttelford@linuxnetworx.com> writes:
> 
>> I'm thinking of it as an option for git-repack-- that the user can set
>> the maximum size of any individual pack, and after that limit is
>> reached, a  new pack file is started.  (ie. --max-size 2GB) and will
>> end up with two  packs, each 2GB in size.
> 
> The way I would suggest you do it is not by size but by distance
> from the latest.  If you want to split the kernel history for
> example, you repack up to 2.6.14 for example, and then repack
> the remainder.  That way, you can optimize for size for older
> (presumably less frequently used) data while optimizing for
> speed for more reent stuff.

If there would be some enhancement to pack files allowing either limiting
the size of pack (e.g. filesystem limits, mmap limits) and/or mmaping only
fragment or fragments of pack file, the maximal size of the pack, or
maximal size of the mmapped fragment should be configurable per repository,
not olny as an option to some command (git-repack for example).

Probably would need some enhancement to git-fetch/git-clone too...

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2006-08-22  0:43 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-18 22:42 git clone dies (large git repository) Troy Telford
2006-08-19 10:58 ` Jakub Narebski
2006-08-19 20:46 ` Junio C Hamano
2006-08-21 23:30   ` Troy Telford
2006-08-22  0:23     ` Junio C Hamano
2006-08-22  0:42       ` Jakub Narebski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).