Errors cloning large repo

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Errors cloning large repo
@ 2007-03-09 19:20 Anton Tropashko
  2007-03-09 21:37 ` Linus Torvalds
  0 siblings, 1 reply; 25+ messages in thread
From: Anton Tropashko @ 2007-03-09 19:20 UTC (permalink / raw)
  To: git

I managed to stuff 8.5 GB worth of files into a git repo (two two git commits since
it was running out of memory when I gave it -a option)

but when I'm cloning to another linux box I get:

Generating pack...
Done counting 152200 objects.
Deltifying 152200 objects.
0* 80% (122137/152200) donee
 100% (152200/152200) done
/usr/bin/git-clone: line 321:  2072 File size limit exceededgit-fetch-pack --all -k $quiet "$repo"

Would be nice to be able to work around this somehow if the bug can not be fixed.
1.5.0 on the server
1.4.1 on the client

____________________________________________________________________________________
Food fight? Enjoy some healthy debate 
in the Yahoo! Answers Food & Drink Q&A.
http://answers.yahoo.com/dir/?link=list&sid=396545367

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Errors cloning large repo
  2007-03-09 19:20 Anton Tropashko
@ 2007-03-09 21:37 ` Linus Torvalds
  0 siblings, 0 replies; 25+ messages in thread
From: Linus Torvalds @ 2007-03-09 21:37 UTC (permalink / raw)
  To: Anton Tropashko; +Cc: git

On Fri, 9 Mar 2007, Anton Tropashko wrote:
>
> I managed to stuff 8.5 GB worth of files into a git repo (two two git commits since
> it was running out of memory when I gave it -a option)

Heh. Your usage schenario may not be one where git is useful. If a single 
commit generates that much data, git will likely perform horribly badly. 
But it's an interesting test-case, and I don't think anybody has really 
*tried* this before, so don't give up yet.

First off, you shouldn't really need two commits. It's true that "git 
commit -a" will probably have memory usage issues (because a single "git 
add" will keep it all in memory while it generates the objects), but it 
should be possible to just use "git add" to add even 8.5GB worth of data 
in a few chunks, and then a single "git commit" should commit it.

So you might be able to do just do

	git add dir1
	git add dir2
	git add dir3
	..
	git commit

or something.

But one caveat: git may not be the right tool for the job. May I inquire 
what the heck you're doing? We may be able to fix git even for your kinds 
of usage, but it's also possible that 
 (a) git may not suit your needs
 (b) you might be better off using git differently

Especially when it comes to that "(b)" case, please realize that git is 
somewhat different from something like CVS at a very fundamental level.

CVS in many ways can more easily track *humongous* projects, for one very 
simple reason: CVS really deep down just tracks individual files.

So people who have used CVS may get used to the notion of putting 
everything in one big repository, because in the end, it's just a ton of 
small files to CVS. CVS never really looks at the big picture - even doing 
something like merging or doing a full checkout is really just iterating 
over all the individual files.

So if you put a million files in a CVS repository, it's just going to 
basically loop over those million files, but they are still just 
individual files. There's never any operation that works on *all* of the 
files at once.

Git really is *fundamentally* different here. Git takes completely the 
opposite approach, and git never tracks individual files at all at any 
level, really. Git almost doesn't care about file boundaries (I say 
"almost", because obviously git knows about them, and they are visible in 
myriads of ways, but at the same time it's not entirely untrue to say that 
git really doesn't care).

So git scales in a very different way from CVS. Many things are tons 
faster (because git does many operations a full directory structure at a 
time, and that makes merges that only touch a few subdirectories *much* 
faster), but on the other hand, it means that git will consider everything 
to be *related* in a way that CVS never does.

So, for example, if your 8.5GB thing is something like your whole home 
directory, putting it as one git archive now ties everything together and 
that can cause issues that really aren't very nice. Tying everything 
together is very important in a software project (the "total state" is 
what matters), but in your home directory, many things are simply totally 
independent, and tying them together can be the wrong thing to do.

So I'm not saying that git won't work for you, I'm just warning that the 
whole model of operation may or may not actually match what you want to 
do. Do you really want to track that 8.5GB as *one* entity?

> but when I'm cloning to another linux box I get:
> 
> Generating pack...
> Done counting 152200 objects.
> Deltifying 152200 objects.

.. this is the part makes me think git *should* be able to work for you. 
Having lots of smallish files is much better for git than a few DVD 
images, for example. And if those 152200 objects are just from two 
commits, you obviously have lots of files ;)

However, if it packs really badly (and without any history, that's quite 
likely), maybe the resulting pack-file is bigger than 4GB, and then you'd 
have trouble (in fact, I think you'd hit trouble at the 2GB pack-file 
mark).

Does "git repack -a -d" work for you?

> /usr/bin/git-clone: line 321:  2072 File size limit exceededgit-fetch-pack --all -k $quiet "$repo"

"File size limit exceeded" sounds like SIGXFSZ, which is either:

 - you have file limits enabled, and the resulting pack-file was just too 
   big for the limits.

 - the file size is bigger than MAX_NON_LFS (2GB-1), and we don't use 
   O_LARGEFILE.

I suspect the second case. Shawn and Nico have worked on 64-bit packfile 
indexing, so they may have a patch / git tree for you to try out.

			Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Errors cloning large repo
@ 2007-03-09 23:48 Anton Tropashko
  2007-03-10  0:54 ` Linus Torvalds
  0 siblings, 1 reply; 25+ messages in thread
From: Anton Tropashko @ 2007-03-09 23:48 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

answers inline prefixed with >>>> while I'm trying to figure out how to deal
with new "improvement" to yahoo mail beta.

On Fri, 9 Mar 2007, Anton Tropashko wrote:
>
> I managed to stuff 8.5 GB worth of files into a git repo (two two git commits since
> it was running out of memory when I gave it -a option)

So you might be able to do just do

    git add dir1
    git add dir2
    git add dir3
    ..
    git commit

or something.

>>>>>>>>>>>>  For some reason git add . swallowed the whole thing
>>>>>>>>>>>>  but git commit did not and I had to split it up. I trimmed the tree a bit
>>>>>>>>>>>>  since then by removing c & c++ files ;-)

But one caveat: git may not be the right tool for the job. May I inquire 
what the heck you're doing? We may be able to fix git even for your kinds 

>>>>>>>>>>>>  I dumped a rather large SDK into it. Headers, libraries
>>>>>>>>>>>> event crs.o from the toolchains that are part of SDK. The idea is to keep
>>>>>>>>>>>>  SDK versioned and being able to pull an arbitrary version once tagged.

So I'm not saying that git won't work for you, I'm just warning that the 
whole model of operation may or may not actually match what you want to 
do. Do you really want to track that 8.5GB as *one* entity?

>>>>>>>>>>>> Yes. It would be nice if I won't have to prune pdfs, txts, and who
>>>>>>>>>>>> knows what else people put in there just to reduce the size.

> but when I'm cloning to another linux box I get:
> 
> Generating pack...
> Done counting 152200 objects.
> Deltifying 152200 objects.

.. this is the part makes me think git *should* be able to work for you. 
Having lots of smallish files is much better for git than a few DVD 
images, for example. And if those 152200 objects are just from two 
commits, you obviously have lots of files ;)

However, if it packs really badly (and without any history, that's quite 
likely), maybe the resulting pack-file is bigger than 4GB, and then you'd 
have trouble (in fact, I think you'd hit trouble at the 2GB pack-file 
mark).

Does "git repack -a -d" work for you?

>>>>>>>>>>>> I'll tell you as soon as I get another failure. As you
>>>>>>>>>>>> might guess it takes a while :-]

> /usr/bin/git-clone: line 321:  2072 File size limit exceededgit-fetch-pack --all -k $quiet "$repo"

"File size limit exceeded" sounds like SIGXFSZ, which is either:

 - you have file limits enabled, and the resulting pack-file was just too 
   big for the limits.

 - the file size is bigger than MAX_NON_LFS (2GB-1), and we don't use 
   O_LARGEFILE.

I suspect the second case. Shawn and Nico have worked on 64-bit packfile 
indexing, so they may have a patch / git tree for you to try out.

>>>>>>>>>>>> Ok. I think you're correct:
from ulimit -a:
...
file size             (blocks, -f) unlimited
...

Good to know developers are ahead of the users.

Is there way to get rid of pending (uncommitted) changes?
git revert does not work the same way as svn revert as I just discovered
and git status still reports a ton of pending deletions
(I changed my mind and need my object files back). I suppose I can move .git out
of the way blow all the files move it back and git pull or whatever
does a local checkout, but there must be a better way.

____________________________________________________________________________________
No need to miss a message. Get email on-the-go 
with Yahoo! Mail for Mobile. Get started.
http://mobile.yahoo.com/mail 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Errors cloning large repo
  2007-03-09 23:48 Anton Tropashko
@ 2007-03-10  0:54 ` Linus Torvalds
  2007-03-10  2:03   ` Linus Torvalds
  0 siblings, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2007-03-10  0:54 UTC (permalink / raw)
  To: Anton Tropashko; +Cc: git

On Fri, 9 Mar 2007, Anton Tropashko wrote:
>
> > So you might be able to do just do
> > 
> >     git add dir1
> >     git add dir2
> >     git add dir3
> >     ..
> >     git commit
> > 
> > or something.
>
> For some reason git add . swallowed the whole thing
> but git commit did not and I had to split it up. I trimmed the tree a bit
> since then by removing c & c++ files ;-)

Ok, that's a bit surprising, since "git commit" actually should do less 
than "git add .", but it's entirely possible that just the status message 
generation ends up doing strange things for a repository with that many 
files in it.

I should try it out with some made-up auto-generated directory setup, but 
I'm not sure I have the energy to do it ;)

> > But one caveat: git may not be the right tool for the job. May I inquire 
> > what the heck you're doing? We may be able to fix git even for your kinds 
>
> I dumped a rather large SDK into it. Headers, libraries
> event crs.o from the toolchains that are part of SDK. The idea is to keep
> SDK versioned and being able to pull an arbitrary version once tagged.

Ok. Assuming most of this doesn't change very often (ie the crs.o files 
aren't actually *generated*, but come from some external thing), git 
should do well enough once it's past the original hump.

So your usage scheanrio doesn't sound insane, and it's something we should 
be able to support well enough. 

> > So I'm not saying that git won't work for you, I'm just warning that the 
> > whole model of operation may or may not actually match what you want to 
> > do. Do you really want to track that 8.5GB as *one* entity?
>
> Yes. It would be nice if I won't have to prune pdfs, txts, and who
> knows what else people put in there just to reduce the size.

Sure. 8.5GB is absolutely huge, and clearly you're hitting some problems 
here, but if we're talking things like having a whole development 
environment with big manuals etc, it might be a perfectly valid usage 
schenario.

That said, it might also be a good idea (regardless of anything else) to 
split things up, if only because it's quite possible that not everybody is 
interested in having *everything*. Forcing people to work with a 8.5GB 
repository when they might not care about it all could be a bad idea.

> >  - the file size is bigger than MAX_NON_LFS (2GB-1), and we don't use 
> >    O_LARGEFILE.
>
> Ok. I think you're correct:
> from ulimit -a:
> ...
> file size             (blocks, -f) unlimited

Ok, then it's the 2GB limit that the OS puts on you unless you tell it to 
use O_LARGEFILE.

Which is just as well, since the normal git pack-files won't index past 
that size *anyway* (ok, so it should index all the way up to 4GB, but it's 
close enough..)

> Good to know developers are ahead of the users.

Well, not "ahead enough" apparently ;)

I was seriously hoping that we could keep off the 64-bit issues for a bit 
longer, since the biggest real archive (firefox) we've seen so far was 
barely over half a gigabyte.

> Is there way to get rid of pending (uncommitted) changes?

"git reset --hard" will do it for you. As will "git checkout -f", for that 
matter.

"git revert" will just undo an old commit (as you apparently already found 
out)

		Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Errors cloning large repo
@ 2007-03-10  1:21 Anton Tropashko
  2007-03-10  1:45 ` Linus Torvalds
  0 siblings, 1 reply; 25+ messages in thread
From: Anton Tropashko @ 2007-03-10  1:21 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

> I should try it out with some made-up auto-generated directory setup, but 
> I'm not sure I have the energy to do it ;)

but your /usr should be large enough if /usr/local and /usr/local/src are not!!!
I don't think you need to generate anything.
Or you are saying that the problem is the number of files I have, not the
total size of the files? In any event there should be a plenty of files in /usr

> That said, it might also be a good idea (regardless of anything else) to 
> split things up, if only because it's quite possible that not everybody is 
> interested in having *everything*. Forcing people to work with a 8.5GB 
> repository when they might not care about it all could be a bad idea.

> "git reset --hard" will do it for you. As will "git checkout -f", for that 
> matter.

> "git revert" will just undo an old commit (as you apparently already found 
> out)

Yep. I found checkout -f works before I got the rest alternative.

I was pleased that git did not lock me out of committing a few
deletions for *.pdf, *.doc and makefiles after repack started.
repack -a -d just finished and I started clone again.
It's already deltifying at 6%.

Thank you.

____________________________________________________________________________________
Now that's room service!  Choose from over 150,000 hotels
in 45,000 destinations on Yahoo! Travel to find your fit.
http://farechase.yahoo.com/promo-generic-14795097

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Errors cloning large repo
  2007-03-10  1:21 Anton Tropashko
@ 2007-03-10  1:45 ` Linus Torvalds
  0 siblings, 0 replies; 25+ messages in thread
From: Linus Torvalds @ 2007-03-10  1:45 UTC (permalink / raw)
  To: Anton Tropashko; +Cc: Git Mailing List



On Fri, 9 Mar 2007, Anton Tropashko wrote:
> 
> but your /usr should be large enough if /usr/local and /usr/local/src 
> are not!!!

I don't like the size distribution.

My /usr has 181585 files, but is 4.0G in size, which doesn't match yours. 
Also, I've wanted to generate bogus data for a while, just for testing, so 
I wrote this silly program that I can tweak the size distribution for.

It gives me something that approaches your distribution (I ran it a few 
times, I now have 110402 files, and 5.7GB of space according to 'du').

It's totally unrealistic wrt packing, though (no deltas, and no 
compression, since the data itself is all random), and I don't know how to 
approximate that kind of details samely.

I'll need to call it a day for the kids dinner etc, so I'm probably done 
for the day. I'll play with this a bit more to see if I can find various 
scalability issues (and just ignore the delta/compression problem - you 
probably don't have many deltas either, so I'm hoping that the fact 
that I only have 5.7GB will approximate your data thanks to it not being 
compressible).

		Linus

---
#include <time.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/fcntl.h>

/*
 * Create a file with a random size in the range
 * 0-1MB, but with a "pink noise"ish distribution
 * (ie equally many files in the 1-2kB range as in
 * the half-meg to megabyte range).
 */
static void create_file(const char *name)
{
	int i;
	int fd = open(name, O_CREAT | O_WRONLY | O_TRUNC, 0666);
	static char buffer[1000];
	unsigned long size = random() % (1 << (10+(random() % 10)));

	if (fd < 0)
		return;
	for (i = 0; i < sizeof(buffer); i++)
		buffer[i] = random();
	while (size) {
		int len = sizeof(buffer);
		if (len > size)
			len = size;
		write(fd, buffer, len);
		size -= len;
	}
	close(fd);
}

static void start(const char *base,
	float dir_likely, float dir_expand,
	float end_likely, float end_expand)
{
	int len = strlen(base);
	char *name = malloc(len + 10);

	mkdir(base, 0777);

	memcpy(name, base, len);
	name[len++] = '/';

	dir_likely *= dir_expand;
	end_likely *= end_expand;

	for (;;) {
		float rand = (random() & 65535) / 65536.0;

		sprintf(name + len, "%ld", random() % 1000000);
		rand -= dir_likely;
		if (rand < 0) {
			start(name, dir_likely, dir_expand, end_likely, end_expand);
			continue;
		}
		rand -= end_likely;
		if (rand < 0)
			break;
		create_file(name);
	}
}

int main(int argc, char **argv)
{
	/*
	 * Tune the numbers to your liking..
	 *
	 * The floats are:
	 *  - dir_likely (likelihood of creating a recursive directory)
	 *  - dir_expand (how dir_likely behaves as we move down recursively)
	 *  - end_likely (likelihood of ending file creation in a directory)
	 *  - end_expand (how end_likely behaves as we move down recursively)
	 *
	 * The numbers 0.3/0.6 0.03/1.1 are totally made up, and for me
	 * generate a tree of between a few hundred files and a few tens 
	 * of thousands of files.
	 *
	 * Re-run several times to generate more files in the tree.
	 */
	srandom(time(NULL));
	start("tree",
		0.3, 0.6,
		0.02, 1.1);
	return 0;
}

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Errors cloning large repo
  2007-03-10  0:54 ` Linus Torvalds
@ 2007-03-10  2:03   ` Linus Torvalds
  2007-03-10  2:12     ` Junio C Hamano
  0 siblings, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2007-03-10  2:03 UTC (permalink / raw)
  To: Anton Tropashko, Junio C Hamano; +Cc: Git Mailing List



On Fri, 9 Mar 2007, Linus Torvalds wrote:
> >
> > For some reason git add . swallowed the whole thing
> > but git commit did not and I had to split it up. I trimmed the tree a bit
> > since then by removing c & c++ files ;-)
> 
> Ok, that's a bit surprising, since "git commit" actually should do less 
> than "git add .", but it's entirely possible that just the status message 
> generation ends up doing strange things for a repository with that many 
> files in it.

Ahhh. Found it.

It's indeed "git commit" that takes tons of memory, but for all the wrong 
reasons. It does a "git diff-tree" to generate the diffstat, and *that* is 
extremely expensive:

	git-diff-tree --shortstat --summary --root --no-commit-id HEAD --

I suspect we shouldn't bother with the diffstat for the initial commit. 
Just removing "--root" migth be sufficient.

		Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Errors cloning large repo
  2007-03-10  2:03   ` Linus Torvalds
@ 2007-03-10  2:12     ` Junio C Hamano
  0 siblings, 0 replies; 25+ messages in thread
From: Junio C Hamano @ 2007-03-10  2:12 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Anton Tropashko, Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> It's indeed "git commit" that takes tons of memory, but for all the wrong 
> reasons. It does a "git diff-tree" to generate the diffstat, and *that* is 
> extremely expensive:
>
> 	git-diff-tree --shortstat --summary --root --no-commit-id HEAD --
>
> I suspect we shouldn't bother with the diffstat for the initial commit. 
> Just removing "--root" migth be sufficient.

Yes and no.  It was added as a response to sugestions from
people in "baby step tutorial" camp.

An option to disable the last diff-tree step, just like git-pull
has --no-summary option, would be perfectly fine, though.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Errors cloning large repo
@ 2007-03-10  2:37 Anton Tropashko
  2007-03-10  3:07 ` Shawn O. Pearce
  2007-03-10  5:10 ` Linus Torvalds
  0 siblings, 2 replies; 25+ messages in thread
From: Anton Tropashko @ 2007-03-10  2:37 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

> I suspect we shouldn't bother with the diffstat for the initial commit. 
> Just removing "--root" migth be sufficient.

My problem is git-clone though since for commit it's no big deal
to git commit [a-c]* , or use xargs as a workaround

For git clone I got this

Deltifying 144511 objects.
 100% (144511/144511) done
1625.375MB  (1713 kB/s)       
1729.057MB  (499 kB/s)       
/usr/bin/git-clone: line 321: 24360 File size limit exceededgit-fetch-pack --all -k $quiet "$repo"

again after git repack and don't see how to work around that aside from artifically
splitting the tree at the top or resorting to a tarball on an ftp site.
That 64 bit indexing code you previously mentioned would force me to upgrade git on both ends?
Anywhere I can pull it out from?

____________________________________________________________________________________
Food fight? Enjoy some healthy debate 
in the Yahoo! Answers Food & Drink Q&A.
http://answers.yahoo.com/dir/?link=list&sid=396545367

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Errors cloning large repo
  2007-03-10  2:37 Anton Tropashko
@ 2007-03-10  3:07 ` Shawn O. Pearce
  2007-03-10  5:54   ` Linus Torvalds
                     ` (2 more replies)
  2007-03-10  5:10 ` Linus Torvalds
  1 sibling, 3 replies; 25+ messages in thread
From: Shawn O. Pearce @ 2007-03-10  3:07 UTC (permalink / raw)
  To: Anton Tropashko; +Cc: Linus Torvalds, git

Anton Tropashko <atropashko@yahoo.com> wrote:
> again after git repack and don't see how to work around that aside from artifically
> splitting the tree at the top or resorting to a tarball on an ftp site.
> That 64 bit indexing code you previously mentioned would force me to upgrade git on both ends?
> Anywhere I can pull it out from?

I'm shocked you were able to repack an 8.5 GiB repository.
The default git-repack script that we ship assumes you want to
combine everything into one giant packfile; this is what is also
happening during git-clone.  Clearly your system is rejecting this
packfile; and even if the OS allowed us to make that file the
index offsets would all be wrong as they are only 32 bits wide.
The repository becomes corrupt when those overflow.

Troy Telford (with the help of Eric Biederman) recently posted a
patch that attempts to push the index to 64 bits:

  http://thread.gmane.org/gmane.comp.version-control.git/40680/focus=40999

You can try Troy's patch.  Nico and my's 64 bit index work is *not*
ready for anyone to use.  It doesn't exist as a compileable chunk
of code.  ;-)

Just to warn you, I have (re)done some of Troy's changes and Junio
has applied them to the current 'master' branch.  So Troy's patch
would need to be applied to something that is futher back, like
around 2007-02-28 (when Troy sent the patch).  But my changes alone
are not enough to get "64 bit packfiles" working.

As Linus said earlier in this thread; Nico and I are working on
pushing out the packfile limits, just not fast enough for some users
needs apparently (sorry about that!).  Troy's patch was rejected
mainly because it is a file format change that is not backwards
compatible (once you use the 64 bit index, anything accessing that
repository *must* also support that).

Nico and I are working on other file format changes that are
more extensive than just expanding the index out to 64 bits, and
likewise are also not backwards compatible.  To help users manage
the upgrades, we want to do a single file format change in 2007,
not two.  So we are trying to be very sure that what we give Junio
for final application really is the best we can do this year.

Otherwise we would have worked with Troy to help test his patch and
get that into shape for application to main the git.git repository.

One thing that you could do is segment the repository into multiple
packfiles yourself, and then clone using rsync or http, rather than
using the native Git protocol.

For segmenting the repository, you would do something like:

	git rev-list --objects HEAD >S
	# segment S up into several files, e.g. T1, T2, T3
	foreach s in T*
	do
		name=$(git pack-objects tmp <$s)
		touch .git/objects/pack/pack-$name.keep
		mv tmp-$name.pack .git/objects/pack/pack-$name.pack
		mv tmp-$name.idx .git/objects/pack/pack-$name.idx
	done
	git prune-packed

The trick here is to segment S up into enough T1, T2, ... files such
that when packed they each are less than 2 GiB.  You can then clone
this repository by copying the .git directory using more standard
filesystem tools, which is what a clone with rsync or http is
(more or less) doing.

Yes, the above process is horribly tedious and has a some trial
and error involved in terms of selecting the packfile segmenting.
We don't have anything that can automate this right now.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Errors cloning large repo
  2007-03-10  2:37 Anton Tropashko
  2007-03-10  3:07 ` Shawn O. Pearce
@ 2007-03-10  5:10 ` Linus Torvalds
  1 sibling, 0 replies; 25+ messages in thread
From: Linus Torvalds @ 2007-03-10  5:10 UTC (permalink / raw)
  To: Anton Tropashko; +Cc: git

On Fri, 9 Mar 2007, Anton Tropashko wrote:
> 
> My problem is git-clone though since for commit it's no big deal
> to git commit [a-c]* , or use xargs as a workaround

Sure, but there were two problems.

The "git commit" problem is trivial, and in no way fundamental. The thing 
that uses tons of memory is literally just eyecandy, to show you *what* 
you're committing.

In fact, by the time it starts using tons of memory, the commit has 
literally already happened. It's just doing statistics afterwards that 
bloats it up.

> For git clone I got this

The "git clone" problem is different, in that it's due to the 2GB 
pack-file limit. It's not "fundmentally hard" either, but it's at least 
not just a small tiny silly detail.

In fact, you can just do

	git add .
	git commit -q

and the "-q" flag (or "--quiet") will mean that the diffstat is never 
done, and the commit should be almost instantaneous (all the real work is 
done by the "git add .")

So "git commit" issue really is just a small beauty wart.

> Deltifying 144511 objects.
>  100% (144511/144511) done
> 1625.375MB  (1713 kB/s)       
> 1729.057MB  (499 kB/s)       
> /usr/bin/git-clone: line 321: 24360 File size limit exceededgit-fetch-pack --all -k $quiet "$repo"
> 
> again after git repack and don't see how to work around that aside from artifically
> splitting the tree at the top or resorting to a tarball on an ftp site.

So the "git repack" actually worked for you? It really shouldn't have 
worked.

Is the server side perhaps 64-bit? If so, the limit ends up being 4GB 
instead of 2GB, and your 8.5GB project may actually fit.

If so, we can trivially fix it with the current index file even for a 
32-bit machine. The reason we limit pack-files to 2GB on 32-bit machines 
is purely that we don't use O_LARGEFILE. If we enable O_LARGEFILE, that 
moves the limit up from 31 bits to 32 bits, and it might be enough for 
you. No new data structures for the index necessary at all.

		Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Errors cloning large repo
  2007-03-10  3:07 ` Shawn O. Pearce
@ 2007-03-10  5:54   ` Linus Torvalds
  2007-03-10  6:01     ` Shawn O. Pearce
  2007-03-10 10:27   ` Jakub Narebski
       [not found]   ` <82B0999F-73E8-494E-8D66-FEEEDA25FB91@adacore.com>
  2 siblings, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2007-03-10  5:54 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Anton Tropashko, git

On Fri, 9 Mar 2007, Shawn O. Pearce wrote:
> 
> I'm shocked you were able to repack an 8.5 GiB repository.

Side note - it would be nice to hear just how big the repository *really* 
is.

For example, if "du -sh" says 8.5GB, it doesn't necessarily mean that 
there really is 8.5GB of data there.

With a normal 4kB blocksize filesystem, and ~150.000 filesystem objects, 
you'd have an average of 300MB of just padding (roughly 2kB per file). 
Depending on the file statistics, it could be even more.

And if it's compressible, it's entirely possible that even without 
much delta compression, it could fit in a pack-file smaller than 4GB. At 
which point a 32-bit index file should work fine, just not with a 32-bit 
off_t.

So this really could be a situation where just small tweaks makes it work 
out for now. We'll need the full 64-bit index eventually for sure, but..

		Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Errors cloning large repo
  2007-03-10  5:54   ` Linus Torvalds
@ 2007-03-10  6:01     ` Shawn O. Pearce
  2007-03-10 22:32       ` Martin Waitz
  0 siblings, 1 reply; 25+ messages in thread
From: Shawn O. Pearce @ 2007-03-10  6:01 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Anton Tropashko, git

Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Fri, 9 Mar 2007, Shawn O. Pearce wrote:
> > 
> > I'm shocked you were able to repack an 8.5 GiB repository.
> 
> Side note - it would be nice to hear just how big the repository *really* 
> is.
> 
> For example, if "du -sh" says 8.5GB, it doesn't necessarily mean that 
> there really is 8.5GB of data there.

Oh, good point.  Thanks for reminding me of reality.

I'm just so used to not looking at repository size unless the
repository has been fully repacked first.  So I somehow just read
this thread has Anton having 8.5 GiB worth of *packed* data (where
filesystem wastage in the tail block is minimal) and not 8.5 GiB
of loose objects.

Its very likely this did fit in just under 4 GiB of packed data,
but as you said, without O_LARGEFILE we can't work with it.

> So this really could be a situation where just small tweaks makes it work 
> out for now. We'll need the full 64-bit index eventually for sure, but..

Yes.  ;-)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Errors cloning large repo
  2007-03-10  3:07 ` Shawn O. Pearce
  2007-03-10  5:54   ` Linus Torvalds
@ 2007-03-10 10:27   ` Jakub Narebski
  2007-03-11  2:00     ` Shawn O. Pearce
       [not found]   ` <82B0999F-73E8-494E-8D66-FEEEDA25FB91@adacore.com>
  2 siblings, 1 reply; 25+ messages in thread
From: Jakub Narebski @ 2007-03-10 10:27 UTC (permalink / raw)
  To: git

Shawn O. Pearce wrote:

> One thing that you could do is segment the repository into multiple
> packfiles yourself, and then clone using rsync or http, rather than
> using the native Git protocol.

By the way, it would be nice to have talked about fetch / clone
support for sending (and creating) _multiple_ pack files. Beside
the situation where we must use more than one packfile because
of size limits, it would also help clone as it could send existing
packs and pack only loose objects (trading perhaps some bandwidth
with CPU load on the server; think kernel.org).

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Errors cloning large repo
       [not found]   ` <82B0999F-73E8-494E-8D66-FEEEDA25FB91@adacore.com>
@ 2007-03-10 22:21     ` Linus Torvalds
  0 siblings, 0 replies; 25+ messages in thread
From: Linus Torvalds @ 2007-03-10 22:21 UTC (permalink / raw)
  To: Geert Bosch; +Cc: Shawn O. Pearce, Anton Tropashko, git

On Sat, 10 Mar 2007, Geert Bosch wrote:
> 
> Larger packs might still be sent over the network, but they
> wouldn't have an index and could be constructed on the fly,
> without ever writing any multi-gigabyte files to disk.

I have to say, I think that's a good idea. Rather than supporting a 64-bit 
index, not generating big pack-files in general is probably a great idea.

For the streaming formats, we'd obviously generate arbitrarily large 
pack-files, but as you say, they never have an index at all, and the 
receiver always re-writes them *anyway* (ie we now always run 
"git-index-pack --fix-thin" on them), so we could just modify that 
"--fix-thin" logic to also split the pack when it reaches some arbitrary 
limit).

Some similar logic in git-pack-objects would mean that we'd never generate 
bigger packs in the first place..

It's not that 64-bit index file support is "wrong", but it does seem like 
it's not really necessary.

		Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Errors cloning large repo
  2007-03-10  6:01     ` Shawn O. Pearce
@ 2007-03-10 22:32       ` Martin Waitz
  2007-03-10 22:46         ` Linus Torvalds
  0 siblings, 1 reply; 25+ messages in thread
From: Martin Waitz @ 2007-03-10 22:32 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Linus Torvalds, Anton Tropashko, git

[-- Attachment #1: Type: text/plain, Size: 377 bytes --]

hoi :)

On Sat, Mar 10, 2007 at 01:01:44AM -0500, Shawn O. Pearce wrote:
> Its very likely this did fit in just under 4 GiB of packed data,
> but as you said, without O_LARGEFILE we can't work with it.

but newer git version can cope with it:

-r--r--r-- 1 martin martin 3847536413 18. Feb 10:36 pack-ffe867679d673ea5fbfa598b28aca1e58528b8cd.pack

-- 
Martin Waitz

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Errors cloning large repo
  2007-03-10 22:32       ` Martin Waitz
@ 2007-03-10 22:46         ` Linus Torvalds
  2007-03-11 21:35           ` Martin Waitz
  0 siblings, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2007-03-10 22:46 UTC (permalink / raw)
  To: Martin Waitz; +Cc: Shawn O. Pearce, Anton Tropashko, git



On Sat, 10 Mar 2007, Martin Waitz wrote:
> 
> On Sat, Mar 10, 2007 at 01:01:44AM -0500, Shawn O. Pearce wrote:
> > Its very likely this did fit in just under 4 GiB of packed data,
> > but as you said, without O_LARGEFILE we can't work with it.
> 
> but newer git version can cope with it:
> 
> -r--r--r-- 1 martin martin 3847536413 18. Feb 10:36 pack-ffe867679d673ea5fbfa598b28aca1e58528b8cd.pack

Are you sure you're not just running a 64-bit process?

64-bit processes don't need O_LARGEFILE to process files larger than 2GB, 
since for them, off_t is already 64-bit.

Grepping for O_LARGEFILE shows nothing.

Oh, except we have that 

	#define _FILE_OFFSET_BITS 64

which is just a horrible hack. That's nasty. We should just use 
O_LARGEFILE rather than depend on some internal glibc thing that works 
nowhere else.

		Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Errors cloning large repo
  2007-03-10 10:27   ` Jakub Narebski
@ 2007-03-11  2:00     ` Shawn O. Pearce
  2007-03-12 11:09       ` Jakub Narebski
  0 siblings, 1 reply; 25+ messages in thread
From: Shawn O. Pearce @ 2007-03-11  2:00 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

Jakub Narebski <jnareb@gmail.com> wrote:
> Shawn O. Pearce wrote:
> 
> > One thing that you could do is segment the repository into multiple
> > packfiles yourself, and then clone using rsync or http, rather than
> > using the native Git protocol.
> 
> By the way, it would be nice to have talked about fetch / clone
> support for sending (and creating) _multiple_ pack files. Beside
> the situation where we must use more than one packfile because
> of size limits, it would also help clone as it could send existing
> packs and pack only loose objects (trading perhaps some bandwidth
> with CPU load on the server; think kernel.org).

I've thought about adding that type of protocol extension on
more than one occasion, but have now convinced myself that it is
completely unnecessary.  Well at least until a project has more
than 2^32-1 objects anyway.

The reason is we can send any size packfile over the network; there
is no index sent so there is no limit on how much data we transfer.
We could easily just dump all existing packfiles as-is (just clip
the header/footers and generate our own for the entire stream)
and then send the loose objects on the end.

The client could easily segment that into multiple packfiles
locally using two rules:

  - if the last object was not a OBJ_COMMIT and this object is
  an OBJ_COMMIT, start a new packfile with this object.

  - if adding this object to the current packfile exceeds my local
  filesize threshold, start a new packfile.

The first rule works because we sort objects by type, and commits
appear at the front of a packfile.  So if you see a non-commit
followed by a commit, that's the packfile boundary that the
server had.

The second rule is just common sense.  But I'm not sure the first
rule is even worthwhile; the server's packfile boundaries have no
real interest for the client.

But Linus has already pointed all of this out (more or less) in a
different fork of this thread.  ;-)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Errors cloning large repo
  2007-03-10 22:46         ` Linus Torvalds
@ 2007-03-11 21:35           ` Martin Waitz
  0 siblings, 0 replies; 25+ messages in thread
From: Martin Waitz @ 2007-03-11 21:35 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Shawn O. Pearce, Anton Tropashko, git

[-- Attachment #1: Type: text/plain, Size: 878 bytes --]

hoi :)

On Sat, Mar 10, 2007 at 02:46:35PM -0800, Linus Torvalds wrote:
> Are you sure you're not just running a 64-bit process?

pretty sure, yes :-)

> 64-bit processes don't need O_LARGEFILE to process files larger than 2GB, 
> since for them, off_t is already 64-bit.

but O_LARGEFILE is a Linux-only thing, right?

> Oh, except we have that 
> 
> 	#define _FILE_OFFSET_BITS 64
> 
> which is just a horrible hack. That's nasty. We should just use 
> O_LARGEFILE rather than depend on some internal glibc thing that works 
> nowhere else.

Well, if I remember correctly the *BSD systems always use 64bit now,
its sad that glibc does not do the same out of the box for Linux.
_FILE_OFFSET_BITS is the documented way to get 64bit file sizes on
glibc, so I think it is the right thing for us (even when that define
is really ugly).

-- 
Martin Waitz

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Errors cloning large repo
  2007-03-11  2:00     ` Shawn O. Pearce
@ 2007-03-12 11:09       ` Jakub Narebski
  2007-03-12 14:24         ` Shawn O. Pearce
  0 siblings, 1 reply; 25+ messages in thread
From: Jakub Narebski @ 2007-03-12 11:09 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: git

Shawn O. Pearce wrote:
> Jakub Narebski <jnareb@gmail.com> wrote:
>> Shawn O. Pearce wrote:
>> 
>>> One thing that you could do is segment the repository into multiple
>>> packfiles yourself, and then clone using rsync or http, rather than
>>> using the native Git protocol.
>> 
>> By the way, it would be nice to have talked about fetch / clone
>> support for sending (and creating) _multiple_ pack files. Beside
>> the situation where we must use more than one packfile because
>> of size limits, it would also help clone as it could send existing
>> packs and pack only loose objects (trading perhaps some bandwidth
>> with CPU load on the server; think kernel.org).
> 
> I've thought about adding that type of protocol extension on
> more than one occasion, but have now convinced myself that it is
> completely unnecessary.  Well at least until a project has more
> than 2^32-1 objects anyway.
> 
> The reason is we can send any size packfile over the network; there
> is no index sent so there is no limit on how much data we transfer.
> We could easily just dump all existing packfiles as-is (just clip
> the header/footers and generate our own for the entire stream)
> and then send the loose objects on the end.

But what would happen if server supporting concatenated packfiles
sends such stream to the old client? So I think some kind of protocol
extension, or at least new request / new feature is needed for that.

Wouldn't it be better to pack loose objects into separate pack
(and perhaps save it, if some threshold is crossed, and we have
writing rights to repo), by the way?

> The client could easily segment that into multiple packfiles
> locally using two rules:
> 
>   - if the last object was not a OBJ_COMMIT and this object is
>   an OBJ_COMMIT, start a new packfile with this object.
> 
>   - if adding this object to the current packfile exceeds my local
>   filesize threshold, start a new packfile.
> 
> The first rule works because we sort objects by type, and commits
> appear at the front of a packfile.  So if you see a non-commit
> followed by a commit, that's the packfile boundary that the
> server had.
> 
> The second rule is just common sense.  But I'm not sure the first
> rule is even worthwhile; the server's packfile boundaries have no
> real interest for the client.

Without first rule, wouldn't client end with strange packfile?
Or would it have to rewrite a pack?

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Errors cloning large repo
  2007-03-12 11:09       ` Jakub Narebski
@ 2007-03-12 14:24         ` Shawn O. Pearce
  2007-03-17 13:23           ` Jakub Narebski
  0 siblings, 1 reply; 25+ messages in thread
From: Shawn O. Pearce @ 2007-03-12 14:24 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

Jakub Narebski <jnareb@gmail.com> wrote:
> But what would happen if server supporting concatenated packfiles
> sends such stream to the old client? So I think some kind of protocol
> extension, or at least new request / new feature is needed for that.

No, a protocol extension is not required.  The packfile format
is: 12 byte header, objects, 20 byte SHA-1 footer.  When sending
concatenated packfiles to a client the server just needs to:

  - figure out how many objects total will be sent;
  - send its own (new) header with that count;
  - initialize a SHA-1 context and update it with the header;
  - for each packfile to be sent:
    - strip the first 12 bytes of the packfile;
    - send the remaining bytes, except the last 20;
    - update the SHA-1 context with the packfile data;
  - send its own footer with the SHA-1 context.

Very simple.  Even the oldest Git clients (pre multi-ack extension)
would understand that.  That's what's great about the way the
packfile protocol and disk format is organized.  ;-)

> Wouldn't it be better to pack loose objects into separate pack
> (and perhaps save it, if some threshold is crossed, and we have
> writing rights to repo), by the way?

Perhaps.  Interesting food for thought, something nobody has tried
to experiment with.  Currently servers pack to update the fetching
client.  That means they may be sending a mixture of already-packed
(older) objects and loose (newer) objects.  But with the new kept
pack thing in receive-pack its more likely that things are already
packed on the server, and not loose.  (I suspect most public open
source users are pushing >100 objects when they do push to their
server.)

> > The client could easily segment that into multiple packfiles
> > locally using two rules:
> > 
> >   - if the last object was not a OBJ_COMMIT and this object is
> >   an OBJ_COMMIT, start a new packfile with this object.
...
> 
> Without first rule, wouldn't client end with strange packfile?
> Or would it have to rewrite a pack?

Nope.  We don't care about the order of the objects in a packfile.
Never have.  Never will.  Even in pack v4 where we have special
object types that should only appear once in a packfile, they can
appear at any position within the packfile.  MUCH simpler code.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Errors cloning large repo
@ 2007-03-12 17:39 Anton Tropashko
  2007-03-12 18:40 ` Linus Torvalds
  0 siblings, 1 reply; 25+ messages in thread
From: Anton Tropashko @ 2007-03-12 17:39 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

> For example, if "du -sh" says 8.5GB, it doesn't necessarily mean that 
> there really is 8.5GB of data there.

> Its very likely this did fit in just under 4 GiB of packed data,
> but as you said, without O_LARGEFILE we can't work with it.

.git is 3.5GB according to du -H :)

> As Linus said earlier in this thread; Nico and I are working on
> pushing out the packfile limits, just not fast enough for some users
> needs apparently (sorry about that!).  Troy's patch was rejected

No problem.
You're providing things to work around faster than I can process them :-)

> So the "git repack" actually worked for you? It really shouldn't have 
> worked.

It did not complain. I did not check the exit status but there were no so
much as a single warning message:
index file has overflown the kernel will panic shortly. please stand by...

> Is the server side perhaps 64-bit? If so, the limit ends up being 4GB 
> instead of 2GB, and your 8.5GB project may actually fit.

both server and client are 32 bit.

> If so, we can trivially fix it with the current index file even for a 
> 32-bit machine. The reason we limit pack-files to 2GB on 32-bit machines 

Unfortunately the server machine is managed by IT. I can't install whatever
I want. The client is not and it's against the IT policy to have rogue linux boxes
on the net ;)

> So, wouldn't the correct fix be to automatically split a pack
> file in two pieces when it would become larger than 2 GB?

Just curious why won't you use something like 
PostgreSQL for data storage at this point, but, then
I know nothing about git internals :)

Anyhow, I have a patch to apply now and a bash script to hone my
bashing skills on. If you have anything else for me to test just shoot me
an e-mail.

I'm glad I can keep you all busy.

____________________________________________________________________________________
Expecting? Get great news right away with email Auto-Check. 
Try the Yahoo! Mail Beta.
http://advision.webevents.yahoo.com/mailbeta/newmail_tools.html 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Errors cloning large repo
  2007-03-12 17:39 Anton Tropashko
@ 2007-03-12 18:40 ` Linus Torvalds
  0 siblings, 0 replies; 25+ messages in thread
From: Linus Torvalds @ 2007-03-12 18:40 UTC (permalink / raw)
  To: Anton Tropashko; +Cc: git

On Mon, 12 Mar 2007, Anton Tropashko wrote:
> 
> > Its very likely this did fit in just under 4 GiB of packed data,
> > but as you said, without O_LARGEFILE we can't work with it.
> 
> .git is 3.5GB according to du -H :)

Ok, that's good.  That means that we really can use git without any major 
issues, and that it's literally apparently only receive-pack that has 
problems.

I didn't even realize that we have

	#define _FILE_OFFSET_BITS 64

in the header file, but not only is that a glibc-specific thing, it also 
won't really even cover all issues.

For example, if a file is opened from the shell (ie we're talking shell 
re-direction etc), that means that since the program that used 
_FILE_OFFSET_BITS wasn't the one opening, it was opened without 
O_LARGEFILE, and as such a write() will hit the LFS 31-bit limit.

That said, I'm not quite seeing why the _FILE_OFFSET_BITS trick doesn't 
help. We don't have any shell redirection in that path.

I just did an "strace -f" on a git clone on x86, and all the git open's 
seemed to use O_LARGEFILE, but that's with a very recent git.

I think you said that you had git-1.4.1 on the client, and I think that 
the _FILE_OFFSET_BITS=64 hack went in after that, and if your client just 
upgrades to the current 1.5.x release, it will all "just work" for you.

> Just curious why won't you use something like 
> PostgreSQL for data storage at this point, but, then
> I know nothing about git internals :)

I can pretty much guarantee that if we used a "real" database, we'd have

 - really really horrendously bad performance
 - total inability to actually recover from errors.

Other SCM projects have used databases, and it *always* boils down that. 
Most either die off, or decide to just do their own homegrown database (eg 
switching to FSFS for SVN).

Even database people seem to have figured it out lately: relational 
databases are starting to lose ground to specialized ones. These days you 
can google for something like

	relational specialized database performance

and you'll see real papers that are actually finally being taken seriously 
about how specialized databases often have performance-advantages of 
orders of magnitude. There's a paper (the above will find it, but if you 
add "one size fits all" you'll probably find it even better) that talks 
about benchmarking specialized databases against RDBMS, and they are 
*literally* talking about three and four *orders*of*magnitude* speedups 
(ie not factors of 2 or three, but factors of _seven_hundred_).

In other words, the whole relational database hype is so seventies and 
eighties. People have since figured out that yeah, they are convenient to 
program in if you want to do Visual Basic kind of things, but they really 
are *not* a replacement for good data structures.

So git has ended up writing its own data structures, but git is a lot 
better for it.

		Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Errors cloning large repo
@ 2007-03-13  0:02 Anton Tropashko
  0 siblings, 0 replies; 25+ messages in thread
From: Anton Tropashko @ 2007-03-13  0:02 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

> probably don't have many deltas either, so I'm hoping that the fact 
> that I only have 5.7GB will approximate your data thanks to it not being 
> compressible).

I made a tarball for the sdk and it's 5.2GB
I think you do have a good test set.

____________________________________________________________________________________
8:00? 8:25? 8:40? Find a flick in no time 
with the Yahoo! Search movie showtime shortcut.
http://tools.search.yahoo.com/shortcuts/#news

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Errors cloning large repo
  2007-03-12 14:24         ` Shawn O. Pearce
@ 2007-03-17 13:23           ` Jakub Narebski
  0 siblings, 0 replies; 25+ messages in thread
From: Jakub Narebski @ 2007-03-17 13:23 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: git

On Mon, 12 March 2007, Shawn O. Pearce wrote:
> Jakub Narebski <jnareb@gmail.com> wrote:

>> But what would happen if server supporting concatenated packfiles
>> sends such stream to the old client? So I think some kind of protocol
>> extension, or at least new request / new feature is needed for that.
> 
> No, a protocol extension is not required.  The packfile format
> is: 12 byte header, objects, 20 byte SHA-1 footer.  When sending
> concatenated packfiles to a client the server just needs to:
> 
>   - figure out how many objects total will be sent;
>   - send its own (new) header with that count;
>   - initialize a SHA-1 context and update it with the header;
>   - for each packfile to be sent:
>     - strip the first 12 bytes of the packfile;
>     - send the remaining bytes, except the last 20;
>     - update the SHA-1 context with the packfile data;
>   - send its own footer with the SHA-1 context.
> 
> Very simple.  Even the oldest Git clients (pre multi-ack extension)
> would understand that.  That's what's great about the way the
> packfile protocol and disk format is organized.  ;-)

It would be a very nice thing to have, if it is backwards compatibile.
It would ease load to server on clone, even if packs are divided into
large tight archive pack and perhaps a few more current packs to make
dumb transport do not neeed to download everything on [incremental]
fetch.

On fetch... perhaps there should be some configuration variable which
would change balance between load and bandwidth used...

And automatic splitting large pack on client side would help if for
example we have huge repository (non-compressable binaries) and client
has smaller filesystem limit on maximum file size than server.

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2007-03-17 13:20 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-03-13  0:02 Errors cloning large repo Anton Tropashko
  -- strict thread matches above, loose matches on Subject: below --
2007-03-12 17:39 Anton Tropashko
2007-03-12 18:40 ` Linus Torvalds
2007-03-10  2:37 Anton Tropashko
2007-03-10  3:07 ` Shawn O. Pearce
2007-03-10  5:54   ` Linus Torvalds
2007-03-10  6:01     ` Shawn O. Pearce
2007-03-10 22:32       ` Martin Waitz
2007-03-10 22:46         ` Linus Torvalds
2007-03-11 21:35           ` Martin Waitz
2007-03-10 10:27   ` Jakub Narebski
2007-03-11  2:00     ` Shawn O. Pearce
2007-03-12 11:09       ` Jakub Narebski
2007-03-12 14:24         ` Shawn O. Pearce
2007-03-17 13:23           ` Jakub Narebski
     [not found]   ` <82B0999F-73E8-494E-8D66-FEEEDA25FB91@adacore.com>
2007-03-10 22:21     ` Linus Torvalds
2007-03-10  5:10 ` Linus Torvalds
2007-03-10  1:21 Anton Tropashko
2007-03-10  1:45 ` Linus Torvalds
2007-03-09 23:48 Anton Tropashko
2007-03-10  0:54 ` Linus Torvalds
2007-03-10  2:03   ` Linus Torvalds
2007-03-10  2:12     ` Junio C Hamano
2007-03-09 19:20 Anton Tropashko
2007-03-09 21:37 ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).