git-svnimport failed and now git-repack hates me

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* git-svnimport failed and now git-repack hates me
@ 2007-01-03 23:52 Chris Lee
  2007-01-04  1:59 ` Linus Torvalds
  0 siblings, 1 reply; 55+ messages in thread
From: Chris Lee @ 2007-01-03 23:52 UTC (permalink / raw)
  To: git

So I'm using git 1.4.1, and I have been experimenting with importing
the KDE sources from Subversion using git-svnimport.

First issue I ran into: On a machine with 4GB of RAM, when I tried to
do a full import, git-svnimport died after 309906 revisions, saying
that it couldn't fork.

Checking `top` and `ps` revealed that there were no git-svnimport
processes doing anything, but all of my 4G of RAM was still marked as
used by the kernel. I had to do sysctl -w vm.drop_caches=3 to get it
to free all the RAM that the svn import had used up.

Now, after that, I tried doing `git-repack -a` because I wanted to see
how small the packed archive would be (before trying to continue
importing the rest of the revisions. There are at least another 100k
revisions that I should be able to import, eventually.)

The repack finished after about nine hours, but when I try to do a
git-verify-pack on it, it dies with this error message:

error: Packfile
.git/objects/pack/pack-540263fe66ab9398cc796f000d52531a5c6f3df3.pack
SHA1 mismatch with itself

I get the same message from git-prune.

Any ideas?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-03 23:52 git-svnimport failed and now git-repack hates me Chris Lee
@ 2007-01-04  1:59 ` Linus Torvalds
  2007-01-04  2:06   ` Shawn O. Pearce
                     ` (5 more replies)
  0 siblings, 6 replies; 55+ messages in thread
From: Linus Torvalds @ 2007-01-04  1:59 UTC (permalink / raw)
  To: Chris Lee, Junio C Hamano, Shawn Pearce, Sasha Khapyorsky
  Cc: Git Mailing List

On Wed, 3 Jan 2007, Chris Lee wrote:
>
> So I'm using git 1.4.1, and I have been experimenting with importing
> the KDE sources from Subversion using git-svnimport.

As one single _huge_ import? All the sub-projects together? I have to say, 
that sounds pretty horrid.

> First issue I ran into: On a machine with 4GB of RAM, when I tried to
> do a full import, git-svnimport died after 309906 revisions, saying
> that it couldn't fork.
> 
> Checking `top` and `ps` revealed that there were no git-svnimport
> processes doing anything, but all of my 4G of RAM was still marked as
> used by the kernel. I had to do sysctl -w vm.drop_caches=3 to get it
> to free all the RAM that the svn import had used up.

I think that was just all cached, and all ok. The reason you didn't see 
any git-svnimport was that it had died off already, and all your memory 
was just caches. You could just have left it alone, and the kernel would 
have started re-using the memory for other things even without any 
"drop_caches". 

But what you did there didn't make anything worse, it was just likely had 
no real impact.

However, it does sound like git-svnimport probably acts like git-cvsimport 
used to, and just keeps too much in memory - so it's never going to act 
really nicely..

It also looks like git-svnimport never repacks the repo, which is 
absolutely horrible for performance on all levels. The CVS importer 
repacks every one thousand commits or something like that.

> Now, after that, I tried doing `git-repack -a` because I wanted to see
> how small the packed archive would be (before trying to continue
> importing the rest of the revisions. There are at least another 100k
> revisions that I should be able to import, eventually.)

I suspect you'd have been better off just re-starting, and using something 
like

	while :
	do
		git svnimport -l 1000 <...>
		.. figure out some way to decide if it's all done ..
		git repack -d
	done

which would make svnimport act a bit  more sanely, and repack 
incrementally. That should make both the import much faster, _and_ avoid 
any insane big repack at the end (well, you'd still want to do a "git 
repack -a -d" at the end to turn the many smaller packs into a bigger one, 
but it would be nicer).

However, I don't know what the proper magic is for svnimport to do that 
sane "do it in chunks and tell when you're all done". Or even better - to 
just make it repack properly and not keep everything in memory.

> The repack finished after about nine hours, but when I try to do a
> git-verify-pack on it, it dies with this error message:
> 
> error: Packfile
> .git/objects/pack/pack-540263fe66ab9398cc796f000d52531a5c6f3df3.pack
> SHA1 mismatch with itself

That sounds suspiciously like the bug we had in out POWER sha1 
implementation that would generate the wrong SHA1 for any pack-file that 
was over 512MB in size, due to an overflow in 32 bits (SHA1 does some 
counting in _bits_, so 512MB is 4G _bits_),

Now, I assume you're not on POWER (and we fixed that bug anyway - and I 
think long before 1.4.1 too), but I could easily imagine the same bug in 
some other SHA1 implementation (or perhaps _another_ overflow at the 1GB 
or 2GB mark..). I assume that the pack-file you had was something horrid..

I hope this is with a 64-bit kernel and a 64-bit user space? That should 
limit _some_ of the issues. But I would still not be surprised if your 
SHA1 libraries had some 32-bit ("unsigned int") or 31-bit ("int") limits 
in them somewhere - very few people do SHA1's over huge areas, and even 
when you do SHA1 on something like a DVD image (which is easily over any 
4GB limit), that tends to be done as many smaller calls to the SHA1 
library routines.

Junio - I suspect "pack-check.c" really shouldn't try to do it as one 
single humungous "SHA1_Update()" call. It showed one bug on PPC, I 
wouldn't be surprised if it's implicated now on some other architecture. 

Shawn - does the pack-file-windowing thing already change that? I'm too 
lazy to check..

As to who knows how to fix git-svnimport to do something saner, I have no 
clue.. Sasha seems to have touched it last. Sasha?

		Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-04  1:59 ` Linus Torvalds
@ 2007-01-04  2:06   ` Shawn O. Pearce
  2007-01-04  2:35     ` Shawn O. Pearce
  2007-01-04  2:16   ` Chris Lee
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 55+ messages in thread
From: Shawn O. Pearce @ 2007-01-04  2:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chris Lee, Junio C Hamano, Sasha Khapyorsky, Git Mailing List

Linus Torvalds <torvalds@osdl.org> wrote:
> Junio - I suspect "pack-check.c" really shouldn't try to do it as one 
> single humungous "SHA1_Update()" call. It showed one bug on PPC, I 
> wouldn't be surprised if it's implicated now on some other architecture. 

It used to do it as one big SHA1_Update() call...
 
> Shawn - does the pack-file-windowing thing already change that? I'm too 
> lazy to check..

But with the mmap window thing in `next` it does it in window
units only.  Which the user could configure to be huge, or could
configure to be sane.  The default when using mmap() is 32 MiB;
1 MiB when using pread() and git_mmap().

-- 
Shawn.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-04  1:59 ` Linus Torvalds
  2007-01-04  2:06   ` Shawn O. Pearce
@ 2007-01-04  2:16   ` Chris Lee
  2007-01-04 17:56     ` Chris Lee
  2007-01-04  2:33   ` Eric Wong
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 55+ messages in thread
From: Chris Lee @ 2007-01-04  2:16 UTC (permalink / raw)
  To: Linus Torvalds

On 1/3/07, Linus Torvalds <torvalds@osdl.org> wrote:
> > So I'm using git 1.4.1, and I have been experimenting with importing
> > the KDE sources from Subversion using git-svnimport.
>
> As one single _huge_ import? All the sub-projects together? I have to say,
> that sounds pretty horrid.

Unfortunately, that's how the KDE repo is organized. (I tried arguing
against this when they were going to do the original import, but I
lost the argument.) And git-svnimport doesn't appear to have any sort
of method for splitting a gigantic svn repo into several smaller git
repos.

> > First issue I ran into: On a machine with 4GB of RAM, when I tried to
> > do a full import, git-svnimport died after 309906 revisions, saying
> > that it couldn't fork.
> >
> > Checking `top` and `ps` revealed that there were no git-svnimport
> > processes doing anything, but all of my 4G of RAM was still marked as
> > used by the kernel. I had to do sysctl -w vm.drop_caches=3 to get it
> > to free all the RAM that the svn import had used up.
>
> I think that was just all cached, and all ok. The reason you didn't see
> any git-svnimport was that it had died off already, and all your memory
> was just caches. You could just have left it alone, and the kernel would
> have started re-using the memory for other things even without any
> "drop_caches".
>
> But what you did there didn't make anything worse, it was just likely had
> no real impact.

I got the tip about drop_caches from davej. Normally, when a process
taking up a huge amount of memory exits, it shows a bunch of free
memory in `top` and friends. I was a little bit surprised when that
didn't happen this time.

> However, it does sound like git-svnimport probably acts like git-cvsimport
> used to, and just keeps too much in memory - so it's never going to act
> really nicely..
>
> It also looks like git-svnimport never repacks the repo, which is
> absolutely horrible for performance on all levels. The CVS importer
> repacks every one thousand commits or something like that.

Yeah. I haven't bothered hacking git-svnimport yet - but it looks like
having it automatically repack every thousand revisions or so would
probably be a pretty big win.

> > Now, after that, I tried doing `git-repack -a` because I wanted to see
> > how small the packed archive would be (before trying to continue
> > importing the rest of the revisions. There are at least another 100k
> > revisions that I should be able to import, eventually.)
>
> I suspect you'd have been better off just re-starting, and using something
> like
>
>         while :
>         do
>                 git svnimport -l 1000 <...>
>                 .. figure out some way to decide if it's all done ..
>                 git repack -d
>         done
>
> which would make svnimport act a bit  more sanely, and repack
> incrementally. That should make both the import much faster, _and_ avoid
> any insane big repack at the end (well, you'd still want to do a "git
> repack -a -d" at the end to turn the many smaller packs into a bigger one,
> but it would be nicer).
>
> However, I don't know what the proper magic is for svnimport to do that
> sane "do it in chunks and tell when you're all done". Or even better - to
> just make it repack properly and not keep everything in memory.

You can pass limits to svnimport to give it a revision to start at and
another one to end at, so that wouldn't be too bad - I was thinking
about working around it like that (so that i don't have to go poking
around in the Perl code behind the svn importer).

By default, if I had, say, one pack with the first 1000 revisions, and
I imported another 1000, running 'git-repack' on its own would leave
the first pack alone and create a new pack with just the second 1000
revisions, right?

> > The repack finished after about nine hours, but when I try to do a
> > git-verify-pack on it, it dies with this error message:
> >
> > error: Packfile
> > .git/objects/pack/pack-540263fe66ab9398cc796f000d52531a5c6f3df3.pack
> > SHA1 mismatch with itself
>
> That sounds suspiciously like the bug we had in out POWER sha1
> implementation that would generate the wrong SHA1 for any pack-file that
> was over 512MB in size, due to an overflow in 32 bits (SHA1 does some
> counting in _bits_, so 512MB is 4G _bits_),
>
> Now, I assume you're not on POWER (and we fixed that bug anyway - and I
> think long before 1.4.1 too), but I could easily imagine the same bug in
> some other SHA1 implementation (or perhaps _another_ overflow at the 1GB
> or 2GB mark..). I assume that the pack-file you had was something horrid..
>
> I hope this is with a 64-bit kernel and a 64-bit user space? That should
> limit _some_ of the issues. But I would still not be surprised if your
> SHA1 libraries had some 32-bit ("unsigned int") or 31-bit ("int") limits
> in them somewhere - very few people do SHA1's over huge areas, and even
> when you do SHA1 on something like a DVD image (which is easily over any
> 4GB limit), that tends to be done as many smaller calls to the SHA1
> library routines.

This is on a dual-CPU dual-core Opteron, running the AMD64 variant of
Ubuntu's Edgy release (64-bit kernel, 64-bit native userland). The
pack-file was around 2.3GB.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-04  1:59 ` Linus Torvalds
  2007-01-04  2:06   ` Shawn O. Pearce
  2007-01-04  2:16   ` Chris Lee
@ 2007-01-04  2:33   ` Eric Wong
  2007-01-04  2:40     ` Randal L. Schwartz
  2007-01-05  2:09     ` [PATCH] git-svn: make --repack work consistently between fetch and multi-fetch Eric Wong
  2007-01-04  6:25   ` git-svnimport failed and now git-repack hates me Junio C Hamano
                     ` (2 subsequent siblings)
  5 siblings, 2 replies; 55+ messages in thread
From: Eric Wong @ 2007-01-04  2:33 UTC (permalink / raw)
  To: Chris Lee
  Cc: Linus Torvalds, Junio C Hamano, Shawn Pearce, Sasha Khapyorsky,
	Randal L. Schwartz, Git Mailing List

Linus Torvalds <torvalds@osdl.org> wrote:
> On Wed, 3 Jan 2007, Chris Lee wrote:
> > First issue I ran into: On a machine with 4GB of RAM, when I tried to
> > do a full import, git-svnimport died after 309906 revisions, saying
> > that it couldn't fork.

Managing memory with the Perl SVN libraries has been very painful in my
experience.

Part of it is Perl, which (as far as I know) never frees allocated
memory back to the OS (although Perl can reuse the allocated memory for
other things).  I'm CC-ing the resident Perl guru on this...

I'm also fairly certain that most higher-level languages have this
problem.

> I suspect you'd have been better off just re-starting, and using something 
> like
> 
> 	while :
> 	do
> 		git svnimport -l 1000 <...>
> 		.. figure out some way to decide if it's all done ..
> 		git repack -d
> 	done

> However, I don't know what the proper magic is for svnimport to do that 
> sane "do it in chunks and tell when you're all done". Or even better - to 
> just make it repack properly and not keep everything in memory.

<shameless self-promotion>
	git-svn already does this chunking internally

	Just set the repack interval to something smaller than 1000;
	(--repack=100) if you experience timeouts.
</shameless self-promotion>

-- 
Eric Wong

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-04  2:06   ` Shawn O. Pearce
@ 2007-01-04  2:35     ` Shawn O. Pearce
  2007-01-04  2:36       ` Chris Lee
  0 siblings, 1 reply; 55+ messages in thread
From: Shawn O. Pearce @ 2007-01-04  2:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chris Lee, Junio C Hamano, Sasha Khapyorsky, Git Mailing List

"Shawn O. Pearce" <spearce@spearce.org> wrote:
> Linus Torvalds <torvalds@osdl.org> wrote:
> > Junio - I suspect "pack-check.c" really shouldn't try to do it as one 
> > single humungous "SHA1_Update()" call. It showed one bug on PPC, I 
> > wouldn't be surprised if it's implicated now on some other architecture. 
> 
> It used to do it as one big SHA1_Update() call...
>  
> > Shawn - does the pack-file-windowing thing already change that? I'm too 
> > lazy to check..
> 
> But with the mmap window thing in `next` it does it in window
> units only.  Which the user could configure to be huge, or could
> configure to be sane.  The default when using mmap() is 32 MiB;
> 1 MiB when using pread() and git_mmap().

I should also point out that my git-fastimport hack that we used
on the huge Mozilla import may be helpful here.  Its _very_ fast
as it goes right to a pack file, but there's no SVN frontend for
it at this time.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-04  2:35     ` Shawn O. Pearce
@ 2007-01-04  2:36       ` Chris Lee
  2007-01-04  2:45         ` Shawn O. Pearce
  0 siblings, 1 reply; 55+ messages in thread
From: Chris Lee @ 2007-01-04  2:36 UTC (permalink / raw)
  To: Shawn O. Pearce
  Cc: Linus Torvalds, Junio C Hamano, Sasha Khapyorsky,
	Git Mailing List

On 1/3/07, Shawn O. Pearce <spearce@spearce.org> wrote:
> I should also point out that my git-fastimport hack that we used
> on the huge Mozilla import may be helpful here.  Its _very_ fast
> as it goes right to a pack file, but there's no SVN frontend for
> it at this time.

I would be *really* interested in playing with that. Where do I get it?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-04  2:33   ` Eric Wong
@ 2007-01-04  2:40     ` Randal L. Schwartz
  2007-01-04  3:13       ` Eric Wong
  2007-01-05  2:09     ` [PATCH] git-svn: make --repack work consistently between fetch and multi-fetch Eric Wong
  1 sibling, 1 reply; 55+ messages in thread
From: Randal L. Schwartz @ 2007-01-04  2:40 UTC (permalink / raw)
  To: Eric Wong
  Cc: Chris Lee, Linus Torvalds, Junio C Hamano, Shawn Pearce,
	Sasha Khapyorsky, Git Mailing List

>>>>> "Eric" == Eric Wong <normalperson@yhbt.net> writes:

Eric> Part of it is Perl, which (as far as I know) never frees allocated
Eric> memory back to the OS (although Perl can reuse the allocated memory for
Eric> other things).

It does on Linux, of all things.  That's because Linux has a smarter
malloc/free that uses mmap(2) for the large chunks.  On Linux, Perl memory
size can apparently grow and shrink nicely.  The "old school" advice about
Perl comes from sbrk(2)-driven malloc/free.

Try:

        $x[1e6] = "0";
        sleep 10; # do a ps here
        @x = ();
        sleep 30; # do a ps here

and watch the process on Linux.  If I'm right, this should show a large
process,  then a smaller one.

If you're getting a growing process though, you probably have a circular data
reference.  Maybe you have a tree with backpointers, and those backpointers
should have been weakened?

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-04  2:36       ` Chris Lee
@ 2007-01-04  2:45         ` Shawn O. Pearce
  2007-01-04  2:53           ` Chris Lee
  0 siblings, 1 reply; 55+ messages in thread
From: Shawn O. Pearce @ 2007-01-04  2:45 UTC (permalink / raw)
  To: Chris Lee
  Cc: Linus Torvalds, Junio C Hamano, Sasha Khapyorsky,
	Git Mailing List

Chris Lee <chris133@gmail.com> wrote:
> On 1/3/07, Shawn O. Pearce <spearce@spearce.org> wrote:
> >I should also point out that my git-fastimport hack that we used
> >on the huge Mozilla import may be helpful here.  Its _very_ fast
> >as it goes right to a pack file, but there's no SVN frontend for
> >it at this time.
> 
> I would be *really* interested in playing with that. Where do I get it?

Its a fork of git.git on repo.or.cz; the gitweb can be seen here:

  http://repo.or.cz/w/git/fastimport.git

the clone url is:

  git://repo.or.cz/git/fastimport.git
  http://repo.or.cz/r/git/fastimport.git

The entire code is in fast-import.c.  The input stream it consumes
comes in on STDIN and is documented in a large comment at the top
of the file.

All that's needed is to get data from SVN in a way that it can be
fed into git-fastimport.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-04  2:45         ` Shawn O. Pearce
@ 2007-01-04  2:53           ` Chris Lee
  2007-01-04  2:57             ` Shawn O. Pearce
  0 siblings, 1 reply; 55+ messages in thread
From: Chris Lee @ 2007-01-04  2:53 UTC (permalink / raw)
  To: Shawn O. Pearce
  Cc: Linus Torvalds, Junio C Hamano, Sasha Khapyorsky,
	Git Mailing List

On 1/3/07, Shawn O. Pearce <spearce@spearce.org> wrote:
> Chris Lee <chris133@gmail.com> wrote:
> > On 1/3/07, Shawn O. Pearce <spearce@spearce.org> wrote:
> > >I should also point out that my git-fastimport hack that we used
> > >on the huge Mozilla import may be helpful here.  Its _very_ fast
> > >as it goes right to a pack file, but there's no SVN frontend for
> > >it at this time.
> >
> > I would be *really* interested in playing with that. Where do I get it?
>
> Its a fork of git.git on repo.or.cz; the gitweb can be seen here:
>
>   http://repo.or.cz/w/git/fastimport.git
>
> the clone url is:
>
>   git://repo.or.cz/git/fastimport.git
>   http://repo.or.cz/r/git/fastimport.git
>
> The entire code is in fast-import.c.  The input stream it consumes
> comes in on STDIN and is documented in a large comment at the top
> of the file.

Neat. How do I do that?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-04  2:53           ` Chris Lee
@ 2007-01-04  2:57             ` Shawn O. Pearce
  2007-01-04  2:58               ` Chris Lee
  0 siblings, 1 reply; 55+ messages in thread
From: Shawn O. Pearce @ 2007-01-04  2:57 UTC (permalink / raw)
  To: Chris Lee; +Cc: Git Mailing List

[cc: list modified to remove folks who probably aren't immediately
 interested in git-fastimport]

Chris Lee <chris133@gmail.com> wrote:
> On 1/3/07, Shawn O. Pearce <spearce@spearce.org> wrote:
> >the clone url is:
> >
> >  git://repo.or.cz/git/fastimport.git
> >  http://repo.or.cz/r/git/fastimport.git
> >
> >The entire code is in fast-import.c.  The input stream it consumes
> >comes in on STDIN and is documented in a large comment at the top
> >of the file.
> 
> Neat. How do I do that?

I'm not sure I understand the question...

-- 
Shawn.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-04  2:57             ` Shawn O. Pearce
@ 2007-01-04  2:58               ` Chris Lee
  2007-01-04  3:05                 ` Shawn O. Pearce
  2007-01-04  3:06                 ` Chris Lee
  0 siblings, 2 replies; 55+ messages in thread
From: Chris Lee @ 2007-01-04  2:58 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Git Mailing List

Uh... somehow, it lost this part:

> All that's needed is to get data from SVN in a way that it can be
> fed into git-fastimport.

That's what I meant - I assume that someone already has the
svn-repo-to-gfi piece working? Where's that available from?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-04  2:58               ` Chris Lee
@ 2007-01-04  3:05                 ` Shawn O. Pearce
  2007-01-04  3:06                 ` Chris Lee
  1 sibling, 0 replies; 55+ messages in thread
From: Shawn O. Pearce @ 2007-01-04  3:05 UTC (permalink / raw)
  To: Chris Lee; +Cc: Git Mailing List

Chris Lee <chris133@gmail.com> wrote:
> Uh... somehow, it lost this part:
> 
> >All that's needed is to get data from SVN in a way that it can be
> >fed into git-fastimport.
> 
> That's what I meant - I assume that someone already has the
> svn-repo-to-gfi piece working? Where's that available from?

No.  That hasn't been written.

In theory someone could take the SVN dump library (its a chunk of
C code which parses SVN dump files) and write a tool which translates
it into git-fastimport.

One could also use the SVN client library to suck data from SVN
and pump it into git-fastimport.

Jon Smirl attempted to create a CVS-->git-fastimport program in
Python by starting with the cvs2svn codebase, but that doesn't
do anything about importing *from* SVN.  Jon was able to import
the entire Mozilla CVS repository (250k commits, about 3 GiB
input) in 2 hours using his hacked up cvs2svn and git-fastimport.
The resulting pack was ~900 MiB.  He recompressed that using
`git repack -a -d --window=50 --depth=1000` (which is insane) in
about an hour.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-04  2:58               ` Chris Lee
  2007-01-04  3:05                 ` Shawn O. Pearce
@ 2007-01-04  3:06                 ` Chris Lee
  1 sibling, 0 replies; 55+ messages in thread
From: Chris Lee @ 2007-01-04  3:06 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Git Mailing List

On 1/3/07, Chris Lee <chris133@gmail.com> wrote:
> Uh... somehow, it lost this part:
>
> > All that's needed is to get data from SVN in a way that it can be
> > fed into git-fastimport.
>
> That's what I meant - I assume that someone already has the
> svn-repo-to-gfi piece working? Where's that available from?

Right, and I'm an idiot! Awesome.

I obviously didn't comprehend the part where you wrote:

> I should also point out that my git-fastimport hack that we used
> on the huge Mozilla import may be helpful here.  Its _very_ fast
> as it goes right to a pack file, but there's no SVN frontend for
> it at this time.

Anyway. Thanks for the pointers, I'll see if I can't hack something up.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-04  2:40     ` Randal L. Schwartz
@ 2007-01-04  3:13       ` Eric Wong
  0 siblings, 0 replies; 55+ messages in thread
From: Eric Wong @ 2007-01-04  3:13 UTC (permalink / raw)
  To: Randal L. Schwartz
  Cc: Chris Lee, Linus Torvalds, Junio C Hamano, Shawn Pearce,
	Sasha Khapyorsky, Git Mailing List

"Randal L. Schwartz" <merlyn@stonehenge.com> wrote:
> >>>>> "Eric" == Eric Wong <normalperson@yhbt.net> writes:
> 
> Eric> Part of it is Perl, which (as far as I know) never frees allocated
> Eric> memory back to the OS (although Perl can reuse the allocated memory for
> Eric> other things).
> 
> It does on Linux, of all things.  That's because Linux has a smarter
> malloc/free that uses mmap(2) for the large chunks.  On Linux, Perl memory
> size can apparently grow and shrink nicely.  The "old school" advice about
> Perl comes from sbrk(2)-driven malloc/free.
> 
> Try:
> 
>         $x[1e6] = "0";
>         sleep 10; # do a ps here
>         @x = ();
>         sleep 30; # do a ps here
> 
> and watch the process on Linux.  If I'm right, this should show a large
> process,  then a smaller one.

Nope, not happening to me.  I'm using Perl 5.8.8-7 and glibc 2.3.6.ds1-8
on a Debian Etch machine.  The kernel is a vanilla 2.6.18.1 from
kernel.org.

strace shows an mmap2 call, but no corresponding mumap.  I've added a
sleep loop to the end of the above program and had it print
something every 10 seconds; but so far, there's still no munmap.

while (1) {
        print "hi\n" if ((time % 10) == 0);
	sleep 1;
}

Trying to allocate a bigger chunk (1e7) doesn't show anything different,
either.  I've also conducted similar experiments with Ruby in the past
and noticed the same things...

-- 
Eric Wong

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-04  1:59 ` Linus Torvalds
                     ` (2 preceding siblings ...)
  2007-01-04  2:33   ` Eric Wong
@ 2007-01-04  6:25   ` Junio C Hamano
  2007-01-04  7:26     ` [PATCH] pack-check.c::verify_packfile(): don't run SHA-1 update on huge data Junio C Hamano
  2007-01-04 17:58     ` git-svnimport failed and now git-repack hates me Chris Lee
  2007-01-04 19:24   ` Chris Lee
  2007-01-04 21:31   ` Sasha Khapyorsky
  5 siblings, 2 replies; 55+ messages in thread
From: Junio C Hamano @ 2007-01-04  6:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chris Lee, Shawn Pearce, Sasha Khapyorsky, Git Mailing List

Linus Torvalds <torvalds@osdl.org> writes:

> On Wed, 3 Jan 2007, Chris Lee wrote:
>>
>> So I'm using git 1.4.1, and I have been experimenting with importing
>> the KDE sources from Subversion using git-svnimport.
>
> As one single _huge_ import? All the sub-projects together? I have to say, 
> that sounds pretty horrid.

Thanks -- you said everything I should have said on this issue
while I was in bed ;-).

> Junio - I suspect "pack-check.c" really shouldn't try to do it as one 
> single humungous "SHA1_Update()" call. It showed one bug on PPC, I 
> wouldn't be surprised if it's implicated now on some other architecture. 

If Chris still has that huge .pack & .idx pair, it would be a
very good guinea pig to try a few things on, assuming that this
problem is that the pack-check.c feeds a huge blob to SHA-1
function with a single call.

 (1) Apply the attached patch on top of "master" (the patch
     should apply to 1.4.1 almost cleanly as well, except that
     we have hashcmp(a,b) instead of memcmp(a,b,20) since then),
     and see what it says about the packfile.  If your suspicion
     is correct, it should complain about your SHA-1
     implementation.

 (2) Try tip of "next" to see if its verify-pack passes the
     check.  Again, if your suspicion is correct, it should, since it
     uses Shawn's sliding mmap() stuff that will not feed the
     whole pack in one go.

 (3) I suspect that the tip of "master" should work except
     verify-pack.  It may be interesting to see how well the tip
     of "master" and "next" performs on the resulting huge pack
     (say, "time git log -p HEAD >/dev/null").  I am hoping this
     would be another datapoint to judge the runtime penalty of
     Shawn's sliding mmap() in "next" -- I suspect the penalty
     is either negligible or even negative.

diff --git a/pack-check.c b/pack-check.c
index c0caaee..738a0c5 100644
--- a/pack-check.c
+++ b/pack-check.c
@@ -29,6 +29,28 @@ static int verify_packfile(struct packed_git *p)
 	pack_base = p->pack_base;
 	SHA1_Update(&ctx, pack_base, pack_size - 20);
 	SHA1_Final(sha1, &ctx);
+
+	if (1) {
+		SHA_CTX another;
+		unsigned char *data = p->pack_base;
+		unsigned long size = pack_size - 20;
+		const unsigned long batchsize = (1u << 20);
+		unsigned char another_sha1[20];
+
+		SHA1_Init(&another);
+		while (size) {
+			unsigned long batch = size;
+			if (batchsize < batch)
+				batch = batchsize;
+			SHA1_Update(&another, data, batch);
+			size -= batch;
+			data += batch;
+		}
+		SHA1_Final(another_sha1, &another);
+		if (hashcmp(sha1, another_sha1))
+			die("Your SHA-1 implementation cannot hash %lu bytes correctly at once", pack_size - 20);
+	}
+
 	if (hashcmp(sha1, (unsigned char *)pack_base + pack_size - 20))
 		return error("Packfile %s SHA1 mismatch with itself",
 			     p->pack_name);

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH] pack-check.c::verify_packfile(): don't run SHA-1 update on huge data
  2007-01-04  6:25   ` git-svnimport failed and now git-repack hates me Junio C Hamano
@ 2007-01-04  7:26     ` Junio C Hamano
  2007-01-04 17:58     ` git-svnimport failed and now git-repack hates me Chris Lee
  1 sibling, 0 replies; 55+ messages in thread
From: Junio C Hamano @ 2007-01-04  7:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chris Lee, Shawn Pearce, Sasha Khapyorsky, Git Mailing List

Running the SHA1_Update() on the whole packfile in a single call
revealed an overflow problem we had in the SHA-1 implementation
on POWER architecture some time ago, which was fixed with commit
b47f509b (June 19, 2006).  Other SHA-1 implementations may have
a similar problem.

The sliding mmap() series already makes chunked calls to
SHA1_Update(), so this patch itself will become moot when it
graduates to "master", but in the meantime, run the hash
function in smaller chunks to prevent possible future problems.

Signed-off-by: Junio C Hamano <junkio@cox.net>
---

 * Chris, if you have a chance could you try this on the huge
   pack you had trouble with?

   Also whose SHA-1 implementation are you using, if this indeed
   is the problem, I wonder?

 pack-check.c |   20 +++++++++++++++-----
 1 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/pack-check.c b/pack-check.c
index c0caaee..8e123b7 100644
--- a/pack-check.c
+++ b/pack-check.c
@@ -1,16 +1,18 @@
 #include "cache.h"
 #include "pack.h"
 
+#define BATCH (1u<<20)
+
 static int verify_packfile(struct packed_git *p)
 {
 	unsigned long index_size = p->index_size;
 	void *index_base = p->index_base;
 	SHA_CTX ctx;
 	unsigned char sha1[20];
-	unsigned long pack_size = p->pack_size;
-	void *pack_base;
 	struct pack_header *hdr;
 	int nr_objects, err, i;
+	unsigned char *packdata;
+	unsigned long datasize;
 
 	/* Header consistency check */
 	hdr = p->pack_base;
@@ -25,11 +27,19 @@ static int verify_packfile(struct packed_git *p)
 			     "while idx size expects %d", nr_objects,
 			     num_packed_objects(p));
 
+	/* Check integrity of pack data with its SHA-1 checksum */
 	SHA1_Init(&ctx);
-	pack_base = p->pack_base;
-	SHA1_Update(&ctx, pack_base, pack_size - 20);
+	packdata = p->pack_base;
+	datasize = p->pack_size - 20;
+	while (datasize) {
+		unsigned long batch = (datasize < BATCH) ? datasize : BATCH;
+		SHA1_Update(&ctx, packdata, batch);
+		datasize -= batch;
+		packdata += batch;
+	}
 	SHA1_Final(sha1, &ctx);
-	if (hashcmp(sha1, (unsigned char *)pack_base + pack_size - 20))
+
+	if (hashcmp(sha1, (unsigned char *)(p->pack_base) + p->pack_size - 20))
 		return error("Packfile %s SHA1 mismatch with itself",
 			     p->pack_name);
 	if (hashcmp(sha1, (unsigned char *)index_base + index_size - 40))

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-04  2:16   ` Chris Lee
@ 2007-01-04 17:56     ` Chris Lee
  2007-01-04 18:30       ` Linus Torvalds
  0 siblings, 1 reply; 55+ messages in thread
From: Chris Lee @ 2007-01-04 17:56 UTC (permalink / raw)
  To: Git Mailing List

Accidentally sent this to just Linus instead of the list...

On 1/3/07, Linus Torvalds <torvalds@osdl.org> wrote:
> > So I'm using git 1.4.1, and I have been experimenting with importing
> > the KDE sources from Subversion using git-svnimport.
>
> As one single _huge_ import? All the sub-projects together? I have to say,
> that sounds pretty horrid.

Unfortunately, that's how the KDE repo is organized. (I tried arguing
against this when they were going to do the original import, but I
lost the argument.) And git-svnimport doesn't appear to have any sort
of method for splitting a gigantic svn repo into several smaller git
repos.

> > First issue I ran into: On a machine with 4GB of RAM, when I tried to
> > do a full import, git-svnimport died after 309906 revisions, saying
> > that it couldn't fork.
> >
> > Checking `top` and `ps` revealed that there were no git-svnimport
> > processes doing anything, but all of my 4G of RAM was still marked as
> > used by the kernel. I had to do sysctl -w vm.drop_caches=3 to get it
> > to free all the RAM that the svn import had used up.
>
> I think that was just all cached, and all ok. The reason you didn't see
> any git-svnimport was that it had died off already, and all your memory
> was just caches. You could just have left it alone, and the kernel would
> have started re-using the memory for other things even without any
> "drop_caches".
>
> But what you did there didn't make anything worse, it was just likely had
> no real impact.

I got the tip about drop_caches from davej. Normally, when a process
taking up a huge amount of memory exits, it shows a bunch of free
memory in `top` and friends. I was a little bit surprised when that
didn't happen this time.

> However, it does sound like git-svnimport probably acts like git-cvsimport
> used to, and just keeps too much in memory - so it's never going to act
> really nicely..
>
> It also looks like git-svnimport never repacks the repo, which is
> absolutely horrible for performance on all levels. The CVS importer
> repacks every one thousand commits or something like that.

Yeah. I haven't bothered hacking git-svnimport yet - but it looks like
having it automatically repack every thousand revisions or so would
probably be a pretty big win.

> > Now, after that, I tried doing `git-repack -a` because I wanted to see
> > how small the packed archive would be (before trying to continue
> > importing the rest of the revisions. There are at least another 100k
> > revisions that I should be able to import, eventually.)
>
> I suspect you'd have been better off just re-starting, and using something
> like
>
>         while :
>         do
>                 git svnimport -l 1000 <...>
>                 .. figure out some way to decide if it's all done ..
>                 git repack -d
>         done
>
> which would make svnimport act a bit  more sanely, and repack
> incrementally. That should make both the import much faster, _and_ avoid
> any insane big repack at the end (well, you'd still want to do a "git
> repack -a -d" at the end to turn the many smaller packs into a bigger one,
> but it would be nicer).
>
> However, I don't know what the proper magic is for svnimport to do that
> sane "do it in chunks and tell when you're all done". Or even better - to
> just make it repack properly and not keep everything in memory.

You can pass limits to svnimport to give it a revision to start at and
another one to end at, so that wouldn't be too bad - I was thinking
about working around it like that (so that i don't have to go poking
around in the Perl code behind the svn importer).

By default, if I had, say, one pack with the first 1000 revisions, and
I imported another 1000, running 'git-repack' on its own would leave
the first pack alone and create a new pack with just the second 1000
revisions, right?

> > The repack finished after about nine hours, but when I try to do a
> > git-verify-pack on it, it dies with this error message:
> >
> > error: Packfile
> > .git/objects/pack/pack-540263fe66ab9398cc796f000d52531a5c6f3df3.pack
> > SHA1 mismatch with itself
>
> That sounds suspiciously like the bug we had in out POWER sha1
> implementation that would generate the wrong SHA1 for any pack-file that
> was over 512MB in size, due to an overflow in 32 bits (SHA1 does some
> counting in _bits_, so 512MB is 4G _bits_),
>
> Now, I assume you're not on POWER (and we fixed that bug anyway - and I
> think long before 1.4.1 too), but I could easily imagine the same bug in
> some other SHA1 implementation (or perhaps _another_ overflow at the 1GB
> or 2GB mark..). I assume that the pack-file you had was something horrid..
>
> I hope this is with a 64-bit kernel and a 64-bit user space? That should
> limit _some_ of the issues. But I would still not be surprised if your
> SHA1 libraries had some 32-bit ("unsigned int") or 31-bit ("int") limits
> in them somewhere - very few people do SHA1's over huge areas, and even
> when you do SHA1 on something like a DVD image (which is easily over any
> 4GB limit), that tends to be done as many smaller calls to the SHA1
> library routines.

This is on a dual-CPU dual-core Opteron, running the AMD64 variant of
Ubuntu's Edgy release (64-bit kernel, 64-bit native userland). The
pack-file was around 2.3GB.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-04  6:25   ` git-svnimport failed and now git-repack hates me Junio C Hamano
  2007-01-04  7:26     ` [PATCH] pack-check.c::verify_packfile(): don't run SHA-1 update on huge data Junio C Hamano
@ 2007-01-04 17:58     ` Chris Lee
  2007-01-04 20:22       ` Junio C Hamano
  1 sibling, 1 reply; 55+ messages in thread
From: Chris Lee @ 2007-01-04 17:58 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Linus Torvalds, Shawn Pearce, Sasha Khapyorsky, Git Mailing List

> If Chris still has that huge .pack & .idx pair, it would be a
> very good guinea pig to try a few things on, assuming that this
> problem is that the pack-check.c feeds a huge blob to SHA-1
> function with a single call.

I do not still have it, but I can pretty easily regenerate it. Should
have it again in another nine hours or so. :)

>  (1) Apply the attached patch on top of "master" (the patch
>      should apply to 1.4.1 almost cleanly as well, except that
>      we have hashcmp(a,b) instead of memcmp(a,b,20) since then),
>      and see what it says about the packfile.  If your suspicion
>      is correct, it should complain about your SHA-1
>      implementation.
>
>  (2) Try tip of "next" to see if its verify-pack passes the
>      check.  Again, if your suspicion is correct, it should, since it
>      uses Shawn's sliding mmap() stuff that will not feed the
>      whole pack in one go.
>
>  (3) I suspect that the tip of "master" should work except
>      verify-pack.  It may be interesting to see how well the tip
>      of "master" and "next" performs on the resulting huge pack
>      (say, "time git log -p HEAD >/dev/null").  I am hoping this
>      would be another datapoint to judge the runtime penalty of
>      Shawn's sliding mmap() in "next" -- I suspect the penalty
>      is either negligible or even negative.

I'll try all of this after the pack is regenerated. Thanks!

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-04 17:56     ` Chris Lee
@ 2007-01-04 18:30       ` Linus Torvalds
  2007-01-04 18:54         ` Chris Lee
  0 siblings, 1 reply; 55+ messages in thread
From: Linus Torvalds @ 2007-01-04 18:30 UTC (permalink / raw)
  To: Chris Lee; +Cc: Git Mailing List

On Thu, 4 Jan 2007, Chris Lee wrote:
>
> Unfortunately, that's how the KDE repo is organized. (I tried arguing
> against this when they were going to do the original import, but I
> lost the argument.) And git-svnimport doesn't appear to have any sort
> of method for splitting a gigantic svn repo into several smaller git
> repos.

Well, the good news is, I think we could probably split it up from within 
git. It's not fundamentally hard, although it is pretty damn expensive 
(and it would require the subproject support to do really well).

So ignore that issue for now. I'd love to see the end result, if only 
because it sounds like you have a test-case for git that is four times 
bigger than the mozilla archive - even if it's just because of some really 
really stupid design decisions from the KDE SVN maintainers ;)

(But I would actually expect that KDE SVN uses SVN subprojects, so 
hopefully it's not _really_ one big repository. Of course, I don't know if 
SVN really does subprojects or how well it does them, so that's just a 
total guess).

The real problem with a SVN import is that I think SVN doesn't do merges 
right, so you can't import merge history properly (well, you can, if you 
decide that "properly" really means "SVN can't merge, so we can't really 
show it as merges in git either").

I think both git-svn and git-svnimport can _guess_ about merges, but it's 
just a heuristic, afaik. Whether it's a good one, I don't know.

> Yeah. I haven't bothered hacking git-svnimport yet - but it looks like
> having it automatically repack every thousand revisions or so would
> probably be a pretty big win.

That, or making it use the same "fastimport" that the hacked-up CVS 
importer was made to use. Either way, somebody who understands SVN 
intimately (and probably perl) would need to work on it. 

That would not be me, so I can't really help ;)

> By default, if I had, say, one pack with the first 1000 revisions, and
> I imported another 1000, running 'git-repack' on its own would leave
> the first pack alone and create a new pack with just the second 1000
> revisions, right?

Yes. It's _probably_ better to do a full re-pack every once in a while 
(because if you have a lot of pack-files, eventually that ends up being 
problematic too), but as a first approximation, it's probably fine to just 
do a plain "git repack" every thousand commits, and then do a full big 
repack at the end.

The big repack will still be pretty expensive, but it should be less 
painful than having everything unpacked. And at least the import won't 
have run with millions and millions of loose objects.

So doing a "git repack -a -d" at the end is a good idea, and _maybe_ it 
could be done in the middle too for really big packs.

Again, doing what fastimport does avoids most of the whole issue, since it 
just generates a pack up-front instead. But that requires the importer to 
specifically understand about that kind of setup.

> This is on a dual-CPU dual-core Opteron, running the AMD64 variant of
> Ubuntu's Edgy release (64-bit kernel, 64-bit native userland). The
> pack-file was around 2.3GB.

Ok, that should all be fine. A 31-bit thing in OpenSSL would explain it, 
and doesn't sound unlikely. Just somebody using "int" somewhere, and it 
would never have been triggered by any sane user of SHA1_Update(). The git 
pack-check.c usage really _is_ very odd, even if it happens to make sense 
in that particular schenario.

		Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-04 18:30       ` Linus Torvalds
@ 2007-01-04 18:54         ` Chris Lee
  0 siblings, 0 replies; 55+ messages in thread
From: Chris Lee @ 2007-01-04 18:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

On 1/4/07, Linus Torvalds <torvalds@osdl.org> wrote:
> Well, the good news is, I think we could probably split it up from within
> git. It's not fundamentally hard, although it is pretty damn expensive
> (and it would require the subproject support to do really well).

I was hoping that'd be possible at some point. I really want to split
the submodules back out into first-class modules - one of my biggest
misgivings about the current KDE repository setup is how everything is
part of one gigantic repository.

> So ignore that issue for now. I'd love to see the end result, if only
> because it sounds like you have a test-case for git that is four times
> bigger than the mozilla archive - even if it's just because of some really
> really stupid design decisions from the KDE SVN maintainers ;)

The full on-disk size of the KDE SVN repo is about 37GB, last time I
checked. It may be up to 38 or 39GB now - I last ran rsync against the
svn repo a few weeks ago. I'm only focusing on importing the first
409k revisions at the moment, because that comprises the commits that
originally came from CVS and were imported into SVN. Almost
immediately after the CVS import, coolo made some changes - moving all
of the core KDE modules into /trunk/KDE, and their branches and tags
into /branches/KDE and /tags/KDE respectively. This, I suspect will
end up making things "fun" for the other part of the import, which is
another 200k revisions, give or take.

So, yes, I suspect it's quite a bit larger than Mozilla. I'm doing the
conversion to git as a test so that I can show some numbers to the KDE
guys; I'm not trying to campaign for a transition to git, but I think
it's definitely worth exploring what such a world would look like. But
in order for me to try to make a compelling argument for an eventual
project move to git, the git win32 support would need to be really
good. (In KDE4, we're supporting Windows and OS X as well as X11 as
first-class platforms.)

> (But I would actually expect that KDE SVN uses SVN subprojects, so
> hopefully it's not _really_ one big repository. Of course, I don't know if
> SVN really does subprojects or how well it does them, so that's just a
> total guess).

I don't think so, but I'll ask coolo (the KDE SVN administrator).

> The real problem with a SVN import is that I think SVN doesn't do merges
> right, so you can't import merge history properly (well, you can, if you
> decide that "properly" really means "SVN can't merge, so we can't really
> show it as merges in git either").
>
> I think both git-svn and git-svnimport can _guess_ about merges, but it's
> just a heuristic, afaik. Whether it's a good one, I don't know.

Not too worried about the merges right now - as long as I have a rough
approximation of what the original looked like, I'm pretty happy.

> > Yeah. I haven't bothered hacking git-svnimport yet - but it looks like
> > having it automatically repack every thousand revisions or so would
> > probably be a pretty big win.
>
> That, or making it use the same "fastimport" that the hacked-up CVS
> importer was made to use. Either way, somebody who understands SVN
> intimately (and probably perl) would need to work on it.
>
> That would not be me, so I can't really help ;)

Well, Shawn pointed me at the fastimport stuff, and I happen to know
Perl reasonably well (I think) so I'll take a stab at trying it that
way.

> > By default, if I had, say, one pack with the first 1000 revisions, and
> > I imported another 1000, running 'git-repack' on its own would leave
> > the first pack alone and create a new pack with just the second 1000
> > revisions, right?
>
> Yes. It's _probably_ better to do a full re-pack every once in a while
> (because if you have a lot of pack-files, eventually that ends up being
> problematic too), but as a first approximation, it's probably fine to just
> do a plain "git repack" every thousand commits, and then do a full big
> repack at the end.

Sounds like a good idea. Also sounds like it would be much less
painful than the current situation, where it takes over nine hours to
pack up all these revisions. :)

> The big repack will still be pretty expensive, but it should be less
> painful than having everything unpacked. And at least the import won't
> have run with millions and millions of loose objects.
>
> So doing a "git repack -a -d" at the end is a good idea, and _maybe_ it
> could be done in the middle too for really big packs.

Okay, good to know.

> Again, doing what fastimport does avoids most of the whole issue, since it
> just generates a pack up-front instead. But that requires the importer to
> specifically understand about that kind of setup.

I'll definitely be investigating the fastimport option. Looks like
I'll get to crack open some of my Perl books - haven't had to do that
in a while. :)

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-04  1:59 ` Linus Torvalds
                     ` (3 preceding siblings ...)
  2007-01-04  6:25   ` git-svnimport failed and now git-repack hates me Junio C Hamano
@ 2007-01-04 19:24   ` Chris Lee
  2007-01-04 21:12     ` Linus Torvalds
  2007-01-04 21:31   ` Sasha Khapyorsky
  5 siblings, 1 reply; 55+ messages in thread
From: Chris Lee @ 2007-01-04 19:24 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

On 1/3/07, Linus Torvalds <torvalds@osdl.org> wrote:
> > Checking `top` and `ps` revealed that there were no git-svnimport
> > processes doing anything, but all of my 4G of RAM was still marked as
> > used by the kernel. I had to do sysctl -w vm.drop_caches=3 to get it
> > to free all the RAM that the svn import had used up.
>
> I think that was just all cached, and all ok. The reason you didn't see
> any git-svnimport was that it had died off already, and all your memory
> was just caches. You could just have left it alone, and the kernel would
> have started re-using the memory for other things even without any
> "drop_caches".
>
> But what you did there didn't make anything worse, it was just likely had
> no real impact.

Thought it was worth mentioning this:

When I checked top, the numbers it showed me were:
Mem:   4059332k total,  3216480k used,   842852k free,    40824k buffers
Swap:        0k total,        0k used,        0k free,    37364k cached

40MB in buffers, 37MB in cache, and 3GB used.

Seems like *something* was definitely lost there. The 'used' number
didn't go down at all when I started doing other things; it went up as
the new programs started, then they used up some RAM, and then when
they exited they'd free whatever resources they'd used. However, until
I did the drop_caches, that number stayed pretty damn big.

The system has been up since then, doing lots of things, and still
seems pretty stable, so I think it's okay, but I thought that it was
worth mentioning that something seemed to be leaky.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-04 17:58     ` git-svnimport failed and now git-repack hates me Chris Lee
@ 2007-01-04 20:22       ` Junio C Hamano
  2007-01-05 17:19         ` Chris Lee
  0 siblings, 1 reply; 55+ messages in thread
From: Junio C Hamano @ 2007-01-04 20:22 UTC (permalink / raw)
  To: Chris Lee
  Cc: Junio C Hamano, Linus Torvalds, Shawn Pearce, Sasha Khapyorsky,
	Git Mailing List

"Chris Lee" <chris133@gmail.com> writes:

> I'll try all of this after the pack is regenerated. Thanks!

Thank YOU for helping to make git better.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-04 19:24   ` Chris Lee
@ 2007-01-04 21:12     ` Linus Torvalds
  0 siblings, 0 replies; 55+ messages in thread
From: Linus Torvalds @ 2007-01-04 21:12 UTC (permalink / raw)
  To: Chris Lee; +Cc: Git Mailing List

On Thu, 4 Jan 2007, Chris Lee wrote:
> 
> Seems like *something* was definitely lost there. The 'used' number
> didn't go down at all when I started doing other things; it went up as
> the new programs started

The 'used' number basically _never_ goes down as long as there is memory 
free. The kernel simply doesn't have any reason to free any of its caches, 
even if those caches end up not being very useful.

What happened is almost certainly that with your big unpacked repository, 
the kernel ended up using a lot of memory on filename caching. In other 
words, I'd have expected that if you were to do 

	cat /proc/slabinfo

you'd have seen a _lot_ of memory being used for dentries ("dentry_cache") 
and inodes ("ext3_inode_cache" assuming you're an ext3 user).

The kernel can easily drop those caches on demand, but "free" isn't quite 
smart enough to know about them as being caches, so they will just show up 
as "used".

That said, since you didn't want them, dropping them by hand with sysctl 
certainly didn't hurt. Manual control can often be better than automatic 
heuristics..

So the reason why repacking is so useful is that it gets rid of all these 
millions of individual files. They all take up space on the disk, but they 
also do end up having a lot of caches associated with them.

Btw, you may find that despite your 4GB of RAM, you might still be 
better off with a swapfile. It gives the kernel a certain amount of 
freedom in choosing how to allocate memory, and perhaps more importantly, 
even when the kernel doesn't actively use it, it means that IF the kernel 
runs out of totally free memory (because it has decided to keep a lot of 
stuff in the dentry cache), it gives the kernel choices, and a certain 
"buffer" for making the right decision.

What often happens is that the memory management heuristics don't make the 
"perfect" choice (partly because it's theoretically impossible anyway, but 
largely just because it's just a damn hard problem to even get all that 
*close* to perfect), and having a swap partition or even a swap file just 
allows the kernel to make some mistakes without it hitting a hard wall of 
"oh, I can't do anything at all about this particular page".

So that buffer zone can be helpful in avoiding bad situations, but it can 
actually also end up improving performance - it doesn't sound like the 
case in this particular situation, but in some other loads there really 
are a lot of dirty pages that aren't all that useful and where the memory 
really could be better used for other things if the largely unused dirty 
page could just be written to disk.

			Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-04  1:59 ` Linus Torvalds
                     ` (4 preceding siblings ...)
  2007-01-04 19:24   ` Chris Lee
@ 2007-01-04 21:31   ` Sasha Khapyorsky
  2007-01-04 22:04     ` Chris Lee
  5 siblings, 1 reply; 55+ messages in thread
From: Sasha Khapyorsky @ 2007-01-04 21:31 UTC (permalink / raw)
  To: Linus Torvalds, Chris Lee; +Cc: Junio C Hamano, Shawn Pearce, Git Mailing List

On 17:59 Wed 03 Jan     , Linus Torvalds wrote:
> 
> However, I don't know what the proper magic is for svnimport to do that 
> sane "do it in chunks and tell when you're all done". Or even better - to 
> just make it repack properly and not keep everything in memory.

> As to who knows how to fix git-svnimport to do something saner, I have no 
> clue.. Sasha seems to have touched it last. Sasha?

I guess it should not be hard to do svnimport in incrementally with
repacking. Like this:


diff --git a/git-svnimport.perl b/git-svnimport.perl
index 071777b..afbbe63 100755
--- a/git-svnimport.perl
+++ b/git-svnimport.perl
@@ -31,12 +31,13 @@ $SIG{'PIPE'}="IGNORE";
 $ENV{'TZ'}="UTC";
 
 our($opt_h,$opt_o,$opt_v,$opt_u,$opt_C,$opt_i,$opt_m,$opt_M,$opt_t,$opt_T,
-    $opt_b,$opt_r,$opt_I,$opt_A,$opt_s,$opt_l,$opt_d,$opt_D,$opt_S,$opt_F,$opt_P);
+    $opt_b,$opt_r,$opt_I,$opt_A,$opt_s,$opt_l,$opt_d,$opt_D,$opt_S,$opt_F,
+    $opt_P,$opt_R);
 
 sub usage() {
 	print STDERR <<END;
 Usage: ${\basename $0}     # fetch/update GIT from SVN
-       [-o branch-for-HEAD] [-h] [-v] [-l max_rev]
+       [-o branch-for-HEAD] [-h] [-v] [-l max_rev] [-R repack_each_revs]
        [-C GIT_repository] [-t tagname] [-T trunkname] [-b branchname]
        [-d|-D] [-i] [-u] [-r] [-I ignorefilename] [-s start_chg]
        [-m] [-M regex] [-A author_file] [-S] [-F] [-P project_name] [SVN_URL]
@@ -44,7 +45,7 @@ END
 	exit(1);
 }
 
-getopts("A:b:C:dDFhiI:l:mM:o:rs:t:T:SP:uv") or usage();
+getopts("A:b:C:dDFhiI:l:mM:o:rs:t:T:SP:R:uv") or usage();
 usage if $opt_h;
 
 my $tag_name = $opt_t || "tags";
@@ -52,6 +53,7 @@ my $trunk_name = $opt_T || "trunk";
 my $branch_name = $opt_b || "branches";
 my $project_name = $opt_P || "";
 $project_name = "/" . $project_name if ($project_name);
+my $repack_after = $opt_R || 1000;
 
 @ARGV == 1 or @ARGV == 2 or usage();
 
@@ -938,11 +940,27 @@ if ($opt_l < $current_rev) {
     exit;
 }
 
-print "Fetching from $current_rev to $opt_l ...\n" if $opt_v;
+print "Processing from $current_rev to $opt_l ...\n" if $opt_v;
 
-my $pool=SVN::Pool->new;
-$svn->{'svn'}->get_log("/",$current_rev,$opt_l,0,1,1,\&commit_all,$pool);
-$pool->clear;
+my $from_rev;
+my $to_rev = $current_rev;
+
+while ($to_rev < $opt_l) {
+	$from_rev = $to_rev;
+	$to_rev = $from_rev + $repack_after;
+	$to_rev = $opt_l if $opt_l < $to_rev;
+	print "Fetching from $from_rev to $to_rev ...\n" if $opt_v;
+	my $pool=SVN::Pool->new;
+	$svn->{'svn'}->get_log("/",$from_rev,$to_rev,0,1,1,\&commit_all,$pool);
+	$pool->clear;
+	my $pid = fork();
+	die "Fork: $!\n" unless defined $pid;
+	unless($pid) {
+		exec("git-repack", "-d")
+			or die "Cannot repack: $!\n";
+	}
+	waitpid($pid, 0);
+}
 
 
 unlink($git_index);


Chris, it works fine for me with small repository (~9000 revisions), but
I don't have such huge one as yours. Could you try? Thanks.

Sasha

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-04 21:31   ` Sasha Khapyorsky
@ 2007-01-04 22:04     ` Chris Lee
  2007-01-07  0:17       ` [PATCH] git-svnimport: support for incremental import Sasha Khapyorsky
  0 siblings, 1 reply; 55+ messages in thread
From: Chris Lee @ 2007-01-04 22:04 UTC (permalink / raw)
  To: Sasha Khapyorsky; +Cc: Git Mailing List

> Chris, it works fine for me with small repository (~9000 revisions), but
> I don't have such huge one as yours. Could you try? Thanks.

Patch looks like it makes sense. I can definitely try it later.

Back to work for now...

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] git-svn: make --repack work consistently between fetch and multi-fetch
  2007-01-04  2:33   ` Eric Wong
  2007-01-04  2:40     ` Randal L. Schwartz
@ 2007-01-05  2:09     ` Eric Wong
  1 sibling, 0 replies; 55+ messages in thread
From: Eric Wong @ 2007-01-05  2:09 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Chris Lee, Git Mailing List

Since fetch reforks itself at most every 1000 revisions, we
need to update the counter in the parent process to have a
working count if we set our repack interval to be > ~1000
revisions.  multi-fetch has always done this correctly
because of an extra process; now fetch uses the extra process;
as well.

While we're at it, only compile the $sha1 regex that checks for
repacking once.

Signed-off-by: Eric Wong <normalperson@yhbt.net>
---

I wrote:
> 	Just set the repack interval to something smaller than 1000;
> 	(--repack=100) if you experience timeouts.

Chris: you shouldn't get timeouts (at least not across HTTP(s)).
Also, don't worry about repack=100 either; there was a bug that
was triggered only in 'fetch' not 'multi-fetch' (you should use
'multi-fetch').  This patch fixes the 'fetch' bug.

 git-svn.perl |   10 ++++++----
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/git-svn.perl b/git-svn.perl
index 0fc386a..5377762 100755
--- a/git-svn.perl
+++ b/git-svn.perl
@@ -102,7 +102,7 @@ my %cmt_opts = ( 'edit|e' => \$_edit,
 );
 
 my %cmd = (
-	fetch => [ \&fetch, "Download new revisions from SVN",
+	fetch => [ \&cmd_fetch, "Download new revisions from SVN",
 			{ 'revision|r=s' => \$_revision, %fc_opts } ],
 	init => [ \&init, "Initialize a repo for tracking" .
 			  " (requires URL argument)",
@@ -293,6 +293,10 @@ sub init {
 	setup_git_svn();
 }
 
+sub cmd_fetch {
+	fetch_child_id($GIT_SVN, @_);
+}
+
 sub fetch {
 	check_upgrade_needed();
 	$SVN_URL ||= file_to_s("$GIT_SVN_DIR/info/url");
@@ -836,7 +840,6 @@ sub fetch_child_id {
 	my $ref = "$GIT_DIR/refs/remotes/$id";
 	defined(my $pid = open my $fh, '-|') or croak $!;
 	if (!$pid) {
-		$_repack = undef;
 		$GIT_SVN = $ENV{GIT_SVN_ID} = $id;
 		init_vars();
 		fetch(@_);
@@ -844,7 +847,7 @@ sub fetch_child_id {
 	}
 	while (<$fh>) {
 		print $_;
-		check_repack() if (/^r\d+ = $sha1/);
+		check_repack() if (/^r\d+ = $sha1/o);
 	}
 	close $fh or croak $?;
 }
@@ -1407,7 +1410,6 @@ sub git_commit {
 
 	# this output is read via pipe, do not change:
 	print "r$log_msg->{revision} = $commit\n";
-	check_repack();
 	return $commit;
 }
 
-- 
1.5.0.rc0.g0d67

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-04 20:22       ` Junio C Hamano
@ 2007-01-05 17:19         ` Chris Lee
  2007-01-05 19:05           ` Junio C Hamano
  0 siblings, 1 reply; 55+ messages in thread
From: Chris Lee @ 2007-01-05 17:19 UTC (permalink / raw)
  To: Git Mailing List

So, first up:

Using git-verify-pack from master does not fail. It actually does
verify the pack (after a pretty decent wait.) I should have tried
master first before sending out the first mail. :)

It takes about eleven minutes for git-verify-pack to complete, but it
does run to completion. So something that changed between 1.4.1 and
master made everything great again.

I haven't tried git-prune yet, but I'll report back with the results
from that next.

Junio: Did you still want me to try those steps with that patch
anyway, even though it works on master?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-05 17:19         ` Chris Lee
@ 2007-01-05 19:05           ` Junio C Hamano
  2007-01-05 19:33             ` Chris Lee
  0 siblings, 1 reply; 55+ messages in thread
From: Junio C Hamano @ 2007-01-05 19:05 UTC (permalink / raw)
  To: Chris Lee; +Cc: git

"Chris Lee" <chris133@gmail.com> writes:

> Using git-verify-pack from master does not fail. It actually does
> verify the pack (after a pretty decent wait.) I should have tried
> master first before sending out the first mail. :)

Depends on which "master" -- I pushed out the "chuncked hashing"
fix on "master" as commit 8977c110 as part of the update last
night.

> Junio: Did you still want me to try those steps with that patch
> anyway, even though it works on master?

It would give us a confirmation that the above actually fixes
the problem, if your 1.4.1 fails to verify that same new pack
you just generated, on which you saw that the "master" (assuming
you mean the one with the above patch) works correctly.

If your "master" before 8977c110 already passes, then there is
something else going on, which would be worrysome.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-05 19:05           ` Junio C Hamano
@ 2007-01-05 19:33             ` Chris Lee
  2007-01-05 19:39               ` Shawn O. Pearce
  0 siblings, 1 reply; 55+ messages in thread
From: Chris Lee @ 2007-01-05 19:33 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List

On 1/5/07, Junio C Hamano <junkio@cox.net> wrote:
> > Using git-verify-pack from master does not fail. It actually does
> > verify the pack (after a pretty decent wait.) I should have tried
> > master first before sending out the first mail. :)
>
> Depends on which "master" -- I pushed out the "chuncked hashing"
> fix on "master" as commit 8977c110 as part of the update last
> night.

Well, that would definitely explain it. :)

I did a fresh 'git pull' on master last night before I ran the
git-verify-pack, and that was around 11PM PST.

> > Junio: Did you still want me to try those steps with that patch
> > anyway, even though it works on master?
>
> It would give us a confirmation that the above actually fixes
> the problem, if your 1.4.1 fails to verify that same new pack
> you just generated, on which you saw that the "master" (assuming
> you mean the one with the above patch) works correctly.
>
> If your "master" before 8977c110 already passes, then there is
> something else going on, which would be worrysome.

The 'master' I had definitely included 8977c110. I can try it out with
the tip from before that commit, though, if you want.

Also, 'git-prune' took about 30 minutes to run to completion. Oddly,
git-prune didn't remove the older packs - does git-prune ignore packs?
'git-repack -a -d' did remove them.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-05 19:33             ` Chris Lee
@ 2007-01-05 19:39               ` Shawn O. Pearce
  2007-01-05 20:48                 ` Chris Lee
  2007-01-05 21:37                 ` Junio C Hamano
  0 siblings, 2 replies; 55+ messages in thread
From: Shawn O. Pearce @ 2007-01-05 19:39 UTC (permalink / raw)
  To: Chris Lee; +Cc: Junio C Hamano, Git Mailing List

Chris Lee <chris133@gmail.com> wrote:
> Also, 'git-prune' took about 30 minutes to run to completion. Oddly,
> git-prune didn't remove the older packs - does git-prune ignore packs?
> 'git-repack -a -d' did remove them.

git-prune is expensive.  Very expensive on very large projects,
as it must iterate every object to decide what is needed, before
it can start to remove objects that aren't needed.

Yes, it doesn't deal with removing pack files.  That's what the -d
to git-repack is for.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-05 19:39               ` Shawn O. Pearce
@ 2007-01-05 20:48                 ` Chris Lee
  2007-01-05 21:37                 ` Junio C Hamano
  1 sibling, 0 replies; 55+ messages in thread
From: Chris Lee @ 2007-01-05 20:48 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Git Mailing List

On 1/5/07, Shawn O. Pearce <spearce@spearce.org> wrote:
> git-prune is expensive.  Very expensive on very large projects,
> as it must iterate every object to decide what is needed, before
> it can start to remove objects that aren't needed.
>
> Yes, it doesn't deal with removing pack files.  That's what the -d
> to git-repack is for.

Not nearly as expensive as git-repack, that's for sure. :)

And - I originally thought that adding '-d' to git-repack just told it
to call 'git-prune' afterwards. It does more than that, which is cool.
Happily importing away - up to r320k now.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-05 19:39               ` Shawn O. Pearce
  2007-01-05 20:48                 ` Chris Lee
@ 2007-01-05 21:37                 ` Junio C Hamano
  2007-01-05 21:57                   ` Linus Torvalds
  2007-01-05 23:03                   ` Chris Lee
  1 sibling, 2 replies; 55+ messages in thread
From: Junio C Hamano @ 2007-01-05 21:37 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: git, Chris Lee, Linus Torvalds

Subject: [PATCH] builtin-prune: memory diet.

Somehow we forgot to turn save_commit_buffer off while walking
the reachable objects.  Releasing the memory for commit object
data that we do not use matters for large projects (for example,
about 90MB is saved while traversing linux-2.6 history).

Signed-off-by: Junio C Hamano <junkio@cox.net>
---

 * The linux-2.6 history number for me is inflated because I
   have grafts that connects historical archive behind the
   current v2.6.12-rc2 based history...

 builtin-prune.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/builtin-prune.c b/builtin-prune.c
index 00a53b3..b469c43 100644
--- a/builtin-prune.c
+++ b/builtin-prune.c
@@ -253,6 +253,8 @@ int cmd_prune(int argc, const char **argv, const char *prefix)
 		usage(prune_usage);
 	}
 
+	save_commit_buffer = 0;
+
 	/*
 	 * Set up revision parsing, and mark us as being interested
 	 * in all object types, not just commits.

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-05 21:37                 ` Junio C Hamano
@ 2007-01-05 21:57                   ` Linus Torvalds
  2007-01-05 22:18                     ` alan
  2007-01-05 22:39                     ` Linus Torvalds
  2007-01-05 23:03                   ` Chris Lee
  1 sibling, 2 replies; 55+ messages in thread
From: Linus Torvalds @ 2007-01-05 21:57 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Shawn O. Pearce, git, Chris Lee



On Fri, 5 Jan 2007, Junio C Hamano wrote:
> 
> Somehow we forgot to turn save_commit_buffer off while walking
> the reachable objects.  Releasing the memory for commit object
> data that we do not use matters for large projects (for example,
> about 90MB is saved while traversing linux-2.6 history).

Heh. Maybe we should just make the default the other way? It's probably 
pretty easy to find any users that suddenly start segfaulting ;)

(and just setting it in "cmd_log_init" would likely catch quite a number 
of them already).

		Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-05 21:57                   ` Linus Torvalds
@ 2007-01-05 22:18                     ` alan
  2007-01-07  0:36                       ` Eric Wong
  2007-01-05 22:39                     ` Linus Torvalds
  1 sibling, 1 reply; 55+ messages in thread
From: alan @ 2007-01-05 22:18 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Shawn O. Pearce, git, Chris Lee

On Fri, 5 Jan 2007, Linus Torvalds wrote:

>
>
> On Fri, 5 Jan 2007, Junio C Hamano wrote:
>>
>> Somehow we forgot to turn save_commit_buffer off while walking
>> the reachable objects.  Releasing the memory for commit object
>> data that we do not use matters for large projects (for example,
>> about 90MB is saved while traversing linux-2.6 history).
>
> Heh. Maybe we should just make the default the other way? It's probably
> pretty easy to find any users that suddenly start segfaulting ;)

I am trying to import a subversion repository and have yet to be able to 
suck down the whole thing without segfaulting.  It is a large repository. 
Works fine until about the last 10% and then runs out of memory.

open3: fork failed: Cannot allocate memory at /usr/bin/git-svn line 2711
512 at /usr/bin/git-svn line 446
         main::fetch_lib() called at /usr/bin/git-svn line 314
         main::fetch() called at /usr/bin/git-svn line 173

I need to try the "partial download" script and see if that helps.

-- 
"Invoking the supernatural can explain anything, and hence explains nothing."
                   - University of Utah bioengineering professor Gregory Clark

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-05 21:57                   ` Linus Torvalds
  2007-01-05 22:18                     ` alan
@ 2007-01-05 22:39                     ` Linus Torvalds
  2007-01-05 22:48                       ` Junio C Hamano
  1 sibling, 1 reply; 55+ messages in thread
From: Linus Torvalds @ 2007-01-05 22:39 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Shawn O. Pearce, git, Chris Lee



On Fri, 5 Jan 2007, Linus Torvalds wrote:
> 
> Heh. Maybe we should just make the default the other way? It's probably 
> pretty easy to find any users that suddenly start segfaulting ;)

This seems to pass all the tests, at least.

(But I didn't test the SVN stuff, since I don't have perl::SVN installed)

		Linus
---
diff --git a/builtin-branch.c b/builtin-branch.c
index d3df5a5..0b662a8 100644
--- a/builtin-branch.c
+++ b/builtin-branch.c
@@ -441,6 +441,9 @@ int cmd_branch(int argc, const char **argv, const char *prefix)
 	    (rename && force_create))
 		usage(builtin_branch_usage);
 
+	if (verbose)
+		save_commit_buffer = 1;
+
 	head = xstrdup(resolve_ref("HEAD", head_sha1, 0, NULL));
 	if (!head)
 		die("Failed to resolve HEAD as a valid ref.");
diff --git a/builtin-diff-tree.c b/builtin-diff-tree.c
index 24cb2d7..212ad59 100644
--- a/builtin-diff-tree.c
+++ b/builtin-diff-tree.c
@@ -67,6 +67,7 @@ int cmd_diff_tree(int argc, const char **argv, const char *prefix)
 	static struct rev_info *opt = &log_tree_opt;
 	int read_stdin = 0;
 
+	save_commit_buffer = 1;
 	init_revisions(opt, prefix);
 	git_config(git_default_config); /* no "diff" UI options */
 	nr_sha1 = 0;
diff --git a/builtin-fmt-merge-msg.c b/builtin-fmt-merge-msg.c
index 87d3d63..4053651 100644
--- a/builtin-fmt-merge-msg.c
+++ b/builtin-fmt-merge-msg.c
@@ -251,6 +251,7 @@ int cmd_fmt_merge_msg(int argc, const char **argv, const char *prefix)
 	unsigned char head_sha1[20];
 	const char *current_branch;
 
+	save_commit_buffer = 1;
 	git_config(fmt_merge_msg_config);
 
 	while (argc > 1) {
diff --git a/builtin-log.c b/builtin-log.c
index a59b4ac..ac95921 100644
--- a/builtin-log.c
+++ b/builtin-log.c
@@ -22,6 +22,7 @@ static void cmd_log_init(int argc, const char **argv, const char *prefix,
 {
 	int i;
 
+	save_commit_buffer = 1;
 	rev->abbrev = DEFAULT_ABBREV;
 	rev->commit_format = CMIT_FMT_DEFAULT;
 	rev->verbose_header = 1;
@@ -372,6 +373,7 @@ int cmd_format_patch(int argc, const char **argv, const char *prefix)
 	rev.ignore_merges = 1;
 	rev.diffopt.msg_sep = "";
 	rev.diffopt.recursive = 1;
+	save_commit_buffer = 1;
 
 	rev.extra_headers = extra_headers;
 
@@ -569,6 +571,7 @@ int cmd_cherry(int argc, const char **argv, const char *prefix)
 	const char *limit = NULL;
 	int verbose = 0;
 
+	save_commit_buffer = 1;
 	if (argc > 1 && !strcmp(argv[1], "-v")) {
 		verbose = 1;
 		argc--;
diff --git a/builtin-show-branch.c b/builtin-show-branch.c
index c67f2fa..53d1b29 100644
--- a/builtin-show-branch.c
+++ b/builtin-show-branch.c
@@ -586,6 +586,7 @@ int cmd_show_branch(int ac, const char **av, const char *prefix)
 	int dense = 1;
 	int reflog = 0;
 
+	save_commit_buffer = 1;
 	git_config(git_show_branch_config);
 
 	/* If nothing is specified, try the default first */
diff --git a/commit.c b/commit.c
index 2a58175..660d365 100644
--- a/commit.c
+++ b/commit.c
@@ -4,7 +4,7 @@
 #include "pkt-line.h"
 #include "utf8.h"
 
-int save_commit_buffer = 1;
+int save_commit_buffer = 0;
 
 struct sort_node
 {
diff --git a/merge-recursive.c b/merge-recursive.c
index bac16f5..b98ed1a 100644
--- a/merge-recursive.c
+++ b/merge-recursive.c
@@ -1286,6 +1286,7 @@ int main(int argc, char *argv[])
 	const char *branch1, *branch2;
 	struct commit *result, *h1, *h2;
 
+	save_commit_buffer = 1;
 	git_config(git_default_config); /* core.filemode */
 	original_index_file = getenv(INDEX_ENVIRONMENT);
 
diff --git a/revision.c b/revision.c
index 6e4ec46..aa10088 100644
--- a/revision.c
+++ b/revision.c
@@ -737,6 +737,7 @@ static void add_grep(struct rev_info *revs, const char *ptn, enum grep_pat_token
 		opt->pattern_tail = &(opt->pattern_list);
 		opt->regflags = REG_NEWLINE;
 		revs->grep_filter = opt;
+		save_commit_buffer = 1;
 	}
 	append_grep_pattern(revs->grep_filter, ptn,
 			    "command line", 0, what);

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-05 22:39                     ` Linus Torvalds
@ 2007-01-05 22:48                       ` Junio C Hamano
  2007-01-05 23:00                         ` Linus Torvalds
  0 siblings, 1 reply; 55+ messages in thread
From: Junio C Hamano @ 2007-01-05 22:48 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Shawn O. Pearce, git, Chris Lee

Linus Torvalds <torvalds@osdl.org> writes:

> On Fri, 5 Jan 2007, Linus Torvalds wrote:
>> 
>> Heh. Maybe we should just make the default the other way? It's probably 
>> pretty easy to find any users that suddenly start segfaulting ;)
>
> This seems to pass all the tests, at least.
>
> (But I didn't test the SVN stuff, since I don't have perl::SVN installed)

I do not think we have too many branch refs (builtin-branch and
builtin-show-branch) for this patch to make any practical
difference, but I wonder why this is needed...

> diff --git a/merge-recursive.c b/merge-recursive.c
> index bac16f5..b98ed1a 100644
> --- a/merge-recursive.c
> +++ b/merge-recursive.c
> @@ -1286,6 +1286,7 @@ int main(int argc, char *argv[])
>  	const char *branch1, *branch2;
>  	struct commit *result, *h1, *h2;
>  
> +	save_commit_buffer = 1;
>  	git_config(git_default_config); /* core.filemode */
>  	original_index_file = getenv(INDEX_ENVIRONMENT);

Ah, there are those annoying "using this as the merge base whose
commit log is..." business.  I wonder if anybody is actually
reading them (I once considered squelching that output).

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-05 22:48                       ` Junio C Hamano
@ 2007-01-05 23:00                         ` Linus Torvalds
  2007-01-05 23:02                           ` Linus Torvalds
  2007-01-05 23:44                           ` Junio C Hamano
  0 siblings, 2 replies; 55+ messages in thread
From: Linus Torvalds @ 2007-01-05 23:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Shawn O. Pearce, git, Chris Lee



On Fri, 5 Jan 2007, Junio C Hamano wrote:
> 
> I do not think we have too many branch refs (builtin-branch and
> builtin-show-branch) for this patch to make any practical
> difference

Yeah, it's mainly a "safety thing" - have the default be the "don't waste 
memory".

> but I wonder why this is needed...
> 
> > diff --git a/merge-recursive.c b/merge-recursive.c
> > index bac16f5..b98ed1a 100644
> > --- a/merge-recursive.c
> > +++ b/merge-recursive.c
> > @@ -1286,6 +1286,7 @@ int main(int argc, char *argv[])
> >  	const char *branch1, *branch2;
> >  	struct commit *result, *h1, *h2;
> >  
> > +	save_commit_buffer = 1;
> >  	git_config(git_default_config); /* core.filemode */
> >  	original_index_file = getenv(INDEX_ENVIRONMENT);
> 
> Ah, there are those annoying "using this as the merge base whose
> commit log is..." business.  I wonder if anybody is actually
> reading them (I once considered squelching that output).

"output_commit_title()" used it. Not just for the merge base, but for the 
regular "merging X and Y" messages, I think.

		Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-05 23:00                         ` Linus Torvalds
@ 2007-01-05 23:02                           ` Linus Torvalds
  2007-01-05 23:44                           ` Junio C Hamano
  1 sibling, 0 replies; 55+ messages in thread
From: Linus Torvalds @ 2007-01-05 23:02 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Shawn O. Pearce, git, Chris Lee



On Fri, 5 Jan 2007, Linus Torvalds wrote:
> 
> Yeah, it's mainly a "safety thing" - have the default be the "don't waste 
> memory".

Btw, I'm not at all certain whether it's necessary or a good thing. I just 
decided to see how many people really seem to use the commit messages at 
all. So feel free to throw the patch away if you don't think this is 
worthwhile, I won't push it.

		Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-05 21:37                 ` Junio C Hamano
  2007-01-05 21:57                   ` Linus Torvalds
@ 2007-01-05 23:03                   ` Chris Lee
  2007-01-05 23:09                     ` Junio C Hamano
  1 sibling, 1 reply; 55+ messages in thread
From: Chris Lee @ 2007-01-05 23:03 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Shawn O. Pearce, git, Linus Torvalds

On 1/5/07, Junio C Hamano <junkio@cox.net> wrote:
> Subject: [PATCH] builtin-prune: memory diet.
>
> Somehow we forgot to turn save_commit_buffer off while walking
> the reachable objects.  Releasing the memory for commit object
> data that we do not use matters for large projects (for example,
> about 90MB is saved while traversing linux-2.6 history).

Is git-verify-pack supposed to mmap the entire packfile? Because the
version I have maps 2.3GB into RAM and keeps it there until it's done.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-05 23:03                   ` Chris Lee
@ 2007-01-05 23:09                     ` Junio C Hamano
  2007-01-05 23:17                       ` Linus Torvalds
  0 siblings, 1 reply; 55+ messages in thread
From: Junio C Hamano @ 2007-01-05 23:09 UTC (permalink / raw)
  To: Chris Lee; +Cc: Shawn O. Pearce, git, Linus Torvalds

"Chris Lee" <chris133@gmail.com> writes:

> On 1/5/07, Junio C Hamano <junkio@cox.net> wrote:
>> Subject: [PATCH] builtin-prune: memory diet.
>>
>> Somehow we forgot to turn save_commit_buffer off while walking
>> the reachable objects.  Releasing the memory for commit object
>> data that we do not use matters for large projects (for example,
>> about 90MB is saved while traversing linux-2.6 history).
>
> Is git-verify-pack supposed to mmap the entire packfile? Because the
> version I have maps 2.3GB into RAM and keeps it there until it's done.

Yes -- we need to hash the whole thing as well as doing other
checks on it.  Sliding mmap() in "next" will mmap that in chunks
of 32MB or 1GB, but its needing to read every byte of it does
not change.

The problem Linus pointed out was that your SHA1_Update()
implementations may not be prepared to hash the whole 2.3GB in
one go.  The one in "master" (and "maint", although I haven't
done a v1.4.4.4 maintenance release yet) calls SHA1_Update()
in chunks to work around that potential issue.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-05 23:09                     ` Junio C Hamano
@ 2007-01-05 23:17                       ` Linus Torvalds
  2007-01-05 23:58                         ` Junio C Hamano
  0 siblings, 1 reply; 55+ messages in thread
From: Linus Torvalds @ 2007-01-05 23:17 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Chris Lee, Shawn O. Pearce, git

On Fri, 5 Jan 2007, Junio C Hamano wrote:
> 
> The problem Linus pointed out was that your SHA1_Update()
> implementations may not be prepared to hash the whole 2.3GB in
> one go.  The one in "master" (and "maint", although I haven't
> done a v1.4.4.4 maintenance release yet) calls SHA1_Update()
> in chunks to work around that potential issue.

Well, I think Chris is worried about having it all mapped at the same 
time.

It does actually end up forcing the kernel to do more work (it's harder to 
re-use a mapped page than it is to reuse one that isn't), and in that 
sense, if you have less than <n> GB of RAM and can't just keep it all in 
memory at the same time, doing one large mmap is possibly more expensive 
than chunking things up.

That said, I doubt it's a huge problem. If you can't fit the whole file in 
memory, your real performance issue is going to be the IO, not the fact 
that the kernel has to work a bit harder at unmapping pages ;)

		Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-05 23:00                         ` Linus Torvalds
  2007-01-05 23:02                           ` Linus Torvalds
@ 2007-01-05 23:44                           ` Junio C Hamano
  2007-01-05 23:59                             ` Linus Torvalds
  2007-01-06  0:06                             ` Johannes Schindelin
  1 sibling, 2 replies; 55+ messages in thread
From: Junio C Hamano @ 2007-01-05 23:44 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Shawn O. Pearce, git, Chris Lee

Linus Torvalds <torvalds@osdl.org> writes:

>> Ah, there are those annoying "using this as the merge base whose
>> commit log is..." business.  I wonder if anybody is actually
>> reading them (I once considered squelching that output).
>
> "output_commit_title()" used it. Not just for the merge base, but for the 
> regular "merging X and Y" messages, I think.

Yes, what I really was wondering were (1) if the messages are
useful, and (2) if so should that belong to git-merge not
git-merge-recursive.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-05 23:17                       ` Linus Torvalds
@ 2007-01-05 23:58                         ` Junio C Hamano
  2007-01-06  0:11                           ` Linus Torvalds
  0 siblings, 1 reply; 55+ messages in thread
From: Junio C Hamano @ 2007-01-05 23:58 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Chris Lee, Shawn O. Pearce, git

Linus Torvalds <torvalds@osdl.org> writes:

> It does actually end up forcing the kernel to do more work (it's harder to 
> re-use a mapped page than it is to reuse one that isn't), and in that 
> sense, if you have less than <n> GB of RAM and can't just keep it all in 
> memory at the same time, doing one large mmap is possibly more expensive 
> than chunking things up.

Even if it is a read-only private mapping?  Would MAP_SHARED
help?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-05 23:44                           ` Junio C Hamano
@ 2007-01-05 23:59                             ` Linus Torvalds
  2007-01-06  0:06                             ` Johannes Schindelin
  1 sibling, 0 replies; 55+ messages in thread
From: Linus Torvalds @ 2007-01-05 23:59 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Shawn O. Pearce, git, Chris Lee



On Fri, 5 Jan 2007, Junio C Hamano wrote:
> 
> Yes, what I really was wondering were (1) if the messages are
> useful, and (2) if so should that belong to git-merge not
> git-merge-recursive.

I kind of like them, but I don't really look _too_ much at them, so .. 

I guess it would make more sense to do that at a higher level, and have 
the low-level merger just do the actual merge itself.

		Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-05 23:44                           ` Junio C Hamano
  2007-01-05 23:59                             ` Linus Torvalds
@ 2007-01-06  0:06                             ` Johannes Schindelin
  1 sibling, 0 replies; 55+ messages in thread
From: Johannes Schindelin @ 2007-01-06  0:06 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, Shawn O. Pearce, git, Chris Lee

Hi,

On Fri, 5 Jan 2007, Junio C Hamano wrote:

> Linus Torvalds <torvalds@osdl.org> writes:
> 
> >> Ah, there are those annoying "using this as the merge base whose
> >> commit log is..." business.  I wonder if anybody is actually
> >> reading them (I once considered squelching that output).
> >
> > "output_commit_title()" used it. Not just for the merge base, but for the 
> > regular "merging X and Y" messages, I think.
> 
> Yes, what I really was wondering were (1) if the messages are
> useful, and (2) if so should that belong to git-merge not
> git-merge-recursive.

Since recursive merge performs possibly more than one merge, it belongs 
into merge-recursive.c, _if_ we want that message.

I found it helpful for "debugging" failed _recursive_ merges. I.e. I knew 
which of the recursive merges introduced the many, many conflicts. But I 
cannot remember off-hand if that was a test merge, and if it was before, 
or after, I sorted the merge bases by date.

Since the conflict markers now say which commit the conflicts came from, I 
am okay with removing the message, though.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-05 23:58                         ` Junio C Hamano
@ 2007-01-06  0:11                           ` Linus Torvalds
  2007-01-06  0:15                             ` Linus Torvalds
  0 siblings, 1 reply; 55+ messages in thread
From: Linus Torvalds @ 2007-01-06  0:11 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Chris Lee, Shawn O. Pearce, git



On Fri, 5 Jan 2007, Junio C Hamano wrote:
> 
> Even if it is a read-only private mapping?  Would MAP_SHARED
> help?

mmap is mmap, and it all boils down to having to remove it from the page 
tables.

But it really shouldn't be a problem. 

		Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-06  0:11                           ` Linus Torvalds
@ 2007-01-06  0:15                             ` Linus Torvalds
  2007-01-06  0:23                               ` Junio C Hamano
  0 siblings, 1 reply; 55+ messages in thread
From: Linus Torvalds @ 2007-01-06  0:15 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Chris Lee, Shawn O. Pearce, git

On Fri, 5 Jan 2007, Linus Torvalds wrote:
> 
> But it really shouldn't be a problem. 

Basically, this boils down to the same old issue: if you have a fixed 
access pattern (like SHA1_Update() over the whole buffer), you're actually 
likely to perform better with a loop of read() calls than with mmap.

So if we ONLY did the SHA1 thing, we shouldn't do mmap, we should just 
chunk things up into 16kB buffers or something, and read them.

But the mmap in pack-check _also_ ends up being for the subsequent object 
checking (with unpacking etc), so the mmap here actually is probably the 
right thing to do. I really wouldn't worry, unless we get people who 
report real problems (and I think the problems with svn-import of the huge 
KDE repos are all elsewhere, notably in teh SVN import itself, not in any 
pack handling ;)

		Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-06  0:15                             ` Linus Torvalds
@ 2007-01-06  0:23                               ` Junio C Hamano
  2007-01-06  1:22                                 ` Linus Torvalds
  0 siblings, 1 reply; 55+ messages in thread
From: Junio C Hamano @ 2007-01-06  0:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Chris Lee, Shawn O. Pearce, git

Linus Torvalds <torvalds@osdl.org> writes:

> On Fri, 5 Jan 2007, Linus Torvalds wrote:
>> 
>> But it really shouldn't be a problem. 
>
> Basically, this boils down to the same old issue: if you have a fixed 
> access pattern (like SHA1_Update() over the whole buffer), you're actually 
> likely to perform better with a loop of read() calls than with mmap.
>
> So if we ONLY did the SHA1 thing, we shouldn't do mmap, we should just 
> chunk things up into 16kB buffers or something, and read them.

While I have your attention, there is a patch for the sliding
mmap() thing that raises the mmap window to 1GB (which means a
pack smaller than that is mmap'ed in its entirety, whle 2.3GB
pack will be mapped perhaps as three separate chunks) and the
total mmap window to 8GB (and any overflows we LRU out) on
places where sizeof(void*) == 8 (i.e. git compiled for 64-bit).

Currently these limits are 32MB and 256MB respectively on
platforms with real mmap().

Do you have any comments on it?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-06  0:23                               ` Junio C Hamano
@ 2007-01-06  1:22                                 ` Linus Torvalds
  0 siblings, 0 replies; 55+ messages in thread
From: Linus Torvalds @ 2007-01-06  1:22 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Chris Lee, Shawn O. Pearce, git



On Fri, 5 Jan 2007, Junio C Hamano wrote:
>
> While I have your attention, there is a patch for the sliding
> mmap() thing that raises the mmap window to 1GB (which means a
> pack smaller than that is mmap'ed in its entirety, whle 2.3GB
> pack will be mapped perhaps as three separate chunks) and the
> total mmap window to 8GB (and any overflows we LRU out) on
> places where sizeof(void*) == 8 (i.e. git compiled for 64-bit).
> 
> Currently these limits are 32MB and 256MB respectively on
> platforms with real mmap().
> 
> Do you have any comments on it?

I think it's fine. Most "normal" mmap users hopefully will only use a 
small portion of the mapped space, adn if they use it all, it means that 
they needed it all, so..

		Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] git-svnimport: support for incremental import
  2007-01-04 22:04     ` Chris Lee
@ 2007-01-07  0:17       ` Sasha Khapyorsky
  2007-01-07 18:12         ` Chris Lee
  0 siblings, 1 reply; 55+ messages in thread
From: Sasha Khapyorsky @ 2007-01-07  0:17 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Chris Lee, Git Mailing List

This adds ability to do import "in chunks" (default 1000 revisions),
after each chunk git repo will be repacked. The option -R is used to
change default value of chunk size (or how often repository will
repacked).

Signed-off-by: Sasha Khapyorsky <sashak@voltaire.com>
---

Chris reported successful test with this patch.

 Documentation/git-svnimport.txt |   10 +++++++++-
 git-svnimport.perl              |   32 +++++++++++++++++++++++++-------
 2 files changed, 34 insertions(+), 8 deletions(-)

diff --git a/Documentation/git-svnimport.txt b/Documentation/git-svnimport.txt
index 2c7c7da..b166cf3 100644
--- a/Documentation/git-svnimport.txt
+++ b/Documentation/git-svnimport.txt
@@ -15,7 +15,7 @@ SYNOPSIS
 		[ -b branch_subdir ] [ -T trunk_subdir ] [ -t tag_subdir ]
 		[ -s start_chg ] [ -m ] [ -r ] [ -M regex ]
 		[ -I <ignorefile_name> ] [ -A <author_file> ]
-		[ -P <path_from_trunk> ]
+		[ -R <repack_each_revs>] [ -P <path_from_trunk> ]
 		<SVN_repository_URL> [ <path> ]
 
 
@@ -108,6 +108,14 @@ repository without -A.
 Formerly, this option controlled how many revisions to pull,
 due to SVN memory leaks. (These have been worked around.)
 
+-R <repack_each_revs>::
+	Specify how often git repository should be repacked.
++
+The default value is 1000. git-svnimport will do import in chunks of 1000
+revisions, after each chunk git repository will be repacked. To disable
+this behavior specify some big value here which is mote than number of
+revisions to import.
+
 -P <path_from_trunk>::
 	Partial import of the SVN tree.
 +
diff --git a/git-svnimport.perl b/git-svnimport.perl
index 071777b..afbbe63 100755
--- a/git-svnimport.perl
+++ b/git-svnimport.perl
@@ -31,12 +31,13 @@ $SIG{'PIPE'}="IGNORE";
 $ENV{'TZ'}="UTC";
 
 our($opt_h,$opt_o,$opt_v,$opt_u,$opt_C,$opt_i,$opt_m,$opt_M,$opt_t,$opt_T,
-    $opt_b,$opt_r,$opt_I,$opt_A,$opt_s,$opt_l,$opt_d,$opt_D,$opt_S,$opt_F,$opt_P);
+    $opt_b,$opt_r,$opt_I,$opt_A,$opt_s,$opt_l,$opt_d,$opt_D,$opt_S,$opt_F,
+    $opt_P,$opt_R);
 
 sub usage() {
 	print STDERR <<END;
 Usage: ${\basename $0}     # fetch/update GIT from SVN
-       [-o branch-for-HEAD] [-h] [-v] [-l max_rev]
+       [-o branch-for-HEAD] [-h] [-v] [-l max_rev] [-R repack_each_revs]
        [-C GIT_repository] [-t tagname] [-T trunkname] [-b branchname]
        [-d|-D] [-i] [-u] [-r] [-I ignorefilename] [-s start_chg]
        [-m] [-M regex] [-A author_file] [-S] [-F] [-P project_name] [SVN_URL]
@@ -44,7 +45,7 @@ END
 	exit(1);
 }
 
-getopts("A:b:C:dDFhiI:l:mM:o:rs:t:T:SP:uv") or usage();
+getopts("A:b:C:dDFhiI:l:mM:o:rs:t:T:SP:R:uv") or usage();
 usage if $opt_h;
 
 my $tag_name = $opt_t || "tags";
@@ -52,6 +53,7 @@ my $trunk_name = $opt_T || "trunk";
 my $branch_name = $opt_b || "branches";
 my $project_name = $opt_P || "";
 $project_name = "/" . $project_name if ($project_name);
+my $repack_after = $opt_R || 1000;
 
 @ARGV == 1 or @ARGV == 2 or usage();
 
@@ -938,11 +940,27 @@ if ($opt_l < $current_rev) {
     exit;
 }
 
-print "Fetching from $current_rev to $opt_l ...\n" if $opt_v;
+print "Processing from $current_rev to $opt_l ...\n" if $opt_v;
 
-my $pool=SVN::Pool->new;
-$svn->{'svn'}->get_log("/",$current_rev,$opt_l,0,1,1,\&commit_all,$pool);
-$pool->clear;
+my $from_rev;
+my $to_rev = $current_rev;
+
+while ($to_rev < $opt_l) {
+	$from_rev = $to_rev;
+	$to_rev = $from_rev + $repack_after;
+	$to_rev = $opt_l if $opt_l < $to_rev;
+	print "Fetching from $from_rev to $to_rev ...\n" if $opt_v;
+	my $pool=SVN::Pool->new;
+	$svn->{'svn'}->get_log("/",$from_rev,$to_rev,0,1,1,\&commit_all,$pool);
+	$pool->clear;
+	my $pid = fork();
+	die "Fork: $!\n" unless defined $pid;
+	unless($pid) {
+		exec("git-repack", "-d")
+			or die "Cannot repack: $!\n";
+	}
+	waitpid($pid, 0);
+}
 
 
 unlink($git_index);
-- 
1.5.0.rc0.g2484-dirty

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: git-svnimport failed and now git-repack hates me
  2007-01-05 22:18                     ` alan
@ 2007-01-07  0:36                       ` Eric Wong
  0 siblings, 0 replies; 55+ messages in thread
From: Eric Wong @ 2007-01-07  0:36 UTC (permalink / raw)
  To: alan; +Cc: git

alan <alan@clueserver.org> wrote:
> I am trying to import a subversion repository and have yet to be able to 
> suck down the whole thing without segfaulting.  It is a large repository. 
> Works fine until about the last 10% and then runs out of memory.
> 
> open3: fork failed: Cannot allocate memory at /usr/bin/git-svn line 2711
> 512 at /usr/bin/git-svn line 446
>         main::fetch_lib() called at /usr/bin/git-svn line 314
>         main::fetch() called at /usr/bin/git-svn line 173
> 
> I need to try the "partial download" script and see if that helps.

Which version of git-svn is this?  If it's a public repository I'd
like to have a look.

git-svn memory usage should be bounded by:
	max(max(commit-message size),
	    max(number of files changed per revision))

I'm not sure if the size of the files changed per-revision or if the
size of the deltas is an issue with git-svn.  But if you have a repo
with big files and big changes to them, let me know so I can take a
look.

Can you also try lowering $inc in git-svn to something lower (perhaps
100)? (my $inc = 1000; in the fetch_lib function) and see if that helps
things?  Thanks.

-- 
Eric Wong

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] git-svnimport: support for incremental import
  2007-01-07  0:17       ` [PATCH] git-svnimport: support for incremental import Sasha Khapyorsky
@ 2007-01-07 18:12         ` Chris Lee
  2007-01-07 18:59           ` Sasha Khapyorsky
  0 siblings, 1 reply; 55+ messages in thread
From: Chris Lee @ 2007-01-07 18:12 UTC (permalink / raw)
  To: Sasha Khapyorsky; +Cc: Junio C Hamano, Git Mailing List

On 1/6/07, Sasha Khapyorsky <sashak@voltaire.com> wrote:
> This adds ability to do import "in chunks" (default 1000 revisions),
> after each chunk git repo will be repacked. The option -R is used to
> change default value of chunk size (or how often repository will
> repacked).

Actually, I just noticed an issue here with this - it appears to be
double-importing the edge revisions.

So if I started with -s 349000 and tell it to repack every 1000
revisions, it's now importing every thousandth revision twice.

Off-by-one?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] git-svnimport: support for incremental import
  2007-01-07 18:12         ` Chris Lee
@ 2007-01-07 18:59           ` Sasha Khapyorsky
  2007-01-08  2:22             ` [PATCH] git-svnimport: fix edge revisions double importing Sasha Khapyorsky
  0 siblings, 1 reply; 55+ messages in thread
From: Sasha Khapyorsky @ 2007-01-07 18:59 UTC (permalink / raw)
  To: Chris Lee; +Cc: Junio C Hamano, Git Mailing List

On 10:12 Sun 07 Jan     , Chris Lee wrote:
> On 1/6/07, Sasha Khapyorsky <sashak@voltaire.com> wrote:
> >This adds ability to do import "in chunks" (default 1000 revisions),
> >after each chunk git repo will be repacked. The option -R is used to
> >change default value of chunk size (or how often repository will
> >repacked).
> 
> Actually, I just noticed an issue here with this - it appears to be
> double-importing the edge revisions.
> 
> So if I started with -s 349000 and tell it to repack every 1000
> revisions, it's now importing every thousandth revision twice.

Indeed. There is the fix:


diff --git a/git-svnimport.perl b/git-svnimport.perl
index afbbe63..f1f1a7d 100755
--- a/git-svnimport.perl
+++ b/git-svnimport.perl
@@ -943,10 +943,10 @@ if ($opt_l < $current_rev) {
 print "Processing from $current_rev to $opt_l ...\n" if $opt_v;
 
 my $from_rev;
-my $to_rev = $current_rev;
+my $to_rev = $current_rev - 1;
 
 while ($to_rev < $opt_l) {
-	$from_rev = $to_rev;
+	$from_rev = $to_rev + 1;
 	$to_rev = $from_rev + $repack_after;
 	$to_rev = $opt_l if $opt_l < $to_rev;
 	print "Fetching from $from_rev to $to_rev ...\n" if $opt_v;


Sasha

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH] git-svnimport: fix edge revisions double importing
  2007-01-07 18:59           ` Sasha Khapyorsky
@ 2007-01-08  2:22             ` Sasha Khapyorsky
  0 siblings, 0 replies; 55+ messages in thread
From: Sasha Khapyorsky @ 2007-01-08  2:22 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Chris Lee, Git Mailing List

This fixes newly introduced bug when the incremental cycle edge revisions
are imported twice.

Signed-off-by: Sasha Khapyorsky <sashak@voltaire.com>
---
 git-svnimport.perl |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/git-svnimport.perl b/git-svnimport.perl
index afbbe63..f1f1a7d 100755
--- a/git-svnimport.perl
+++ b/git-svnimport.perl
@@ -943,10 +943,10 @@ if ($opt_l < $current_rev) {
 print "Processing from $current_rev to $opt_l ...\n" if $opt_v;
 
 my $from_rev;
-my $to_rev = $current_rev;
+my $to_rev = $current_rev - 1;
 
 while ($to_rev < $opt_l) {
-	$from_rev = $to_rev;
+	$from_rev = $to_rev + 1;
 	$to_rev = $from_rev + $repack_after;
 	$to_rev = $opt_l if $opt_l < $to_rev;
 	print "Fetching from $from_rev to $to_rev ...\n" if $opt_v;
-- 
1.5.0.rc0.g2484-dirty

^ permalink raw reply related	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2007-01-08  2:15 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-01-03 23:52 git-svnimport failed and now git-repack hates me Chris Lee
2007-01-04  1:59 ` Linus Torvalds
2007-01-04  2:06   ` Shawn O. Pearce
2007-01-04  2:35     ` Shawn O. Pearce
2007-01-04  2:36       ` Chris Lee
2007-01-04  2:45         ` Shawn O. Pearce
2007-01-04  2:53           ` Chris Lee
2007-01-04  2:57             ` Shawn O. Pearce
2007-01-04  2:58               ` Chris Lee
2007-01-04  3:05                 ` Shawn O. Pearce
2007-01-04  3:06                 ` Chris Lee
2007-01-04  2:16   ` Chris Lee
2007-01-04 17:56     ` Chris Lee
2007-01-04 18:30       ` Linus Torvalds
2007-01-04 18:54         ` Chris Lee
2007-01-04  2:33   ` Eric Wong
2007-01-04  2:40     ` Randal L. Schwartz
2007-01-04  3:13       ` Eric Wong
2007-01-05  2:09     ` [PATCH] git-svn: make --repack work consistently between fetch and multi-fetch Eric Wong
2007-01-04  6:25   ` git-svnimport failed and now git-repack hates me Junio C Hamano
2007-01-04  7:26     ` [PATCH] pack-check.c::verify_packfile(): don't run SHA-1 update on huge data Junio C Hamano
2007-01-04 17:58     ` git-svnimport failed and now git-repack hates me Chris Lee
2007-01-04 20:22       ` Junio C Hamano
2007-01-05 17:19         ` Chris Lee
2007-01-05 19:05           ` Junio C Hamano
2007-01-05 19:33             ` Chris Lee
2007-01-05 19:39               ` Shawn O. Pearce
2007-01-05 20:48                 ` Chris Lee
2007-01-05 21:37                 ` Junio C Hamano
2007-01-05 21:57                   ` Linus Torvalds
2007-01-05 22:18                     ` alan
2007-01-07  0:36                       ` Eric Wong
2007-01-05 22:39                     ` Linus Torvalds
2007-01-05 22:48                       ` Junio C Hamano
2007-01-05 23:00                         ` Linus Torvalds
2007-01-05 23:02                           ` Linus Torvalds
2007-01-05 23:44                           ` Junio C Hamano
2007-01-05 23:59                             ` Linus Torvalds
2007-01-06  0:06                             ` Johannes Schindelin
2007-01-05 23:03                   ` Chris Lee
2007-01-05 23:09                     ` Junio C Hamano
2007-01-05 23:17                       ` Linus Torvalds
2007-01-05 23:58                         ` Junio C Hamano
2007-01-06  0:11                           ` Linus Torvalds
2007-01-06  0:15                             ` Linus Torvalds
2007-01-06  0:23                               ` Junio C Hamano
2007-01-06  1:22                                 ` Linus Torvalds
2007-01-04 19:24   ` Chris Lee
2007-01-04 21:12     ` Linus Torvalds
2007-01-04 21:31   ` Sasha Khapyorsky
2007-01-04 22:04     ` Chris Lee
2007-01-07  0:17       ` [PATCH] git-svnimport: support for incremental import Sasha Khapyorsky
2007-01-07 18:12         ` Chris Lee
2007-01-07 18:59           ` Sasha Khapyorsky
2007-01-08  2:22             ` [PATCH] git-svnimport: fix edge revisions double importing Sasha Khapyorsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).