git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Excruciatingly slow git-svn imports
@ 2008-04-24 18:54 Geert Bosch
  2008-04-24 19:57 ` Steven Grimm
  2008-04-29  7:03 ` Eric Wong
  0 siblings, 2 replies; 9+ messages in thread
From: Geert Bosch @ 2008-04-24 18:54 UTC (permalink / raw)
  To: git@vger.kernel.org List

I'm trying to import a 9.7G, 130K revision svn repository
but it seems to only import about 6K revisions per day on fast hardware
using a recent git (1.5.5).

This means about 20 days, or more if things slow down as the repo gets  
bigger
Are there any tips/tricks on how to most efficiently convert large  
repos?
I'm using ssh+svn protocol for accessing the repository, but slowness
seems due to local inefficiency. An strace -fcp <pid> during a minute  
gives
the following results:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  52.46   21.392640       17607      1215           clone
  47.47   19.358882        3983      4860      3645 execve
   0.05    0.019571          16      1216           wait4
   0.01    0.003944           0     14582      1215 open
   0.01    0.002458           0     14580     12150 access
   0.00    0.000797           0      8500           write
   0.00    0.000694           0     26013           read
   0.00    0.000574           0      3693           munmap
   0.00    0.000513           0     20659           close
   0.00    0.000452           0     21918           mmap
   0.00    0.000353           0      1215           stat
   0.00    0.000234           0     12158      1215 lseek
   0.00    0.000155           0     17013           fstat
   0.00    0.000077           0      6075           mprotect
   0.00    0.000076           0      8511           rt_sigaction
   0.00    0.000074           0      6078      6078 ioctl
   0.00    0.000049           0      2432           unlink
   0.00    0.000033           0      2430           dup2
   0.00    0.000033           0      7293           fcntl
   0.00    0.000022           0      3681           brk
   0.00    0.000022           0      1215           getppid
   0.00    0.000019           0      1215           uname
   0.00    0.000019           0      1215           arch_prctl
   0.00    0.000000           0      1215           lstat
   0.00    0.000000           0      1216           pipe
   0.00    0.000000           0        22           mremap
   0.00    0.000000           0      2431           dup
   0.00    0.000000           0      1215           getcwd
   0.00    0.000000           0      2430           getdents64
------ ----------- ----------- --------- --------- ----------------
100.00   40.781691                196296     24303 total

So, 99.93% of the time seems to be in clone/execve
(including actual work done by the forked programs)

In another trace, I found the following execve calls were made:
      22 execve("/homes/bosch/x86_64-linux/bin/git",
       2 execve("/homes/bosch/x86_64-linux/bin/git-commit-tree",
    2842 execve("/homes/bosch/x86_64-linux/bin/git-hash-object",
      22 execve("/opt/gnu/bin/git",
       2 execve("/opt/gnu/bin/git-commit-tree",
    2842 execve("/opt/gnu/bin/git-hash-object",
      22 execve("/opt/local/bin/git",
       2 execve("/opt/local/bin/git-commit-tree",
    2842 execve("/opt/local/bin/git-hash-object",
      22 execve("/opt/local/sbin/git",
       2 execve("/opt/local/sbin/git-commit-tree",
    2842 execve("/opt/local/sbin/git-hash-object",

I don't have git installed in either of /opt/gnu/bin, /opt/local/bin  
or /opt/local/sbin.
These three directories just happen to be before the one containing  
git in my path:

bosch:~/git$ echo $PATH
/opt/gnu/bin:/opt/local/bin:/opt/local/sbin:/homes/bosch/x86_64-linux/ 
bin ...

Before trying to brush up my Perl and propose patching fixes for this
(I doubt the extra execve's take much time at all), I was wondering why
we don't open a single stream to git-fast-import and have it do
the heavy lifting. Are there fundamental issues with this?

   -Geert

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Excruciatingly slow git-svn imports
  2008-04-24 18:54 Excruciatingly slow git-svn imports Geert Bosch
@ 2008-04-24 19:57 ` Steven Grimm
  2008-04-29  7:11   ` Eric Wong
  2008-04-29  7:03 ` Eric Wong
  1 sibling, 1 reply; 9+ messages in thread
From: Steven Grimm @ 2008-04-24 19:57 UTC (permalink / raw)
  To: Geert Bosch; +Cc: git@vger.kernel.org List

On Apr 24, 2008, at 11:54 AM, Geert Bosch wrote:

> I'm trying to import a 9.7G, 130K revision svn repository
> but it seems to only import about 6K revisions per day on fast  
> hardware
> using a recent git (1.5.5).

I've found that git-svn gets slower as it runs. Try interrupting the  
clone and running "git svn fetch" -- it should pick up where it left  
off and will be MUCH faster if my experience is any indication. When I  
clone the big svn repository at work I usually restart it every 1000  
revisions or so and it finishes in a fraction of the time it takes if  
I let it do everything in a single run.

-Steve

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Excruciatingly slow git-svn imports
  2008-04-24 18:54 Excruciatingly slow git-svn imports Geert Bosch
  2008-04-24 19:57 ` Steven Grimm
@ 2008-04-29  7:03 ` Eric Wong
  1 sibling, 0 replies; 9+ messages in thread
From: Eric Wong @ 2008-04-29  7:03 UTC (permalink / raw)
  To: Geert Bosch; +Cc: git@vger.kernel.org List, Adam Roben

Geert Bosch <bosch@adacore.com> wrote:

> Before trying to brush up my Perl and propose patching fixes for this
> (I doubt the extra execve's take much time at all), I was wondering why
> we don't open a single stream to git-fast-import and have it do
> the heavy lifting. Are there fundamental issues with this?

Last I checked, fast-import doesn't allow rereading freshly imported
objects before that particular fast-import instance is finished running.

Since git-svn imports deltas from SVN instead of full files, so it often
needs to reread objects it imported in the same run to make use of
deltas.

However, Adam Roben's been working on some improvements to git-cat-file
which allow it to avoid many fork+exec calls.  The tests and some code
have some outstanding issues, but the code appears to work, and I'm sure
Adam would love to have you test it more for him :)

 http://thread.gmane.org/gmane.comp.version-control.git/80240

-- 
Eric Wong

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Excruciatingly slow git-svn imports
  2008-04-24 19:57 ` Steven Grimm
@ 2008-04-29  7:11   ` Eric Wong
  2008-05-05  4:29     ` Geert Bosch
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Wong @ 2008-04-29  7:11 UTC (permalink / raw)
  To: Steven Grimm; +Cc: Geert Bosch, git@vger.kernel.org List

Steven Grimm <koreth@midwinter.com> wrote:
> On Apr 24, 2008, at 11:54 AM, Geert Bosch wrote:
> 
> >I'm trying to import a 9.7G, 130K revision svn repository
> >but it seems to only import about 6K revisions per day on fast  
> >hardware
> >using a recent git (1.5.5).
> 
> I've found that git-svn gets slower as it runs. Try interrupting the  
> clone and running "git svn fetch" -- it should pick up where it left  
> off and will be MUCH faster if my experience is any indication. When I  
> clone the big svn repository at work I usually restart it every 1000  
> revisions or so and it finishes in a fraction of the time it takes if  
> I let it do everything in a single run.

That's really strange to hear...  The git-svn process itself does not
store much state other than the current revision and the log information
for the next 100 or so revisions it needs to import.

Are you packing the repository?  Which SVN protocol are you using?  Does
memory usage of git-svn stay stable throughout the run?

-- 
Eric Wong

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Excruciatingly slow git-svn imports
  2008-04-29  7:11   ` Eric Wong
@ 2008-05-05  4:29     ` Geert Bosch
  2008-05-06  3:28       ` Eric Wong
  0 siblings, 1 reply; 9+ messages in thread
From: Geert Bosch @ 2008-05-05  4:29 UTC (permalink / raw)
  To: Eric Wong; +Cc: Steven Grimm, git@vger.kernel.org List


On Apr 29, 2008, at 03:11, Eric Wong wrote:

>> I've found that git-svn gets slower as it runs. Try interrupting the
>> clone and running "git svn fetch" -- it should pick up where it left
>> off and will be MUCH faster if my experience is any indication.  
>> When I
>> clone the big svn repository at work I usually restart it every 1000
>> revisions or so and it finishes in a fraction of the time it takes if
>> I let it do everything in a single run.
>
> That's really strange to hear...  The git-svn process itself does not
> store much state other than the current revision and the log  
> information
> for the next 100 or so revisions it needs to import.
>
> Are you packing the repository?  Which SVN protocol are you using?   
> Does
> memory usage of git-svn stay stable throughout the run?

I found the same. After about 5 days (with maybe 10 break/restarts), I  
had
a converted repository with all 135K commits and a total size of just
under 1 GB. The last 100K commits took (much?) less than a day, almost  
all the
time was spend in the earlier ones. These commits seemed all to have  
thousands
of files, even though most were probably the same. I'm sure this  
repositor,
which covers 15 years of development of a multi-million line project,  
has
a lot of tags and it seemed that it just had to chew through many copies
of the complete set of files to find out that they're all the same.

It's great git-svn can be restarted so well and doesn't get confused by
uncleanly terminated runs. My final repository is fast and small.
I'm still struggling with how to properly synchronize branches, but that
probably is mostly a matter of user education.

Thanks all for these great tools.

   -Geert

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Excruciatingly slow git-svn imports
  2008-05-05  4:29     ` Geert Bosch
@ 2008-05-06  3:28       ` Eric Wong
  2008-05-06  3:56         ` Avery Pennarun
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Wong @ 2008-05-06  3:28 UTC (permalink / raw)
  To: Geert Bosch; +Cc: Steven Grimm, git@vger.kernel.org List

Geert Bosch <bosch@adacore.com> wrote:
> On Apr 29, 2008, at 03:11, Eric Wong wrote:
> 
> >>I've found that git-svn gets slower as it runs. Try interrupting the
> >>clone and running "git svn fetch" -- it should pick up where it left
> >>off and will be MUCH faster if my experience is any indication.
> >>When I clone the big svn repository at work I usually restart it
> >>every 1000 revisions or so and it finishes in a fraction of the time
> >>it takes if I let it do everything in a single run.
> >
> >That's really strange to hear...  The git-svn process itself does not
> >store much state other than the current revision and the log
> >information for the next 100 or so revisions it needs to import.
> >
> >Are you packing the repository?  Which SVN protocol are you using?
> >Does memory usage of git-svn stay stable throughout the run?
> 
> I found the same. After about 5 days (with maybe 10 break/restarts), I
> had a converted repository with all 135K commits and a total size of
> just under 1 GB. The last 100K commits took (much?) less than a day,
> almost  all the time was spend in the earlier ones. These commits
> seemed all to have  thousands of files, even though most were probably
> the same. I'm sure this  repositor, which covers 15 years of
> development of a multi-million line project,  has a lot of tags and it
> seemed that it just had to chew through many copies of the complete
> set of files to find out that they're all the same.

Interesting.  By  "These commits seemed all to have thousands of files",
you mean the first 35K that took up most of the time?  If so, yes,
that's definitely a problem...

git-svn requests a log from SVN containing a list of all paths modified
in each revision.  By default, git-svn only requests log entries for up
to 100 revisions at a time to reduce memory usage.  However, having
thousands of files modified for each revision would still be
problematic, as would having insanely long commit messages.

Is this repository public by any chance?  I'd like to be able to take
a look at it in case I have time and have access to decent hardware.
Also, what command-line arguments did you use?

> It's great git-svn can be restarted so well and doesn't get confused
> by uncleanly terminated runs. My final repository is fast and small.
> I'm still struggling with how to properly synchronize branches, but
> that probably is mostly a matter of user education.
> 
> Thanks all for these great tools.

You're welcome, thanks for the feedback!  Restartability in git-svn
is one of the things I focused on from the beginning.

-- 
Eric Wong

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Excruciatingly slow git-svn imports
  2008-05-06  3:28       ` Eric Wong
@ 2008-05-06  3:56         ` Avery Pennarun
  2008-05-06  4:25           ` Eric Wong
  0 siblings, 1 reply; 9+ messages in thread
From: Avery Pennarun @ 2008-05-06  3:56 UTC (permalink / raw)
  To: Eric Wong; +Cc: Geert Bosch, Steven Grimm, git@vger.kernel.org List

On 5/5/08, Eric Wong <normalperson@yhbt.net> wrote:
> Interesting.  By  "These commits seemed all to have thousands of files",
>  you mean the first 35K that took up most of the time?  If so, yes,
>  that's definitely a problem...
>
>  git-svn requests a log from SVN containing a list of all paths modified
>  in each revision.  By default, git-svn only requests log entries for up
>  to 100 revisions at a time to reduce memory usage.  However, having
>  thousands of files modified for each revision would still be
>  problematic, as would having insanely long commit messages.

On my system, any branch that was created using "svn cp" of a toplevel
directory seems to cause git-svn to (rather slowly) download every
single file in the entire branch for the first commit on that branch,
giving a symptom that sounds a lot like the above "commits with
thousands of files".  I assumed this was just an intentional design
decision in git-svn, to be slow and safe instead of fast and loose.
Is it actually supposed to do something smarter than that?

Thanks,

Avery

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Excruciatingly slow git-svn imports
  2008-05-06  3:56         ` Avery Pennarun
@ 2008-05-06  4:25           ` Eric Wong
  2008-05-06 11:23             ` Geert Bosch
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Wong @ 2008-05-06  4:25 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: Geert Bosch, Steven Grimm, git@vger.kernel.org List

Avery Pennarun <apenwarr@gmail.com> wrote:
> On 5/5/08, Eric Wong <normalperson@yhbt.net> wrote:
> > Interesting.  By  "These commits seemed all to have thousands of files",
> >  you mean the first 35K that took up most of the time?  If so, yes,
> >  that's definitely a problem...
> >
> >  git-svn requests a log from SVN containing a list of all paths modified
> >  in each revision.  By default, git-svn only requests log entries for up
> >  to 100 revisions at a time to reduce memory usage.  However, having
> >  thousands of files modified for each revision would still be
> >  problematic, as would having insanely long commit messages.
> 
> On my system, any branch that was created using "svn cp" of a toplevel
> directory seems to cause git-svn to (rather slowly) download every
> single file in the entire branch for the first commit on that branch,
> giving a symptom that sounds a lot like the above "commits with
> thousands of files".  I assumed this was just an intentional design
> decision in git-svn, to be slow and safe instead of fast and loose.
> Is it actually supposed to do something smarter than that?

When using "svn cp" on a top-level directory, it *should*
just show up as a single file change in the log entry.
Something like:

  A /project/branch/my-new-branch (from /project/trunk:1234)

This would not take much memory at all.
However, I've also occasionally seen stuff like this:

  A /project/branch/my-new-branch
  A /project/branch/my-new-branch/file1 (from /project/trunk/file1:1234)
  A /project/branch/my-new-branch/file2 (from /project/trunk/file2:1234)
  A /project/branch/my-new-branch/file3 (from /project/trunk/file3:1234)
  .... many more files and directories along the same lines ...

This is what I suspect Geert is seeing in his repository and causing
problems.  Perhaps something caused by cvs2svn importing those tags into
SVN originally?


But the symptom you're seeing with git-svn downloading every file seems
to be the result of using a pre-1.4.3 version of the Perl SVN bindings
which lacked a working do_switch() function.  I fallback to using
do_update() and checking out a new tree for SVN 1.4.2 and before.
So yes, I'm definitely safe, slow and _lazy_ by falling back to
do_update() instead of doing something fancy to workaround something
that's already fixed in SVN :)

-- 
Eric Wong

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Excruciatingly slow git-svn imports
  2008-05-06  4:25           ` Eric Wong
@ 2008-05-06 11:23             ` Geert Bosch
  0 siblings, 0 replies; 9+ messages in thread
From: Geert Bosch @ 2008-05-06 11:23 UTC (permalink / raw)
  To: Eric Wong; +Cc: Avery Pennarun, Steven Grimm, git@vger.kernel.org List


On May 6, 2008, at 00:25, Eric Wong wrote:

> When using "svn cp" on a top-level directory, it *should*
> just show up as a single file change in the log entry.
> Something like:
>
>  A /project/branch/my-new-branch (from /project/trunk:1234)
>
> This would not take much memory at all.
> However, I've also occasionally seen stuff like this:
>
>  A /project/branch/my-new-branch
>  A /project/branch/my-new-branch/file1 (from /project/trunk/ 
> file1:1234)
>  A /project/branch/my-new-branch/file2 (from /project/trunk/ 
> file2:1234)
>  A /project/branch/my-new-branch/file3 (from /project/trunk/ 
> file3:1234)
>  .... many more files and directories along the same lines ...

This is exactly what I'm experiencing.
> This is what I suspect Geert is seeing in his repository and causing
> problems.  Perhaps something caused by cvs2svn importing those tags  
> into
> SVN originally?

Yes, most likely. BTW, the conversion from CVS to SVN took about a day
on slightly slower hardware.

   -Geert

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2008-05-06 11:24 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-24 18:54 Excruciatingly slow git-svn imports Geert Bosch
2008-04-24 19:57 ` Steven Grimm
2008-04-29  7:11   ` Eric Wong
2008-05-05  4:29     ` Geert Bosch
2008-05-06  3:28       ` Eric Wong
2008-05-06  3:56         ` Avery Pennarun
2008-05-06  4:25           ` Eric Wong
2008-05-06 11:23             ` Geert Bosch
2008-04-29  7:03 ` Eric Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).