svn to git, N-squared?

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* svn to git, N-squared?
@ 2006-06-12  2:02 Jon Smirl
  2006-06-12  3:31 ` Linus Torvalds
  2006-06-12  4:29 ` Eric Wong
  0 siblings, 2 replies; 23+ messages in thread
From: Jon Smirl @ 2006-06-12  2:02 UTC (permalink / raw)
  To: git

I have Mozilla CVS in a SVN repository. I've been using git-svnimport
to import it. This time I am letting it run to completion; but the
import has been running for four days now and it is only up to 2004.
The import task is stable at 570MB and it is using about 50% of my
CPU. It is constantly spawning off git write-tree, read-tree,
hash-object, update-index. It is not doing excessive disk activity.

The import seems to be getting n-squared slower. It is still making
forward progress but the progress seems to be getting slower and
slower.

It looks like it is doing write-tree, read-tree, hash-object,
update-index once or more per change set. If these commands are
n-proportional and they are getting run n times, then this is a
n-squared process. Projecting this out, the import may take 10 days or
more to completely finish.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: svn to git, N-squared?
  2006-06-12  2:02 Jon Smirl
@ 2006-06-12  3:31 ` Linus Torvalds
  2006-06-12  3:39   ` Jon Smirl
  2006-06-12 16:18   ` Randal L. Schwartz
  2006-06-12  4:29 ` Eric Wong
  1 sibling, 2 replies; 23+ messages in thread
From: Linus Torvalds @ 2006-06-12  3:31 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git



On Sun, 11 Jun 2006, Jon Smirl wrote:
>
> I have Mozilla CVS in a SVN repository. I've been using git-svnimport
> to import it. This time I am letting it run to completion; but the
> import has been running for four days now and it is only up to 2004.
> The import task is stable at 570MB and it is using about 50% of my
> CPU. It is constantly spawning off git write-tree, read-tree,
> hash-object, update-index. It is not doing excessive disk activity.

This sounds like _exactly_ what happens if you don't repack occasionally. 
Expecially if you are using a filesystem without hashed filename lookup, 
but it's true to some degree even with that - the filesystem tends to end 
up spending tons of time in kernel space, trying to find a place to put 
new objects.

I don't think git-svnimport has the repack logic in it, so that would be 
it.

You can probably stop it with ^Z, do a "git repack -a -d", and then let it 
continue.

(The only reason for stopping it is actually to let "git repack" remove 
most of the object directories - many filesystems, including ext3, don't 
even speed up all that much if the directories are emptied after they've 
grown big, and it's much better if the object directories get totally 
removed and re-created)

			Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: svn to git, N-squared?
  2006-06-12  3:31 ` Linus Torvalds
@ 2006-06-12  3:39   ` Jon Smirl
  2006-06-12  4:02     ` Linus Torvalds
  2006-06-12 16:18   ` Randal L. Schwartz
  1 sibling, 1 reply; 23+ messages in thread
From: Jon Smirl @ 2006-06-12  3:39 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

On 6/11/06, Linus Torvalds <torvalds@osdl.org> wrote:
>
>
> On Sun, 11 Jun 2006, Jon Smirl wrote:
> >
> > I have Mozilla CVS in a SVN repository. I've been using git-svnimport
> > to import it. This time I am letting it run to completion; but the
> > import has been running for four days now and it is only up to 2004.
> > The import task is stable at 570MB and it is using about 50% of my
> > CPU. It is constantly spawning off git write-tree, read-tree,
> > hash-object, update-index. It is not doing excessive disk activity.
>
> This sounds like _exactly_ what happens if you don't repack occasionally.
> Expecially if you are using a filesystem without hashed filename lookup,
> but it's true to some degree even with that - the filesystem tends to end
> up spending tons of time in kernel space, trying to find a place to put
> new objects.
>
> I don't think git-svnimport has the repack logic in it, so that would be
> it.
>
> You can probably stop it with ^Z, do a "git repack -a -d", and then let it
> continue.

I have it stopped and I am running the repack.
There are 1.27M files in my .git directory

I ordered 2GB more RAM which should be here Tuesday.

> (The only reason for stopping it is actually to let "git repack" remove
> most of the object directories - many filesystems, including ext3, don't
> even speed up all that much if the directories are emptied after they've
> grown big, and it's much better if the object directories get totally
> removed and re-created)
>
>                         Linus
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: svn to git, N-squared?
  2006-06-12  3:39   ` Jon Smirl
@ 2006-06-12  4:02     ` Linus Torvalds
  2006-06-12 19:04       ` Yakov Lerner
  0 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2006-06-12  4:02 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git

On Sun, 11 Jun 2006, Jon Smirl wrote:
> 
> I have it stopped and I am running the repack.
> There are 1.27M files in my .git directory

Yeah, that would do it. That's ~5000 files per object directory, so I 
assume that your directories are 200+kB in size, and for every new object 
added, you'll basically have to traverse the old directory fully in order 
to find an empty place for it (and without hashing, you'll traverse it 
_twice_ - first to look for it, then to look for the empty space).

Btw, after repacking, if it still has lots of lose objects, and you still 
have several directories that are huge (because there are pending objects 
for a commit that didn't happen yet when you ^Z'd the svnimport), you'll 
literally get better performance if you do something like

	for i in ??
	do
		cp -r $i $i.new
		rm -rf $i
		mv $i.new $i
	done

in your .git/objects/ directory (CAREFUL! Any script that does "rm -rf" 
should be double- and triple-checked for sanity! ;)

That should make sure that you don't still have huge directories.

(And yes, this is a real problem at least with ext3).

The git cvsimporter ends up repacking the archive every thousand commits. 
That's just a random number, but it's indicative of what we did there to 
handle large imports. I don't think anybody has done a large import using 
the git-svnimport before, so you're in new territory which explains some 
of the teething problems.

		Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: svn to git, N-squared?
  2006-06-12  2:02 Jon Smirl
  2006-06-12  3:31 ` Linus Torvalds
@ 2006-06-12  4:29 ` Eric Wong
  1 sibling, 0 replies; 23+ messages in thread
From: Eric Wong @ 2006-06-12  4:29 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git

Jon Smirl <jonsmirl@gmail.com> wrote:
> I have Mozilla CVS in a SVN repository. I've been using git-svnimport
> to import it. This time I am letting it run to completion; but the
> import has been running for four days now and it is only up to 2004.
> The import task is stable at 570MB and it is using about 50% of my
> CPU. It is constantly spawning off git write-tree, read-tree,
> hash-object, update-index. It is not doing excessive disk activity.

SVN itself seems to get much slower as you get towards newer revisions
in a repository (FSFS) with lots of history.  I've been experimenting a
bit with a local copy of the gcc repo from November and git-svn SUCKED
at importing it (it took over a week and I cancelled it out of
frustration).   I started repacking too, but, and it didn't help,  Much
of the performance defieciency was the svn sub process. being extremely
slow at updating.

I also tried git-svnimport, of course, but I only had 512M on that
machine and the machine became unusable due to heavy swapping.

> The import seems to be getting n-squared slower. It is still making
> forward progress but the progress seems to be getting slower and
> slower.
> 
> It looks like it is doing write-tree, read-tree, hash-object,
> update-index once or more per change set. If these commands are
> n-proportional and they are getting run n times, then this is a
> n-squared process. Projecting this out, the import may take 10 days or
> more to completely finish.

I'm working on some improvements to git-svn to make it a bit more
spiffy.

-- 
Eric Wong

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: svn to git, N-squared?
@ 2006-06-12  4:39 linux
  2006-06-12 15:32 ` Jon Smirl
  0 siblings, 1 reply; 23+ messages in thread
From: linux @ 2006-06-12  4:39 UTC (permalink / raw)
  To: git, jonsmirl, torvalds

>	for i in ??
>	do
>		cp -r $i $i.new
>		rm -rf $i
>		mv $i.new $i
>	done
>
> in your .git/objects/ directory (CAREFUL! Any script that does "rm -rf" 
> should be double- and triple-checked for sanity! ;)

Insanity is copying the data rather than just the file name.  Git is
good about not reading unnecessary files, and anything necessary should
be cached, so on-disk fragmentation is not a concern.

rmdir --ignore-fail-on-non-empty ??	# Probably unnecessary.
for i in ??
do
	mkdir $i.new
	mv $i/* $i.new
	rmdir $i
	mv $i.new $i
done

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: svn to git, N-squared?
  2006-06-12  4:39 svn to git, N-squared? linux
@ 2006-06-12 15:32 ` Jon Smirl
  2006-06-12 15:45   ` Linus Torvalds
  0 siblings, 1 reply; 23+ messages in thread
From: Jon Smirl @ 2006-06-12 15:32 UTC (permalink / raw)
  To: linux@horizon.com; +Cc: git, torvalds

On 12 Jun 2006 00:39:49 -0400, linux@horizon.com <linux@horizon.com> wrote:
> Insanity is copying the data rather than just the file name.  Git is
> good about not reading unnecessary files, and anything necessary should
> be cached, so on-disk fragmentation is not a concern.

I've run a pack and I moved the objects to new directories. Directory
is 746M with 64K files now.

I've stablized like this. 1GB RAM with 2.8Ghz P4 hyperthread. Is there
anyway to tell what it is doing in the kernel for so much time?

procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us
sy id wa st
 1  0 599128  24712  38196 247008    0    0     0     0  451   382 12
39 48  0  0
 1  0 599128  24836  38196 246872    0    0     0     4  413   354 15
38 48  0  0
 1  0 599128  24960  38212 246856    0    0     0    64  453   390 15
37 48  0  0
 1  0 599128  24960  38212 246856    0    0     0     0  414   367 12
40 49  0  0
 1  0 599128  23504  38212 248216    0    0     0     0  448   365 13
39 48  0  0
 1  0 599128  24156  38212 247604    0    0     0     0  407   355 13
39 49  0  0
 1  0 599128  25240  38212 246652    0    0     0     0  446   390 13
39 48  0  0
 1  0 599128  25240  38224 246572    0    0     4    48  415   418 12
40 47  0  0
 1  0 599128  25116  38232 246496    0    0     0    12  452   432 12
40 48  0  0

Still doesn't seem to be making much forward progress.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: svn to git, N-squared?
  2006-06-12 15:32 ` Jon Smirl
@ 2006-06-12 15:45   ` Linus Torvalds
  2006-06-12 15:55     ` Jon Smirl
  2006-06-12 16:16     ` Jon Smirl
  0 siblings, 2 replies; 23+ messages in thread
From: Linus Torvalds @ 2006-06-12 15:45 UTC (permalink / raw)
  To: Jon Smirl; +Cc: linux@horizon.com, git

On Mon, 12 Jun 2006, Jon Smirl wrote:
> 
> I've stablized like this. 1GB RAM with 2.8Ghz P4 hyperthread. Is there
> anyway to tell what it is doing in the kernel for so much time?

oprofile will tell you.

I don't see why it would spend a lot of time in the kernel, unless it's 
the SVN part that does a ton of reads or something. git should have almost 
no kernel footprint apart from the individual objects creation/reading, so 
once it's repacked, I generally see very little system time.

What does top say? (Ie can you see _which_ process spends time in the 
kernel?)

		Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: svn to git, N-squared?
  2006-06-12 15:45   ` Linus Torvalds
@ 2006-06-12 15:55     ` Jon Smirl
  2006-06-12 16:12       ` Linus Torvalds
  2006-06-12 16:16     ` Jon Smirl
  1 sibling, 1 reply; 23+ messages in thread
From: Jon Smirl @ 2006-06-12 15:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux@horizon.com, git

On 6/12/06, Linus Torvalds <torvalds@osdl.org> wrote:
>
>
> On Mon, 12 Jun 2006, Jon Smirl wrote:
> >
> > I've stablized like this. 1GB RAM with 2.8Ghz P4 hyperthread. Is there
> > anyway to tell what it is doing in the kernel for so much time?
>
> oprofile will tell you.
>
> I don't see why it would spend a lot of time in the kernel, unless it's
> the SVN part that does a ton of reads or something. git should have almost
> no kernel footprint apart from the individual objects creation/reading, so
> once it's repacked, I generally see very little system time.
>
> What does top say? (Ie can you see _which_ process spends time in the
> kernel?)

top - 11:54:32 up 4 days,  1:27,  5 users,  load average: 1.85, 1.74, 1.55
Tasks: 135 total,   2 running, 133 sleeping,   0 stopped,   0 zombie
Cpu(s): 14.7% us, 35.3% sy,  0.0% ni, 49.3% id,  0.0% wa,  0.2% hi,  0.5% si,  0
Mem:   1035740k total,  1020836k used,    14904k free,    18368k buffers
Swap: 118222276k total,   645124k used, 117577152k free,   183172k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
14525 jonsmirl  16   0  604m 391m 1904 S   24 38.7 916:53.39 git-svnimport
20947 jonsmirl  17   0     0    0    0 R    1  0.0   0:00.03 git-svnimport
20864 jonsmirl  16   0  2120 1024  788 R    1  0.1   0:00.08 top
 2436 root      15   0 71184  28m 6100 S    0  2.8 119:13.55 Xorg
    1 root      16   0  1992  340  312 S    0  0.0   0:00.79 init
    2 root      RT   0     0    0    0 S    0  0.0   0:00.00 migration/0
    3 root      34  19     0    0    0 S    0  0.0   0:01.42 ksoftirqd/0
    4 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/0


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: svn to git, N-squared?
  2006-06-12 15:55     ` Jon Smirl
@ 2006-06-12 16:12       ` Linus Torvalds
  2006-06-12 16:22         ` Jon Smirl
  0 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2006-06-12 16:12 UTC (permalink / raw)
  To: Jon Smirl; +Cc: linux@horizon.com, git

On Mon, 12 Jun 2006, Jon Smirl wrote:
> 
>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 14525 jonsmirl  16   0  604m 391m 1904 S   24 38.7 916:53.39 git-svnimport
> 20947 jonsmirl  17   0     0    0    0 R    1  0.0   0:00.03 git-svnimport

Hard to tell, it's obviously got short-lived processes there too that it's 
not showing, but equally obviously that svnimport script itself is 
spending an alarming amount of CPU time. I don't think it should do that 
much processing, but since it's written in perl, I can't read it.

Are there any other directories that seem to be growing (eg some temp-file 
directory where the old files aren't cleaned away?). I can't imagine what 
else it could be doing in kernel space than simply some silly filesystem 
operation, but dang it all, Linux filesystems are usually very efficient 
indeed, unless we're talking huge directories (and if it's not the git 
object directory any more, it must be something else).

At least with the cvs importer I have _some_ clue what it's doing, since I 
wrote an earlier version myself (very different, but at least I know what 
the operations are). SVN has always just confused me, and I have no idea 
what svnimport does, so I think I'll have to defer to somebody who 
actually knows the code.

Smurf, have you looked at any larger repositories?

		Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: svn to git, N-squared?
  2006-06-12 15:45   ` Linus Torvalds
  2006-06-12 15:55     ` Jon Smirl
@ 2006-06-12 16:16     ` Jon Smirl
  1 sibling, 0 replies; 23+ messages in thread
From: Jon Smirl @ 2006-06-12 16:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux@horizon.com, git

On 6/12/06, Linus Torvalds <torvalds@osdl.org> wrote:
>
>
> On Mon, 12 Jun 2006, Jon Smirl wrote:
> >
> > I've stablized like this. 1GB RAM with 2.8Ghz P4 hyperthread. Is there
> > anyway to tell what it is doing in the kernel for so much time?
>
> oprofile will tell you.

I don't have profiling turn on in the kernel. I've turned it on so
I'll pick it up next time I reboot.
I'll kill everything and restart when my new RAM arrives tomorrow.

Hopefully the SVN import will finish before then but it doesn't look likely.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: svn to git, N-squared?
  2006-06-12  3:31 ` Linus Torvalds
  2006-06-12  3:39   ` Jon Smirl
@ 2006-06-12 16:18   ` Randal L. Schwartz
  2006-06-12 16:25     ` Randal L. Schwartz
  1 sibling, 1 reply; 23+ messages in thread
From: Randal L. Schwartz @ 2006-06-12 16:18 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jon Smirl, git

>>>>> "Linus" == Linus Torvalds <torvalds@osdl.org> writes:

Linus> This sounds like _exactly_ what happens if you don't repack
Linus> occasionally.  Expecially if you are using a filesystem without hashed
Linus> filename lookup, but it's true to some degree even with that - the
Linus> filesystem tends to end up spending tons of time in kernel space,
Linus> trying to find a place to put new objects.

I'm using git-svn to do a similar thing with a 11K-commit history.  It's now 4
days running, and yes, I'm repacking and deleting empty dirs every 200-300
commits, but I'm only up to commit 4000 or so.  At this rate, I *may* finish
by sometime next week. :(

However, I notice one thing that can't be good: .git/git-svn/revs has one file
per revision.  Yes, I'll end up with 11000 files in a single directory.  Ugh.

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: svn to git, N-squared?
  2006-06-12 16:12       ` Linus Torvalds
@ 2006-06-12 16:22         ` Jon Smirl
  2006-06-12 16:32           ` Jon Smirl
  2006-06-12 16:41           ` Linus Torvalds
  0 siblings, 2 replies; 23+ messages in thread
From: Jon Smirl @ 2006-06-12 16:22 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux@horizon.com, git

On 6/12/06, Linus Torvalds <torvalds@osdl.org> wrote:
>
>
> On Mon, 12 Jun 2006, Jon Smirl wrote:
> >
> >  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> > 14525 jonsmirl  16   0  604m 391m 1904 S   24 38.7 916:53.39 git-svnimport
> > 20947 jonsmirl  17   0     0    0    0 R    1  0.0   0:00.03 git-svnimport
>
> Hard to tell, it's obviously got short-lived processes there too that it's
> not showing, but equally obviously that svnimport script itself is
> spending an alarming amount of CPU time. I don't think it should do that
> much processing, but since it's written in perl, I can't read it.
>
> Are there any other directories that seem to be growing (eg some temp-file
> directory where the old files aren't cleaned away?). I can't imagine what
> else it could be doing in kernel space than simply some silly filesystem
> operation, but dang it all, Linux filesystems are usually very efficient
> indeed, unless we're talking huge directories (and if it's not the git
> object directory any more, it must be something else).

64 files in tmp.
But the SVN repository itself has 411,000 files in it. Split between
two directories.

Is there some pack equivalent for svn that I haven't found yet?

> At least with the cvs importer I have _some_ clue what it's doing, since I
> wrote an earlier version myself (very different, but at least I know what
> the operations are). SVN has always just confused me, and I have no idea
> what svnimport does, so I think I'll have to defer to somebody who
> actually knows the code.
>
> Smurf, have you looked at any larger repositories?
>
>                 Linus
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: svn to git, N-squared?
  2006-06-12 16:18   ` Randal L. Schwartz
@ 2006-06-12 16:25     ` Randal L. Schwartz
  0 siblings, 0 replies; 23+ messages in thread
From: Randal L. Schwartz @ 2006-06-12 16:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jon Smirl, git

>>>>> "Randal" == Randal L Schwartz <merlyn@stonehenge.com> writes:

>>>>> "Linus" == Linus Torvalds <torvalds@osdl.org> writes:
Linus> This sounds like _exactly_ what happens if you don't repack
Linus> occasionally.  Expecially if you are using a filesystem without hashed
Linus> filename lookup, but it's true to some degree even with that - the
Linus> filesystem tends to end up spending tons of time in kernel space,
Linus> trying to find a place to put new objects.

Randal> I'm using git-svn to do a similar thing with a 11K-commit history.  It's now 4
Randal> days running, and yes, I'm repacking and deleting empty dirs every 200-300
Randal> commits, but I'm only up to commit 4000 or so.  At this rate, I *may* finish
Randal> by sometime next week. :(

Randal> However, I notice one thing that can't be good: .git/git-svn/revs has one file
Randal> per revision.  Yes, I'll end up with 11000 files in a single directory.  Ugh.

Another contributing factor is that there's 2500 files in the repo (at
revision 3931).  I was recording 20 commits a minute in the early part of the
cycle, and now I'm down to 1 commit every two minutes.  Doing a bit of
back-of-the-scribbled-on-envelope calcs, I won't be finished for
another two weeks or so. :(

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: svn to git, N-squared?
  2006-06-12 16:22         ` Jon Smirl
@ 2006-06-12 16:32           ` Jon Smirl
  2006-06-12 16:57             ` Linus Torvalds
  2006-06-12 16:41           ` Linus Torvalds
  1 sibling, 1 reply; 23+ messages in thread
From: Jon Smirl @ 2006-06-12 16:32 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux@horizon.com, git

On 6/12/06, Jon Smirl <jonsmirl@gmail.com> wrote:
> On 6/12/06, Linus Torvalds <torvalds@osdl.org> wrote:
> >
> >
> > On Mon, 12 Jun 2006, Jon Smirl wrote:
> > >
> > >  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> > > 14525 jonsmirl  16   0  604m 391m 1904 S   24 38.7 916:53.39 git-svnimport
> > > 20947 jonsmirl  17   0     0    0    0 R    1  0.0   0:00.03 git-svnimport
> >
> > Hard to tell, it's obviously got short-lived processes there too that it's
> > not showing, but equally obviously that svnimport script itself is
> > spending an alarming amount of CPU time. I don't think it should do that
> > much processing, but since it's written in perl, I can't read it.
> >
> > Are there any other directories that seem to be growing (eg some temp-file
> > directory where the old files aren't cleaned away?). I can't imagine what
> > else it could be doing in kernel space than simply some silly filesystem
> > operation, but dang it all, Linux filesystems are usually very efficient
> > indeed, unless we're talking huge directories (and if it's not the git
> > object directory any more, it must be something else).
>
> 64 files in tmp.
> But the SVN repository itself has 411,000 files in it. Split between
> two directories.

I'm doing all of this on ext3. I have plenty of free disk space so I
can make another partition and switch to a new file system after I
install the new RAM. What would be the best one to try? Doing that
would provide a data point to determine if this is a problem with file
system performance or the misuse of file systems.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: svn to git, N-squared?
  2006-06-12 16:22         ` Jon Smirl
  2006-06-12 16:32           ` Jon Smirl
@ 2006-06-12 16:41           ` Linus Torvalds
  2006-06-12 16:44             ` Jon Smirl
  1 sibling, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2006-06-12 16:41 UTC (permalink / raw)
  To: Jon Smirl; +Cc: linux@horizon.com, git

On Mon, 12 Jun 2006, Jon Smirl wrote:
>
> 64 files in tmp.
> But the SVN repository itself has 411,000 files in it. Split between
> two directories.

Ouch. That sounds like it. 

> Is there some pack equivalent for svn that I haven't found yet?

Is this literally what SVN does normally? That's just insane. I mean, even 
git tried to at least hash out the files (and yeah, admittedly even that 
worked less well than I was hoping for, but I at least fixed it within 
just a few weeks through the pack mechanism).

Or is that 411,000 files a result of how git-svnimport does things, rather 
than some basic SVN approach to live: does it perhaps end up checking out 
each file under an individual temporary name?

			Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: svn to git, N-squared?
  2006-06-12 16:41           ` Linus Torvalds
@ 2006-06-12 16:44             ` Jon Smirl
  2006-06-12 17:08               ` Linus Torvalds
  0 siblings, 1 reply; 23+ messages in thread
From: Jon Smirl @ 2006-06-12 16:44 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux@horizon.com, git

On 6/12/06, Linus Torvalds <torvalds@osdl.org> wrote:
> > Is there some pack equivalent for svn that I haven't found yet?
>
> Is this literally what SVN does normally? That's just insane. I mean, even
> git tried to at least hash out the files (and yeah, admittedly even that
> worked less well than I was hoping for, but I at least fixed it within
> just a few weeks through the pack mechanism).
>
> Or is that 411,000 files a result of how git-svnimport does things, rather
> than some basic SVN approach to live: does it perhaps end up checking out
> each file under an individual temporary name?

The svn repository was built by cvs2svn, none of the git tools were involved.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: svn to git, N-squared?
  2006-06-12 16:32           ` Jon Smirl
@ 2006-06-12 16:57             ` Linus Torvalds
  0 siblings, 0 replies; 23+ messages in thread
From: Linus Torvalds @ 2006-06-12 16:57 UTC (permalink / raw)
  To: Jon Smirl; +Cc: linux@horizon.com, git



On Mon, 12 Jun 2006, Jon Smirl wrote:
> > 
> > 64 files in tmp.
> > But the SVN repository itself has 411,000 files in it. Split between
> > two directories.
> 
> I'm doing all of this on ext3. I have plenty of free disk space so I
> can make another partition and switch to a new file system after I
> install the new RAM. What would be the best one to try? Doing that
> would provide a data point to determine if this is a problem with file
> system performance or the misuse of file systems.

I'm sure there are better filesystems to try for this kind of insane 
schenario, but at the same time, I really cannot imaging that the 411,000 
files is a "normal" thing. There _must_ be some way to have SVN not do 
that in the first place (or git-svnimport).

Is this what happened when the SVN people started using fsfs? 

			Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: svn to git, N-squared?
  2006-06-12 16:44             ` Jon Smirl
@ 2006-06-12 17:08               ` Linus Torvalds
  2006-06-12 18:06                 ` Jon Smirl
  0 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2006-06-12 17:08 UTC (permalink / raw)
  To: Jon Smirl; +Cc: linux@horizon.com, git

On Mon, 12 Jun 2006, Jon Smirl wrote:
> 
> The svn repository was built by cvs2svn, none of the git tools were involved.

Ok, so that part is purely a SVN issue.

Having that many files in a single directory (or two) is a total disaster. 
That said, it works well enough if you don't create new files very often 
(and _preferably_ don't look them up either, although that is effectively 
helped by indexing). I _suspect_ that 

 - the "cvs->svn" import process was probably optimized so that it did one 
   file at a time (your "eight stages" description certainly sounds as if 
   it could do it), and in that case it's entirely possible that that can 
   be done efficiently (ie you still do file creates and lookups in an 
   increasingly big directory, but you do it only _once_ per file, rather 
   than look up old files all the time). So your lookup ratio would be 1:1 
   with the files.

   Doing a git-cvsimport would then do basically random lookups in that 
   _huge_ directory, and instead of reading the files one at a time (and 
   fully) and never again, I assume it opens them, reads one revision, 
   closes it, and then goes on to the next revision, so it will have a 
   much higher lookup ratio (you'd look up every file several times).

 - I suspect the SVN people must be hurting for performance themselves. I 
   guess they don't expect to be able to do 5-10 commits per second, the 
   way git was designed to do. So they optimized the cvs import part, but 
   their actual regular live usage is probably hitting this same directory 
   inefficiency.

Of course, the old SVN Berkeley DB usage was probably even worse (not in 
system time, but I'd expect the access patterns within the BDB file to be 
pretty nasty, and probably a lot of user time spent seeking around it). 
But in this particular case, it might even have been better.

Maybe we could teach the SVN people about pack-files? ;)

			Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: svn to git, N-squared?
  2006-06-12 17:08               ` Linus Torvalds
@ 2006-06-12 18:06                 ` Jon Smirl
  2006-06-12 19:00                   ` Jon Smirl
  0 siblings, 1 reply; 23+ messages in thread
From: Jon Smirl @ 2006-06-12 18:06 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux@horizon.com, git

On 6/12/06, Linus Torvalds <torvalds@osdl.org> wrote:
> Having that many files in a single directory (or two) is a total disaster.
> That said, it works well enough if you don't create new files very often
> (and _preferably_ don't look them up either, although that is effectively
> helped by indexing). I _suspect_ that

Posted to the svn list, they said that 220K files is normal. They told
me to turn on the ext2 dir_index option. Cheking my system I see that
none of partitions have it turned on so it must not be the default for
FC5.

I have to unmount the drive to convert existing directories. I can
trying doing the file move trick while the process is running since
new directories will use it.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: svn to git, N-squared?
  2006-06-12 18:06                 ` Jon Smirl
@ 2006-06-12 19:00                   ` Jon Smirl
  0 siblings, 0 replies; 23+ messages in thread
From: Jon Smirl @ 2006-06-12 19:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux@horizon.com, git

On 6/12/06, Jon Smirl <jonsmirl@gmail.com> wrote:
> On 6/12/06, Linus Torvalds <torvalds@osdl.org> wrote:
> > Having that many files in a single directory (or two) is a total disaster.
> > That said, it works well enough if you don't create new files very often
> > (and _preferably_ don't look them up either, although that is effectively
> > helped by indexing). I _suspect_ that
>
> Posted to the svn list, they said that 220K files is normal. They told
> me to turn on the ext2 dir_index option. Cheking my system I see that
> none of partitions have it turned on so it must not be the default for
> FC5.
>
> I have to unmount the drive to convert existing directories. I can
> trying doing the file move trick while the process is running since
> new directories will use it.

I converted the ext3 directories to dir_index on the fly using the
move trick. Switching the directory index makes it look like it is
spending even more time in the kernel.

procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us
sy id wa st
 1  0 636188  22380  19176 157200    0    0     0    52  436   415 13
40 48  0  0
 1  0 636188  22504  19176 157200    0    0     0     0  430   373 13
38 49  0  0
 1  0 636188  22628  19176 157064    0    0     0     0  433   380 12
39 49  0  0
 1  0 636188  22628  19184 157056    0    0     0    20  434   390 12
38 49  0  0
 1  0 636188  22628  19184 156920    0    0     0     0  431   376 11
40 49  0  0
 1  0 636188  22752  19192 156912    0    0     0    48  437   376 12
40 49  0  0
 1  0 636188  22876  19192 156912    0    0     0     0  430   386 11
40 49  0  0
 1  0 636188  22752  19192 156776    0    0     0     0  431   370 10
41 49  0  0
 1  0 636188  23016  19192 156776    0    0     8     0  422   500 22
40 37  2  0

The size of the svn directories went from 3.2MB to 4.4MB after they
were converted to ext3 indexed mode.

I'll get oprofile running when I do a reboot.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: svn to git, N-squared?
  2006-06-12  4:02     ` Linus Torvalds
@ 2006-06-12 19:04       ` Yakov Lerner
  2006-06-12 19:17         ` Linus Torvalds
  0 siblings, 1 reply; 23+ messages in thread
From: Yakov Lerner @ 2006-06-12 19:04 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jon Smirl, git

On 6/12/06, Linus Torvalds <torvalds@osdl.org> wrote:
> On Sun, 11 Jun 2006, Jon Smirl wrote:
> > I have it stopped and I am running the repack.
> > There are 1.27M files in my .git directory
> Yeah, that would do it. That's ~5000 files per object directory, so I
> assume that your directories are 200+kB in size, and for every new object
> added, you'll basically have to traverse the old directory fully in order
> to find an empty place for it

Is this related to 1-level dir tree for objects (12/object)
vs 2-level dir tree (12/34/object) ? Does git employ more levels
for object tree for large projects ?

Yakov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: svn to git, N-squared?
  2006-06-12 19:04       ` Yakov Lerner
@ 2006-06-12 19:17         ` Linus Torvalds
  0 siblings, 0 replies; 23+ messages in thread
From: Linus Torvalds @ 2006-06-12 19:17 UTC (permalink / raw)
  To: Yakov Lerner; +Cc: Jon Smirl, git

On Mon, 12 Jun 2006, Yakov Lerner wrote:
> 
> Is this related to 1-level dir tree for objects (12/object)
> vs 2-level dir tree (12/34/object) ? Does git employ more levels
> for object tree for large projects ?

The "more levels" approach was certainly an option early on, when we 
discussed how the objects should be spread out.

It was basically made a non-issue by the pack-files. These days, the rule 
is really more along the lines of "if you ever have more than a few 
thousand files, you've not repacked properly".

The git-svnimport script obviously doesn't do it right, but it should be 
trivial to fix. For the git cvsimporter, the fix was literally to just do

	$commitcount++;
	..
	if (($commitcount & 1023) == 0) {
		system("git repack -a -d");  
	}

when committing and that was it. It doesn't get much simpler than that, 
but the svnimporter just hasn't done it yet.

		Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2006-06-12 19:17 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-12  4:39 svn to git, N-squared? linux
2006-06-12 15:32 ` Jon Smirl
2006-06-12 15:45   ` Linus Torvalds
2006-06-12 15:55     ` Jon Smirl
2006-06-12 16:12       ` Linus Torvalds
2006-06-12 16:22         ` Jon Smirl
2006-06-12 16:32           ` Jon Smirl
2006-06-12 16:57             ` Linus Torvalds
2006-06-12 16:41           ` Linus Torvalds
2006-06-12 16:44             ` Jon Smirl
2006-06-12 17:08               ` Linus Torvalds
2006-06-12 18:06                 ` Jon Smirl
2006-06-12 19:00                   ` Jon Smirl
2006-06-12 16:16     ` Jon Smirl
  -- strict thread matches above, loose matches on Subject: below --
2006-06-12  2:02 Jon Smirl
2006-06-12  3:31 ` Linus Torvalds
2006-06-12  3:39   ` Jon Smirl
2006-06-12  4:02     ` Linus Torvalds
2006-06-12 19:04       ` Yakov Lerner
2006-06-12 19:17         ` Linus Torvalds
2006-06-12 16:18   ` Randal L. Schwartz
2006-06-12 16:25     ` Randal L. Schwartz
2006-06-12  4:29 ` Eric Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).