[PATCH] cvsimport: introduce -L<imit> option to workaround memory leaks

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] cvsimport: introduce -L<imit> option to workaround memory leaks
@ 2006-05-22 11:38 Martin Langhoff
  2006-05-23  2:28 ` Linus Torvalds
  0 siblings, 1 reply; 10+ messages in thread
From: Martin Langhoff @ 2006-05-22 11:38 UTC (permalink / raw)
  To: git, junkio, Johannes.Schindelin, torvals, spyderous, smurf
  Cc: Martin Langhoff

Signed-off-by: Martin Langhoff <martin@catalyst.net.nz>

---
This is ugly, but while I work on cleaning up the leak
that seems to be somewhere in the commit() sub, we may
as well set up a workaround.

I am not 100% happy woth including this in git.git. 
In any case, I hope we can revert it soon.

---

 git-cvsimport.perl |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

64ea3c83d8cd176ee972055bd1d11f398655dad8
diff --git a/git-cvsimport.perl b/git-cvsimport.perl
index c0ae00b..c1923d1 100755
--- a/git-cvsimport.perl
+++ b/git-cvsimport.perl
@@ -29,7 +29,7 @@ use IPC::Open2;
 $SIG{'PIPE'}="IGNORE";
 $ENV{'TZ'}="UTC";
 
-our($opt_h,$opt_o,$opt_v,$opt_k,$opt_u,$opt_d,$opt_p,$opt_C,$opt_z,$opt_i,$opt_P, $opt_s,$opt_m,$opt_M,$opt_A,$opt_S);
+our($opt_h,$opt_o,$opt_v,$opt_k,$opt_u,$opt_d,$opt_p,$opt_C,$opt_z,$opt_i,$opt_P, $opt_s,$opt_m,$opt_M,$opt_A,$opt_S,$opt_L);
 my (%conv_author_name, %conv_author_email);
 
 sub usage() {
@@ -85,7 +85,7 @@ sub write_author_info($) {
 	close ($f);
 }
 
-getopts("hivmkuo:d:p:C:z:s:M:P:A:S:") or usage();
+getopts("hivmkuo:d:p:C:z:s:M:P:A:S:L:") or usage();
 usage if $opt_h;
 
 @ARGV <= 1 or usage();
@@ -716,6 +716,7 @@ my $commit = sub {
 	}
 };
 
+my $commitcount = 1;
 while(<CVS>) {
 	chomp;
 	if($state == 0 and /^-+$/) {
@@ -849,6 +850,9 @@ #	VERSION:1.96->1.96.2.1
 	} elsif($state == 9 and /^\s*$/) {
 		$state = 10;
 	} elsif(($state == 9 or $state == 10) and /^-+$/) {
+		if ($opt_L && $commitcount++ >= $opt_L) {
+			last;
+		}
 		&$commit();
 		$state = 1;
 	} elsif($state == 11 and /^-+$/) {
-- 
1.3.2.g82000

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] cvsimport: introduce -L<imit> option to workaround memory leaks
  2006-05-22 11:38 [PATCH] cvsimport: introduce -L<imit> option to workaround memory leaks Martin Langhoff
@ 2006-05-23  2:28 ` Linus Torvalds
  2006-05-23  3:15   ` Martin Langhoff (CatalystIT)
                     ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Linus Torvalds @ 2006-05-23  2:28 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Git Mailing List, Junio C Hamano, Johannes.Schindelin, spyderous,
	smurf



This stupid patch on top of yours seems to make git happier. It's 
disgusting, I know, but it just repacks things every kilo-commit.

I actually think that I found a real ext3 performance bug from trying to 
determine why git sometimes slows down ridiculously when the tree has been 
allowed to go too long without a repack.

		Linus

---
diff --git a/git-cvsimport.perl b/git-cvsimport.perl
index fb56278..c141f5e 100755
--- a/git-cvsimport.perl
+++ b/git-cvsimport.perl
@@ -853,10 +853,14 @@ #	VERSION:1.96->1.96.2.1
 	} elsif($state == 9 and /^\s*$/) {
 		$state = 10;
 	} elsif(($state == 9 or $state == 10) and /^-+$/) {
-		if ($opt_L && $commitcount++ >= $opt_L) {
+		$commitcount++;
+		if ($opt_L && $commitcount > $opt_L) {
 			last;
 		}
 		commit();
+		if (($commitcount & 1023) == 0) {
+			system("git repack -a -d");
+		}
 		$state = 1;
 	} elsif($state == 11 and /^-+$/) {
 		$state = 1;

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] cvsimport: introduce -L<imit> option to workaround memory leaks
  2006-05-23  2:28 ` Linus Torvalds
@ 2006-05-23  3:15   ` Martin Langhoff (CatalystIT)
  2006-05-23 15:36   ` Theodore Tso
  2006-05-26  0:42   ` Martin Langhoff
  2 siblings, 0 replies; 10+ messages in thread
From: Martin Langhoff (CatalystIT) @ 2006-05-23  3:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Git Mailing List, Junio C Hamano, Johannes.Schindelin, spyderous,
	smurf

Linus Torvalds wrote:

> 
> This stupid patch on top of yours seems to make git happier. It's 
> disgusting, I know, but it just repacks things every kilo-commit.
> 
> I actually think that I found a real ext3 performance bug from trying to 
> determine why git sometimes slows down ridiculously when the tree has been 
> allowed to go too long without a repack.
> 

Acked (in case anyone cares for such an obvious one), and thanks! I 
thought of doing that last night together with that exact patch, but I 
was focussing on the leak.

cheers,


m
-- 
-----------------------------------------------------------------------
Martin @ Catalyst .Net .NZ  Ltd, PO Box 11-053, Manners St,  Wellington
WEB: http://catalyst.net.nz/           PHYS: Level 2, 150-154 Willis St
OFFICE: +64(4)916-7224                              MOB: +64(21)364-017
       Make things as simple as possible, but no simpler - Einstein
-----------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] cvsimport: introduce -L<imit> option to workaround memory leaks
  2006-05-23  2:28 ` Linus Torvalds
  2006-05-23  3:15   ` Martin Langhoff (CatalystIT)
@ 2006-05-23 15:36   ` Theodore Tso
  2006-05-23 16:05     ` Linus Torvalds
  2006-05-26  0:42   ` Martin Langhoff
  2 siblings, 1 reply; 10+ messages in thread
From: Theodore Tso @ 2006-05-23 15:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Langhoff, Git Mailing List, Junio C Hamano,
	Johannes.Schindelin, spyderous, smurf

On Mon, May 22, 2006 at 07:28:37PM -0700, Linus Torvalds wrote:
> 
> 
> This stupid patch on top of yours seems to make git happier. It's 
> disgusting, I know, but it just repacks things every kilo-commit.
> 
> I actually think that I found a real ext3 performance bug from trying to 
> determine why git sometimes slows down ridiculously when the tree has been 
> allowed to go too long without a repack.

Do you have dir_index (the hashed btree) feature enabled by any chance?

						- Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] cvsimport: introduce -L<imit> option to workaround memory leaks
  2006-05-23 15:36   ` Theodore Tso
@ 2006-05-23 16:05     ` Linus Torvalds
  2006-05-23 16:25       ` Linus Torvalds
  0 siblings, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2006-05-23 16:05 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Martin Langhoff, Git Mailing List, Junio C Hamano,
	Johannes.Schindelin, spyderous, smurf

On Tue, 23 May 2006, Theodore Tso wrote:
> On Mon, May 22, 2006 at 07:28:37PM -0700, Linus Torvalds wrote:
> > 
> > I actually think that I found a real ext3 performance bug from trying to 
> > determine why git sometimes slows down ridiculously when the tree has been 
> > allowed to go too long without a repack.
> 
> Do you have dir_index (the hashed btree) feature enabled by any chance?

No, and I know I probably should, since it would hopefully help git usage.

But my problem actually happens even with moderately sized directories: 
they were just 40kB or so in size, and the problem isn't high system CPU 
usage, but tons of extra IO. I ran things on a machine with 2GB of RAM, 
and as far as I could tell, the working set _should_ have fit into memory, 
but CPU utilization was consistently in the 1% range.

Now, it's possible that I'm just wrong, and it really didn't fit in 
memory, but I I _suspect_ that the issue is that ext3 directory handling 
still uses the "buffer_head" thing rather than the page cache, and that we 
simply don't LRU the memory appropriately so we don't let the memory 
pressure expand the buffer cache.

Now, using buffer cache in this day and age is insane and horrible 
(there's a reason I suspect the LRU doesn't work that well: the buffer 
heads aren't supposed to be used as a cache, and people are supposed to 
use the page cache for it these days), but Andrew tells me that the whole 
JBD thing basically requires it. Whatever.

Now, repacking obviously hides it entirely (because then the load becomes 
entirely a page-cache load, and the kernel does _that_ beautifully), but 
I'm a bit bummed that I think I hit an ext3 braindamage.

So an unpacked git archive on ext3 (but not ext2, I believe: ext2 should 
use the page cache for directories) ends up being very buffer-cache 
intensive. And the buffer cache is basically deprecated..

		Linus

PS. I'll see if I can figure out the problem, and maybe the good news is 
that I'll be able to just fix a real kernel performance issue. Still, 
there's a _reason_ we tried to get away from the buffer heads as a caching 
entity..

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] cvsimport: introduce -L<imit> option to workaround memory leaks
  2006-05-23 16:05     ` Linus Torvalds
@ 2006-05-23 16:25       ` Linus Torvalds
  0 siblings, 0 replies; 10+ messages in thread
From: Linus Torvalds @ 2006-05-23 16:25 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Martin Langhoff, Git Mailing List, Junio C Hamano,
	Johannes.Schindelin, spyderous, smurf

On Tue, 23 May 2006, Linus Torvalds wrote:
> 
> So an unpacked git archive on ext3 (but not ext2, I believe: ext2 should 
> use the page cache for directories) ends up being very buffer-cache 
> intensive. And the buffer cache is basically deprecated..

A few notes: I'm not 100% sure things really fit in the 2GB (*), so it 
really may be IO limited, and the low CPU use is just because that machine 
also happens to have a dang fast next-gen Intel CPU that hasn't even been 
released yet.

Also, I do realize that hashed directories should actually decrease the 
buffer cache pressure too, just because we wouldn't need to read all of 
the directory for a lookup. 

		Linus

(*) cvsps itself grows to 1.6GB of the 2GB and while that memory should be 
largely idle, the problem may simply be that we don't swap it out eagerly 
enough. Allowing filesystem metadata to swap out processes is something 
we've tuned against, because it tends to result in horrible interactive 
behaviour after a nightly "updatedb" run. So it's entirely possible that 
this is all normal..

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] cvsimport: introduce -L<imit> option to workaround memory leaks
  2006-05-23  2:28 ` Linus Torvalds
  2006-05-23  3:15   ` Martin Langhoff (CatalystIT)
  2006-05-23 15:36   ` Theodore Tso
@ 2006-05-26  0:42   ` Martin Langhoff
  2006-05-26  5:20     ` Linus Torvalds
  2 siblings, 1 reply; 10+ messages in thread
From: Martin Langhoff @ 2006-05-26  0:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Langhoff, Git Mailing List, Junio C Hamano,
	Johannes.Schindelin, spyderous, smurf

On 5/23/06, Linus Torvalds <torvalds@osdl.org> wrote:
>
> This stupid patch on top of yours seems to make git happier. It's
> disgusting, I know, but it just repacks things every kilo-commit.

Call me slow, but I am still running and rerunning that gentoo import. ;-)

The current import has reached ~200K commits, and .git is 450MB, while
the checked out tree is 230MB (680MB with .git). At this stage, git
repack -a -d is too memory hungry. I just had to kill it after it took
2GB and was still growing fast. So I have dropped the -a in my test
import for the time being.

Other than that, we are doing ~6K commits per hour on a 2GB 2GHz Opteron.

cheers,

martin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] cvsimport: introduce -L<imit> option to workaround memory leaks
  2006-05-26  0:42   ` Martin Langhoff
@ 2006-05-26  5:20     ` Linus Torvalds
  2006-05-26  5:29       ` Jakub Narebski
  2006-05-26  6:02       ` Martin Langhoff
  0 siblings, 2 replies; 10+ messages in thread
From: Linus Torvalds @ 2006-05-26  5:20 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Martin Langhoff, Git Mailing List, Junio C Hamano,
	Johannes.Schindelin, spyderous, smurf

On Fri, 26 May 2006, Martin Langhoff wrote:
> 
> Call me slow, but I am still running and rerunning that gentoo import. ;-)

I'm doing it too, just for fun.

Of course, since I'm doing this on a machine that basically has a laptop 
disk, the "just for fun" part is a bit sad. It's waiting for disk about 
25% of the time ;/

And it's slow as hell. I really wish we could do better on the CVS import 
front. 

> The current import has reached ~200K commits, and .git is 450MB, while
> the checked out tree is 230MB (680MB with .git). At this stage, git
> repack -a -d is too memory hungry.

I've got 2GB in that puppy, and "repack -a -d" is fine for me. I'm not 
quite up to 200k commits yet (I'm at 160k), but the repacking is certainly 
faster than the rest of the import.. Gaah. 

It's "git-rev-list --objects" that is the memory sucker for me, the 
packing itself doesn't seem to be too bad.

The biggest cost seems to be git-write-tree, which is about 0.225 seconds 
for me on that tree on that machine. Which _should_ mean that we could do 
4 commits a second, but that sure as hell ain't how it works out. It seems 
to do about 1.71 commits a second for me on that tree, which is pretty 
damn pitiful. Some cvs overhead, and probably some other git overhead too.

(That's a 2GHz Merom, so the fact that you get ~6k commits per hour on 
your 2GHz Opteron is about the same speed - I suspect you're also at least 
partly limited by disk, our numbers seem to match pretty well).

200k commits at 6k commits per hour is about a day and a half (plus the 
occasional packing load). Taking that long to import a CVS archive is 
horrible. But I guess it _is_ several years of work, and I guess you 
really have to do it only once, but still.

			Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] cvsimport: introduce -L<imit> option to workaround memory leaks
  2006-05-26  5:20     ` Linus Torvalds
@ 2006-05-26  5:29       ` Jakub Narebski
  2006-05-26  6:02       ` Martin Langhoff
  1 sibling, 0 replies; 10+ messages in thread
From: Jakub Narebski @ 2006-05-26  5:29 UTC (permalink / raw)
  To: git

Linus Torvalds wrote:

> 200k commits at 6k commits per hour is about a day and a half (plus the 
> occasional packing load). Taking that long to import a CVS archive is 
> horrible. But I guess it _is_ several years of work, and I guess you 
> really have to do it only once, but still.

And how parsecvs (which as far as I remember didn't have incremental mode)
compares wrt speed to git-cvsimport? It is supposed to be faster...

-- 
Jakub Narebski
Warsaw, Poland

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] cvsimport: introduce -L<imit> option to workaround memory leaks
  2006-05-26  5:20     ` Linus Torvalds
  2006-05-26  5:29       ` Jakub Narebski
@ 2006-05-26  6:02       ` Martin Langhoff
  1 sibling, 0 replies; 10+ messages in thread
From: Martin Langhoff @ 2006-05-26  6:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Langhoff, Git Mailing List, Junio C Hamano,
	Johannes.Schindelin, spyderous, smurf

On 5/26/06, Linus Torvalds <torvalds@osdl.org> wrote:
> I'm doing it too, just for fun.

Well, it's good to not be so alone in our definition of fun ;-)

> Of course, since I'm doing this on a machine that basically has a laptop
> disk, the "just for fun" part is a bit sad. It's waiting for disk about
> 25% of the time ;/

Ouch.

> And it's slow as hell. I really wish we could do better on the CVS import
> front.

Me too. However, I don't think the perl part is so costly anymore.
It's down to waiting on IO. git-write-tree is also prominently there.
It takes a lot of memory in some writes -- I had thought it'd be
cheaper as it takes one tree object at the time...

I also have a trivial patch that I haven't posted yet, that runs cvsps
to a tempfile, and then reads the file. Serialising the tasks means
that we don't carry around cvsps' memory footprint during the import
itself.

...
> It's "git-rev-list --objects" that is the memory sucker for me, the
> packing itself doesn't seem to be too bad.

No, you're right, it's git-rev-list that gets called during the
repack. But it was pushing everything it could to swap. Once it didn't
fit in memory, it hit a brick wall :(

> The biggest cost seems to be git-write-tree, which is about 0.225 seconds
> for me on that tree on that machine. Which _should_ mean that we could do
> 4 commits a second, but that sure as hell ain't how it works out. It seems
> to do about 1.71 commits a second for me on that tree, which is pretty
> damn pitiful. Some cvs overhead, and probably some other git overhead too.

Well, we _have_ to fetch the file. I guess you are thinking of
extracting if frrom the RCS ,v file directly? One tihng that I found
that seemed to speed things up a bit was to declare TMPDIR to be a
directory in the same partition.

> (That's a 2GHz Merom, so the fact that you get ~6k commits per hour on
> your 2GHz Opteron is about the same speed - I suspect you're also at least
> partly limited by disk, our numbers seem to match pretty well).

Yup. This is _very_ diskbound.

> 200k commits at 6k commits per hour is about a day and a half (plus the
> occasional packing load). Taking that long to import a CVS archive is
> horrible. But I guess it _is_ several years of work, and I guess you
> really have to do it only once, but still.

And it's a huge CVS archive too.

martin

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2006-05-26  6:02 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-05-22 11:38 [PATCH] cvsimport: introduce -L<imit> option to workaround memory leaks Martin Langhoff
2006-05-23  2:28 ` Linus Torvalds
2006-05-23  3:15   ` Martin Langhoff (CatalystIT)
2006-05-23 15:36   ` Theodore Tso
2006-05-23 16:05     ` Linus Torvalds
2006-05-23 16:25       ` Linus Torvalds
2006-05-26  0:42   ` Martin Langhoff
2006-05-26  5:20     ` Linus Torvalds
2006-05-26  5:29       ` Jakub Narebski
2006-05-26  6:02       ` Martin Langhoff

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).