git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* irc usage..
@ 2006-05-20 17:26 Linus Torvalds
  2006-05-20 17:50 ` Junio C Hamano
  2006-05-20 20:39 ` Yann Dirson
  0 siblings, 2 replies; 83+ messages in thread
From: Linus Torvalds @ 2006-05-20 17:26 UTC (permalink / raw)
  To: Git Mailing List


I hate irc.

I'm reading the irc logs, and seeing that people have problems, but (a) it 
was while I was asleep and (b) irc use doesn't encourage people to 
actually explain what the problems _are_, so I have no clue.

So now I know that "spyderous" has problems importing some 1GB gentoo CVS 
archive, but that's pretty much it. Grr.

Are people afraid to post to git@vger.kernel.org, or what?

I saw that people tried to suggest posting to the git mailing list, but 
can any of you who are active on irc be a bit more forceful? And perhaps 
we don't make this mailing list address well enough known? 

As far as I'm aware, the git mailing list isn't closed, so people should 
be able to post here without even subscribing. I can well understand that 
you might not want to subscribe and prefer to look ove rthe list through 
some archive setup (the way I look at the irc logs), and maybe we should 
just make the git mailing list address more obvious.

Right now, the "community" page at http://git.or.cz/community.html doesn't 
even mention the git mailing list address directly, it just tells you how 
you can subscribe and read the archives.

Can we perhaps fix that, and the people who are active on irc please also 
make it clear to people that if they have some real problems that don't 
get an immediate answer, the git mailing list ends up where a lot of 
people can actually look more closely at it.. And tell them what the 
address is.

			Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-20 17:26 irc usage Linus Torvalds
@ 2006-05-20 17:50 ` Junio C Hamano
  2006-05-20 18:52   ` Jakub Narebski
  2006-05-20 20:39 ` Yann Dirson
  1 sibling, 1 reply; 83+ messages in thread
From: Junio C Hamano @ 2006-05-20 17:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Linus Torvalds <torvalds@osdl.org> writes:

> I hate irc.
>...
> Can we perhaps fix that, and the people who are active on irc please also 
> make it clear to people that if they have some real problems that don't 
> get an immediate answer, the git mailing list ends up where a lot of 
> people can actually look more closely at it.. And tell them what the 
> address is.

I hate irc, too.  Number of times easily solvable usage problems
come up and I look at the log to realize when the solutions
suggested were waaaaay suboptimal it is too late (with loops
being quite active recently things have improved a lot, but we
should not expect him to be 24/7).

Maybe somebody can run a dumb 'bot that notices somebody said
something that ends with a '?' and there is no activity there
for N minutes and inject a recorded message that reminds the
mailing list address ;-).

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-20 17:50 ` Junio C Hamano
@ 2006-05-20 18:52   ` Jakub Narebski
  0 siblings, 0 replies; 83+ messages in thread
From: Jakub Narebski @ 2006-05-20 18:52 UTC (permalink / raw)
  To: git

Junio C Hamano wrote:

> Maybe somebody can run a dumb 'bot that notices somebody said
> something that ends with a '?' and there is no activity there
> for N minutes and inject a recorded message that reminds the
> mailing list address ;-).

Or something like fsbot or other bots on #emacs channel

   http://www.emacswiki.org/cgi-bin/wiki/EmacsChannel

-- 
Jakub Narebski
Warsaw, Poland

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-20 17:26 irc usage Linus Torvalds
  2006-05-20 17:50 ` Junio C Hamano
@ 2006-05-20 20:39 ` Yann Dirson
  2006-05-20 22:18   ` Donnie Berkholz
  2006-05-22  1:45   ` Linus Torvalds
  1 sibling, 2 replies; 83+ messages in thread
From: Yann Dirson @ 2006-05-20 20:39 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

On Sat, May 20, 2006 at 10:26:22AM -0700, Linus Torvalds wrote:
> I'm reading the irc logs, and seeing that people have problems, but (a) it 
> was while I was asleep and (b) irc use doesn't encourage people to 
> actually explain what the problems _are_, so I have no clue.
> 
> So now I know that "spyderous" has problems importing some 1GB gentoo CVS 
> archive, but that's pretty much it. Grr.

FWIW, I have mentionned a problem that may be the same, under
Message-ID <20060107090148.GB32585@nowhere.earth>, that was on January
7th.  Namely, when importing a repository with very large files over
pserver or ssh, timeouts can occur and prevent the import from
working.  But, as you said, it's not easy to get precise info from the
logs :)

Best regards,
-- 
Yann Dirson    <ydirson@altern.org> |
Debian-related: <dirson@debian.org> |   Support Debian GNU/Linux:
                                    |  Freedom, Power, Stability, Gratis
     http://ydirson.free.fr/        | Check <http://www.debian.org/>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-20 20:39 ` Yann Dirson
@ 2006-05-20 22:18   ` Donnie Berkholz
  2006-05-20 22:45     ` Linus Torvalds
  2006-05-21  1:14     ` Donnie Berkholz
  2006-05-22  1:45   ` Linus Torvalds
  1 sibling, 2 replies; 83+ messages in thread
From: Donnie Berkholz @ 2006-05-20 22:18 UTC (permalink / raw)
  To: Yann Dirson; +Cc: Linus Torvalds, Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 1725 bytes --]

Yann Dirson wrote:
> On Sat, May 20, 2006 at 10:26:22AM -0700, Linus Torvalds wrote:
>> I'm reading the irc logs, and seeing that people have problems, but (a) it 
>> was while I was asleep and (b) irc use doesn't encourage people to 
>> actually explain what the problems _are_, so I have no clue.
>>
>> So now I know that "spyderous" has problems importing some 1GB gentoo CVS 
>> archive, but that's pretty much it. Grr.

Hi all,

I just subscribed and this post is the only one I've got from the
thread, so I'm responding to it instead of the original. Gentoo's an
IRC-based community, so I tend to try IRC first for any problems I have
and fall back to the list later if I can't get things figured out.

Here's a rough summary:

Our main repo is actually a bit over 2G (2103621223) now that I check,
but it's not very complex. There's actually just one branch, and I don't
think anyone would care if we lost the history from it because it's a
release branch from a few years ago.

Somebody else tried importing it with git-cvsimport, but he said he hit
some kind of problem and recalled that it was a cvsps segfault. Sounds
about right, since I've never gotten cvsps to run successfully on the
whole repo either.

I tried with parsecvs, but it runs into OOM even on a machine with 4G
RAM after reading in all the ,v files, presumably while it's building
some huge tree of changesets in memory. Keith Packard's suggested that
there are ways to reduce parsecvs's memory use, because it retains the
full tree in memory for each revision rather than just the files that
actually changed. But my C skills are pretty weak; I'm an OK reader but
not much of a writer yet.

Thanks,
Donnie


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 252 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-20 22:18   ` Donnie Berkholz
@ 2006-05-20 22:45     ` Linus Torvalds
  2006-05-20 23:12       ` Donnie Berkholz
  2006-05-21  9:46       ` Thomas Glanzmann
  2006-05-21  1:14     ` Donnie Berkholz
  1 sibling, 2 replies; 83+ messages in thread
From: Linus Torvalds @ 2006-05-20 22:45 UTC (permalink / raw)
  To: Donnie Berkholz; +Cc: Yann Dirson, Git Mailing List



On Sat, 20 May 2006, Donnie Berkholz wrote:
> 
> Our main repo is actually a bit over 2G (2103621223) now that I check,
> but it's not very complex. There's actually just one branch, and I don't
> think anyone would care if we lost the history from it because it's a
> release branch from a few years ago.

Can you point to it? I'm not a CVS user, but I've played with cvsps before 
(to get it to work), and I'm a humanitarian - rescuing people from CVS is 
to me not just a good idea, it's a moral imperative.

		Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-20 22:45     ` Linus Torvalds
@ 2006-05-20 23:12       ` Donnie Berkholz
  2006-05-21 19:24         ` Linus Torvalds
  2006-05-21  9:46       ` Thomas Glanzmann
  1 sibling, 1 reply; 83+ messages in thread
From: Donnie Berkholz @ 2006-05-20 23:12 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Yann Dirson, Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 826 bytes --]

Linus Torvalds wrote:
> 
> On Sat, 20 May 2006, Donnie Berkholz wrote:
>> Our main repo is actually a bit over 2G (2103621223) now that I check,
>> but it's not very complex. There's actually just one branch, and I don't
>> think anyone would care if we lost the history from it because it's a
>> release branch from a few years ago.
> 
> Can you point to it? I'm not a CVS user, but I've played with cvsps before 
> (to get it to work), and I'm a humanitarian - rescuing people from CVS is 
> to me not just a good idea, it's a moral imperative.

I don't want to post the link publicly for a few reasons, including the
huge amount of bandwidth it would suck up for lots of people to download
it. I've sent it to you off-list, and if anyone else would also like it,
please drop me a note.

Thanks,
Donnie


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 252 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-20 22:18   ` Donnie Berkholz
  2006-05-20 22:45     ` Linus Torvalds
@ 2006-05-21  1:14     ` Donnie Berkholz
  1 sibling, 0 replies; 83+ messages in thread
From: Donnie Berkholz @ 2006-05-21  1:14 UTC (permalink / raw)
  To: Donnie Berkholz; +Cc: Yann Dirson, Linus Torvalds, Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 542 bytes --]

Donnie Berkholz wrote:
> Somebody else tried importing it with git-cvsimport, but he said he hit
> some kind of problem and recalled that it was a cvsps segfault. Sounds
> about right, since I've never gotten cvsps to run successfully on the
> whole repo either.

Much to my surprise, a cvsps run I started earlier has just finished
without segfaulting. But attempts to actually run cvsps (e.g., cvsps -a
spyderous) spit thousands of warnings of "WARNING: revision 1.1.1.1 of
file $FILENAME on unnamed branch".

Thanks,
Donnie


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 252 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-20 22:45     ` Linus Torvalds
  2006-05-20 23:12       ` Donnie Berkholz
@ 2006-05-21  9:46       ` Thomas Glanzmann
  1 sibling, 0 replies; 83+ messages in thread
From: Thomas Glanzmann @ 2006-05-21  9:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Donnie Berkholz, Yann Dirson, Git Mailing List

Hello Linus,

> and I'm a humanitarian - rescuing people from CVS is 
> to me not just a good idea, it's a moral imperative.

you're a very brave man.

        Thomas

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-20 23:12       ` Donnie Berkholz
@ 2006-05-21 19:24         ` Linus Torvalds
  2006-05-22  3:59           ` Linus Torvalds
  0 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2006-05-21 19:24 UTC (permalink / raw)
  To: Donnie Berkholz; +Cc: Yann Dirson, Git Mailing List



On Sat, 20 May 2006, Donnie Berkholz wrote:
> 
> I don't want to post the link publicly for a few reasons, including the
> huge amount of bandwidth it would suck up for lots of people to download
> it. I've sent it to you off-list, and if anyone else would also like it,
> please drop me a note.

Ok. It's still converting (that's a big archive), but it has passed the 
cvsps stage without errors for me, and the conversion so far seems ok. But 
it has only gotten to 

	Author: vapier <vapier>  2002-09-23 12:32:42
	Changed GPL to GPL-2 in LICENSE and updated SRC_URI to use mirror:

so it has converted only slightly more than the first two years of 
history in the roughly 30 minutes I've let it run. So it will take several 
hours.

The reason it works for me is likely simply the fact that I had a few 
patches to my cvsps already. I'm appending the stupid patches, I'm not 
guaranteeing that they are correct at all, although the three _committed_ 
patches are almost certainly correct (and the last uncommitted one is 
almost certainly totally broken). The patches are against clean cvsps 2.1.

Also, when I say "the conversion so far seems ok", I obviously don't 
actually know what the hell the archive is supposed to look like, so I can 
only say that the end result seems not totally insane.

To do a good conversion, you'll want to make sure that you have a author 
name conversion file. See the "-A" flag in "git help cvsimport" (if you 
have the man-pages installed).

		Linus

---
commit 534120d9a47062eecd7b53fd7ac0b70d97feb4fd
Author: Linus Torvalds <torvalds@g5.osdl.org>
Date:   Wed Mar 22 11:20:59 2006 -0800

    Increase log-length limit to 64kB
    
    Yeah, it should be dynamic. I'm lazy.
---
 cvsps_types.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/cvsps_types.h b/cvsps_types.h
index b41e2a9..dba145d 100644
--- a/cvsps_types.h
+++ b/cvsps_types.h
@@ -8,7 +8,7 @@ #define CVSPS_TYPES_H
 
 #include <time.h>
 
-#define LOG_STR_MAX 32768
+#define LOG_STR_MAX 65536
 #define AUTH_STR_MAX 64
 #define REV_STR_MAX 64
 #define MIN(a, b) ((a) < (b) ? (a) : (b))


commit 82fcf7e31bbeae3b01a8656549e9b8fd89d598eb
Author: Linus Torvalds <torvalds@g5.osdl.org>
Date:   Wed Mar 22 11:23:37 2006 -0800

    Improve handling of file collisions in the same patchset
    
    Take the file revision into account.
---
 cvsps.c |   27 +++++++++++++++++++++++++--
 1 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/cvsps.c b/cvsps.c
index 1e64e3c..c22147e 100644
--- a/cvsps.c
+++ b/cvsps.c
@@ -2384,8 +2384,31 @@ void patch_set_add_member(PatchSet * ps,
     for (next = ps->members.next; next != &ps->members; next = next->next) 
     {
 	PatchSetMember * m = list_entry(next, PatchSetMember, link);
-	if (m->file == psm->file && ps->collision_link.next == NULL) 
-		list_add(&ps->collision_link, &collisions);
+	if (m->file == psm->file) {
+		int order = compare_rev_strings(psm->post_rev->rev, m->post_rev->rev);
+
+		/*
+		 * Same revision too? Add it to the collision list
+		 * if it isn't already.
+		 */
+		if (!order) {
+			if (ps->collision_link.next == NULL)
+				list_add(&ps->collision_link, &collisions);
+			return;
+		}
+
+		/*
+		 * If this is an older revision than the one we already have
+		 * in this patchset, just ignore it
+		 */
+		if (order < 0)
+			return;
+
+		/*
+		 * This is a newer one, remove the old one
+		 */
+		list_del(&m->link);
+	}
     }
 
     psm->ps = ps;

commit 3d1ebcef6b4f9f6c9064efd64da4dd30d93c3c96
Author: Linus Torvalds <torvalds@g5.osdl.org>
Date:   Wed Mar 22 17:20:20 2006 -0800

    Fix branch ancestor calculation
    
    Not having any ancestor at all means that any valid ancestor (even of
    "depth 0") is fine.
    
    Signed-off-by: Linus Torvalds <torvalds@osdl.org>
---
 cvsps.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/cvsps.c b/cvsps.c
index c22147e..2695a0f 100644
--- a/cvsps.c
+++ b/cvsps.c
@@ -2599,7 +2599,7 @@ static void determine_branch_ancestor(Pa
 	 * note: rev is the pre-commit revision, not the post-commit
 	 */
 	if (!head_ps->ancestor_branch)
-	    d1 = 0;
+	    d1 = -1;
 	else if (strcmp(ps->branch, rev->branch) == 0)
 	    continue;
 	else if (strcmp(head_ps->ancestor_branch, "HEAD") == 0)


uncommitted diff
Author: Linus Torvalds <torvalds@g5.osdl.org>

    Probably totally broken dot counting
---
 cvsps.c |   13 ++++++++++---
 1 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/cvsps.c b/cvsps.c
index 2695a0f..2ad1595 100644
--- a/cvsps.c
+++ b/cvsps.c
@@ -2357,9 +2357,16 @@ static int revision_affects_branch(CvsFi
 static int count_dots(const char * p)
 {
     int dots = 0;
+    int len = strlen(p);
 
-    while (*p)
-	if (*p++ == '.')
+    while (len > 2) {
+	if (memcmp(p+len-2, ".1", 2))
+		break;
+	len -= 2;
+    }
+
+    while (len)
+	if (p[--len] == '.')
 	    dots++;
 
     return dots;
@@ -2613,7 +2620,7 @@ static void determine_branch_ancestor(Pa
 	/* HACK: we sometimes pretend to derive from the import branch.  
 	 * just don't do that.  this is the easiest way to prevent... 
 	 */
-	d2 = (strcmp(rev->rev, "1.1.1.1") == 0) ? 0 : count_dots(rev->rev);
+	d2 = count_dots(rev->rev);
 	
 	if (d2 > d1)
 	    head_ps->ancestor_branch = rev->branch;

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-20 20:39 ` Yann Dirson
  2006-05-20 22:18   ` Donnie Berkholz
@ 2006-05-22  1:45   ` Linus Torvalds
  1 sibling, 0 replies; 83+ messages in thread
From: Linus Torvalds @ 2006-05-22  1:45 UTC (permalink / raw)
  To: Yann Dirson; +Cc: Git Mailing List



On Sat, 20 May 2006, Yann Dirson wrote:
> 
> FWIW, I have mentionned a problem that may be the same, under
> Message-ID <20060107090148.GB32585@nowhere.earth>, that was on January
> 7th.  Namely, when importing a repository with very large files over
> pserver or ssh, timeouts can occur and prevent the import from
> working.  But, as you said, it's not easy to get precise info from the
> logs :)

For big repositories, you really shouldn't use pserver or ssh anyway. You 
should try really really hard to just get a local copy, and do it that 
way. It's going to be tons faster, and will avoid a lot of the problems, 
including network timeouts etc.

		Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-21 19:24         ` Linus Torvalds
@ 2006-05-22  3:59           ` Linus Torvalds
  2006-05-22  4:19             ` Donnie Berkholz
  0 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2006-05-22  3:59 UTC (permalink / raw)
  To: Donnie Berkholz; +Cc: Yann Dirson, Git Mailing List



On Sun, 21 May 2006, Linus Torvalds wrote:
> 
> Ok. It's still converting (that's a big archive), but it has passed the 
> cvsps stage without errors for me, and the conversion so far seems ok. But 
> it has only gotten to 
> 
> 	Author: vapier <vapier>  2002-09-23 12:32:42
> 	Changed GPL to GPL-2 in LICENSE and updated SRC_URI to use mirror:
> 
> so it has converted only slightly more than the first two years of 
> history in the roughly 30 minutes I've let it run. So it will take several 
> hours.

Btw, trying this import (which got interrupted by a thunderstorm and one 
of our first power failures in a long time - just a few seconds, but 
enough to power off everything but my laptops) it became very obvious that 
"git cvsimport" really _really_ should re-pack the archive every once in a 
while.

The old "repack every month or so" approach doesn't work that well when 
you try to import several years of history in a few hours.

Now, you can just repack after the whole thing is done (it will probably 
take no more than ~15 minutes or so), but it would probably be best if the 
import script itself decided to repack every once in a while just to avoid 
wasting a lot of diskspace _during_ the import itself.

So this isn't so much a correctness issue as a "avoid wasting time and 
space" issue, but still..

			Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-22  3:59           ` Linus Torvalds
@ 2006-05-22  4:19             ` Donnie Berkholz
  2006-05-22  4:50               ` Linus Torvalds
  0 siblings, 1 reply; 83+ messages in thread
From: Donnie Berkholz @ 2006-05-22  4:19 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Yann Dirson, Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 2416 bytes --]

Linus Torvalds wrote:
> 
> On Sun, 21 May 2006, Linus Torvalds wrote:
>> Ok. It's still converting (that's a big archive), but it has passed the 
>> cvsps stage without errors for me, and the conversion so far seems ok. But 
>> it has only gotten to 
>>
>> 	Author: vapier <vapier>  2002-09-23 12:32:42
>> 	Changed GPL to GPL-2 in LICENSE and updated SRC_URI to use mirror:
>>
>> so it has converted only slightly more than the first two years of 
>> history in the roughly 30 minutes I've let it run. So it will take several 
>> hours.
> 
> Btw, trying this import (which got interrupted by a thunderstorm and one 
> of our first power failures in a long time - just a few seconds, but 
> enough to power off everything but my laptops) it became very obvious that 
> "git cvsimport" really _really_ should re-pack the archive every once in a 
> while.

Fortunately the storms haven't been that bad down in Corvallis. cvsps
also worked fine for me, but git-cvsimport broke in the middle. The
command I'm using is 'git-cvsimport -P ../gentoo.cvsps -k -d
/media/scm_comparison -A ~/dev/Authors -v gentoo-x86 | tee cvsimport.log'

Here's the last bits:

Fetching gnome-base/gnome-applets/gnome-applets-1.4.0.4-r1.ebuild   v 1.5
Update gnome-base/gnome-applets/gnome-applets-1.4.0.4-r1.ebuild: 947 bytes
Fetching gnome-base/gnome-applets/gnome-applets-1.4.0.4-r2.ebuild   v 1.3
Update gnome-base/gnome-applets/gnome-applets-1.4.0.4-r2.ebuild: 977 bytes
Fetching gnome-base/gnome-applets/gnome-applets-2.0.0-r1.ebuild   v 1.2
Update gnome-base/gnome-applets/gnome-applets-2.0.0-r1.ebuild: 2704 bytes
Fetching gnome-base/gnome-applets/gnome-applets-2.0.0.ebuild   v 1.2
Update gnome-base/gnome-applets/gnome-applets-2.0.0.ebuild: 3031 bytes
Tree ID 4d19a84efce2de9cfb42ac0397e0036bbed2ad65
Parent ID ecb78bbe30369a76e2599d0d17de8fe922dca211
Committed patch 14615 (origin 2002-07-16 20:13:15)
Commit ID 4dd2179e0c1369e07cd268fb5c8b150c3a2a1094
Delete net-fs/openafs/openafs-1.2.2-r6.ebuild
Delete net-fs/openafs/files/digest-openafs-1.2.2-r6
Tree ID bfc7320883983655d7d2ea2c6d04f85b45365ce1
Parent ID 4dd2179e0c1369e07cd268fb5c8b150c3a2a1094
Committed patch 14616 (origin 2002-07-16 20:15:15)
Commit ID 7a36de9c4c9b93337ed789ae2341cad3d0991c6d
Unknown: error  Cannot allocate memory
Fetching profiles/package.mask   v 1.992
cat: write error: Broken pipe

Thanks,
Donnie


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 252 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-22  4:19             ` Donnie Berkholz
@ 2006-05-22  4:50               ` Linus Torvalds
  2006-05-22  5:04                 ` Martin Langhoff
                                   ` (2 more replies)
  0 siblings, 3 replies; 83+ messages in thread
From: Linus Torvalds @ 2006-05-22  4:50 UTC (permalink / raw)
  To: Donnie Berkholz
  Cc: Yann Dirson, Git Mailing List, Matthias Urlichs, Martin Langhoff,
	Johannes Schindelin



On Sun, 21 May 2006, Donnie Berkholz wrote:
> 
> Fortunately the storms haven't been that bad down in Corvallis. cvsps
> also worked fine for me, but git-cvsimport broke in the middle.

Hmm. It's actually possible that it did that for me too - I had put the 
cvsimport in an xterm and forgotten about it, and just assumed that the 
power failure was what broke it. But maybe it had broken down before that 
happened - I just don't have any logs left ;)

> Here's the last bits:
> 
> [ snip snip ]
> Commit ID 7a36de9c4c9b93337ed789ae2341cad3d0991c6d
> Unknown: error  Cannot allocate memory
> Fetching profiles/package.mask   v 1.992
> cat: write error: Broken pipe

Hmm. I don't actually know perl, and my original "cvsimport" script was 
actually this funny C program that generated a shell script to do the 
import. That worked fine, and had no memory leaks, but it was a truly 
hacky thing of horrible beauty. Or rather, it _would_ have been that, if 
it had had any beauty to be horrible about. But at least I would have been 
able to debug it.

But the perl one I can't parse any more. That said, the whole "Unknown:" 
printout seems to come from the subroutine "_line()", which just reads a 
line from the cvs server.

Did you do a "top" at any time just before this all happened? It _sounds_ 
like it might actually be a memory leak on the CVS server side, and the 
problem may (or may not) be due to the optimization that keeps a single 
long-running CVS server instance for the whole process.

I wouldn't be in the least surprised if that ends up triggering a slow 
leak in CVS itself, and then CVS runs out of memory.

That would likely have been obvious in any "top" output just before the 
failure.

Smurf, Martin, Dscho.. Any ideas? My old script just ran RCS directly on 
the files, and had no issues like that. I'll happily admit that my old 
script generator thing was horrible, but it was a lot easier to debug than 
the smarter perl script that uses a CVS server connection..

		Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-22  4:50               ` Linus Torvalds
@ 2006-05-22  5:04                 ` Martin Langhoff
  2006-05-22  5:21                 ` Donnie Berkholz
  2006-05-22  7:42                 ` Martin Langhoff
  2 siblings, 0 replies; 83+ messages in thread
From: Martin Langhoff @ 2006-05-22  5:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Donnie Berkholz, Yann Dirson, Git Mailing List, Matthias Urlichs,
	Martin Langhoff, Johannes Schindelin

On 5/22/06, Linus Torvalds <torvalds@osdl.org> wrote:
> I wouldn't be in the least surprised if that ends up triggering a slow
> leak in CVS itself, and then CVS runs out of memory.

I'm dying to try this out myself after work. I don't discard that
cvsimport might be stuffing data in an array that grows forever. In
any case you'll hear from me soon.



martin

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-22  4:50               ` Linus Torvalds
  2006-05-22  5:04                 ` Martin Langhoff
@ 2006-05-22  5:21                 ` Donnie Berkholz
  2006-05-22  7:42                 ` Martin Langhoff
  2 siblings, 0 replies; 83+ messages in thread
From: Donnie Berkholz @ 2006-05-22  5:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Yann Dirson, Git Mailing List, Matthias Urlichs, Martin Langhoff,
	Johannes Schindelin

[-- Attachment #1: Type: text/plain, Size: 451 bytes --]

Linus Torvalds wrote:
> Did you do a "top" at any time just before this all happened? It _sounds_ 
> like it might actually be a memory leak on the CVS server side, and the 
> problem may (or may not) be due to the optimization that keeps a single 
> long-running CVS server instance for the whole process.

No. =\ I just started the thing running in a screen session and came
back a few hours later to find it like that.

Thanks,
Donnie


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 252 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-22  4:50               ` Linus Torvalds
  2006-05-22  5:04                 ` Martin Langhoff
  2006-05-22  5:21                 ` Donnie Berkholz
@ 2006-05-22  7:42                 ` Martin Langhoff
  2006-05-22  9:13                   ` Linus Torvalds
  2 siblings, 1 reply; 83+ messages in thread
From: Martin Langhoff @ 2006-05-22  7:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Donnie Berkholz, Yann Dirson, Git Mailing List, Matthias Urlichs,
	Johannes Schindelin

On 5/22/06, Linus Torvalds <torvalds@osdl.org> wrote:
> Did you do a "top" at any time just before this all happened? It _sounds_
> like it might actually be a memory leak on the CVS server side, and the
> problem may (or may not) be due to the optimization that keeps a single
> long-running CVS server instance for the whole process.

Running a few tests right now. Looks like cvs (Debian/etch 1.12.9-13)
itself is not leaking any memory. The Perl (Debian/etch
5.8.7-something and now 5.8.8-4) process OTOH is visibly allocating
memory. Starts off at 4MB and gets up to ~17MB by the time it has done
6K commits.

I am trying to figure out whether the leak is in the script or in the
Perl implementation, using PadWalk, Devel::Leak and friends. If the
leak is here, I can't see it (yet).

> I wouldn't be in the least surprised if that ends up triggering a slow
> leak in CVS itself, and then CVS runs out of memory.

Or a slow leak in Perl? The 5.8.8 release notes do talk about some
leaks being fixed, but this 5.8.8 isn't making a difference.

Working on it.



martin

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-22  7:42                 ` Martin Langhoff
@ 2006-05-22  9:13                   ` Linus Torvalds
  2006-05-22 12:54                     ` Martin Langhoff
  0 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2006-05-22  9:13 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Donnie Berkholz, Yann Dirson, Git Mailing List, Matthias Urlichs,
	Johannes Schindelin



On Mon, 22 May 2006, Martin Langhoff wrote:
> 
> Or a slow leak in Perl? The 5.8.8 release notes do talk about some
> leaks being fixed, but this 5.8.8 isn't making a difference.
> 
> Working on it.

Thanks. Looking at what I did convert, that horrid gentoo CVS tree is 
interesting. The resulting (partial) git history has 93413 commits and 
850,000+ objects total, all in a totally linear history.

And that's just up to April 2004, so the full tree is probably a million 
objects.

The good news is that git seems to handle that size repo no problem at 
all. The repack did indeed take a long while, but it packed it all down to 
a 189MB pack-file (and 20MB pack index).

Considering that the bzip2'd tar-file of the CVS history was 157MB, and 
the actual CVS footprint was about 1.6GB, if git stays at under a quarter 
gigabyte for the whole archive once converted (which sounds likely, 
counting indexing), git would basically cut down the disk usage for a live 
repo by a factor of 7 or so.

_And_ I can do a "git log origin > /dev/null" in about 2.4 seconds. Take 
that, CVS.

		Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-22  9:13                   ` Linus Torvalds
@ 2006-05-22 12:54                     ` Martin Langhoff
  2006-05-22 17:27                       ` Linus Torvalds
  2006-05-22 19:09                       ` Donnie Berkholz
  0 siblings, 2 replies; 83+ messages in thread
From: Martin Langhoff @ 2006-05-22 12:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Donnie Berkholz, Yann Dirson, Git Mailing List, Matthias Urlichs,
	Johannes Schindelin

On 5/22/06, Linus Torvalds <torvalds@osdl.org> wrote:
> On Mon, 22 May 2006, Martin Langhoff wrote:
> >
> > Or a slow leak in Perl? The 5.8.8 release notes do talk about some
> > leaks being fixed, but this 5.8.8 isn't making a difference.
> >
> > Working on it.
>
> Thanks. Looking at what I did convert, that horrid gentoo CVS tree is
> interesting. The resulting (partial) git history has 93413 commits and
> 850,000+ objects total, all in a totally linear history.

Ok, so there's 3 patches posted that should help narrow down the
problem. There's a new -L <imit> so that Donnie can get his stuff done
by running it in a while(true) loop. Not proud of it, but hey.

And there are two patches that I suspect may fix the leak. After
applying them, the cvsimport process grows up to ~13MB and then tapers
off, at least as far as my patience has gotten me. It's late on this
side of the globe so I'll look at the results tomorrow morning.

(BTW, I typo-ed Linus' address in the git-send-email invocation. Will
resend to him separately)

I'll also prep a patch as Linus suggests to do auto-repacking while
the import runs so we don't eat up the harddisk.

> git would basically cut down the disk usage for a live
> repo by a factor of 7 or so.
>
> _And_ I can do a "git log origin > /dev/null" in about 2.4 seconds. Take
> that, CVS.

Heh. Faster Gitticat, Kill Kill Kill!




martin

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-22 12:54                     ` Martin Langhoff
@ 2006-05-22 17:27                       ` Linus Torvalds
  2006-05-22 17:51                         ` Jakub Narebski
  2006-05-22 19:46                         ` Martin Langhoff
  2006-05-22 19:09                       ` Donnie Berkholz
  1 sibling, 2 replies; 83+ messages in thread
From: Linus Torvalds @ 2006-05-22 17:27 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Donnie Berkholz, Yann Dirson, Git Mailing List, Matthias Urlichs,
	Johannes Schindelin



On Tue, 23 May 2006, Martin Langhoff wrote:
> 
> And there are two patches that I suspect may fix the leak. After
> applying them, the cvsimport process grows up to ~13MB and then tapers
> off, at least as far as my patience has gotten me. It's late on this
> side of the globe so I'll look at the results tomorrow morning.

Ok, initial results are promising. git-cvsimport appears to be still 
slowly growing, but it's at 40M (ie pretty tiny, considering that cvsps 
grew to 800+MB on this archive) and growth seems to actually be slowing.

My conversion is only up to September 2002, but if it doesn't suddenly hit 
some huge growth spurt, I wouldn't expect it to run out of memory. The CVS 
server process itself is tiny, and doesn't seem to grow at all.

As to packing, it doing something like

	while :
	do
		sleep 30

		#
		# repack roughly every 25600 objects
		#
		n=$(ls .git/objects/00 2> /dev/null | wc -l)
		if [ $n -gt 100 ]; then
			git repack -a
			#
			# Stupid sleep to make sure that nobody is still
			# using any unpacked objects after the pack got
			# generated
			#
			sleep 10
			git prune-packed
		fi
	done

or similar (the above is totally untested - I've just done it by hand a 
few times) should work. It's perfectly ok to repack the archive even while 
the cvsimport script is adding more data and changing it.

		Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-22 17:27                       ` Linus Torvalds
@ 2006-05-22 17:51                         ` Jakub Narebski
  2006-05-22 18:03                           ` Linus Torvalds
  2006-05-22 19:46                         ` Martin Langhoff
  1 sibling, 1 reply; 83+ messages in thread
From: Jakub Narebski @ 2006-05-22 17:51 UTC (permalink / raw)
  To: git

Linus Torvalds wrote:

>                       git repack -a
>                       #
>                       # Stupid sleep to make sure that nobody is still
>                       # using any unpacked objects after the pack got
>                       # generated
>                       #
>                       sleep 10
>                       git prune-packed

Is it really necessary (on Linux at least)? Git boast it's atomicity...

-- 
Jakub Narebski
Warsaw, Poland

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-22 17:51                         ` Jakub Narebski
@ 2006-05-22 18:03                           ` Linus Torvalds
  2006-05-22 19:03                             ` Matthias Lederhofer
  2006-05-23 20:19                             ` Jakub Narebski
  0 siblings, 2 replies; 83+ messages in thread
From: Linus Torvalds @ 2006-05-22 18:03 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git



On Mon, 22 May 2006, Jakub Narebski wrote:
>
> Linus Torvalds wrote:
> 
> >                       git repack -a
> >                       #
> >                       # Stupid sleep to make sure that nobody is still
> >                       # using any unpacked objects after the pack got
> >                       # generated
> >                       #
> >                       sleep 10
> >                       git prune-packed
> 
> Is it really necessary (on Linux at least)? Git boast it's atomicity...

I don't think it's necessary in practice.

But people _should_ realize that removing objects is very very special. 
Whether it's done by "git prune-packed" or "git prune", that's a very 
dangerous operations. "git prune" a lot more so than "git prune-packed", 
of course (in fact, you should _never_ run "git prune" on a repository 
that is active - you _will_ corrupt it)-

Doing "git prune-packed" _should_ be mostly safe on UNIX, since the 
objects all exist in packs, and anybody who already opened an object will 
keep the fd open, and not even notice that the name is gone. However, 
there is at least one race:

	object lookup			"git repack -a -d"
	=============			==================

 - a process does its object
   database setup. No new pack-file
   yet.

					 - mv tmp-packfile active-packfile

					 - git prune-packed

 - the process looks up the object,
   and doesn't look in the pack-file
   because it didn't see the pack-file.

   So it tries to look up an object,
   fails, and errors out.

   It's not a fatal error (just re-try)
   but it could break something like a
   cvsimport

Now, in PRACTICE, I doubt you'd ever hit this. But the fact is, pruning 
your repository (whether prune-packed or a full prune) is _the_ special 
operation. It's something that removes a filesystem representation of an 
object that is otherwise immutable.

		Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-22 18:03                           ` Linus Torvalds
@ 2006-05-22 19:03                             ` Matthias Lederhofer
  2006-05-22 19:09                               ` Junio C Hamano
  2006-05-23 20:19                             ` Jakub Narebski
  1 sibling, 1 reply; 83+ messages in thread
From: Matthias Lederhofer @ 2006-05-22 19:03 UTC (permalink / raw)
  To: git

> But people _should_ realize that removing objects is very very special. 

Just a similar question: is there any reason not tu run git
repack/prune-packed as cron job? I would think of something like this
for every night:

- git prune-packed (remove objects packed last time)
- check how many objects git-count-objects counts, if it are not enough
  abort
- git repack

git repack -a -d is probably a bad idea, I guess, because a program
could try to open them after they were deleted.  Is there any way to
delete unnecessary packs (those which would repack -a -d delete)?
Making it possible to do a git repack -a and delete those packs the
next night?

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-22 19:03                             ` Matthias Lederhofer
@ 2006-05-22 19:09                               ` Junio C Hamano
  0 siblings, 0 replies; 83+ messages in thread
From: Junio C Hamano @ 2006-05-22 19:09 UTC (permalink / raw)
  To: Matthias Lederhofer; +Cc: git

Matthias Lederhofer <matled@gmx.net> writes:

> ...  Is there any way to
> delete unnecessary packs (those which would repack -a -d delete)?
> Making it possible to do a git repack -a and delete those packs the
> next night?

pack-redundant is supposed to figure it out, but I have never
used it myself so your mileage may vary.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-22 12:54                     ` Martin Langhoff
  2006-05-22 17:27                       ` Linus Torvalds
@ 2006-05-22 19:09                       ` Donnie Berkholz
  2006-05-22 19:38                         ` Linus Torvalds
                                           ` (2 more replies)
  1 sibling, 3 replies; 83+ messages in thread
From: Donnie Berkholz @ 2006-05-22 19:09 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Linus Torvalds, Yann Dirson, Git Mailing List, Matthias Urlichs,
	Johannes Schindelin

[-- Attachment #1: Type: text/plain, Size: 1530 bytes --]

Martin Langhoff wrote:
> On 5/22/06, Linus Torvalds <torvalds@osdl.org> wrote:
>> On Mon, 22 May 2006, Martin Langhoff wrote:
>> >
>> > Or a slow leak in Perl? The 5.8.8 release notes do talk about some
>> > leaks being fixed, but this 5.8.8 isn't making a difference.
>> >
>> > Working on it.
>>
>> Thanks. Looking at what I did convert, that horrid gentoo CVS tree is
>> interesting. The resulting (partial) git history has 93413 commits and
>> 850,000+ objects total, all in a totally linear history.
> 
> Ok, so there's 3 patches posted that should help narrow down the
> problem. There's a new -L <imit> so that Donnie can get his stuff done
> by running it in a while(true) loop. Not proud of it, but hey.
> 
> And there are two patches that I suspect may fix the leak. After
> applying them, the cvsimport process grows up to ~13MB and then tapers
> off, at least as far as my patience has gotten me. It's late on this
> side of the globe so I'll look at the results tomorrow morning.

OK, I started a new run without -L, and I'm watching it in top right
now. The cvsimport seems to be doing alright, but the cvs server process
sucks about another megabyte of virtual every 4-5 seconds. This is a bit
concerning since I don't have any swap. Shortly after it hit 670M, I got
"Cannot allocate memory" again. I've got a gig of RAM, and around 300M
was resident in various processes at the time.

So it seems the problem is in cvs itself. I will try another run with -L
now.

Thanks,
Donnie


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 252 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-22 19:09                       ` Donnie Berkholz
@ 2006-05-22 19:38                         ` Linus Torvalds
  2006-05-22 19:49                           ` Donnie Berkholz
  2006-05-22 19:41                         ` Martin Langhoff
  2006-05-22 20:16                         ` irc usage Donnie Berkholz
  2 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2006-05-22 19:38 UTC (permalink / raw)
  To: Donnie Berkholz
  Cc: Martin Langhoff, Yann Dirson, Git Mailing List, Matthias Urlichs,
	Johannes Schindelin



On Mon, 22 May 2006, Donnie Berkholz wrote:
> 
> OK, I started a new run without -L, and I'm watching it in top right
> now. The cvsimport seems to be doing alright, but the cvs server process
> sucks about another megabyte of virtual every 4-5 seconds. This is a bit
> concerning since I don't have any swap. Shortly after it hit 670M, I got
> "Cannot allocate memory" again. I've got a gig of RAM, and around 300M
> was resident in various processes at the time.

Hmm. My cvs server doesn't really grow at all. It's at 13M RSS.

What version of cvs are you running?

	[torvalds@g5 ~]$ cvs --version

	Concurrent Versions System (CVS) 1.11.21 (client/server)

maybe that matters.

(but my import is only up to Jun 22, 2003 so far).

		Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-22 19:09                       ` Donnie Berkholz
  2006-05-22 19:38                         ` Linus Torvalds
@ 2006-05-22 19:41                         ` Martin Langhoff
  2006-05-22 20:11                           ` Linus Torvalds
  2006-05-22 20:16                         ` irc usage Donnie Berkholz
  2 siblings, 1 reply; 83+ messages in thread
From: Martin Langhoff @ 2006-05-22 19:41 UTC (permalink / raw)
  To: Donnie Berkholz
  Cc: Linus Torvalds, Yann Dirson, Git Mailing List, Matthias Urlichs,
	Johannes Schindelin

On 5/23/06, Donnie Berkholz <spyderous@gentoo.org> wrote:
> So it seems the problem is in cvs itself. I will try another run with -L
> now.

What version of cvs are you using? Perhaps trying a different one?

The dev machine where I am running the import is a slug! It's still
working on it, only gotten to 7700 commits, with the cvsimport process
stable at 28MB RAM and cvs stable at 4MB.

cheers,


martin

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-22 17:27                       ` Linus Torvalds
  2006-05-22 17:51                         ` Jakub Narebski
@ 2006-05-22 19:46                         ` Martin Langhoff
  1 sibling, 0 replies; 83+ messages in thread
From: Martin Langhoff @ 2006-05-22 19:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Donnie Berkholz, Yann Dirson, Git Mailing List, Matthias Urlichs,
	Johannes Schindelin

On 5/23/06, Linus Torvalds <torvalds@osdl.org> wrote:
> Ok, initial results are promising. git-cvsimport appears to be still
> slowly growing, but it's at 40M (ie pretty tiny, considering that cvsps
> grew to 800+MB on this archive) and growth seems to actually be slowing.

That's great news. The cvs archive seems to have large commits every
once in a while, so I suspect the residual memory growth may be
related to those. Or to a smaller leak I haven't nailed.

My test box is bloody slow it seems. I'll try and get hold of a faster
machine to run this if I can.

> As to packing, it doing something like

Given that we are running batch, it is safe and simple to stop the
import, repack, prune-packed, and keep going. Don't think we'll win
any races by running it in parallel ;-)

cheers,


martin

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-22 19:38                         ` Linus Torvalds
@ 2006-05-22 19:49                           ` Donnie Berkholz
  2006-05-22 20:20                             ` Linus Torvalds
  0 siblings, 1 reply; 83+ messages in thread
From: Donnie Berkholz @ 2006-05-22 19:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Langhoff, Yann Dirson, Git Mailing List, Matthias Urlichs,
	Johannes Schindelin

[-- Attachment #1: Type: text/plain, Size: 586 bytes --]

Linus Torvalds wrote:
> Hmm. My cvs server doesn't really grow at all. It's at 13M RSS.

Yeah, that's the thing. RSS stayed about the same (according to top),
but virtual just kept growing.

> What version of cvs are you running?
> 
> 	[torvalds@g5 ~]$ cvs --version
> 
> 	Concurrent Versions System (CVS) 1.11.21 (client/server)

Concurrent Versions System (CVS) 1.12.12 (client/server)

Looks like there's a .13 out but the zlib interaction is badly broken
(-z >=1) so my system didn't get upgraded. I'll try it anyway after the
-L run finishes.

Thanks,
Donnie


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 252 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-22 19:41                         ` Martin Langhoff
@ 2006-05-22 20:11                           ` Linus Torvalds
  2006-05-22 20:33                             ` Linus Torvalds
  2006-05-22 21:41                             ` Matthias Urlichs
  0 siblings, 2 replies; 83+ messages in thread
From: Linus Torvalds @ 2006-05-22 20:11 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Donnie Berkholz, Yann Dirson, Git Mailing List, Matthias Urlichs,
	Johannes Schindelin



On Tue, 23 May 2006, Martin Langhoff wrote:
> 
> The dev machine where I am running the import is a slug! It's still
> working on it, only gotten to 7700 commits, with the cvsimport process
> stable at 28MB RAM and cvs stable at 4MB.

I have to say, that cvsimport script really does do horrible things. It's 
basically a fork/exec/exit benchmark, as far as I can tell. Running 
oprofile on the thing, the top offenders are (ignore the 45% idle thing: 
it's just because this was run on a dual-cpu system, so since it's almost 
completely single-threaded you get ~50% idle by default).

	3117654  45.8708  vmlinux                  vmlinux                  .power4_idle
	802313   11.8046  vmlinux                  vmlinux                  .unmap_vmas
	632913    9.3122  vmlinux                  vmlinux                  .copy_page_range
	150359    2.2123  vmlinux                  vmlinux                  .release_pages
	131330    1.9323  vmlinux                  vmlinux                  .vm_normal_page
	117836    1.7337  libperl.so               libperl.so               (no symbols)
	74098     1.0902  libgklayout.so           libgklayout.so           (no symbols)
	54680     0.8045  vmlinux                  vmlinux                  .free_pages_and_swap_cache
	54300     0.7989  libfb.so                 libfb.so                 (no symbols)
	49052     0.7217  vmlinux                  vmlinux                  .copy_4K_page
	46559     0.6850  libc-2.4.so              libc-2.4.so              getc
	42677     0.6279  vmlinux                  vmlinux                  .page_remove_rmap
	41133     0.6052  libc-2.4.so              libc-2.4.so              ferror
	..

those kernel functions are all about process create/exit, and COW faulting 
after the fork.

Now, this is on ppc, so process creation is likely slower (idiotic PPC VM 
page table hashes), but Linux is actually very good at doing this, and the 
fact that process create/exit is so high is a very big sign that the 
script just ends up executing a _ton_ of small simple processes that do 
almost nothing.

I wonder why those "git-update-index" calls seem to be (assuming I read 
the perl correctly) done only a few files at a time. We can do a hundreds 
in one go, but it seems to want to do just ten files or something at the 
same time. Although since most commits should hopefully just modify a 
couple of files, that probably isn't a big deal.

That thing would probably be an order of magnitude faster if written to 
use the git library interfaces directly. Of course, the CVS part is 
probably a big overhead, so it might not help much (I would not be 
surprised at all if a number of the fork/exec/exit things are due to the 
CVS server starting RCS or something, not due to git-cvsimport itself)

		Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-22 19:09                       ` Donnie Berkholz
  2006-05-22 19:38                         ` Linus Torvalds
  2006-05-22 19:41                         ` Martin Langhoff
@ 2006-05-22 20:16                         ` Donnie Berkholz
  2 siblings, 0 replies; 83+ messages in thread
From: Donnie Berkholz @ 2006-05-22 20:16 UTC (permalink / raw)
  To: Donnie Berkholz
  Cc: Martin Langhoff, Linus Torvalds, Yann Dirson, Git Mailing List,
	Matthias Urlichs, Johannes Schindelin

[-- Attachment #1: Type: text/plain, Size: 652 bytes --]

Donnie Berkholz wrote:
> OK, I started a new run without -L, and I'm watching it in top right
> now.

Tried a run with -L 1024 and it broke in just a couple of minutes:

Fetching
sys-kernel/linux/files/2.4.0.8/linux-2.4.0-ac8-reiserfs-3.6.25-nfs.diff.gz
  v 1.1
New
sys-kernel/linux/files/2.4.0.8/linux-2.4.0-ac8-reiserfs-3.6.25-nfs.diff.gz:
6367 bytes
Tree ID 457f629df10e70a5ef430f431eca27ed02a83d46
Parent ID 0541d8b54a02df3be50d529497236556c6862a4c
Committed patch 1024 (origin 2001-01-13 00:29:39)
Commit ID ba9d995d12a37502a851e198b67e141623f79544
DONE; creating master branch
cat: write error: Broken pipe

Thanks,
Donnie


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 252 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-22 19:49                           ` Donnie Berkholz
@ 2006-05-22 20:20                             ` Linus Torvalds
  2006-05-22 21:48                               ` Donnie Berkholz
  0 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2006-05-22 20:20 UTC (permalink / raw)
  To: Donnie Berkholz
  Cc: Martin Langhoff, Yann Dirson, Git Mailing List, Matthias Urlichs,
	Johannes Schindelin



On Mon, 22 May 2006, Donnie Berkholz wrote:
>
> Linus Torvalds wrote:
> > Hmm. My cvs server doesn't really grow at all. It's at 13M RSS.
> 
> Yeah, that's the thing. RSS stayed about the same (according to top),
> but virtual just kept growing.

Not for me. The virtual size is certainly bigger than RSS, but not by a 
huge amount. So this might be a regression in CVS, since you seem to have 
a newer version than I do.

The latest stable CVS release is 1.11.21, I think: you seem to be running 
the "development" version (1.12.x).

			Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-22 20:11                           ` Linus Torvalds
@ 2006-05-22 20:33                             ` Linus Torvalds
  2006-05-22 21:41                             ` Matthias Urlichs
  1 sibling, 0 replies; 83+ messages in thread
From: Linus Torvalds @ 2006-05-22 20:33 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Donnie Berkholz, Yann Dirson, Git Mailing List, Matthias Urlichs,
	Johannes Schindelin



On Mon, 22 May 2006, Linus Torvalds wrote:
> 
> Of course, the CVS part is probably a big overhead, so it might not help 
> much (I would not be surprised at all if a number of the fork/exec/exit 
> things are due to the CVS server starting RCS or something, not due to 
> git-cvsimport itself)

Ahh. stracing the CVS server seems to imply that it forks off a subprocess 
for every command. It doesn't actually execute any external program, but 
just does a fork + muck around in the ,v files + exit.

Maybe one of the changes in the 1.12.x versions is to not do that, which 
might explain why Donnie seems to see much better performance, but also 
sees all the memory leakage?

		Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-22 20:11                           ` Linus Torvalds
  2006-05-22 20:33                             ` Linus Torvalds
@ 2006-05-22 21:41                             ` Matthias Urlichs
  2006-05-22 22:18                               ` Linus Torvalds
  2006-05-22 22:39                               ` Junio C Hamano
  1 sibling, 2 replies; 83+ messages in thread
From: Matthias Urlichs @ 2006-05-22 21:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Langhoff, Donnie Berkholz, Yann Dirson, Git Mailing List,
	Johannes Schindelin

[-- Attachment #1: Type: text/plain, Size: 872 bytes --]

Hi,

Linus Torvalds:
> I wonder why those "git-update-index" calls seem to be (assuming I read 
> the perl correctly) done only a few files at a time. We can do a hundreds 
> in one go, but it seems to want to do just ten files or something at the 
> same time.

No, fifty.

I simply was too lazy to count the actual filenames' lengths. ;-)

> That thing would probably be an order of magnitude faster if written to 
> use the git library interfaces directly. Of course, the CVS part is 
> probably a big overhead, so it might not help much 

The beast *was* mainly written to do this remotely...

-- 
Matthias Urlichs   |   {M:U} IT Design @ m-u-it.de   |  smurf@smurf.noris.de
Disclaimer: The quote was selected randomly. Really. | http://smurf.noris.de
 - -
The worst form of inequality is to try to make unequal things equal.
					-- Aristotle

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 191 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-22 20:20                             ` Linus Torvalds
@ 2006-05-22 21:48                               ` Donnie Berkholz
  2006-05-29 21:54                                 ` Donnie Berkholz
  0 siblings, 1 reply; 83+ messages in thread
From: Donnie Berkholz @ 2006-05-22 21:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Langhoff, Yann Dirson, Git Mailing List, Matthias Urlichs,
	Johannes Schindelin

[-- Attachment #1: Type: text/plain, Size: 233 bytes --]

Linus Torvalds wrote:
> The latest stable CVS release is 1.11.21, I think: you seem to be running 
> the "development" version (1.12.x).

Backed down to the 1.11 series, things seem to be going fine so far.

Thanks,
Donnie


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 252 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-22 21:41                             ` Matthias Urlichs
@ 2006-05-22 22:18                               ` Linus Torvalds
  2006-05-22 23:23                                 ` Martin Langhoff
  2006-05-22 22:39                               ` Junio C Hamano
  1 sibling, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2006-05-22 22:18 UTC (permalink / raw)
  To: Matthias Urlichs
  Cc: Martin Langhoff, Donnie Berkholz, Yann Dirson, Git Mailing List,
	Johannes Schindelin



On Mon, 22 May 2006, Matthias Urlichs wrote:
> 
> The beast *was* mainly written to do this remotely...

I don't think the remote usability is valid, except for some really small 
repositories. The fact that it takes hours even when the CVS server is 
local doesn't bode well for doing it remotely for any but the most trivial 
things.

I really think it would be better to have local use be the optimized case, 
with remote being the "it's _possible_" case.

		Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-22 21:41                             ` Matthias Urlichs
  2006-05-22 22:18                               ` Linus Torvalds
@ 2006-05-22 22:39                               ` Junio C Hamano
  2006-05-22 23:15                                 ` Martin Langhoff
  1 sibling, 1 reply; 83+ messages in thread
From: Junio C Hamano @ 2006-05-22 22:39 UTC (permalink / raw)
  To: Matthias Urlichs; +Cc: git

Matthias Urlichs <smurf@smurf.noris.de> writes:

> Hi,
>
> Linus Torvalds:
>> I wonder why those "git-update-index" calls seem to be (assuming I read 
>> the perl correctly) done only a few files at a time. We can do a hundreds 
>> in one go, but it seems to want to do just ten files or something at the 
>> same time.
>
> No, fifty.
>
> I simply was too lazy to count the actual filenames' lengths. ;-)

I think cvsimport predates that option, but these days that loop
can be optimized by feeding --index-info from standard input.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-22 22:39                               ` Junio C Hamano
@ 2006-05-22 23:15                                 ` Martin Langhoff
  2006-05-23  6:52                                   ` Jeff King
  0 siblings, 1 reply; 83+ messages in thread
From: Martin Langhoff @ 2006-05-22 23:15 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Matthias Urlichs, git

On 5/23/06, Junio C Hamano <junkio@cox.net> wrote:
> > I simply was too lazy to count the actual filenames' lengths. ;-)
>
> I think cvsimport predates that option, but these days that loop
> can be optimized by feeding --index-info from standard input.

Oh, yep, that'd be a good addition. I think we can also cut down on
the number of fork+exec calls (as Linus points out they are killing
us) by caching some data we should already have that we are repeatedly
asking from git-ref-parse.

Other TODOs from my reading of the code last night...

 - Switch from line-oriented reads to block reads when fetching files
from CVS. This gentoo has repo has some large binary blobs in it and
we end up slurping them into memory.

 - Stop abusing globals in commit() -- pass the commit data as parameters.

 - Further profiling? Whatever we are doing, we aren't doing it fast :(

Will be trying to do those things in the next few days, don't mind if
someone jumps in as well.



martin

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-22 22:18                               ` Linus Torvalds
@ 2006-05-22 23:23                                 ` Martin Langhoff
  2006-05-22 23:29                                   ` Martin Langhoff
  2006-05-22 23:33                                   ` Linus Torvalds
  0 siblings, 2 replies; 83+ messages in thread
From: Martin Langhoff @ 2006-05-22 23:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthias Urlichs, Donnie Berkholz, Yann Dirson, Git Mailing List,
	Johannes Schindelin

On 5/23/06, Linus Torvalds <torvalds@osdl.org> wrote:
> I don't think the remote usability is valid, except for some really small
> repositories. The fact that it takes hours even when the CVS server is
> local doesn't bode well for doing it remotely for any but the most trivial
> things.

I really don't think that using the local cvs binary is a problem at
all. In my experience, the thing is fairly fast and optimized when you
ask it to perform file-oriented questions and that's all we do,
really.

If you want to try it, you'll see that local checkouts of large trees
(like this gentoo one) are fairly fast. Not as fast as GIT itself, but
good enough. I think Donnie has hit a bug with a bad version of cvs,
but other than that, my experience with it is that it is fairly well
behaved -- even if the tool is bad, ubiquity has lead to resiliency
over the years.

> I really think it would be better to have local use be the optimized case,
> with remote being the "it's _possible_" case.

Agreed, but I think we won't see much benefit in direct parsing. And
we'll have to take the hit of double-implementation.

In any case, we have it already -- parsecvs does it quite well (modulo
memory leaks!) and I've used it several times in conjunction with
cvsimport. Just perform the initial import with parsecvs and then
'track' the remote project with cvsimport.

The problem is that they lead to slightly different trees. So their
output is not consistent, and I don't think that'll be easy to fix.

cheers,


martin

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-22 23:23                                 ` Martin Langhoff
@ 2006-05-22 23:29                                   ` Martin Langhoff
  2006-05-22 23:33                                   ` Linus Torvalds
  1 sibling, 0 replies; 83+ messages in thread
From: Martin Langhoff @ 2006-05-22 23:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthias Urlichs, Donnie Berkholz, Yann Dirson, Git Mailing List,
	Johannes Schindelin

On 5/23/06, Martin Langhoff <martin.langhoff@gmail.com> wrote:
> The problem is that they lead to slightly different trees.

Sorry! s/trees/histories/ there. The trees are (or should!) be the
same, and tree differences should be addressed as bugs. Differences in
how history is parsed are unavoidable right now.

martin

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-22 23:23                                 ` Martin Langhoff
  2006-05-22 23:29                                   ` Martin Langhoff
@ 2006-05-22 23:33                                   ` Linus Torvalds
  1 sibling, 0 replies; 83+ messages in thread
From: Linus Torvalds @ 2006-05-22 23:33 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Matthias Urlichs, Donnie Berkholz, Yann Dirson, Git Mailing List,
	Johannes Schindelin



On Tue, 23 May 2006, Martin Langhoff wrote:
> 
> I really don't think that using the local cvs binary is a problem at
> all. In my experience, the thing is fairly fast and optimized when you
> ask it to perform file-oriented questions and that's all we do,
> really.

Fair enough. My worry was mainly that the cvs server was doing something 
stupid, but I suspect most of the fork/exec's are probably from the 
cvsimport perl script itself.

> In any case, we have it already -- parsecvs does it quite well (modulo
> memory leaks!) and I've used it several times in conjunction with
> cvsimport. Just perform the initial import with parsecvs and then
> 'track' the remote project with cvsimport.

I didn't get parsecvs working when I tried it a long time ago, and Donnie 
reported that it ran out of memory, so I didn't even really consider it. 
I'd love for it to work well, and it may be reasonable to do really big 
imports on multi-gigabyte 64-bit machines (after all, they aren't _hard_ 
to find any more, and you only need to do it once).

That said, it still seems pretty stupid to require that much memory just 
to import from CVS.

		Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-22 23:15                                 ` Martin Langhoff
@ 2006-05-23  6:52                                   ` Jeff King
  2006-05-23  6:58                                     ` Jeff King
  2006-05-23  7:00                                     ` [PATCH 2/2] cvsimport: cleanup commit function Jeff King
  0 siblings, 2 replies; 83+ messages in thread
From: Jeff King @ 2006-05-23  6:52 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Junio C Hamano, Matthias Urlichs, git

On Tue, May 23, 2006 at 11:15:07AM +1200, Martin Langhoff wrote:

> >I think cvsimport predates that option, but these days that loop
> >can be optimized by feeding --index-info from standard input.
> Oh, yep, that'd be a good addition. I think we can also cut down on

This patch is relatively simple, and I'll post it in a moment.

I also made a few other cleanups to commit() which apply on top of that;
I'll post it also.

> - Stop abusing globals in commit() -- pass the commit data as parameters.

Some of the globals actually get modified in commit() (e.g., @old and
@new get cleared).  So we need to either pass them in as references or
remember to do that cleanup each time it is called (which is really only
twice, I think).

> Will be trying to do those things in the next few days, don't mind if
> someone jumps in as well.

I can look at the line/block CVS file slurping, but not tonight.

-Peff

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-23  6:52                                   ` Jeff King
@ 2006-05-23  6:58                                     ` Jeff King
  2006-05-23  7:01                                       ` [PATCH 1/2] cvsimport: use git-update-index --index-info Jeff King
  2006-05-23  7:00                                     ` [PATCH 2/2] cvsimport: cleanup commit function Jeff King
  1 sibling, 1 reply; 83+ messages in thread
From: Jeff King @ 2006-05-23  6:58 UTC (permalink / raw)
  To: Martin Langhoff, Junio C Hamano, Matthias Urlichs, git

>From nobody Mon Sep 17 00:00:00 2001
From: Jeff King <peff@peff.net>
Date: Tue, 23 May 2006 01:16:07 -0400
Subject: [PATCH 1/2] cvsimport: use git-update-index --index-info

This should reduce the number of git-update-index forks required per
commit. We now do adds/removes in one call, and we are no longer forced to
deal with argv limitations.

---

cb6452bbfda9c52ad8dbeaac6e3440ae61099a05
 git-cvsimport.perl |   36 +++++++++++++-----------------------
 1 files changed, 13 insertions(+), 23 deletions(-)

cb6452bbfda9c52ad8dbeaac6e3440ae61099a05
diff --git a/git-cvsimport.perl b/git-cvsimport.perl
index d257e66..4efb0a5 100755
--- a/git-cvsimport.perl
+++ b/git-cvsimport.perl
@@ -565,29 +565,19 @@ my($patchset,$date,$author_name,$author_
 my(@old,@new,@skipped);
 sub commit {
 	my $pid;
-	while(@old) {
-		my @o2;
-		if(@old > 55) {
-			@o2 = splice(@old,0,50);
-		} else {
-			@o2 = @old;
-			@old = ();
-		}
-		system("git-update-index","--force-remove","--",@o2);
-		die "Cannot remove files: $?\n" if $?;
-	}
-	while(@new) {
-		my @n2;
-		if(@new > 12) {
-			@n2 = splice(@new,0,10);
-		} else {
-			@n2 = @new;
-			@new = ();
-		}
-		system("git-update-index","--add",
-			(map { ('--cacheinfo', @$_) } @n2));
-		die "Cannot add files: $?\n" if $?;
-	}
+
+      	open(my $fh, '|-', qw(git-update-index --index-info))
+		or die "unable to open git-update-index: $!";
+	print $fh 
+		(map { "0 0000000000000000000000000000000000000000\t$_\n" }
+			@old),
+		(map { '100' . sprintf('%o', $_->[0]) . " $_->[1]\t$_->[2]\n" }
+			@new)
+		or die "unable to write to git-update-index: $!";
+	close $fh
+		or die "unable to write to git-update-index: $!";
+	$? and die "git-update-index reported error: $?";
+	@old = @new = ();
 
 	$pid = open(C,"-|");
 	die "Cannot fork: $!" unless defined $pid;
-- 
1.3.3.gcb64-dirty

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 2/2] cvsimport: cleanup commit function
  2006-05-23  6:52                                   ` Jeff King
  2006-05-23  6:58                                     ` Jeff King
@ 2006-05-23  7:00                                     ` Jeff King
       [not found]                                       ` <7v4pzh6wtr.fsf@assigned-by-dhcp.cox.net>
                                                         ` (3 more replies)
  1 sibling, 4 replies; 83+ messages in thread
From: Jeff King @ 2006-05-23  7:00 UTC (permalink / raw)
  To: Martin Langhoff, Junio C Hamano, Matthias Urlichs, git

This change attempts to clean up the commit function to make it a bit
easier to read (or at least the first half of it). It also improves
robustness and performance. Specifically:
  - report get_headref errors on opening ref unless the error is ENOENT
  - use regex to check for sha1 instead of length
  - use lexically scoped filehandles which get cleaned up automagically
  - check for error on both 'print' and 'close' (since output is buffered)
  - avoid "fork, do some perl, then exec" in commit(). It's not necessary,
    and we probably end up COW'ing parts of the perl process. Plus the code
    is much smaller because we can use open2()
  - avoid calling strftime over and over (mainly a readability cleanup)

---

I know this patch is quite large. I can try to split it if you want, but
I suspect it's not worth the effort (either you like refactoring or you
don't :) ).

9dc9f05ab5e1cbd8765238e7b1da0addd6f4296a
 git-cvsimport.perl |  150 ++++++++++++++++++++++------------------------------
 1 files changed, 64 insertions(+), 86 deletions(-)

9dc9f05ab5e1cbd8765238e7b1da0addd6f4296a
diff --git a/git-cvsimport.perl b/git-cvsimport.perl
index 4efb0a5..f8feb52 100755
--- a/git-cvsimport.perl
+++ b/git-cvsimport.perl
@@ -23,7 +23,7 @@ use File::Basename qw(basename dirname);
 use Time::Local;
 use IO::Socket;
 use IO::Pipe;
-use POSIX qw(strftime dup2);
+use POSIX qw(strftime dup2 :errno_h);
 use IPC::Open2;
 
 $SIG{'PIPE'}="IGNORE";
@@ -429,22 +429,25 @@ sub getwd() {
 	return $pwd;
 }
 
+sub is_sha1 {
+	my $s = shift;
+	return $s =~ /^[a-zA-Z0-9]{40}$/;
+}
 
-sub get_headref($$) {
+sub get_headref ($$) {
     my $name    = shift;
     my $git_dir = shift; 
-    my $sha;
     
-    if (open(C,"$git_dir/refs/heads/$name")) {
-	chomp($sha = <C>);
-	close(C);
-	length($sha) == 40
-	    or die "Cannot get head id for $name ($sha): $!\n";
+    my $f = "$git_dir/refs/heads/$name";
+    if(open(my $fh, $f)) {
+      	    chomp(my $r = <$fh>);
+	    is_sha1($r) or die "Cannot get head id for $name ($r): $!";
+	    return $r;
     }
-    return $sha;
+    die "unable to open $f: $!" unless $! == POSIX::ENOENT;
+    return undef;
 }
 
-
 -d $git_tree
 	or mkdir($git_tree,0777)
 	or die "Could not create $git_tree: $!";
@@ -561,90 +564,67 @@ #---------------------
 
 my $state = 0;
 
-my($patchset,$date,$author_name,$author_email,$branch,$ancestor,$tag,$logmsg);
-my(@old,@new,@skipped);
-sub commit {
-	my $pid;
-
+sub update_index (\@\@) {
+	my $old = shift;
+	my $new = shift;
       	open(my $fh, '|-', qw(git-update-index --index-info))
 		or die "unable to open git-update-index: $!";
 	print $fh 
 		(map { "0 0000000000000000000000000000000000000000\t$_\n" }
-			@old),
+			@$old),
 		(map { '100' . sprintf('%o', $_->[0]) . " $_->[1]\t$_->[2]\n" }
-			@new)
+			@$new)
 		or die "unable to write to git-update-index: $!";
 	close $fh
 		or die "unable to write to git-update-index: $!";
 	$? and die "git-update-index reported error: $?";
-	@old = @new = ();
+}
 
-	$pid = open(C,"-|");
-	die "Cannot fork: $!" unless defined $pid;
-	unless($pid) {
-		exec("git-write-tree");
-		die "Cannot exec git-write-tree: $!\n";
-	}
-	chomp(my $tree = <C>);
-	length($tree) == 40
-		or die "Cannot get tree id ($tree): $!\n";
-	close(C)
+sub write_tree () {
+	open(my $fh, '-|', qw(git-write-tree))
+		or die "unable to open git-write-tree: $!";
+	chomp(my $tree = <$fh>);
+	is_sha1($tree)
+		or die "Cannot get tree id ($tree): $!";
+	close($fh)
 		or die "Error running git-write-tree: $?\n";
 	print "Tree ID $tree\n" if $opt_v;
+	return $tree;
+}
 
-	my $parent = "";
-	if(open(C,"$git_dir/refs/heads/$last_branch")) {
-		chomp($parent = <C>);
-		close(C);
-		length($parent) == 40
-			or die "Cannot get parent id ($parent): $!\n";
-		print "Parent ID $parent\n" if $opt_v;
-	}
-
-	my $pr = IO::Pipe->new() or die "Cannot open pipe: $!\n";
-	my $pw = IO::Pipe->new() or die "Cannot open pipe: $!\n";
-	$pid = fork();
-	die "Fork: $!\n" unless defined $pid;
-	unless($pid) {
-		$pr->writer();
-		$pw->reader();
-		open(OUT,">&STDOUT");
-		dup2($pw->fileno(),0);
-		dup2($pr->fileno(),1);
-		$pr->close();
-		$pw->close();
-
-		my @par = ();
-		@par = ("-p",$parent) if $parent;
-
-		# loose detection of merges
-		# based on the commit msg
-		foreach my $rx (@mergerx) {
-			if ($logmsg =~ $rx) {
-				my $mparent = $1;
-				if ($mparent eq 'HEAD') { $mparent = $opt_o };
-				if ( -e "$git_dir/refs/heads/$mparent") {
-					$mparent = get_headref($mparent, $git_dir);
-					push @par, '-p', $mparent;
-					print OUT "Merge parent branch: $mparent\n" if $opt_v;
-				}
-			}
+my($patchset,$date,$author_name,$author_email,$branch,$ancestor,$tag,$logmsg);
+my(@old,@new,@skipped);
+sub commit {
+	update_index(@old, @new);
+	@old = @new = ();
+	my $tree = write_tree();
+	my $parent = get_headref($last_branch, $git_dir);
+	print "Parent ID " . ($parent ? $parent : "(empty)") . "\n" if $opt_v;
+
+	my @commit_args;
+	push @commit_args, ("-p", $parent) if $parent;
+
+	# loose detection of merges
+	# based on the commit msg
+	foreach my $rx (@mergerx) {
+		next unless $logmsg =~ $rx && $1;
+		my $mparent = $1 eq 'HEAD' ? $opt_o : $1;
+		if(my $sha1 = get_headref($mparent, $git_dir)) {
+			push @commit_args, '-p', $mparent;
+			print "Merge parent branch: $mparent\n" if $opt_v;
 		}
-
-		exec("env",
-			"GIT_AUTHOR_NAME=$author_name",
-			"GIT_AUTHOR_EMAIL=$author_email",
-			"GIT_AUTHOR_DATE=".strftime("+0000 %Y-%m-%d %H:%M:%S",gmtime($date)),
-			"GIT_COMMITTER_NAME=$author_name",
-			"GIT_COMMITTER_EMAIL=$author_email",
-			"GIT_COMMITTER_DATE=".strftime("+0000 %Y-%m-%d %H:%M:%S",gmtime($date)),
-			"git-commit-tree", $tree,@par);
-		die "Cannot exec git-commit-tree: $!\n";
-
-		close OUT;
 	}
-	$pw->writer();
-	$pr->reader();
+
+	my $commit_date = strftime("+0000 %Y-%m-%d %H:%M:%S",gmtime($date));
+	my $pid = open2(my $commit_read, my $commit_write,
+		'env',
+		"GIT_AUTHOR_NAME=$author_name",
+		"GIT_AUTHOR_EMAIL=$author_email",
+		"GIT_AUTHOR_DATE=$commit_date",
+		"GIT_COMMITTER_NAME=$author_name",
+		"GIT_COMMITTER_EMAIL=$author_email",
+		"GIT_COMMITTER_DATE=$commit_date",
+		'git-commit-tree', $tree, @commit_args);
 
 	# compatibility with git2cvs
 	substr($logmsg,32767) = "" if length($logmsg) > 32767;
@@ -656,16 +636,14 @@ sub commit {
 	    @skipped = ();
 	}
 
-	print $pw "$logmsg\n"
+	print($commit_write "$logmsg\n") && close($commit_write)
 		or die "Error writing to git-commit-tree: $!\n";
-	$pw->close();
 
-	print "Committed patch $patchset ($branch ".strftime("%Y-%m-%d %H:%M:%S",gmtime($date)).")\n" if $opt_v;
-	chomp(my $cid = <$pr>);
-	length($cid) == 40
-		or die "Cannot get commit id ($cid): $!\n";
+	print "Committed patch $patchset ($branch $commit_date)\n" if $opt_v;
+	chomp(my $cid = <$commit_read>);
+	is_sha1($cid) or die "Cannot get commit id ($cid): $!\n";
 	print "Commit ID $cid\n" if $opt_v;
-	$pr->close();
+	close($commit_read);
 
 	waitpid($pid,0);
 	die "Error running git-commit-tree: $?\n" if $?;
-- 
1.3.3.gcb64-dirty

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 1/2] cvsimport: use git-update-index --index-info
  2006-05-23  6:58                                     ` Jeff King
@ 2006-05-23  7:01                                       ` Jeff King
  0 siblings, 0 replies; 83+ messages in thread
From: Jeff King @ 2006-05-23  7:01 UTC (permalink / raw)
  To: Martin Langhoff, Junio C Hamano, Matthias Urlichs, git

This should reduce the number of git-update-index forks required per
commit. We now do adds/removes in one call, and we are no longer forced to
deal with argv limitations.

---

Oops, apparently using a mail reader is too challenging for me. Here's a
repost with the headers correctly merged.

cb6452bbfda9c52ad8dbeaac6e3440ae61099a05
 git-cvsimport.perl |   36 +++++++++++++-----------------------
 1 files changed, 13 insertions(+), 23 deletions(-)

cb6452bbfda9c52ad8dbeaac6e3440ae61099a05
diff --git a/git-cvsimport.perl b/git-cvsimport.perl
index d257e66..4efb0a5 100755
--- a/git-cvsimport.perl
+++ b/git-cvsimport.perl
@@ -565,29 +565,19 @@ my($patchset,$date,$author_name,$author_
 my(@old,@new,@skipped);
 sub commit {
 	my $pid;
-	while(@old) {
-		my @o2;
-		if(@old > 55) {
-			@o2 = splice(@old,0,50);
-		} else {
-			@o2 = @old;
-			@old = ();
-		}
-		system("git-update-index","--force-remove","--",@o2);
-		die "Cannot remove files: $?\n" if $?;
-	}
-	while(@new) {
-		my @n2;
-		if(@new > 12) {
-			@n2 = splice(@new,0,10);
-		} else {
-			@n2 = @new;
-			@new = ();
-		}
-		system("git-update-index","--add",
-			(map { ('--cacheinfo', @$_) } @n2));
-		die "Cannot add files: $?\n" if $?;
-	}
+
+      	open(my $fh, '|-', qw(git-update-index --index-info))
+		or die "unable to open git-update-index: $!";
+	print $fh 
+		(map { "0 0000000000000000000000000000000000000000\t$_\n" }
+			@old),
+		(map { '100' . sprintf('%o', $_->[0]) . " $_->[1]\t$_->[2]\n" }
+			@new)
+		or die "unable to write to git-update-index: $!";
+	close $fh
+		or die "unable to write to git-update-index: $!";
+	$? and die "git-update-index reported error: $?";
+	@old = @new = ();
 
 	$pid = open(C,"-|");
 	die "Cannot fork: $!" unless defined $pid;
-- 
1.3.3.gcb64-dirty

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/2] cvsimport: cleanup commit function
       [not found]                                       ` <7v4pzh6wtr.fsf@assigned-by-dhcp.cox.net>
@ 2006-05-23  7:13                                         ` Jeff King
  0 siblings, 0 replies; 83+ messages in thread
From: Jeff King @ 2006-05-23  7:13 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

[cc'd to list to get reactions on open2]

On Tue, May 23, 2006 at 12:10:08AM -0700, Junio C Hamano wrote:

> > +	return $s =~ /^[a-zA-Z0-9]{40}$/;
> [0-9a-f] (We always do lowercase).

Er, yes, that was a complete think-o on my part.

> Hmm.  I personally do not have problems with open2, but folks on
> some other platforms might.  I'll see how the list audience
> sounds.

FWIW, it was already being used in git-cvsimport.

-Peff

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 1/2] cvsimport: use git-update-index --index-info
  2006-05-23  7:00                                     ` [PATCH 2/2] cvsimport: cleanup commit function Jeff King
       [not found]                                       ` <7v4pzh6wtr.fsf@assigned-by-dhcp.cox.net>
@ 2006-05-23  7:27                                       ` Jeff King
  2006-05-23  8:13                                       ` [PATCH 2/2] cvsimport: cleanup commit function Martin Langhoff
  2006-05-23 17:47                                       ` Morten Welinder
  3 siblings, 0 replies; 83+ messages in thread
From: Jeff King @ 2006-05-23  7:27 UTC (permalink / raw)
  To: git; +Cc: martin, junkio

This should reduce the number of git-update-index forks required per
commit. We now do adds/removes in one call, and we are no longer forced to
deal with argv limitations.

---

This is a repost using -z/NUL instead of line feeds.

d82d215430ae5e79210f73a31f5f8a053f36c27f
 git-cvsimport.perl |   36 +++++++++++++-----------------------
 1 files changed, 13 insertions(+), 23 deletions(-)

d82d215430ae5e79210f73a31f5f8a053f36c27f
diff --git a/git-cvsimport.perl b/git-cvsimport.perl
index d257e66..a65bea6 100755
--- a/git-cvsimport.perl
+++ b/git-cvsimport.perl
@@ -565,29 +565,19 @@ my($patchset,$date,$author_name,$author_
 my(@old,@new,@skipped);
 sub commit {
 	my $pid;
-	while(@old) {
-		my @o2;
-		if(@old > 55) {
-			@o2 = splice(@old,0,50);
-		} else {
-			@o2 = @old;
-			@old = ();
-		}
-		system("git-update-index","--force-remove","--",@o2);
-		die "Cannot remove files: $?\n" if $?;
-	}
-	while(@new) {
-		my @n2;
-		if(@new > 12) {
-			@n2 = splice(@new,0,10);
-		} else {
-			@n2 = @new;
-			@new = ();
-		}
-		system("git-update-index","--add",
-			(map { ('--cacheinfo', @$_) } @n2));
-		die "Cannot add files: $?\n" if $?;
-	}
+
+	open(my $fh, '|-', qw(git-update-index -z --index-info))
+		or die "unable to open git-update-index: $!";
+	print $fh 
+		(map { "0 0000000000000000000000000000000000000000\t$_\0" }
+			@old),
+		(map { '100' . sprintf('%o', $_->[0]) . " $_->[1]\t$_->[2]\0" }
+			@new)
+		or die "unable to write to git-update-index: $!";
+	close $fh
+		or die "unable to write to git-update-index: $!";
+	$? and die "git-update-index reported error: $?";
+	@old = @new = ();
 
 	$pid = open(C,"-|");
 	die "Cannot fork: $!" unless defined $pid;
-- 
1.3.3.g3408

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 2/2] cvsimport: cleanup commit function
       [not found] <1148369266352-git-send-email-1>
@ 2006-05-23  7:27 ` Jeff King
  0 siblings, 0 replies; 83+ messages in thread
From: Jeff King @ 2006-05-23  7:27 UTC (permalink / raw)
  To: git; +Cc: martin, junkio

This change attempts to clean up the commit function to make it a bit
easier to read (or at least the first half of it). It also improves
robustness and performance. Specifically:
  - report get_headref errors on opening ref unless the error is ENOENT
  - use regex to check for sha1 instead of length
  - use lexically scoped filehandles which get cleaned up automagically
  - check for error on both 'print' and 'close' (since output is buffered)
  - avoid "fork, do some perl, then exec" in commit(). It's not necessary,
    and we probably end up COW'ing parts of the perl process. Plus the code
    is much smaller because we can use open2()
  - avoid calling strftime over and over (mainly a readability cleanup)

---

This is a repost with some minor fixups from Junio (and based off of the
fixed 1/2 patch).

3408c8d8364f816a7c4a34a03045f466bf028540
 git-cvsimport.perl |  150 ++++++++++++++++++++++------------------------------
 1 files changed, 64 insertions(+), 86 deletions(-)

3408c8d8364f816a7c4a34a03045f466bf028540
diff --git a/git-cvsimport.perl b/git-cvsimport.perl
index a65bea6..219f6dc 100755
--- a/git-cvsimport.perl
+++ b/git-cvsimport.perl
@@ -23,7 +23,7 @@ use File::Basename qw(basename dirname);
 use Time::Local;
 use IO::Socket;
 use IO::Pipe;
-use POSIX qw(strftime dup2);
+use POSIX qw(strftime dup2 :errno_h);
 use IPC::Open2;
 
 $SIG{'PIPE'}="IGNORE";
@@ -429,22 +429,25 @@ sub getwd() {
 	return $pwd;
 }
 
+sub is_sha1 {
+	my $s = shift;
+	return $s =~ /^[a-f0-9]{40}$/;
+}
 
-sub get_headref($$) {
+sub get_headref ($$) {
     my $name    = shift;
     my $git_dir = shift; 
-    my $sha;
     
-    if (open(C,"$git_dir/refs/heads/$name")) {
-	chomp($sha = <C>);
-	close(C);
-	length($sha) == 40
-	    or die "Cannot get head id for $name ($sha): $!\n";
+    my $f = "$git_dir/refs/heads/$name";
+    if(open(my $fh, $f)) {
+      	    chomp(my $r = <$fh>);
+	    is_sha1($r) or die "Cannot get head id for $name ($r): $!";
+	    return $r;
     }
-    return $sha;
+    die "unable to open $f: $!" unless $! == POSIX::ENOENT;
+    return undef;
 }
 
-
 -d $git_tree
 	or mkdir($git_tree,0777)
 	or die "Could not create $git_tree: $!";
@@ -561,90 +564,67 @@ #---------------------
 
 my $state = 0;
 
-my($patchset,$date,$author_name,$author_email,$branch,$ancestor,$tag,$logmsg);
-my(@old,@new,@skipped);
-sub commit {
-	my $pid;
-
+sub update_index (\@\@) {
+	my $old = shift;
+	my $new = shift;
 	open(my $fh, '|-', qw(git-update-index -z --index-info))
 		or die "unable to open git-update-index: $!";
 	print $fh 
 		(map { "0 0000000000000000000000000000000000000000\t$_\0" }
-			@old),
+			@$old),
 		(map { '100' . sprintf('%o', $_->[0]) . " $_->[1]\t$_->[2]\0" }
-			@new)
+			@$new)
 		or die "unable to write to git-update-index: $!";
 	close $fh
 		or die "unable to write to git-update-index: $!";
 	$? and die "git-update-index reported error: $?";
-	@old = @new = ();
+}
 
-	$pid = open(C,"-|");
-	die "Cannot fork: $!" unless defined $pid;
-	unless($pid) {
-		exec("git-write-tree");
-		die "Cannot exec git-write-tree: $!\n";
-	}
-	chomp(my $tree = <C>);
-	length($tree) == 40
-		or die "Cannot get tree id ($tree): $!\n";
-	close(C)
+sub write_tree () {
+	open(my $fh, '-|', qw(git-write-tree))
+		or die "unable to open git-write-tree: $!";
+	chomp(my $tree = <$fh>);
+	is_sha1($tree)
+		or die "Cannot get tree id ($tree): $!";
+	close($fh)
 		or die "Error running git-write-tree: $?\n";
 	print "Tree ID $tree\n" if $opt_v;
+	return $tree;
+}
 
-	my $parent = "";
-	if(open(C,"$git_dir/refs/heads/$last_branch")) {
-		chomp($parent = <C>);
-		close(C);
-		length($parent) == 40
-			or die "Cannot get parent id ($parent): $!\n";
-		print "Parent ID $parent\n" if $opt_v;
-	}
-
-	my $pr = IO::Pipe->new() or die "Cannot open pipe: $!\n";
-	my $pw = IO::Pipe->new() or die "Cannot open pipe: $!\n";
-	$pid = fork();
-	die "Fork: $!\n" unless defined $pid;
-	unless($pid) {
-		$pr->writer();
-		$pw->reader();
-		open(OUT,">&STDOUT");
-		dup2($pw->fileno(),0);
-		dup2($pr->fileno(),1);
-		$pr->close();
-		$pw->close();
-
-		my @par = ();
-		@par = ("-p",$parent) if $parent;
-
-		# loose detection of merges
-		# based on the commit msg
-		foreach my $rx (@mergerx) {
-			if ($logmsg =~ $rx) {
-				my $mparent = $1;
-				if ($mparent eq 'HEAD') { $mparent = $opt_o };
-				if ( -e "$git_dir/refs/heads/$mparent") {
-					$mparent = get_headref($mparent, $git_dir);
-					push @par, '-p', $mparent;
-					print OUT "Merge parent branch: $mparent\n" if $opt_v;
-				}
-			}
+my($patchset,$date,$author_name,$author_email,$branch,$ancestor,$tag,$logmsg);
+my(@old,@new,@skipped);
+sub commit {
+	update_index(@old, @new);
+	@old = @new = ();
+	my $tree = write_tree();
+	my $parent = get_headref($last_branch, $git_dir);
+	print "Parent ID " . ($parent ? $parent : "(empty)") . "\n" if $opt_v;
+
+	my @commit_args;
+	push @commit_args, ("-p", $parent) if $parent;
+
+	# loose detection of merges
+	# based on the commit msg
+	foreach my $rx (@mergerx) {
+		next unless $logmsg =~ $rx && $1;
+		my $mparent = $1 eq 'HEAD' ? $opt_o : $1;
+		if(my $sha1 = get_headref($mparent, $git_dir)) {
+			push @commit_args, '-p', $mparent;
+			print "Merge parent branch: $mparent\n" if $opt_v;
 		}
-
-		exec("env",
-			"GIT_AUTHOR_NAME=$author_name",
-			"GIT_AUTHOR_EMAIL=$author_email",
-			"GIT_AUTHOR_DATE=".strftime("+0000 %Y-%m-%d %H:%M:%S",gmtime($date)),
-			"GIT_COMMITTER_NAME=$author_name",
-			"GIT_COMMITTER_EMAIL=$author_email",
-			"GIT_COMMITTER_DATE=".strftime("+0000 %Y-%m-%d %H:%M:%S",gmtime($date)),
-			"git-commit-tree", $tree,@par);
-		die "Cannot exec git-commit-tree: $!\n";
-
-		close OUT;
 	}
-	$pw->writer();
-	$pr->reader();
+
+	my $commit_date = strftime("+0000 %Y-%m-%d %H:%M:%S",gmtime($date));
+	my $pid = open2(my $commit_read, my $commit_write,
+		'env',
+		"GIT_AUTHOR_NAME=$author_name",
+		"GIT_AUTHOR_EMAIL=$author_email",
+		"GIT_AUTHOR_DATE=$commit_date",
+		"GIT_COMMITTER_NAME=$author_name",
+		"GIT_COMMITTER_EMAIL=$author_email",
+		"GIT_COMMITTER_DATE=$commit_date",
+		'git-commit-tree', $tree, @commit_args);
 
 	# compatibility with git2cvs
 	substr($logmsg,32767) = "" if length($logmsg) > 32767;
@@ -656,16 +636,14 @@ sub commit {
 	    @skipped = ();
 	}
 
-	print $pw "$logmsg\n"
+	print($commit_write "$logmsg\n") && close($commit_write)
 		or die "Error writing to git-commit-tree: $!\n";
-	$pw->close();
 
-	print "Committed patch $patchset ($branch ".strftime("%Y-%m-%d %H:%M:%S",gmtime($date)).")\n" if $opt_v;
-	chomp(my $cid = <$pr>);
-	length($cid) == 40
-		or die "Cannot get commit id ($cid): $!\n";
+	print "Committed patch $patchset ($branch $commit_date)\n" if $opt_v;
+	chomp(my $cid = <$commit_read>);
+	is_sha1($cid) or die "Cannot get commit id ($cid): $!\n";
 	print "Commit ID $cid\n" if $opt_v;
-	$pr->close();
+	close($commit_read);
 
 	waitpid($pid,0);
 	die "Error running git-commit-tree: $?\n" if $?;
-- 
1.3.3.g3408

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/2] cvsimport: cleanup commit function
  2006-05-23  7:00                                     ` [PATCH 2/2] cvsimport: cleanup commit function Jeff King
       [not found]                                       ` <7v4pzh6wtr.fsf@assigned-by-dhcp.cox.net>
  2006-05-23  7:27                                       ` [PATCH 1/2] cvsimport: use git-update-index --index-info Jeff King
@ 2006-05-23  8:13                                       ` Martin Langhoff
  2006-05-23  8:24                                         ` Junio C Hamano
  2006-05-23 16:50                                         ` Linus Torvalds
  2006-05-23 17:47                                       ` Morten Welinder
  3 siblings, 2 replies; 83+ messages in thread
From: Martin Langhoff @ 2006-05-23  8:13 UTC (permalink / raw)
  To: Martin Langhoff, Junio C Hamano, Matthias Urlichs, git

Jeff,

good stuff -- aiming at exactly the things that had been nagging me.
Some minor notes on top of what junio's mentioned...

> +    die "unable to open $f: $!" unless $! == POSIX::ENOENT;
> +    return undef;

Heh. Is that the return of the living dead?

> +sub update_index (\@\@) {
> +       my $old = shift;
> +       my $new = shift;

Would it not make more sense to just pass them as plain parameters?

> +       print "Committed patch $patchset ($branch $commit_date)\n" if

Given that we have that -- should we remember it and avoid re-reading
the headref from disk? A %seenheads cache would save us 99.9% of the
hassle.

In related news, I've dealt with file reads from the socket being
memorybound. Should merge ok.

cheers,


martin

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/2] cvsimport: cleanup commit function
  2006-05-23  8:13                                       ` [PATCH 2/2] cvsimport: cleanup commit function Martin Langhoff
@ 2006-05-23  8:24                                         ` Junio C Hamano
  2006-05-23 20:32                                           ` Martin Langhoff
  2006-05-23 16:50                                         ` Linus Torvalds
  1 sibling, 1 reply; 83+ messages in thread
From: Junio C Hamano @ 2006-05-23  8:24 UTC (permalink / raw)
  To: git

"Martin Langhoff" <martin.langhoff@gmail.com> writes:

> Jeff,
>
> good stuff -- aiming at exactly the things that had been nagging me.
> Some minor notes on top of what junio's mentioned...
>
>> +    die "unable to open $f: $!" unless $! == POSIX::ENOENT;
>> +    return undef;
>
> Heh. Is that the return of the living dead?

Note the trailing "unless" there.

>> +sub update_index (\@\@) {
>> +       my $old = shift;
>> +       my $new = shift;
>
> Would it not make more sense to just pass them as plain parameters?

Meaning...?  Perl5 can pass only one flat array, so the above is
a standard way to pass two arrays.

>> +       print "Committed patch $patchset ($branch $commit_date)\n" if
>
> Given that we have that -- should we remember it and avoid re-reading
> the headref from disk? A %seenheads cache would save us 99.9% of the
> hassle.
>
> In related news, I've dealt with file reads from the socket being
> memorybound. Should merge ok.

Merged OK, and I think your last suggestion makes sense.  I'll
go to bed after pushing out Jeff's two patches and yours.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/2] cvsimport: cleanup commit function
  2006-05-23  8:13                                       ` [PATCH 2/2] cvsimport: cleanup commit function Martin Langhoff
  2006-05-23  8:24                                         ` Junio C Hamano
@ 2006-05-23 16:50                                         ` Linus Torvalds
  2006-05-23 19:36                                           ` Linus Torvalds
  1 sibling, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2006-05-23 16:50 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Junio C Hamano, Matthias Urlichs, git



Hmm. Is it just me, or does the current "git cvsimport" have new problems:

	[torvalds@merom git]$ git cvsimport -d ~/CVS gentoo-x86

causes

	Committing initial tree 34bd3dcd4bfd79bad35ce3fb08b2e21108195db8
	Server has gone away while fetching BUGS-TODO 1.1, retrying...
	Retry failed at /home/torvalds/bin/git-cvsimport line 366, <GEN2656> line 9.

and that's it for the import.

I don't see what would have caused it in the changes, but it definitely 
worked earlier..

		Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/2] cvsimport: cleanup commit function
  2006-05-23  7:00                                     ` [PATCH 2/2] cvsimport: cleanup commit function Jeff King
                                                         ` (2 preceding siblings ...)
  2006-05-23  8:13                                       ` [PATCH 2/2] cvsimport: cleanup commit function Martin Langhoff
@ 2006-05-23 17:47                                       ` Morten Welinder
  2006-05-23 20:59                                         ` Jeff King
  3 siblings, 1 reply; 83+ messages in thread
From: Morten Welinder @ 2006-05-23 17:47 UTC (permalink / raw)
  To: Martin Langhoff, Junio C Hamano, Matthias Urlichs, git

Why run "env" and not just muck with %ENV?

M.


> +       my $pid = open2(my $commit_read, my $commit_write,
> +               'env',
> +               "GIT_AUTHOR_NAME=$author_name",
> +               "GIT_AUTHOR_EMAIL=$author_email",
> +               "GIT_AUTHOR_DATE=$commit_date",
> +               "GIT_COMMITTER_NAME=$author_name",
> +               "GIT_COMMITTER_EMAIL=$author_email",
> +               "GIT_COMMITTER_DATE=$commit_date",
> +               'git-commit-tree', $tree, @commit_args);

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/2] cvsimport: cleanup commit function
  2006-05-23 16:50                                         ` Linus Torvalds
@ 2006-05-23 19:36                                           ` Linus Torvalds
  2006-05-23 20:25                                             ` Junio C Hamano
  2006-05-23 20:29                                             ` Martin Langhoff
  0 siblings, 2 replies; 83+ messages in thread
From: Linus Torvalds @ 2006-05-23 19:36 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Junio C Hamano, Matthias Urlichs, git



On Tue, 23 May 2006, Linus Torvalds wrote:
> 
> Hmm. Is it just me, or does the current "git cvsimport" have new problems:
> 
> 	[torvalds@merom git]$ git cvsimport -d ~/CVS gentoo-x86
> 
> causes
> 
> 	Committing initial tree 34bd3dcd4bfd79bad35ce3fb08b2e21108195db8
> 	Server has gone away while fetching BUGS-TODO 1.1, retrying...
> 	Retry failed at /home/torvalds/bin/git-cvsimport line 366, <GEN2656> line 9.
> 
> and that's it for the import.
> 
> I don't see what would have caused it in the changes, but it definitely 
> worked earlier..

Martin, that problem seems to go away when I initialize $res to 0 in 
_fetchfile. 

I don't know perl, and maybe local variables are pre-initialized to empty. 

It's entirely possible that the fact that it now seems to work for me is 
purely timing-related, since I also ended up using "-P cvsps-output" to 
avoid having a huge cvsps binary in memory at the same time.

		Linus "perl illiterate" Torvalds

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-22 18:03                           ` Linus Torvalds
  2006-05-22 19:03                             ` Matthias Lederhofer
@ 2006-05-23 20:19                             ` Jakub Narebski
  1 sibling, 0 replies; 83+ messages in thread
From: Jakub Narebski @ 2006-05-23 20:19 UTC (permalink / raw)
  To: git

Linus Torvalds wrote:
 
> [...] people _should_ realize that removing objects is very very special. 
> Whether it's done by "git prune-packed" or "git prune", that's a very 
> dangerous operations. "git prune" a lot more so than "git prune-packed", 
> of course (in fact, you should _never_ run "git prune" on a repository 
> that is active - you _will_ corrupt it)-

Would it be possible to make 'git prune' command repository corruption safe,
even if some information might be lost (like 'git add')? Or do _corruption_
mean some recoverable only information is lost? Not always one can use "one
repository per developer" workflow.


One of the solution would be to to use reader/writer lock (filesystem
semaphore), with each command modyfying repository performing locking, and
git-prune waiting on lock until noone is accessing repository. Of course
the problem is with OS and filesystems which does not support locking, and
with stale locks...

Second solution would be to [optionally] wait until no process is accessing
repository, copy repository in some safe place, [optionally] calculate
checksum, prune, [optionally] check if the repository was modified
meanwhile and either abort or repeat, and finally copy pruned repository
back.

-- 
Jakub Narebski
Warsaw, Poland

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/2] cvsimport: cleanup commit function
  2006-05-23 19:36                                           ` Linus Torvalds
@ 2006-05-23 20:25                                             ` Junio C Hamano
  2006-05-23 20:29                                             ` Martin Langhoff
  1 sibling, 0 replies; 83+ messages in thread
From: Junio C Hamano @ 2006-05-23 20:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Linus Torvalds <torvalds@osdl.org> writes:

>> 	Committing initial tree 34bd3dcd4bfd79bad35ce3fb08b2e21108195db8
>> 	Server has gone away while fetching BUGS-TODO 1.1, retrying...
>...
> Martin, that problem seems to go away when I initialize $res to 0 in 
> _fetchfile. 
>
> I don't know perl, and maybe local variables are pre-initialized to empty. 

When a new file that is empty is created, sub _line would call
sub _fetchfile with $cnt == 0, and it can return $res which
is initialized to 'undef'.  That explains why sub file says
$self->_line() returned an undef and I think what you did is the
right fix.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/2] cvsimport: cleanup commit function
  2006-05-23 19:36                                           ` Linus Torvalds
  2006-05-23 20:25                                             ` Junio C Hamano
@ 2006-05-23 20:29                                             ` Martin Langhoff
  2006-05-23 21:10                                               ` Jeff King
  1 sibling, 1 reply; 83+ messages in thread
From: Martin Langhoff @ 2006-05-23 20:29 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Matthias Urlichs, git

On 5/24/06, Linus Torvalds <torvalds@osdl.org> wrote:
> Martin, that problem seems to go away when I initialize $res to 0 in
> _fetchfile.
>
> I don't know perl, and maybe local variables are pre-initialized to empty.
>
> It's entirely possible that the fact that it now seems to work for me is
> purely timing-related, since I also ended up using "-P cvsps-output" to
> avoid having a huge cvsps binary in memory at the same time.

Strange! Cannot repro here with v5.8.8 (debian/etch 5.8.8-4) but
initialising it doesn't hurt, so let's do it:

diff --git a/git-cvsimport.perl b/git-cvsimport.perl
index ace7087..abbfd0b 100755
--- a/git-cvsimport.perl
+++ b/git-cvsimport.perl
@@ -371,7 +371,7 @@ sub file {
 }
 sub _fetchfile {
        my ($self, $fh, $cnt) = @_;
-       my $res;
+       my $res = 0;
        my $bufsize = 1024 * 1024;
        while($cnt) {
            if ($bufsize > $cnt) {

cheers,


martin

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/2] cvsimport: cleanup commit function
  2006-05-23  8:24                                         ` Junio C Hamano
@ 2006-05-23 20:32                                           ` Martin Langhoff
  0 siblings, 0 replies; 83+ messages in thread
From: Martin Langhoff @ 2006-05-23 20:32 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On 5/23/06, Junio C Hamano <junkio@cox.net> wrote:
> "Martin Langhoff" <martin.langhoff@gmail.com> writes:
>
> > Jeff,
> >
> > good stuff -- aiming at exactly the things that had been nagging me.
> > Some minor notes on top of what junio's mentioned...
> >
> >> +    die "unable to open $f: $!" unless $! == POSIX::ENOENT;
> >> +    return undef;
> >
> > Heh. Is that the return of the living dead?
>
> Note the trailing "unless" there.

Of course. I had actually missed the closing quotes, and thought the
error msg wanted to talk about POSIX. 'twas late in the day, seems
like most of my comments in this email were rather stoopid.

> >> +sub update_index (\@\@) {
> >> +       my $old = shift;
> >> +       my $new = shift;
> >
> > Would it not make more sense to just pass them as plain parameters?
>
> Meaning...?  Perl5 can pass only one flat array, so the above is
> a standard way to pass two arrays.

Meaning I am stupid :(

> >> +       print "Committed patch $patchset ($branch $commit_date)\n" if
> >
> > Given that we have that -- should we remember it and avoid re-reading
> > the headref from disk? A %seenheads cache would save us 99.9% of the
> > hassle.
> >
> > In related news, I've dealt with file reads from the socket being
> > memorybound. Should merge ok.
>
> Merged OK, and I think your last suggestion makes sense.  I'll
> go to bed after pushing out Jeff's two patches and yours.

I'll look into caching headrefs tonight if noone beats me to it.




martin

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/2] cvsimport: cleanup commit function
  2006-05-23 17:47                                       ` Morten Welinder
@ 2006-05-23 20:59                                         ` Jeff King
  2006-05-23 23:41                                           ` Junio C Hamano
  0 siblings, 1 reply; 83+ messages in thread
From: Jeff King @ 2006-05-23 20:59 UTC (permalink / raw)
  To: Morten Welinder; +Cc: Martin Langhoff, Junio C Hamano, Matthias Urlichs, git

On Tue, May 23, 2006 at 01:47:01PM -0400, Morten Welinder wrote:

> Why run "env" and not just muck with %ENV?
> >+       my $pid = open2(my $commit_read, my $commit_write,
> >+               'env',
> >+               "GIT_AUTHOR_NAME=$author_name",
> >+               "GIT_AUTHOR_EMAIL=$author_email",
> >+               "GIT_AUTHOR_DATE=$commit_date",
> >+               "GIT_COMMITTER_NAME=$author_name",
> >+               "GIT_COMMITTER_EMAIL=$author_email",
> >+               "GIT_COMMITTER_DATE=$commit_date",
> >+               'git-commit-tree', $tree, @commit_args);

Oops, that's an obvious fork optimization that I should have caught.
Patch is below. Note that this will now affect the environment of all
sub-processes, but it shouldn't matter since we reset it right before
commit. However, if anyone is worried, we can stash the old %ENV in
another hash temporarily.

-Peff

PS What is the preferred format for throwing patches into replies like
this? Putting the patch at the end (as here) or throwing the reply
comments in the ignored section near the diffstat?

---
cvsimport: set up commit environment in perl instead of using env

---

44c4a9f67322302ca49146a7c143c07ea67da366
 git-cvsimport.perl |   13 ++++++-------
 1 files changed, 6 insertions(+), 7 deletions(-)

44c4a9f67322302ca49146a7c143c07ea67da366
diff --git a/git-cvsimport.perl b/git-cvsimport.perl
index 41ee9a6..83d7d3c 100755
--- a/git-cvsimport.perl
+++ b/git-cvsimport.perl
@@ -618,14 +618,13 @@ sub commit {
 	}
 
 	my $commit_date = strftime("+0000 %Y-%m-%d %H:%M:%S",gmtime($date));
+	$ENV{GIT_AUTHOR_NAME} = $author_name;
+	$ENV{GIT_AUTHOR_EMAIL} = $author_email;
+	$ENV{GIT_AUTHOR_DATE} = $commit_date;
+	$ENV{GIT_COMMITTER_NAME} = $author_name;
+	$ENV{GIT_COMMITTER_EMAIL} = $author_email;
+	$ENV{GIT_COMMITTER_DATE} = $commit_date;
 	my $pid = open2(my $commit_read, my $commit_write,
-		'env',
-		"GIT_AUTHOR_NAME=$author_name",
-		"GIT_AUTHOR_EMAIL=$author_email",
-		"GIT_AUTHOR_DATE=$commit_date",
-		"GIT_COMMITTER_NAME=$author_name",
-		"GIT_COMMITTER_EMAIL=$author_email",
-		"GIT_COMMITTER_DATE=$commit_date",
 		'git-commit-tree', $tree, @commit_args);
 
 	# compatibility with git2cvs
-- 
1.3.3.g40505-dirty


> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/2] cvsimport: cleanup commit function
  2006-05-23 20:29                                             ` Martin Langhoff
@ 2006-05-23 21:10                                               ` Jeff King
  2006-05-23 21:13                                                 ` Martin Langhoff
  0 siblings, 1 reply; 83+ messages in thread
From: Jeff King @ 2006-05-23 21:10 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Linus Torvalds, Junio C Hamano, Matthias Urlichs, git

On Wed, May 24, 2006 at 08:29:07AM +1200, Martin Langhoff wrote:

> Strange! Cannot repro here with v5.8.8 (debian/etch 5.8.8-4) but
> initialising it doesn't hurt, so let's do it:

I can reproduce with debian perl 5.8.8-4. The bug is only triggered by
0-length files, so presumably your test repo doesn't have any.

-Peff

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/2] cvsimport: cleanup commit function
  2006-05-23 21:10                                               ` Jeff King
@ 2006-05-23 21:13                                                 ` Martin Langhoff
  0 siblings, 0 replies; 83+ messages in thread
From: Martin Langhoff @ 2006-05-23 21:13 UTC (permalink / raw)
  To: Martin Langhoff, Linus Torvalds, Junio C Hamano, Matthias Urlichs,
	git

On 5/24/06, Jeff King <peff@peff.net> wrote:
> On Wed, May 24, 2006 at 08:29:07AM +1200, Martin Langhoff wrote:
>
> > Strange! Cannot repro here with v5.8.8 (debian/etch 5.8.8-4) but
> > initialising it doesn't hurt, so let's do it:
>
> I can reproduce with debian perl 5.8.8-4. The bug is only triggered by
> 0-length files, so presumably your test repo doesn't have any.

Given that we are all working off the gentoo repo here, it means that
my machine is slower than Linus' unreleased Intel box. And that I am
too impatient...

In any case, the fix is correct as Junio points out.

cheers,


martin

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/2] cvsimport: cleanup commit function
  2006-05-23 20:59                                         ` Jeff King
@ 2006-05-23 23:41                                           ` Junio C Hamano
  2006-05-24  9:52                                             ` Jeff King
  0 siblings, 1 reply; 83+ messages in thread
From: Junio C Hamano @ 2006-05-23 23:41 UTC (permalink / raw)
  To: Jeff King; +Cc: Morten Welinder, Martin Langhoff, Matthias Urlichs, git

Jeff King <peff@peff.net> writes:

> On Tue, May 23, 2006 at 01:47:01PM -0400, Morten Welinder wrote:
>
>> Why run "env" and not just muck with %ENV?
>> >+       my $pid = open2(my $commit_read, my $commit_write,
>> >+               'env',
>> >+               "GIT_AUTHOR_NAME=$author_name",
>> >+               "GIT_AUTHOR_EMAIL=$author_email",
>> >+               "GIT_AUTHOR_DATE=$commit_date",
>> >+               "GIT_COMMITTER_NAME=$author_name",
>> >+               "GIT_COMMITTER_EMAIL=$author_email",
>> >+               "GIT_COMMITTER_DATE=$commit_date",
>> >+               'git-commit-tree', $tree, @commit_args);
>
> Oops, that's an obvious fork optimization that I should have caught.

Are you two talking about running git-commit-tree via env is two
fork-execs instead of just one?  Does that have a measurable
difference?

Not that I have anything against the updated code, but I do not
particularly thing it is such a big issue.

> PS What is the preferred format for throwing patches into replies like
> this? Putting the patch at the end (as here) or throwing the reply
> comments in the ignored section near the diffstat?

You could do it either way.  Although I personally find the
former easier to read (meshes well with "do not top post"
mantra), it appears many other people finds the cover letter
material should come after the first '---' separator.

If you append the patch to your message, btw, you would need to
realize that the receiving end needs to edit your message to
remove the top part before running "git am" to apply.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/2] cvsimport: cleanup commit function
  2006-05-23 23:41                                           ` Junio C Hamano
@ 2006-05-24  9:52                                             ` Jeff King
  0 siblings, 0 replies; 83+ messages in thread
From: Jeff King @ 2006-05-24  9:52 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Morten Welinder, Martin Langhoff, Matthias Urlichs, git

On Tue, May 23, 2006 at 04:41:33PM -0700, Junio C Hamano wrote:

> Are you two talking about running git-commit-tree via env is two
> fork-execs instead of just one?  Does that have a measurable
> difference?

Yes, that's what I was talking about. No, probably not a huge
difference. I did some performance measurements of all of the recent
cvsimport changes on a small-ish personal repo (I don't have the gentoo
repo). The results were not significant (<= 1% improvement for each
change).  I would expect some of the changes (index-info, fetchfile) to
have an impact on a repo with different characteristics (like the gentoo
one).

-Peff

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-22 21:48                               ` Donnie Berkholz
@ 2006-05-29 21:54                                 ` Donnie Berkholz
  2006-05-29 22:21                                   ` Martin Langhoff
  0 siblings, 1 reply; 83+ messages in thread
From: Donnie Berkholz @ 2006-05-29 21:54 UTC (permalink / raw)
  To: Donnie Berkholz
  Cc: Linus Torvalds, Martin Langhoff, Yann Dirson, Git Mailing List,
	Matthias Urlichs, Johannes Schindelin

[-- Attachment #1: Type: text/plain, Size: 406 bytes --]

Donnie Berkholz wrote:
> Linus Torvalds wrote:
>> The latest stable CVS release is 1.11.21, I think: you seem to be running 
>> the "development" version (1.12.x).
> 
> Backed down to the 1.11 series, things seem to be going fine so far.

Finally hit an OOM sometime in the past day (yep, a week later) =\. Not
sure whether it was cvsimport or cvs. Anyone else had more luck?

Thanks,
Donnie


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 252 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-29 21:54                                 ` Donnie Berkholz
@ 2006-05-29 22:21                                   ` Martin Langhoff
  2006-05-29 22:32                                     ` Donnie Berkholz
  0 siblings, 1 reply; 83+ messages in thread
From: Martin Langhoff @ 2006-05-29 22:21 UTC (permalink / raw)
  To: Donnie Berkholz
  Cc: Linus Torvalds, Yann Dirson, Git Mailing List, Matthias Urlichs,
	Johannes Schindelin

On 5/30/06, Donnie Berkholz <spyderous@gentoo.org> wrote:
> Donnie Berkholz wrote:
> > Linus Torvalds wrote:
> >> The latest stable CVS release is 1.11.21, I think: you seem to be running
> >> the "development" version (1.12.x).
> >
> > Backed down to the 1.11 series, things seem to be going fine so far.
>
> Finally hit an OOM sometime in the past day (yep, a week later) =\. Not
> sure whether it was cvsimport or cvs. Anyone else had more luck?

It seemed like it had finished on the machine I was running it, and I
assumed it was alright in yours too. Looking closer it only made it
till April 2004 -- but it may have been killed by a sysadmin, the
captured log talks about 'signal 9', I have no idea what the OOM
sends.

It had done 285070 of 343822 patchsets.

Have you dropped the -a from the git-repack invocation? That should
help. Try also Linus' patch for git-rev-list. The other thing hurting
us is that the commits are _huge_. I wonder how you guys were managing
this with CVS. Now _this_ explains why cvsimport grows humongous.

I'll try to rework the commit loop so that we don't need to hold all
the filenames in memory. It seems to be choking with the commits after
April 2004. But that will have to wait till tonight.

cheers,



martin

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-29 22:21                                   ` Martin Langhoff
@ 2006-05-29 22:32                                     ` Donnie Berkholz
  2006-05-30  0:19                                       ` Martin Langhoff
                                                         ` (2 more replies)
  0 siblings, 3 replies; 83+ messages in thread
From: Donnie Berkholz @ 2006-05-29 22:32 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Linus Torvalds, Yann Dirson, Git Mailing List, Matthias Urlichs,
	Johannes Schindelin

[-- Attachment #1: Type: text/plain, Size: 2375 bytes --]

Martin Langhoff wrote:
> On 5/30/06, Donnie Berkholz <spyderous@gentoo.org> wrote:
>> Finally hit an OOM sometime in the past day (yep, a week later) =\. Not
>> sure whether it was cvsimport or cvs. Anyone else had more luck?
> 
> It seemed like it had finished on the machine I was running it, and I
> assumed it was alright in yours too. Looking closer it only made it
> till April 2004 -- but it may have been killed by a sysadmin, the
> captured log talks about 'signal 9', I have no idea what the OOM
> sends.

Looking closer, I see that the memory suckers do appear to be git, from
dmesg:

Out of Memory: Kill process 17230 (git-repack) score 97207 and children.
Out of memory: Killed process 17231 (git-rev-list).

Just ends like this:

Tree ID 2cc632e5e1d3a430a2cc891bf33c4a12f19a4d0e
Parent ID ad92d7073a52458e0581633bbd8ccbbec838d9e6
Committed patch 249100 (origin 2005-08-20 05:05:58)
Commit ID 28941f00d714f57ab49f1fd725d1c3ce8a5d0b93
Fetching sys-kernel/ck-sources/ChangeLog   v 1.113
Update sys-kernel/ck-sources/ChangeLog: 25425 bytes
Fetching sys-kernel/ck-sources/Manifest   v 1.164
Update sys-kernel/ck-sources/Manifest: 252 bytes
Delete sys-kernel/ck-sources/ck-sources-2.6.12_p5-r1.ebuild
Fetching sys-kernel/ck-sources/ck-sources-2.6.12_p6.ebuild   v 1.1
New sys-kernel/ck-sources/ck-sources-2.6.12_p6.ebuild: 1438 bytes
Delete sys-kernel/ck-sources/files/digest-ck-sources-2.6.12_p5-r1
Fetching sys-kernel/ck-sources/files/digest-ck-sources-2.6.12_p6   v 1.1
New sys-kernel/ck-sources/files/digest-ck-sources-2.6.12_p6: 279 bytes
Can't fork at /usr/bin/git-cvsimport line 592, <CVS> line 3810053.
cat: write error: Broken pipe

> It had done 285070 of 343822 patchsets.
> 
> Have you dropped the -a from the git-repack invocation? That should
> help. Try also Linus' patch for git-rev-list. The other thing hurting
> us is that the commits are _huge_. I wonder how you guys were managing
> this with CVS. Now _this_ explains why cvsimport grows humongous.

I wasn't running with a version that did repacks; I just suspended the
cvsimport a couple of times and ran a repack manually.

> I'll try to rework the commit loop so that we don't need to hold all
> the filenames in memory. It seems to be choking with the commits after
> April 2004. But that will have to wait till tonight.

Thanks,
Donnie


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 252 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-29 22:32                                     ` Donnie Berkholz
@ 2006-05-30  0:19                                       ` Martin Langhoff
  2006-05-30  5:31                                         ` Donnie Berkholz
  2006-05-30  0:43                                       ` Linus Torvalds
  2006-05-30 22:31                                       ` Martin Langhoff
  2 siblings, 1 reply; 83+ messages in thread
From: Martin Langhoff @ 2006-05-30  0:19 UTC (permalink / raw)
  To: Donnie Berkholz
  Cc: Linus Torvalds, Yann Dirson, Git Mailing List, Matthias Urlichs,
	Johannes Schindelin

On 5/30/06, Donnie Berkholz <spyderous@gentoo.org> wrote:
> Looking closer, I see that the memory suckers do appear to be git, from
> dmesg:
>
> Out of Memory: Kill process 17230 (git-repack) score 97207 and children.
> Out of memory: Killed process 17231 (git-rev-list).

That would mean that you do have Linus' patch then. Grep cvsimport for
repack and remove the -a -- and consider using his recent patch to
rev-list.

My dmesg talks about an earlier cvs segfault. Nasty tree you have here
-- it's breaking all sorts of things... and teaching us a thing or two
about the import process.

> Committed patch 249100 (origin 2005-08-20 05:05:58)

Hmmm? How can you be at patch 249100 and still be a good year ahead of
me? Have you told cvsps to cut off old history?

Another thing I found is that this import uses a lot of $TMPDIR, so if
your TMPDIR is small, you'll hit all sorts of problems.

cheers,



martin

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-29 22:32                                     ` Donnie Berkholz
  2006-05-30  0:19                                       ` Martin Langhoff
@ 2006-05-30  0:43                                       ` Linus Torvalds
  2006-05-30 22:31                                       ` Martin Langhoff
  2 siblings, 0 replies; 83+ messages in thread
From: Linus Torvalds @ 2006-05-30  0:43 UTC (permalink / raw)
  To: Donnie Berkholz
  Cc: Martin Langhoff, Yann Dirson, Git Mailing List, Matthias Urlichs,
	Johannes Schindelin



On Mon, 29 May 2006, Donnie Berkholz wrote:
> 
> Looking closer, I see that the memory suckers do appear to be git, from
> dmesg:
> 
> Out of Memory: Kill process 17230 (git-repack) score 97207 and children.
> Out of memory: Killed process 17231 (git-rev-list).

Sounds like you had the "git repack -a -d" thing in your cvsimport.

The current git rev-list should use only about a third of the memory of 
the one you used, so hopefully you could just update your git version, and 
then continue with the "git cvsimport" without having to start all over.

		Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-30  0:19                                       ` Martin Langhoff
@ 2006-05-30  5:31                                         ` Donnie Berkholz
  2006-05-30  6:01                                           ` Martin Langhoff
  0 siblings, 1 reply; 83+ messages in thread
From: Donnie Berkholz @ 2006-05-30  5:31 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Linus Torvalds, Yann Dirson, Git Mailing List, Matthias Urlichs,
	Johannes Schindelin

[-- Attachment #1: Type: text/plain, Size: 1450 bytes --]

Martin Langhoff wrote:
> On 5/30/06, Donnie Berkholz <spyderous@gentoo.org> wrote:
>> Looking closer, I see that the memory suckers do appear to be git, from
>> dmesg:
>>
>> Out of Memory: Kill process 17230 (git-repack) score 97207 and children.
>> Out of memory: Killed process 17231 (git-rev-list).
> 
> That would mean that you do have Linus' patch then. Grep cvsimport for
> repack and remove the -a -- and consider using his recent patch to
> rev-list.

You certainly would think so, and I did as well, but available evidence
indicates otherwise. I'm not sure how the repack got in there.

donnie@supernova ~ $ type git-cvsimport
git-cvsimport is /usr/bin/git-cvsimport
donnie@supernova ~ $ grep repack /usr/bin/git-cvsimport
donnie@supernova ~ $

All I can think of is that I somehow OOM'd when I manually ran a repack
and didn't notice it. But that should've at least made me unable to
resume the cvsimport process, which happily kept chugging along later on.

> My dmesg talks about an earlier cvs segfault. Nasty tree you have here
> -- it's breaking all sorts of things... and teaching us a thing or two
> about the import process.
> 
>> Committed patch 249100 (origin 2005-08-20 05:05:58)
> 
> Hmmm? How can you be at patch 249100 and still be a good year ahead of
> me? Have you told cvsps to cut off old history?

Nope. I ran the exact cvsps flags you posted earlier to create it.

Thanks,
Donnie


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 252 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-30  5:31                                         ` Donnie Berkholz
@ 2006-05-30  6:01                                           ` Martin Langhoff
  0 siblings, 0 replies; 83+ messages in thread
From: Martin Langhoff @ 2006-05-30  6:01 UTC (permalink / raw)
  To: Donnie Berkholz
  Cc: Linus Torvalds, Yann Dirson, Git Mailing List, Matthias Urlichs,
	Johannes Schindelin

On 5/30/06, Donnie Berkholz <spyderous@gentoo.org> wrote:
> All I can think of is that I somehow OOM'd when I manually ran a repack
> and didn't notice it. But that should've at least made me unable to
> resume the cvsimport process, which happily kept chugging along later on.

Sounds likely -- and cvsimport restarts gracefully, though you might want to do

   git checkout HEAD

to get a usable checkout if the very first import failed. However, the
default head is master, and what you want to look at is origin or
whatever you passed as your -o parameter. I use cvshead normally, so I
do

   git log cvshead

> > My dmesg talks about an earlier cvs segfault. Nasty tree you have here
> > -- it's breaking all sorts of things... and teaching us a thing or two
> > about the import process.
> >
> >> Committed patch 249100 (origin 2005-08-20 05:05:58)
> >
> > Hmmm? How can you be at patch 249100 and still be a good year ahead of
> > me? Have you told cvsps to cut off old history?
>
> Nope. I ran the exact cvsps flags you posted earlier to create it.

Oh, that was an earlier PEBKAK at my end: I did git log HEAD instead
of git log cvshead. My import is now at  293145 (cvshead +0000
2005-12-25 12:24:42) which looks promising.

cheers,


martin

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-29 22:32                                     ` Donnie Berkholz
  2006-05-30  0:19                                       ` Martin Langhoff
  2006-05-30  0:43                                       ` Linus Torvalds
@ 2006-05-30 22:31                                       ` Martin Langhoff
  2006-05-30 23:07                                         ` Linus Torvalds
  2 siblings, 1 reply; 83+ messages in thread
From: Martin Langhoff @ 2006-05-30 22:31 UTC (permalink / raw)
  To: Donnie Berkholz
  Cc: Linus Torvalds, Yann Dirson, Git Mailing List, Matthias Urlichs,
	Johannes Schindelin

On 5/30/06, Donnie Berkholz <spyderous@gentoo.org> wrote:
> Martin Langhoff wrote:
> > On 5/30/06, Donnie Berkholz <spyderous@gentoo.org> wrote:
> >> Finally hit an OOM sometime in the past day (yep, a week later) =\. Not
> >> sure whether it was cvsimport or cvs. Anyone else had more luck?

With the latest cvsimport in Junio's repo, a lot of RAM and a bit of patience...

  gitview
  http://git.catalyst.net.nz/gitweb?p=gentoo.git;a=summary

  fetchable
  http://git.catalyst.net.nz/git/gentoo.git#cvshead

Still pushing it, will be there in a minute or so. The packed repo
weights about 660MB. Not too bad given the size of the project and the
number of commits.


martin

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-30 22:31                                       ` Martin Langhoff
@ 2006-05-30 23:07                                         ` Linus Torvalds
  2006-05-31  1:04                                           ` Martin Langhoff
  0 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2006-05-30 23:07 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Donnie Berkholz, Yann Dirson, Git Mailing List, Matthias Urlichs,
	Johannes Schindelin



On Wed, 31 May 2006, Martin Langhoff wrote:
> 
>  gitview
>  http://git.catalyst.net.nz/gitweb?p=gentoo.git;a=summary

Heh. I think you should enable caching in your apache config. 

And maybe we should make that part of the gitweb docs. Without a caching 
web-server, gitweb is pretty slow, but it caches _beautifully_.

That gentoo repo has a lot of "duplicate" commits that cvsps will mark as 
two separate commits because there's one commit for the files, and one 
commit for whatever the "Manifest" file is. I wonder if those commits 
should generally be merged or something. 

That said, things like that are most easily fixed as a git->git update 
(along with adding name translation), which can avoid re-writing the 
trees.

			Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-30 23:07                                         ` Linus Torvalds
@ 2006-05-31  1:04                                           ` Martin Langhoff
  2006-05-31  2:49                                             ` Donnie Berkholz
  0 siblings, 1 reply; 83+ messages in thread
From: Martin Langhoff @ 2006-05-31  1:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Donnie Berkholz, Yann Dirson, Git Mailing List, Matthias Urlichs,
	Johannes Schindelin

On 5/31/06, Linus Torvalds <torvalds@osdl.org> wrote:
> On Wed, 31 May 2006, Martin Langhoff wrote:
> >
> >  gitview
> >  http://git.catalyst.net.nz/gitweb?p=gentoo.git;a=summary
>
> Heh. I think you should enable caching in your apache config.

I know I should -- but I'm hoping to find the time to rework gitweb a
bit to actually work fast instead. It bothers me that it is so slow on
a basically idle machine, and where I can perform the corresponding
git operations in the commandline in a blink.

And caching is great for really busy sites (aka kernel.org) but
git.catalyst.net.nz only serves a handful of small repos for a small
group of people, and is 99% idle. Should blaze through this stuff.

> That gentoo repo has a lot of "duplicate" commits that cvsps will mark as
> two separate commits because there's one commit for the files, and one
> commit for whatever the "Manifest" file is. I wonder if those commits
> should generally be merged or something.
>
> That said, things like that are most easily fixed as a git->git update
> (along with adding name translation), which can avoid re-writing the
> trees.

Yep, large projects often have good reasons to run custom imports,
merging certain commits, rewriting log messages (like the X.org guys
were doing). It can be done at the cvsimport stage or later -- I think
Pasky has a rewritehistory tool hidden somewhere in Cogito, but I
haven't used it.

cheers,


martin

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-31  1:04                                           ` Martin Langhoff
@ 2006-05-31  2:49                                             ` Donnie Berkholz
  2006-05-31  6:05                                               ` Martin Langhoff
  0 siblings, 1 reply; 83+ messages in thread
From: Donnie Berkholz @ 2006-05-31  2:49 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Linus Torvalds, Yann Dirson, Git Mailing List, Matthias Urlichs,
	Johannes Schindelin, Alec Warner

[-- Attachment #1: Type: text/plain, Size: 969 bytes --]

Martin Langhoff wrote:
> On 5/31/06, Linus Torvalds <torvalds@osdl.org> wrote:
>> That gentoo repo has a lot of "duplicate" commits that cvsps will mark as
>> two separate commits because there's one commit for the files, and one
>> commit for whatever the "Manifest" file is. I wonder if those commits
>> should generally be merged or something.
>>
>> That said, things like that are most easily fixed as a git->git update
>> (along with adding name translation), which can avoid re-writing the
>> trees.
> 
> Yep, large projects often have good reasons to run custom imports,
> merging certain commits, rewriting log messages (like the X.org guys
> were doing). It can be done at the cvsimport stage or later -- I think
> Pasky has a rewritehistory tool hidden somewhere in Cogito, but I
> haven't used it.

We've got a guy who got a Summer of Code project to work on CVS
migration, so this could be something along his lines.

Thanks,
Donnie


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 252 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-31  2:49                                             ` Donnie Berkholz
@ 2006-05-31  6:05                                               ` Martin Langhoff
  2006-05-31 13:54                                                 ` Alec Warner
  0 siblings, 1 reply; 83+ messages in thread
From: Martin Langhoff @ 2006-05-31  6:05 UTC (permalink / raw)
  To: Donnie Berkholz
  Cc: Linus Torvalds, Yann Dirson, Git Mailing List, Matthias Urlichs,
	Johannes Schindelin, Alec Warner

On 5/31/06, Donnie Berkholz <spyderous@gentoo.org> wrote:
> We've got a guy who got a Summer of Code project to work on CVS
> migration, so this could be something along his lines.

He'll want a fast box to wrangle with this repo ;-)


martin

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-31  6:05                                               ` Martin Langhoff
@ 2006-05-31 13:54                                                 ` Alec Warner
  2006-05-31 22:03                                                   ` Martin Langhoff
  0 siblings, 1 reply; 83+ messages in thread
From: Alec Warner @ 2006-05-31 13:54 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Donnie Berkholz, Linus Torvalds, Yann Dirson, Git Mailing List,
	Matthias Urlichs, Johannes Schindelin

Martin Langhoff wrote:
> On 5/31/06, Donnie Berkholz <spyderous@gentoo.org> wrote:
>> We've got a guy who got a Summer of Code project to work on CVS
>> migration, so this could be something along his lines.
> 
> He'll want a fast box to wrangle with this repo ;-)
> 
> 
> martin

I have a dual opteron with 4gb of ram "on loan" from work :)

It still dies though, using git cvsimport or parsecvs.

I talked to Keith Packard about adding support to parsecvs for recording 
the actual changed changesets, but I haven't yet started on implementing 
that since he isn't using cvsps in parsecvs.

I also haven't had a chance to look at the git-cvsimport sources yet, 
was hoping to get to that later this week.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-31 13:54                                                 ` Alec Warner
@ 2006-05-31 22:03                                                   ` Martin Langhoff
  2006-06-01  1:42                                                     ` Alec Warner
  0 siblings, 1 reply; 83+ messages in thread
From: Martin Langhoff @ 2006-05-31 22:03 UTC (permalink / raw)
  To: antarus
  Cc: Donnie Berkholz, Linus Torvalds, Yann Dirson, Git Mailing List,
	Matthias Urlichs, Johannes Schindelin

On 6/1/06, Alec Warner <antarus@gentoo.org> wrote:
> I have a dual opteron with 4gb of ram "on loan" from work :)
>
> It still dies though, using git cvsimport or parsecvs.

The machine I am running this is more constrained than that, and it
doesn't die. It just takes maybe 30hs. Make sure it's not a bad cvs
binary you got there (latest from gentoo seems to leak memory).

And if it's still dying... give us some more details ;-)

cheers,


martin

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-05-31 22:03                                                   ` Martin Langhoff
@ 2006-06-01  1:42                                                     ` Alec Warner
  2006-06-01  7:47                                                       ` Martin Langhoff
  0 siblings, 1 reply; 83+ messages in thread
From: Alec Warner @ 2006-06-01  1:42 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Donnie Berkholz, Linus Torvalds, Yann Dirson, Git Mailing List,
	Matthias Urlichs, Johannes Schindelin

Martin Langhoff wrote:
> On 6/1/06, Alec Warner <antarus@gentoo.org> wrote:
> 
>> I have a dual opteron with 4gb of ram "on loan" from work :)
>>
>> It still dies though, using git cvsimport or parsecvs.
> 
> 
> The machine I am running this is more constrained than that, and it
> doesn't die. It just takes maybe 30hs. Make sure it's not a bad cvs
> binary you got there (latest from gentoo seems to leak memory).
> 
> And if it's still dying... give us some more details ;-)
> 
> cheers,
> 
> 
> martin

After reading the whole thread on this, I've using a git checkout of 
git, cvsps-2.1 and cvs-1.11.12, running overnight in verbose mode with 
screen.  Hopefully will have a repo in the morning ;)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-06-01  1:42                                                     ` Alec Warner
@ 2006-06-01  7:47                                                       ` Martin Langhoff
  2006-06-05  0:33                                                         ` Alec Warner
  0 siblings, 1 reply; 83+ messages in thread
From: Martin Langhoff @ 2006-06-01  7:47 UTC (permalink / raw)
  To: antarus
  Cc: Donnie Berkholz, Linus Torvalds, Yann Dirson, Git Mailing List,
	Matthias Urlichs, Johannes Schindelin

On 6/1/06, Alec Warner <antarus@gentoo.org> wrote:
> After reading the whole thread on this, I've using a git checkout of
> git, cvsps-2.1 and cvs-1.11.12, running overnight in verbose mode with
> screen.  Hopefully will have a repo in the morning ;)

Good stuff. I am rerunning it to prove (and bench) a complete an
uninterrupted import. So far it's done 4hs 30m, footprint grown to
207MB, 49750 commits. So I think it will be done in approx 30hs on
this single-cpu opteron.

Most commits are small, but there is a handful that are downright
massive -- and we hold all the file list in memory, which I think
explains (most of) the memory growth. I've looked into avoiding
holding the whole filelist in memory, but it involves rewriting the
cvsps output parsing loop, which is better left for a rainy day, with
a test case that doesn't take 30hs to resolve.

cheers,



martin

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-06-01  7:47                                                       ` Martin Langhoff
@ 2006-06-05  0:33                                                         ` Alec Warner
  2006-06-05  2:06                                                           ` Martin Langhoff
  0 siblings, 1 reply; 83+ messages in thread
From: Alec Warner @ 2006-06-05  0:33 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Donnie Berkholz, Linus Torvalds, Yann Dirson, Git Mailing List,
	Matthias Urlichs, Johannes Schindelin

Martin Langhoff wrote:
> On 6/1/06, Alec Warner <antarus@gentoo.org> wrote:
> 
>> After reading the whole thread on this, I've using a git checkout of
>> git, cvsps-2.1 and cvs-1.11.12, running overnight in verbose mode with
>> screen.  Hopefully will have a repo in the morning ;)
> 
> 
> Good stuff. I am rerunning it to prove (and bench) a complete an
> uninterrupted import. So far it's done 4hs 30m, footprint grown to
> 207MB, 49750 commits. So I think it will be done in approx 30hs on
> this single-cpu opteron.
> 
> Most commits are small, but there is a handful that are downright
> massive -- and we hold all the file list in memory, which I think
> explains (most of) the memory growth. I've looked into avoiding
> holding the whole filelist in memory, but it involves rewriting the
> cvsps output parsing loop, which is better left for a rainy day, with
> a test case that doesn't take 30hs to resolve.

Ok the box this was running on had issues, so I switched to using 
pearl.amd64.dev.gentoo.org, a dual core amd64 X2 4600+ with 4 gigs of 
ram and plenty of disk.  The "problem" now is just converstion time...30 
hours and I'm into 2004-09-17...but it's been in 2004 all day, seems 
like most of the commits are in the last three years.  Are there 
architectural issues with doing this in parallel?

Since the repository commits are all in cvs, it should be possible to do 
the work in parallel, since you know what all the commits touch.  The 
concern would be ordering of nodes in the tree; you'd end up building a 
bunch of subtrees and patching them together?

-Alec Warner

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-06-05  0:33                                                         ` Alec Warner
@ 2006-06-05  2:06                                                           ` Martin Langhoff
  2006-06-05  2:36                                                             ` Alec Warner
  0 siblings, 1 reply; 83+ messages in thread
From: Martin Langhoff @ 2006-06-05  2:06 UTC (permalink / raw)
  To: antarus
  Cc: Donnie Berkholz, Linus Torvalds, Yann Dirson, Git Mailing List,
	Matthias Urlichs, Johannes Schindelin

On 6/5/06, Alec Warner <antarus@gentoo.org> wrote:
> Ok the box this was running on had issues, so I switched to using
> pearl.amd64.dev.gentoo.org, a dual core amd64 X2 4600+ with 4 gigs of
> ram and plenty of disk.  The "problem" now is just converstion time...30
> hours and I'm into 2004-09-17...but it's been in 2004 all day, seems
> like most of the commits are in the last three years.  Are there
> architectural issues with doing this in parallel?

I don't think you can do this in parallel. What I would do is remove
the -a from the git-repack invocation. It does hurt import times quite
a bit -- just do a git-repack -a -d when it's done.

And... having said that, there is still a memory leak somehow,
somewhere. It's been evading me for 2 weeks now, so I feel an idiot
now. Not too bad in general, but it shows clearly in the gentoo and
mozilla imports.

> Since the repository commits are all in cvs, it should be possible to do
> the work in parallel, since you know what all the commits touch.  The
> concern would be ordering of nodes in the tree; you'd end up building a
> bunch of subtrees and patching them together?

Well... parsecvs does a bit of this but in sequential fashion... it
imports all the files first, and then runs through the history
building the tree+commits in order, committing them. It saves a lot of
time in the file imports by parsing the RCS file directly. The
downside is that it must keep a filename+version=>sha1 mapping --
which I think is why parsecvs won't fit in memory until it's changed
to store it on disk somehow ;-)

You are forced to do it in a sequence because cvsps only tells you
about the files added/removed/changed in a commit -- you need the
ancestor to have a view of what the whole tree looked like. The only
room for parallelism I see is to fork off new processes to work on
branches in parallel.



martin

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-06-05  2:06                                                           ` Martin Langhoff
@ 2006-06-05  2:36                                                             ` Alec Warner
  2006-06-05  3:49                                                               ` Martin Langhoff
       [not found]                                                               ` <20060605120743.566fb85f.seanlkml@sympatico.ca>
  0 siblings, 2 replies; 83+ messages in thread
From: Alec Warner @ 2006-06-05  2:36 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Donnie Berkholz, Linus Torvalds, Yann Dirson, Git Mailing List,
	Matthias Urlichs, Johannes Schindelin

Martin Langhoff wrote:
> On 6/5/06, Alec Warner <antarus@gentoo.org> wrote:
> 
>> Ok the box this was running on had issues, so I switched to using
>> pearl.amd64.dev.gentoo.org, a dual core amd64 X2 4600+ with 4 gigs of
>> ram and plenty of disk.  The "problem" now is just converstion time...30
>> hours and I'm into 2004-09-17...but it's been in 2004 all day, seems
>> like most of the commits are in the last three years.  Are there
>> architectural issues with doing this in parallel?
> 
> 
> I don't think you can do this in parallel. What I would do is remove
> the -a from the git-repack invocation. It does hurt import times quite
> a bit -- just do a git-repack -a -d when it's done.

Only repack at the end then? disk space isn't an issue here so I'll give 
that a shot.

> 
> And... having said that, there is still a memory leak somehow,
> somewhere. It's been evading me for 2 weeks now, so I feel an idiot
> now. Not too bad in general, but it shows clearly in the gentoo and
> mozilla imports.

30565 antarus   17   0  470m 456m 1640 S   14 11.6 234:23.38
git-cvsimport
30566 antarus   16   0 6753m 147m  752 S    7  3.7 120:27.06 cvs

I'm on cvs-1.11.12 and the git version of git

> You are forced to do it in a sequence because cvsps only tells you
> about the files added/removed/changed in a commit -- you need the
> ancestor to have a view of what the whole tree looked like. The only
> room for parallelism I see is to fork off new processes to work on
> branches in parallel.

Not helpful in the Gentoo case, since we only have one branch; minus an 
accident when a dev branched gentoo-x86 a while back ;)

I'll keep chugging on this one; it won't be the final import as I 
haven't used the complete Authors file, so I will try the repacking 
optimization next time I do an import.

-Alec Warner

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
  2006-06-05  2:36                                                             ` Alec Warner
@ 2006-06-05  3:49                                                               ` Martin Langhoff
       [not found]                                                               ` <20060605120743.566fb85f.seanlkml@sympatico.ca>
  1 sibling, 0 replies; 83+ messages in thread
From: Martin Langhoff @ 2006-06-05  3:49 UTC (permalink / raw)
  To: antarus
  Cc: Donnie Berkholz, Linus Torvalds, Yann Dirson, Git Mailing List,
	Matthias Urlichs, Johannes Schindelin

On 6/5/06, Alec Warner <antarus@gentoo.org> wrote:
> > I don't think you can do this in parallel. What I would do is remove
> > the -a from the git-repack invocation. It does hurt import times quite
> > a bit -- just do a git-repack -a -d when it's done.
>
> Only repack at the end then? disk space isn't an issue here so I'll give
> that a shot.

Not exactly -- by removing the -a from the git-repack invocation what
you get is cheap "partial" packing rather than a full repack. This is
somewhat inefficient disk-wise, perhaps by 10% or so. But full repacks
get more and more expensive as the repo grows.

So you don't need to run git-repack -a -d at the end, but it will be a
good measure to see how compact the packing gets.

> > And... having said that, there is still a memory leak somehow,
> > somewhere. It's been evading me for 2 weeks now, so I feel an idiot
> > now. Not too bad in general, but it shows clearly in the gentoo and
> > mozilla imports.
>
> 30565 antarus   17   0  470m 456m 1640 S   14 11.6 234:23.38
> git-cvsimport
> 30566 antarus   16   0 6753m 147m  752 S    7  3.7 120:27.06 cvs
>
> I'm on cvs-1.11.12 and the git version of git

Yep, I see roughly the same. It grows slowly and I don't know why :(

> I'll keep chugging on this one; it won't be the final import as I
> haven't used the complete Authors file, so I will try the repacking
> optimization next time I do an import.

Cool. If it dies for any reason, just do

  git-update-ref refs/heads/master refs/heads/origin
  git-update-ref HEAD origin
  git-checkout

You only need to do this the first time -- after that, the core heads
are set. Rerun the script and it will pick up where it left. If it
dies again, just do git-checkout to see the latest files.

(Above, replace origin with your -o option if you are using it. I
normally use -o cvshead.)



martin

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: irc usage..
       [not found]                                                               ` <20060605120743.566fb85f.seanlkml@sympatico.ca>
@ 2006-06-05 16:07                                                                 ` Sean
  0 siblings, 0 replies; 83+ messages in thread
From: Sean @ 2006-06-05 16:07 UTC (permalink / raw)
  To: antarus
  Cc: martin.langhoff, spyderous, torvalds, ydirson, git, smurf,
	Johannes.Schindelin

On Sun, 04 Jun 2006 22:36:44 -0400
Alec Warner <antarus@gentoo.org> wrote:

> I'll keep chugging on this one; it won't be the final import as I 
> haven't used the complete Authors file, so I will try the repacking 
> optimization next time I do an import.

Hi Alec,

You may want to go back and do another import for other reasons, but if
the only reason is to fix up the author information it would be _much_
faster to simply rewrite the git commit history.  Cogito has something
called "cg-admin-rewritehist" which should do what you need and there
are other scripts floating around specificially for rewriting just the
author information.

HTH,
Sean

^ permalink raw reply	[flat|nested] 83+ messages in thread

end of thread, other threads:[~2006-06-05 16:08 UTC | newest]

Thread overview: 83+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-05-20 17:26 irc usage Linus Torvalds
2006-05-20 17:50 ` Junio C Hamano
2006-05-20 18:52   ` Jakub Narebski
2006-05-20 20:39 ` Yann Dirson
2006-05-20 22:18   ` Donnie Berkholz
2006-05-20 22:45     ` Linus Torvalds
2006-05-20 23:12       ` Donnie Berkholz
2006-05-21 19:24         ` Linus Torvalds
2006-05-22  3:59           ` Linus Torvalds
2006-05-22  4:19             ` Donnie Berkholz
2006-05-22  4:50               ` Linus Torvalds
2006-05-22  5:04                 ` Martin Langhoff
2006-05-22  5:21                 ` Donnie Berkholz
2006-05-22  7:42                 ` Martin Langhoff
2006-05-22  9:13                   ` Linus Torvalds
2006-05-22 12:54                     ` Martin Langhoff
2006-05-22 17:27                       ` Linus Torvalds
2006-05-22 17:51                         ` Jakub Narebski
2006-05-22 18:03                           ` Linus Torvalds
2006-05-22 19:03                             ` Matthias Lederhofer
2006-05-22 19:09                               ` Junio C Hamano
2006-05-23 20:19                             ` Jakub Narebski
2006-05-22 19:46                         ` Martin Langhoff
2006-05-22 19:09                       ` Donnie Berkholz
2006-05-22 19:38                         ` Linus Torvalds
2006-05-22 19:49                           ` Donnie Berkholz
2006-05-22 20:20                             ` Linus Torvalds
2006-05-22 21:48                               ` Donnie Berkholz
2006-05-29 21:54                                 ` Donnie Berkholz
2006-05-29 22:21                                   ` Martin Langhoff
2006-05-29 22:32                                     ` Donnie Berkholz
2006-05-30  0:19                                       ` Martin Langhoff
2006-05-30  5:31                                         ` Donnie Berkholz
2006-05-30  6:01                                           ` Martin Langhoff
2006-05-30  0:43                                       ` Linus Torvalds
2006-05-30 22:31                                       ` Martin Langhoff
2006-05-30 23:07                                         ` Linus Torvalds
2006-05-31  1:04                                           ` Martin Langhoff
2006-05-31  2:49                                             ` Donnie Berkholz
2006-05-31  6:05                                               ` Martin Langhoff
2006-05-31 13:54                                                 ` Alec Warner
2006-05-31 22:03                                                   ` Martin Langhoff
2006-06-01  1:42                                                     ` Alec Warner
2006-06-01  7:47                                                       ` Martin Langhoff
2006-06-05  0:33                                                         ` Alec Warner
2006-06-05  2:06                                                           ` Martin Langhoff
2006-06-05  2:36                                                             ` Alec Warner
2006-06-05  3:49                                                               ` Martin Langhoff
     [not found]                                                               ` <20060605120743.566fb85f.seanlkml@sympatico.ca>
2006-06-05 16:07                                                                 ` Sean
2006-05-22 19:41                         ` Martin Langhoff
2006-05-22 20:11                           ` Linus Torvalds
2006-05-22 20:33                             ` Linus Torvalds
2006-05-22 21:41                             ` Matthias Urlichs
2006-05-22 22:18                               ` Linus Torvalds
2006-05-22 23:23                                 ` Martin Langhoff
2006-05-22 23:29                                   ` Martin Langhoff
2006-05-22 23:33                                   ` Linus Torvalds
2006-05-22 22:39                               ` Junio C Hamano
2006-05-22 23:15                                 ` Martin Langhoff
2006-05-23  6:52                                   ` Jeff King
2006-05-23  6:58                                     ` Jeff King
2006-05-23  7:01                                       ` [PATCH 1/2] cvsimport: use git-update-index --index-info Jeff King
2006-05-23  7:00                                     ` [PATCH 2/2] cvsimport: cleanup commit function Jeff King
     [not found]                                       ` <7v4pzh6wtr.fsf@assigned-by-dhcp.cox.net>
2006-05-23  7:13                                         ` Jeff King
2006-05-23  7:27                                       ` [PATCH 1/2] cvsimport: use git-update-index --index-info Jeff King
2006-05-23  8:13                                       ` [PATCH 2/2] cvsimport: cleanup commit function Martin Langhoff
2006-05-23  8:24                                         ` Junio C Hamano
2006-05-23 20:32                                           ` Martin Langhoff
2006-05-23 16:50                                         ` Linus Torvalds
2006-05-23 19:36                                           ` Linus Torvalds
2006-05-23 20:25                                             ` Junio C Hamano
2006-05-23 20:29                                             ` Martin Langhoff
2006-05-23 21:10                                               ` Jeff King
2006-05-23 21:13                                                 ` Martin Langhoff
2006-05-23 17:47                                       ` Morten Welinder
2006-05-23 20:59                                         ` Jeff King
2006-05-23 23:41                                           ` Junio C Hamano
2006-05-24  9:52                                             ` Jeff King
2006-05-22 20:16                         ` irc usage Donnie Berkholz
2006-05-21  9:46       ` Thomas Glanzmann
2006-05-21  1:14     ` Donnie Berkholz
2006-05-22  1:45   ` Linus Torvalds
     [not found] <1148369266352-git-send-email-1>
2006-05-23  7:27 ` [PATCH 2/2] cvsimport: cleanup commit function Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).