Git development
 help / color / mirror / Atom feed
* Re: svn to git, N-squared?
From: Jon Smirl @ 2006-06-12 18:06 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux@horizon.com, git
In-Reply-To: <Pine.LNX.4.64.0606120958230.5498@g5.osdl.org>

On 6/12/06, Linus Torvalds <torvalds@osdl.org> wrote:
> Having that many files in a single directory (or two) is a total disaster.
> That said, it works well enough if you don't create new files very often
> (and _preferably_ don't look them up either, although that is effectively
> helped by indexing). I _suspect_ that

Posted to the svn list, they said that 220K files is normal. They told
me to turn on the ext2 dir_index option. Cheking my system I see that
none of partitions have it turned on so it must not be the default for
FC5.

I have to unmount the drive to convert existing directories. I can
trying doing the file move trick while the process is running since
new directories will use it.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply

* Re: [PATCH] gitweb: Supporting caches (was: Adding a `blame' interface.)
From: Florian Forster @ 2006-06-12 17:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Martin Langhoff, git
In-Reply-To: <Pine.LNX.4.64.0606120754460.5498@g5.osdl.org>

[-- Attachment #1: Type: text/plain, Size: 1286 bytes --]

On Mon, Jun 12, 2006 at 07:59:39AM -0700, Linus Torvalds wrote:
> The apache setup at least on kernel.org is already set up to do
> caching, as long as the generated headers for the page allow it in the
> first place.

I've actually looked into improving native HTTP caching (mostly for
small site without revers proxying) by providing a `Last-Modified'
header where possible and sending a `304 Not Modified' whenever
appropriate.

While it doesn't sound hard it's next to impossible: A commit's
timestamp doesn't change when head a points to it (or does not longer
point to it). Also displaying the timestamps as `Modified xy
{seconds,minutes, hours,...} ago' possess a big problem.

(I guess the webserver could use the `If-Modified-Since' header to check
if the displayed time needs to be updated, but if you ask me it's not
worth the effort.)

In short, the `blob', `blob_plain', and `blobdiff' pages could profit
from that because they don't display the head(s) pointing to the current
commit. On the other hand, this is a little inconsistent and could be
considered a bug. So I'll give up on that unless someone has a great
idea how to handle this.

Regards,
-octo
-- 
Florian octo Forster
Hacker in training
GnuPG: 0x91523C3D
http://verplant.org/

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply

* Re: svn to git, N-squared?
From: Linus Torvalds @ 2006-06-12 17:08 UTC (permalink / raw)
  To: Jon Smirl; +Cc: linux@horizon.com, git
In-Reply-To: <9e4733910606120944p4deb170ejc2863846685917f6@mail.gmail.com>



On Mon, 12 Jun 2006, Jon Smirl wrote:
> 
> The svn repository was built by cvs2svn, none of the git tools were involved.

Ok, so that part is purely a SVN issue.

Having that many files in a single directory (or two) is a total disaster. 
That said, it works well enough if you don't create new files very often 
(and _preferably_ don't look them up either, although that is effectively 
helped by indexing). I _suspect_ that 

 - the "cvs->svn" import process was probably optimized so that it did one 
   file at a time (your "eight stages" description certainly sounds as if 
   it could do it), and in that case it's entirely possible that that can 
   be done efficiently (ie you still do file creates and lookups in an 
   increasingly big directory, but you do it only _once_ per file, rather 
   than look up old files all the time). So your lookup ratio would be 1:1 
   with the files.

   Doing a git-cvsimport would then do basically random lookups in that 
   _huge_ directory, and instead of reading the files one at a time (and 
   fully) and never again, I assume it opens them, reads one revision, 
   closes it, and then goes on to the next revision, so it will have a 
   much higher lookup ratio (you'd look up every file several times).

 - I suspect the SVN people must be hurting for performance themselves. I 
   guess they don't expect to be able to do 5-10 commits per second, the 
   way git was designed to do. So they optimized the cvs import part, but 
   their actual regular live usage is probably hitting this same directory 
   inefficiency.

Of course, the old SVN Berkeley DB usage was probably even worse (not in 
system time, but I'd expect the access patterns within the BDB file to be 
pretty nasty, and probably a lot of user time spent seeking around it). 
But in this particular case, it might even have been better.

Maybe we could teach the SVN people about pack-files? ;)

			Linus

^ permalink raw reply

* Re: svn to git, N-squared?
From: Linus Torvalds @ 2006-06-12 16:57 UTC (permalink / raw)
  To: Jon Smirl; +Cc: linux@horizon.com, git
In-Reply-To: <9e4733910606120932k5b6f7acfra3f3a26168454f47@mail.gmail.com>



On Mon, 12 Jun 2006, Jon Smirl wrote:
> > 
> > 64 files in tmp.
> > But the SVN repository itself has 411,000 files in it. Split between
> > two directories.
> 
> I'm doing all of this on ext3. I have plenty of free disk space so I
> can make another partition and switch to a new file system after I
> install the new RAM. What would be the best one to try? Doing that
> would provide a data point to determine if this is a problem with file
> system performance or the misuse of file systems.

I'm sure there are better filesystems to try for this kind of insane 
schenario, but at the same time, I really cannot imaging that the 411,000 
files is a "normal" thing. There _must_ be some way to have SVN not do 
that in the first place (or git-svnimport).

Is this what happened when the SVN people started using fsfs? 

			Linus

^ permalink raw reply

* Re: svn to git, N-squared?
From: Jon Smirl @ 2006-06-12 16:44 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux@horizon.com, git
In-Reply-To: <Pine.LNX.4.64.0606120938490.5498@g5.osdl.org>

On 6/12/06, Linus Torvalds <torvalds@osdl.org> wrote:
> > Is there some pack equivalent for svn that I haven't found yet?
>
> Is this literally what SVN does normally? That's just insane. I mean, even
> git tried to at least hash out the files (and yeah, admittedly even that
> worked less well than I was hoping for, but I at least fixed it within
> just a few weeks through the pack mechanism).
>
> Or is that 411,000 files a result of how git-svnimport does things, rather
> than some basic SVN approach to live: does it perhaps end up checking out
> each file under an individual temporary name?

The svn repository was built by cvs2svn, none of the git tools were involved.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply

* Re: svn to git, N-squared?
From: Linus Torvalds @ 2006-06-12 16:41 UTC (permalink / raw)
  To: Jon Smirl; +Cc: linux@horizon.com, git
In-Reply-To: <9e4733910606120922g181a5aaal623fd3f29b839f4c@mail.gmail.com>



On Mon, 12 Jun 2006, Jon Smirl wrote:
>
> 64 files in tmp.
> But the SVN repository itself has 411,000 files in it. Split between
> two directories.

Ouch. That sounds like it. 

> Is there some pack equivalent for svn that I haven't found yet?

Is this literally what SVN does normally? That's just insane. I mean, even 
git tried to at least hash out the files (and yeah, admittedly even that 
worked less well than I was hoping for, but I at least fixed it within 
just a few weeks through the pack mechanism).

Or is that 411,000 files a result of how git-svnimport does things, rather 
than some basic SVN approach to live: does it perhaps end up checking out 
each file under an individual temporary name?

			Linus

^ permalink raw reply

* Re: svn to git, N-squared?
From: Jon Smirl @ 2006-06-12 16:32 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux@horizon.com, git
In-Reply-To: <9e4733910606120922g181a5aaal623fd3f29b839f4c@mail.gmail.com>

On 6/12/06, Jon Smirl <jonsmirl@gmail.com> wrote:
> On 6/12/06, Linus Torvalds <torvalds@osdl.org> wrote:
> >
> >
> > On Mon, 12 Jun 2006, Jon Smirl wrote:
> > >
> > >  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> > > 14525 jonsmirl  16   0  604m 391m 1904 S   24 38.7 916:53.39 git-svnimport
> > > 20947 jonsmirl  17   0     0    0    0 R    1  0.0   0:00.03 git-svnimport
> >
> > Hard to tell, it's obviously got short-lived processes there too that it's
> > not showing, but equally obviously that svnimport script itself is
> > spending an alarming amount of CPU time. I don't think it should do that
> > much processing, but since it's written in perl, I can't read it.
> >
> > Are there any other directories that seem to be growing (eg some temp-file
> > directory where the old files aren't cleaned away?). I can't imagine what
> > else it could be doing in kernel space than simply some silly filesystem
> > operation, but dang it all, Linux filesystems are usually very efficient
> > indeed, unless we're talking huge directories (and if it's not the git
> > object directory any more, it must be something else).
>
> 64 files in tmp.
> But the SVN repository itself has 411,000 files in it. Split between
> two directories.

I'm doing all of this on ext3. I have plenty of free disk space so I
can make another partition and switch to a new file system after I
install the new RAM. What would be the best one to try? Doing that
would provide a data point to determine if this is a problem with file
system performance or the misuse of file systems.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply

* Re: svn to git, N-squared?
From: Randal L. Schwartz @ 2006-06-12 16:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jon Smirl, git
In-Reply-To: <86irn6wdob.fsf@blue.stonehenge.com>

>>>>> "Randal" == Randal L Schwartz <merlyn@stonehenge.com> writes:

>>>>> "Linus" == Linus Torvalds <torvalds@osdl.org> writes:
Linus> This sounds like _exactly_ what happens if you don't repack
Linus> occasionally.  Expecially if you are using a filesystem without hashed
Linus> filename lookup, but it's true to some degree even with that - the
Linus> filesystem tends to end up spending tons of time in kernel space,
Linus> trying to find a place to put new objects.

Randal> I'm using git-svn to do a similar thing with a 11K-commit history.  It's now 4
Randal> days running, and yes, I'm repacking and deleting empty dirs every 200-300
Randal> commits, but I'm only up to commit 4000 or so.  At this rate, I *may* finish
Randal> by sometime next week. :(

Randal> However, I notice one thing that can't be good: .git/git-svn/revs has one file
Randal> per revision.  Yes, I'll end up with 11000 files in a single directory.  Ugh.

Another contributing factor is that there's 2500 files in the repo (at
revision 3931).  I was recording 20 commits a minute in the early part of the
cycle, and now I'm down to 1 commit every two minutes.  Doing a bit of
back-of-the-scribbled-on-envelope calcs, I won't be finished for
another two weeks or so. :(

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!

^ permalink raw reply

* Re: svn to git, N-squared?
From: Jon Smirl @ 2006-06-12 16:22 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux@horizon.com, git
In-Reply-To: <Pine.LNX.4.64.0606120906210.5498@g5.osdl.org>

On 6/12/06, Linus Torvalds <torvalds@osdl.org> wrote:
>
>
> On Mon, 12 Jun 2006, Jon Smirl wrote:
> >
> >  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> > 14525 jonsmirl  16   0  604m 391m 1904 S   24 38.7 916:53.39 git-svnimport
> > 20947 jonsmirl  17   0     0    0    0 R    1  0.0   0:00.03 git-svnimport
>
> Hard to tell, it's obviously got short-lived processes there too that it's
> not showing, but equally obviously that svnimport script itself is
> spending an alarming amount of CPU time. I don't think it should do that
> much processing, but since it's written in perl, I can't read it.
>
> Are there any other directories that seem to be growing (eg some temp-file
> directory where the old files aren't cleaned away?). I can't imagine what
> else it could be doing in kernel space than simply some silly filesystem
> operation, but dang it all, Linux filesystems are usually very efficient
> indeed, unless we're talking huge directories (and if it's not the git
> object directory any more, it must be something else).

64 files in tmp.
But the SVN repository itself has 411,000 files in it. Split between
two directories.

Is there some pack equivalent for svn that I haven't found yet?

> At least with the cvs importer I have _some_ clue what it's doing, since I
> wrote an earlier version myself (very different, but at least I know what
> the operations are). SVN has always just confused me, and I have no idea
> what svnimport does, so I think I'll have to defer to somebody who
> actually knows the code.
>
> Smurf, have you looked at any larger repositories?
>
>                 Linus
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply

* Re: svn to git, N-squared?
From: Randal L. Schwartz @ 2006-06-12 16:18 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jon Smirl, git
In-Reply-To: <Pine.LNX.4.64.0606112028010.5498@g5.osdl.org>

>>>>> "Linus" == Linus Torvalds <torvalds@osdl.org> writes:

Linus> This sounds like _exactly_ what happens if you don't repack
Linus> occasionally.  Expecially if you are using a filesystem without hashed
Linus> filename lookup, but it's true to some degree even with that - the
Linus> filesystem tends to end up spending tons of time in kernel space,
Linus> trying to find a place to put new objects.

I'm using git-svn to do a similar thing with a 11K-commit history.  It's now 4
days running, and yes, I'm repacking and deleting empty dirs every 200-300
commits, but I'm only up to commit 4000 or so.  At this rate, I *may* finish
by sometime next week. :(

However, I notice one thing that can't be good: .git/git-svn/revs has one file
per revision.  Yes, I'll end up with 11000 files in a single directory.  Ugh.

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!

^ permalink raw reply

* Re: svn to git, N-squared?
From: Jon Smirl @ 2006-06-12 16:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux@horizon.com, git
In-Reply-To: <Pine.LNX.4.64.0606120843340.5498@g5.osdl.org>

On 6/12/06, Linus Torvalds <torvalds@osdl.org> wrote:
>
>
> On Mon, 12 Jun 2006, Jon Smirl wrote:
> >
> > I've stablized like this. 1GB RAM with 2.8Ghz P4 hyperthread. Is there
> > anyway to tell what it is doing in the kernel for so much time?
>
> oprofile will tell you.

I don't have profiling turn on in the kernel. I've turned it on so
I'll pick it up next time I reboot.
I'll kill everything and restart when my new RAM arrives tomorrow.

Hopefully the SVN import will finish before then but it doesn't look likely.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply

* Re: svn to git, N-squared?
From: Linus Torvalds @ 2006-06-12 16:12 UTC (permalink / raw)
  To: Jon Smirl; +Cc: linux@horizon.com, git
In-Reply-To: <9e4733910606120855p1cec9acfy62dadb89c11756b4@mail.gmail.com>



On Mon, 12 Jun 2006, Jon Smirl wrote:
> 
>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 14525 jonsmirl  16   0  604m 391m 1904 S   24 38.7 916:53.39 git-svnimport
> 20947 jonsmirl  17   0     0    0    0 R    1  0.0   0:00.03 git-svnimport

Hard to tell, it's obviously got short-lived processes there too that it's 
not showing, but equally obviously that svnimport script itself is 
spending an alarming amount of CPU time. I don't think it should do that 
much processing, but since it's written in perl, I can't read it.

Are there any other directories that seem to be growing (eg some temp-file 
directory where the old files aren't cleaned away?). I can't imagine what 
else it could be doing in kernel space than simply some silly filesystem 
operation, but dang it all, Linux filesystems are usually very efficient 
indeed, unless we're talking huge directories (and if it's not the git 
object directory any more, it must be something else).

At least with the cvs importer I have _some_ clue what it's doing, since I 
wrote an earlier version myself (very different, but at least I know what 
the operations are). SVN has always just confused me, and I have no idea 
what svnimport does, so I think I'll have to defer to somebody who 
actually knows the code.

Smurf, have you looked at any larger repositories?

		Linus

^ permalink raw reply

* Re: svn to git, N-squared?
From: Jon Smirl @ 2006-06-12 15:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux@horizon.com, git
In-Reply-To: <Pine.LNX.4.64.0606120843340.5498@g5.osdl.org>

On 6/12/06, Linus Torvalds <torvalds@osdl.org> wrote:
>
>
> On Mon, 12 Jun 2006, Jon Smirl wrote:
> >
> > I've stablized like this. 1GB RAM with 2.8Ghz P4 hyperthread. Is there
> > anyway to tell what it is doing in the kernel for so much time?
>
> oprofile will tell you.
>
> I don't see why it would spend a lot of time in the kernel, unless it's
> the SVN part that does a ton of reads or something. git should have almost
> no kernel footprint apart from the individual objects creation/reading, so
> once it's repacked, I generally see very little system time.
>
> What does top say? (Ie can you see _which_ process spends time in the
> kernel?)

top - 11:54:32 up 4 days,  1:27,  5 users,  load average: 1.85, 1.74, 1.55
Tasks: 135 total,   2 running, 133 sleeping,   0 stopped,   0 zombie
Cpu(s): 14.7% us, 35.3% sy,  0.0% ni, 49.3% id,  0.0% wa,  0.2% hi,  0.5% si,  0
Mem:   1035740k total,  1020836k used,    14904k free,    18368k buffers
Swap: 118222276k total,   645124k used, 117577152k free,   183172k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
14525 jonsmirl  16   0  604m 391m 1904 S   24 38.7 916:53.39 git-svnimport
20947 jonsmirl  17   0     0    0    0 R    1  0.0   0:00.03 git-svnimport
20864 jonsmirl  16   0  2120 1024  788 R    1  0.1   0:00.08 top
 2436 root      15   0 71184  28m 6100 S    0  2.8 119:13.55 Xorg
    1 root      16   0  1992  340  312 S    0  0.0   0:00.79 init
    2 root      RT   0     0    0    0 S    0  0.0   0:00.00 migration/0
    3 root      34  19     0    0    0 S    0  0.0   0:01.42 ksoftirqd/0
    4 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/0


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply

* Re: svn to git, N-squared?
From: Linus Torvalds @ 2006-06-12 15:45 UTC (permalink / raw)
  To: Jon Smirl; +Cc: linux@horizon.com, git
In-Reply-To: <9e4733910606120832xaf74e77pad7f70df864541fc@mail.gmail.com>



On Mon, 12 Jun 2006, Jon Smirl wrote:
> 
> I've stablized like this. 1GB RAM with 2.8Ghz P4 hyperthread. Is there
> anyway to tell what it is doing in the kernel for so much time?

oprofile will tell you.

I don't see why it would spend a lot of time in the kernel, unless it's 
the SVN part that does a ton of reads or something. git should have almost 
no kernel footprint apart from the individual objects creation/reading, so 
once it's repacked, I generally see very little system time.

What does top say? (Ie can you see _which_ process spends time in the 
kernel?)

		Linus

^ permalink raw reply

* Re: svn to git, N-squared?
From: Jon Smirl @ 2006-06-12 15:32 UTC (permalink / raw)
  To: linux@horizon.com; +Cc: git, torvalds
In-Reply-To: <20060612043949.20992.qmail@science.horizon.com>

On 12 Jun 2006 00:39:49 -0400, linux@horizon.com <linux@horizon.com> wrote:
> Insanity is copying the data rather than just the file name.  Git is
> good about not reading unnecessary files, and anything necessary should
> be cached, so on-disk fragmentation is not a concern.

I've run a pack and I moved the objects to new directories. Directory
is 746M with 64K files now.

I've stablized like this. 1GB RAM with 2.8Ghz P4 hyperthread. Is there
anyway to tell what it is doing in the kernel for so much time?

procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us
sy id wa st
 1  0 599128  24712  38196 247008    0    0     0     0  451   382 12
39 48  0  0
 1  0 599128  24836  38196 246872    0    0     0     4  413   354 15
38 48  0  0
 1  0 599128  24960  38212 246856    0    0     0    64  453   390 15
37 48  0  0
 1  0 599128  24960  38212 246856    0    0     0     0  414   367 12
40 49  0  0
 1  0 599128  23504  38212 248216    0    0     0     0  448   365 13
39 48  0  0
 1  0 599128  24156  38212 247604    0    0     0     0  407   355 13
39 49  0  0
 1  0 599128  25240  38212 246652    0    0     0     0  446   390 13
39 48  0  0
 1  0 599128  25240  38224 246572    0    0     4    48  415   418 12
40 47  0  0
 1  0 599128  25116  38232 246496    0    0     0    12  452   432 12
40 48  0  0

Still doesn't seem to be making much forward progress.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply

* Re: bisect and gitk happy together
From: Linus Torvalds @ 2006-06-12 15:10 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: git
In-Reply-To: <46a038f90606120441p74dd4872y441fe04470f4acd5@mail.gmail.com>



On Mon, 12 Jun 2006, Martin Langhoff wrote:
> 
> - git-bisect visualise wasn't as useful as just a plain gitk. (This
> may be because I was working with ~60 commits in a medium-sized
> project).

Definitely. Try just firing up gitk when you're bisecting a kernel archive 
with thousands of commits, and complex history..

That's when "git bisect visualize" really helps: when git bisect has 
already narrowed down the list of commits from "5 years" to "1 week", but 
you still have maybe a hundred-odd commits to go.

I agree that just plain "gitk" is actually nicer if you want to see the 
whole context. It's just that often the context is pretty damn confusing ;)

> - gitk didn't show the bad commit tagged specially, even if
> git-bisect had just identified it. Of course I could find it, but I
> had all the other good/bad commits well labelled. And not the one I
> was looking for. Odd.

It should be the head of the "bisect" branch, and naturally tagged that 
way.

			Linus

^ permalink raw reply

* Re: [PATCH] gitweb: Adding a `blame' interface.
From: Linus Torvalds @ 2006-06-12 14:59 UTC (permalink / raw)
  To: Florian Forster; +Cc: Martin Langhoff, git
In-Reply-To: <20060612082448.GA11857@verplant.org>



On Mon, 12 Jun 2006, Florian Forster wrote:
> 
> Would it help to cache `git-annotate's output, e.g. using one of the
> `Cache::Cache' modules? Or is browsing of blobs too sparse for this to
> result in a performance gain? I'm sure the modules could be integrated
> as a weak precondition.

The apache setup at least on kernel.org is already set up to do caching, 
as long as the generated headers for the page allow it in the first place.

So caching inside gitweb is generally pointless, at least when it's at the 
level of one result page. At a higher level, if the internal caching might 
improve performance of _other_ pages because it caches the result of some 
intermediate important thing, it might be a different issue.

		Linus

^ permalink raw reply

* [PATCH] cvsimport: keep one index per branch during import
From: Martin Langhoff @ 2006-06-12 11:50 UTC (permalink / raw)
  To: junkio, git; +Cc: Martin Langhoff

With this patch we have a speedup and much lower IO when
importing trees with many branches. Instead of forcing
index re-population for each branch switch, we keep
many index files around, one per branch.

Signed-off-by: Martin Langhoff <martin@catalyst.net.nz>

---

This patch should get some review. It is trivial, but not fully tested.
I am testing it on the moz repo (which will take a while) to check that I get
the same result with and without it. 

Performance-wise, it seems to be doing ~15K commits per hour, with
the mozilla repo, up from ~6Kcph on the same hardware. Of course, 
this is only noticeable in projects with lots of concurrent branches.
Linear projects don't get much from this patch.

With this change, we are now truly waiting on cvs to hand over the
files pronto! Running locally, it is apparent that it isn't IO wait
but the latency of the chatty cvs protocol that is making this slow.

Probably forking 2 or 3 processes to prefetch filerevs from cvs
and put them in a queue directory for the main process to pick
up would work wonders. Actually, they could call git-hash-object
and just put some file metadata in the queue directory. 
---
 git-cvsimport.perl |   37 ++++++++++++++++++++++++++++++-------
 1 files changed, 30 insertions(+), 7 deletions(-)

diff --git a/git-cvsimport.perl b/git-cvsimport.perl
old mode 100755
new mode 100644
index 76f6246..9c4588f
--- a/git-cvsimport.perl
+++ b/git-cvsimport.perl
@@ -465,10 +465,15 @@ my $git_dir = $ENV{"GIT_DIR"} || ".git";
 $ENV{"GIT_DIR"} = $git_dir;
 my $orig_git_index;
 $orig_git_index = $ENV{GIT_INDEX_FILE} if exists $ENV{GIT_INDEX_FILE};
-my ($git_ih, $git_index) = tempfile('gitXXXXXX', SUFFIX => '.idx',
-				    DIR => File::Spec->tmpdir());
-close ($git_ih);
-$ENV{GIT_INDEX_FILE} = $git_index;
+
+my %index; # holds filenames of one index per branch
+{   # init with an index for origin
+    my ($fh, $fn) = tempfile('gitXXXXXX', SUFFIX => '.idx',
+			     DIR => File::Spec->tmpdir());
+    close ($fh);
+    $index{$opt_o} = $fn;
+}
+$ENV{GIT_INDEX_FILE} = $index{$opt_o};
 unless(-d $git_dir) {
 	system("git-init-db");
 	die "Cannot init the GIT db at $git_tree: $?\n" if $?;
@@ -496,6 +501,13 @@ unless(-d $git_dir) {
 	$tip_at_start = `git-rev-parse --verify HEAD`;
 
 	# populate index
+	unless ($index{$last_branch}) {
+	    my ($fh, $fn) = tempfile('gitXXXXXX', SUFFIX => '.idx',
+				     DIR => File::Spec->tmpdir());
+	    close ($fh);
+	    $index{$last_branch} = $fn;
+	}
+	$ENV{GIT_INDEX_FILE} = $index{$last_branch};
 	system('git-read-tree', $last_branch);
 	die "read-tree failed: $?\n" if $?;
 
@@ -776,8 +788,17 @@ while(<CVS>) {
 		}
 		if(($ancestor || $branch) ne $last_branch) {
 			print "Switching from $last_branch to $branch\n" if $opt_v;
-			system("git-read-tree", $branch);
-			die "read-tree failed: $?\n" if $?;
+			unless ($index{$branch}) {
+			    my ($fh, $fn) = tempfile('gitXXXXXX', SUFFIX => '.idx',
+						     DIR => File::Spec->tmpdir());
+			    close ($fh);
+			    $index{$branch} = $fn;
+			    $ENV{GIT_INDEX_FILE} = $index{$branch};
+			    system("git-read-tree", $branch);
+			    die "read-tree failed: $?\n" if $?;
+			} else {
+			    $ENV{GIT_INDEX_FILE} = $index{$branch};
+		        }
 		}
 		$last_branch = $branch if $branch ne $last_branch;
 		$state = 9;
@@ -841,7 +862,9 @@ #	VERSION:1.96->1.96.2.1
 }
 commit() if $branch and $state != 11;
 
-unlink($git_index);
+foreach my $git_index (values %index) {
+    unlink($git_index);
+}
 
 if (defined $orig_git_index) {
 	$ENV{GIT_INDEX_FILE} = $orig_git_index;
-- 
1.4.0.g5fba

^ permalink raw reply related

* bisect and gitk happy together
From: Martin Langhoff @ 2006-06-12 11:41 UTC (permalink / raw)
  To: git

I was using git-bisect earlier today, and at the exact point where it
told be about the bad commit, I opened gitk, which was showing all the
bad and good commits. It is great!

Two "user" notes, however:

 - git-bisect visualise wasn't as useful as just a plain gitk. (This
may be because I was working with ~60 commits in a medium-sized
project).

 - gitk didn't show the bad commit tagged specially, even if
git-bisect had just identified it. Of course I could find it, but I
had all the other good/bad commits well labelled. And not the one I
was looking for. Odd.

In any case, the bisect + gitk combo saved the day. I'm too ashamed to
tell what the bug actually was, though ;-)


martin

^ permalink raw reply

* Re: Collecting cvsps patches
From: Anand Kumria @ 2006-06-12 11:27 UTC (permalink / raw)
  To: git
In-Reply-To: <20060611224205.GF1297@nowhere.earth>

On Mon, 12 Jun 2006 00:42:05 +0200, Yann Dirson wrote:

> http://ydirson.free.fr/soft/git/cvsps.git

I think you need to chmod +x hooks/post-update

and then run 'git-update-server-info'.

Cheers,
Anand

^ permalink raw reply

* Re[1]: hi from Galusya B.
From: Galusya B. @ 2006-06-12 10:06 UTC (permalink / raw)
  To: Otto

Hi, Otto

I'm a very young and energetic lady! I have very positive attitude to life and people. I do enjoy new experience life can offer me: to see new interesting places, to meet new people.
I do try to enjoy every moment of life and accept everything the way it comes without complaining.
Though my life seems to be quite enjoyable there's one important thing missing. It's LOVE!
Without my beloved one, my soul mate, my King my life is not completed.
I wish i coud find him very soon so that we could share together every momement of the life-time romance! 
What about you? Could you be my King? If answer is "yes" - you can find more about me 
http://Aqgvj.im-waiting-4you.net/

Yourth faithfully
Galusya B.

^ permalink raw reply

* Re: [PATCH] gitweb: Adding a `blame' interface.
From: Shawn Pearce @ 2006-06-12  9:19 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Martin Langhoff, Florian Forster, git
In-Reply-To: <Pine.LNX.4.63.0606121107520.21813@wbgn013.biozentrum.uni-wuerzburg.de>

Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> Hi,
> 
> On Mon, 12 Jun 2006, Shawn Pearce wrote:
> 
> >   [gitweb]
> >     description=<div class=\"description\">\n\
> > This is a chunk of text which describes this repository.  Some\n\
> > of this text might be rather long, and might need many lines to\n\
> > really be able to describe the repository in a nice editor such as\n\
> > vi running in an 80 character wide xterm.\n\
> > </div>
> 
> AFAIK the trailing "\" will not work.

Actually it does.  I figured out that it works (and why it works)
when I implemented the GIT repository parser in Java for my pure
Java version of GIT...

For example:

  [spearce@spearce-pb15 bob]$ cat .git/config 
  [core]
          repositoryformatversion = 0
          filemode = true
  [gitweb]
          description = This is a very\nlong line to put into GIT\n\
  repo config.\n\
  I hope it works.
          on = true
  [spearce@spearce-pb15 bob]$ git repo-config gitweb.description
  This is a very
  long line to put into GIT
  repo config.
  I hope it works.
  [spearce@spearce-pb15 bob]$ git repo-config gitweb.on
  true

The use of a trailing \ makes sense; the collapsing of multiple
spaces into one space unless quoted inside of "" doesn't.
But whatever...

-- 
Shawn.

^ permalink raw reply

* Re: [PATCH] gitweb: Adding a `blame' interface.
From: Johannes Schindelin @ 2006-06-12  9:08 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Martin Langhoff, Florian Forster, git
In-Reply-To: <20060612084056.GA29220@spearce.org>

Hi,

On Mon, 12 Jun 2006, Shawn Pearce wrote:

>   [gitweb]
>     description=<div class=\"description\">\n\
> This is a chunk of text which describes this repository.  Some\n\
> of this text might be rather long, and might need many lines to\n\
> really be able to describe the repository in a nice editor such as\n\
> vi running in an 80 character wide xterm.\n\
> </div>

AFAIK the trailing "\" will not work.

Ciao,
Dscho

^ permalink raw reply

* Re: [PATCH] gitweb: Adding a `blame' interface.
From: Shawn Pearce @ 2006-06-12  8:40 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Florian Forster, git
In-Reply-To: <46a038f90606120134n21c269bbj3e8c7e31d4d93a23@mail.gmail.com>

Martin Langhoff <martin.langhoff@gmail.com> wrote:
> >- If `GIT_DIR/description' is only used by gitweb it may be more
> >  consistent to use the git-repo-config option `gitweb.description' in
> >  the future.
> 
> Not sure how git-repo configurations deal with long entries. Right now
> the description may contain html for instance.

It has to be escaped, which could be ugly with HTML.  For example:

  [gitweb]
    description=<div class=\"description\">\n\
This is a chunk of text which describes this repository.  Some\n\
of this text might be rather long, and might need many lines to\n\
really be able to describe the repository in a nice editor such as\n\
vi running in an 80 character wide xterm.\n\
</div>

Forget a \ in front of a double quote (") or an LF and the entry is
corrupt.  So as nice as it sounds it might not be the best way to
obtain a description for gitweb.

-- 
Shawn.

^ permalink raw reply

* Re: [PATCH] gitweb: Adding a `blame' interface.
From: Martin Langhoff @ 2006-06-12  8:34 UTC (permalink / raw)
  To: Florian Forster; +Cc: git
In-Reply-To: <20060612082448.GA11857@verplant.org>

On 6/12/06, Florian Forster <octo@verplant.org> wrote:
> On Mon, Jun 12, 2006 at 10:02:05AM +1200, Martin Langhoff wrote:
> > good! git-blame/git-annotate are quite expensive to run. Do you think
> > it would make sense making it conditional on a git-repo-config option
> > (gitweb.blame=1)?
>
> sure, that it's a big change and if it helps the kernel.org folks ;)
> I'll follow-up with a patch for this in a second..

That'd be great. I am looking into integrating other feature patches
too (like tarball downloads) that are useful but costly, making them
conditional too...

> Would it help to cache `git-annotate's output, e.g. using one of the

I think we can rely on proxies doing good caching -- a busy host like
kernel.org will have big reverse proxies in front. A git-blame for a
given file+commitsha doesn't change, so we can give it a long cache
time, like... forever ;-)

> I have two more points regarding gitweb's configuration:
> - IMHO it would make sense to move the general gitweb-configuration
>   (where are the repositories, where are the binaries, etc) out of the
>   script.  As far as I know the Debian maintainer of the `gitweb'
>   package has asked for this before but was refused for some reason..

Sounds like a reasonable request. I would make it rely on env vars,
$ENV{GITWEB_CONFIG} can generally point to /etc/gitweb.conf, and that
would override the config values we have.

This is trivial, and it means we buy a lot of flexibility from
apache's httpd.conf being able to point to different config files
depending on arbitrarty conditions.

BTW, I haven't seen the debian maintainer's request, was that on the list?

> - If `GIT_DIR/description' is only used by gitweb it may be more
>   consistent to use the git-repo-config option `gitweb.description' in
>   the future.

Not sure how git-repo configurations deal with long entries. Right now
the description may contain html for instance.



martin

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox