git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Jon Smirl" <jonsmirl@gmail.com>
To: "Shawn Pearce" <spearce@spearce.org>
Cc: git <git@vger.kernel.org>
Subject: Re: git-fast-import
Date: Sun, 6 Aug 2006 00:09:22 -0400	[thread overview]
Message-ID: <9e4733910608052109v5d4d348ci6aa986cc04939116@mail.gmail.com> (raw)
In-Reply-To: <20060806034009.GE20565@spearce.org>

On 8/5/06, Shawn Pearce <spearce@spearce.org> wrote:
> Jon Smirl <jonsmirl@gmail.com> wrote:
> > git-fast-import works great. I parsed and built my pack file in
> > 1:45hr. That's way better than 24hr. I am still IO bound but that
> > seems to be an issue with not being able to read ahead 150K small
> > files. CPU utilization averages about 50%.
>
> Excellent.  Now if only the damn RCS files were in a more suitable
> format.  :-)
>
> > I didn't bother reading the sha ids back from fast-import, instead I
> > computed them in the python code. Python has a C library function for
> > sha1. That decouple the processes from each other. They would run in
> > parallel on SMP.
>
> At least you are IO bound and not CPU bound.  But it is silly for the
> importer in Python to be computing the SHA1 IDs and for fast-import
> to also be computing them.  Would it help if fast-import allowed
> you to feed in a tag string which it dumps to an output file listing
> SHA1 and the tag?  Then you can feed that data file back into your
> tree/commit processing for revision handling.

I am IO bound, there is plenty of CPU and I am on a 2.8Ghz single processor.
The sha1 is getting stored into an internal Python structure. The
structures then get sliced and diced a thousand ways to compute the
change sets.

The real goal of this is to use the cvs2svn code for change set
detection. Look at how much work these guys have put into it making it
work on the various messed up CVS repositories.
http://git.catalyst.net.nz/gitweb?p=cvs2svn.git;a=shortlog;h=a9167614a7acec27e122ccf948d1602ffe5a0c4b

cvs2svn is the only tool that read and built change sets for Moz CVS
on the first try.

> > My pack file is 980MB compared to 680MB from other attempts. I am
> > still missing entries for the trees and commits.
>
> The delta selection ain't the best.  It may be the case that prior
> attempts were combining files to get better delta chains vs. staying

My suspicion is that prior attempts weren't capturing all of the
revisions. I know cvsps (the 680MB repo) was throwing away branches
that it didn't understand. I don't think anyone got parsecvs to run to
completion. MozCVS has 1,500 branches.

> all in one file.  It may be the case that the branches are causing
> the delta chains to not be ideal.  I guess I expected slightly
> better but not that much; earlier attempts were around 700 MB so
> I thought maybe you'd be in the 800 MB ballpark.  Under 1 GB is
> still good though as it means its feasible to fit the damn thing
> into memory on almost any system, which makes it pretty repackable
> with the standard packing code.

I am still missing all of the commits and trees. Don't know how much
they will add yet.

> Its possible that you are also seeing duplicates in the pack;
> I actually wouldn't be surprised if at least 100 MB of that was
> duplicates where the author(s) reverted a file revision to an exact
> prior revision, such that the SHA1 IDs were the same.  fast-import
> (as I have previously said) is stupid and will write the content
> out twice rather than "reuse" the existing entry.
>
> Tonight I'll try to improve fast-import.c to include index
> generation, and at the same time perform duplicate removal.
> That should get you over the GPF in index-pack.c, may reduce disk
> usage a little for the new pack, and save you from having to perform
> a third pass on the new pack.

Sounds like a good plan.

-- 
Jon Smirl
jonsmirl@gmail.com

  reply	other threads:[~2006-08-06  4:09 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-08-06  2:51 git-fast-import Jon Smirl
2006-08-06  3:40 ` git-fast-import Shawn Pearce
2006-08-06  4:09   ` Jon Smirl [this message]
  -- strict thread matches above, loose matches on Subject: below --
2007-02-06  2:31 git-fast-import Shawn O. Pearce
2007-02-06  3:18 ` git-fast-import Nicolas Pitre
2007-02-06  4:06 ` git-fast-import Nicolas Pitre
2007-02-06  5:48   ` git-fast-import Shawn O. Pearce
2007-02-06 16:35     ` git-fast-import Linus Torvalds
2007-02-06 16:56       ` git-fast-import Shawn O. Pearce
2007-02-06 17:20         ` git-fast-import Linus Torvalds
2007-02-06 18:53           ` git-fast-import Nicolas Pitre
2007-02-06 20:09             ` git-fast-import Shawn O. Pearce
2007-02-06 21:03               ` git-fast-import Nicolas Pitre
2007-02-06 21:15                 ` git-fast-import Shawn O. Pearce
2007-02-06 21:42                   ` git-fast-import Nicolas Pitre
2007-02-07 10:58             ` git-fast-import David Woodhouse
2007-02-06  6:12 ` git-fast-import Aneesh Kumar K.V
2007-02-06  6:18   ` git-fast-import Shawn O. Pearce
2007-02-07  4:55     ` git-fast-import Daniel Barkalow
2007-02-07  9:13       ` git-fast-import Karl Hasselström
2007-02-07 11:17         ` git-fast-import Johannes Schindelin
2007-02-07 22:55           ` git-fast-import Shawn O. Pearce
2007-02-07 23:55             ` git-fast-import Johannes Schindelin
2007-02-08  0:12               ` git-fast-import Shawn O. Pearce
2007-02-08 16:56               ` git-fast-import Linus Torvalds
2007-02-08 19:10                 ` git-fast-import Shawn O. Pearce
2007-02-09  8:49                   ` git-fast-import Karl Hasselström
2007-02-09 15:47                     ` git-fast-import Linus Torvalds
2007-02-07  9:29       ` git-fast-import Raimund Bauer
2007-02-07 13:38       ` git-fast-import David Woodhouse
2007-02-06  9:28 ` git-fast-import Andy Parkins
2007-02-06  9:40   ` git-fast-import Shawn O. Pearce
2007-02-06 16:37   ` git-fast-import Linus Torvalds
2007-02-06 16:44     ` git-fast-import Shawn O. Pearce
2007-02-06 17:24       ` git-fast-import Linus Torvalds
2007-02-07  1:17       ` git-fast-import Horst H. von Brand
2007-02-07  2:50         ` git-fast-import Linus Torvalds
2007-02-07  5:53           ` git-fast-import Shawn O. Pearce
2007-02-07  9:21             ` git-fast-import Karl Hasselström
2007-02-07 22:18             ` git-fast-import Horst H. von Brand
2007-02-07 22:31               ` git-fast-import Jakub Narebski
2007-02-07 22:39               ` git-fast-import Linus Torvalds
2007-02-08 21:34           ` git-fast-import Johannes Schindelin
2007-02-07  5:46         ` git-fast-import Shawn O. Pearce
2007-02-07  4:45       ` git-fast-import Daniel Barkalow
2007-02-06  9:34 ` git-fast-import Jakub Narebski
2007-02-06  9:39   ` git-fast-import Shawn O. Pearce
2007-02-06  9:53 ` git-fast-import Jakub Narebski
2007-02-06 17:20   ` git-fast-import Shawn O. Pearce
2007-02-06 13:50 ` git-fast-import Alex Riesen
2007-02-06 17:43   ` git-fast-import Shawn O. Pearce
2007-02-06 18:02     ` git-fast-import Alex Riesen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9e4733910608052109v5d4d348ci6aa986cc04939116@mail.gmail.com \
    --to=jonsmirl@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=spearce@spearce.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).