git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Jon Smirl" <jonsmirl@gmail.com>
To: "Keith Packard" <keithp@keithp.com>
Cc: "Martin Langhoff" <martin.langhoff@gmail.com>, git <git@vger.kernel.org>
Subject: Re: packs and trees
Date: Tue, 20 Jun 2006 12:33:10 -0400	[thread overview]
Message-ID: <9e4733910606200933p2e802954rdf50d5f0ac037677@mail.gmail.com> (raw)
In-Reply-To: <1150816728.5382.27.camel@neko.keithp.com>

On 6/20/06, Keith Packard <keithp@keithp.com> wrote:
> > Even after spending eight hours building the changeset info iit is
> > still going to take it a couple of days to retrieve the versions one
> > at a time and write them to git. Reparsing 50MB delta files n^2/2
> > times is a major bottleneck for all three programs.
>
> The eight hours in question *were* writing out the deltas and packing
> the resulting trees. All that remained was to construct actual commit
> objects and write them out.
>
> The problem was that parsecvs's internals are structured so that this
> processes would take a large amount of memory, so I'm reworking the code
> to free stuff as it goes along.

How about writing out all of the revisions from the cvs file using the
yacc code the first time the file is encountered and parsed. Then you
only have to track git IDs and not all of those cumbersome CVS rev
numbers. When I was profiling parsecvs the hottest parts of the code
were extracting the revisions and comparing cvs rev numbers. Since the
git IDs are fixed size they work well in arrays and with pointer
compares for sorting. With the right data structure you should be able
to eliminate the CVS rev numbers that are so slow to deal with.

There are about 1M revisions in moz cvs. At eight byes for an ID and
eight bytes for a timestamp that is 16MB if ordering is achieved via
arrays. All of the symbols fit into 400K including pointers to their
revision. If the revs are written out as they are encountered there is
no need to save file names, but you do need one rev structure per
file. Throw in some more memory for relationship pointers. All of this
should fit into less than 100MB RAM.

>
> With a rewritten parsecvs, I'm hoping to be able to steal the algorithms
> from cvs2svn and stick those in place. Then work on truncating the
> history so it can deal with incremental updates to the repository, which
> I think will be straightforward if we stick a few breadcrumbs in the git
> repository to recover state from.
>
> --
> keith.packard@intel.com
>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.3 (GNU/Linux)
>
> iD8DBQBEmBHYQp8BWwlsTdMRAvKAAJ9im3xBdUowt9af+/MtoYDXsCHGtACaAtG4
> GygX7WgiFOamLrnTMzWkIPE=
> =28dp
> -----END PGP SIGNATURE-----
>
>
>


-- 
Jon Smirl
jonsmirl@gmail.com

  reply	other threads:[~2006-06-20 16:33 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-06-20  5:57 packs and trees Jon Smirl
2006-06-20  6:13 ` Martin Langhoff
2006-06-20 14:35   ` Jon Smirl
2006-06-20 15:18     ` Keith Packard
2006-06-20 16:33       ` Jon Smirl [this message]
2006-06-20 15:03   ` Nicolas Pitre
2006-06-20 19:41     ` Martin Langhoff
2006-06-20 20:51       ` Nicolas Pitre
2006-06-21  3:54       ` Linus Torvalds
2006-06-21 15:32         ` David Lang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9e4733910606200933p2e802954rdf50d5f0ac037677@mail.gmail.com \
    --to=jonsmirl@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=keithp@keithp.com \
    --cc=martin.langhoff@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).