Re: fast-import and unique objects.

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Jon Smirl" <jonsmirl@gmail.com>
To: "Shawn Pearce" <spearce@spearce.org>
Cc: git <git@vger.kernel.org>
Subject: Re: fast-import and unique objects.
Date: Mon, 7 Aug 2006 10:37:30 -0400	[thread overview]
Message-ID: <9e4733910608070737k52aaea7clf871d716d16547c2@mail.gmail.com> (raw)
In-Reply-To: <20060807050422.GD20514@spearce.org>

On 8/7/06, Shawn Pearce <spearce@spearce.org> wrote:
> > I'm staring at the cvs2svn code now trying to figure out how to modify
> > it without rewriting everything. I may just leave it all alone and
> > build a table with cvs_file:rev to sha-1 mappings. It would be much
> > more efficient to carry sha-1 throughout the stages but that may
> > require significant rework.
>
> Does it matter?  How long does the cvs2svn processing take,
> excluding the GIT blob processing that's now known to take 2 hours?
> What's your target for an acceptable conversion time on the system
> you are working on?

As is, it takes the code about a week to import MozCVS into
Subversion. But I've already addressed the core of why that was taking
so long. The original code forks off a copy of cvs for each revision
to exact the text. Doing that 1M times takes about two days. The
version with fast-import takes two hours.

At the end of the process cvs2svn forks off svn 250K times to import
the change sets. That takes about four days to finish. Doing a
fast-import backend should fix that.

> Any thoughts yet on how you might want to feed trees and commits
> to a fast pack writer?  I was thinking about doing a stream into
> fast-import such as:

The data I have generates an output that indicates add/change/delete
for each file name. Add/change should have an associated sha-1 for the
new revision. cvs/svn have no concept of trees.

How about sending out a stream of add/change/delete operations
interspersed with commits? That would let fast-import track the tree
and only generate tree nodes when they change.

The protocol may need some thought. I need to be able to handle
branches and labels too.

>         <4 byte length of commit><commit><treeent>*<null>
>
> where <commit> is the raw commit minus the first "tree nnn\n" line, and
> <treeent> is:
>
>         <type><sp><sha1><sp><path><null>
>
> where <type> is one of 'B' (normal blob), 'L' (symlink), 'X'
> (executable blob), <sha1> is the 40 byte hex, <path> is the file from
> the root of the repository ("src/module/foo.c"), and <sp> and <null>
> are the obvious values.  You would feed all tree entries and the pack
> writer would split the stream up into the individual tree objects.
>
> fast-import would generate the tree(s) delta'ing them against the
> prior tree of the same path, prefix "tree nnn\n" to the commit
> blob you supplied, generate the commit, and print out its ID.
> By working from the first commit up to the most recent each tree
> deltas would be using the older tree as the base which may not be
> ideal if a large number of items get added to a tree but should be
> effective enough to generate a reasonably sized initial pack.
>
> It would however mean you need to monitor the output pipe from
> fast-import to get back the commit id so you can use it to prep
> the next commit's parent(s) as you can't produce that in Python.
>
> --
> Shawn.
>

-- 
Jon Smirl
jonsmirl@gmail.com

next prev parent reply	other threads:[~2006-08-07 14:37 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-08-06 12:32 fast-import and unique objects Jon Smirl
2006-08-06 15:53 ` Jon Smirl
2006-08-06 18:03   ` Shawn Pearce
2006-08-07  4:48     ` Jon Smirl
2006-08-07  5:04       ` Shawn Pearce
2006-08-07 14:37         ` Jon Smirl [this message]
2006-08-07 14:48           ` Jakub Narebski
2006-08-07 18:45             ` Jon Smirl
2006-08-08  3:12           ` Shawn Pearce
2006-08-08 12:11             ` Jon Smirl
2006-08-08 22:45               ` Shawn Pearce
2006-08-08 23:56                 ` Jon Smirl
2006-08-07  5:10       ` Martin Langhoff
2006-08-07  7:57     ` Ryan Anderson
2006-08-07 23:02       ` Shawn Pearce

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9e4733910608070737k52aaea7clf871d716d16547c2@mail.gmail.com \
    --to=jonsmirl@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=spearce@spearce.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).